In computer vision, your input data is usually a grid of numbers, a form that neural networks can process directly. Maybe you’ll do some preprocessing, but it still produces a grid of numbers.
In chemistry, your input is often a molecule or set of molecules. That’s a physical object, not a grid of numbers. How do you represent a molecule in a form you can process with machine learning? It isn’t obvious. In fact, people have come up with many different representations. Here are some examples.
- A SMILES string, a text string that lists the atoms and indicates the bonds between them in a particular way.
- A fingerprint, a set of bits where each bit indicates whether a particular local chemical motif exists somewhere in the molecule.
- A distance matrix, an NxN matrix giving the distance between every pair of atoms.
- A Coulomb matrix, similar to a distance matrix but instead giving the Coulomb interaction energy between each pair of atoms.
- A one-hot encoded list of elements, plus a Nx3 matrix with the position of each atom.
- And more!
Before you can apply machine learning to molecules, you need to decide how to represent them. That’s what featurization is.