What is a featurizer?

exnx · November 4, 2022, 9:08am

Hi everyone, I’m new to the DeepChem community and I’m working my way through the tutorials. I come from a computer vision and deep learning background, but for the life of me, I don’t quite understand what the “featurizer” is. In computer vision, we usually have preprocessing and then maybe an encoder, but featurizer seems vague to me.

Is anything learned at this stage (like nn.Embedding or linear layer)?

Thanks!

Eric

peastman · November 4, 2022, 4:48pm

In computer vision, your input data is usually a grid of numbers, a form that neural networks can process directly. Maybe you’ll do some preprocessing, but it still produces a grid of numbers.

In chemistry, your input is often a molecule or set of molecules. That’s a physical object, not a grid of numbers. How do you represent a molecule in a form you can process with machine learning? It isn’t obvious. In fact, people have come up with many different representations. Here are some examples.

A SMILES string, a text string that lists the atoms and indicates the bonds between them in a particular way.
A fingerprint, a set of bits where each bit indicates whether a particular local chemical motif exists somewhere in the molecule.
A distance matrix, an NxN matrix giving the distance between every pair of atoms.
A Coulomb matrix, similar to a distance matrix but instead giving the Coulomb interaction energy between each pair of atoms.
A one-hot encoded list of elements, plus a Nx3 matrix with the position of each atom.
And more!

Before you can apply machine learning to molecules, you need to decide how to represent them. That’s what featurization is.

exnx · November 4, 2022, 6:19pm

Hi @peastman, thanks so much, that is incredibly useful! I get it now.

Eric