Noob question about featurizing molecules

HenriqueCSJ · February 21, 2020, 11:43am

As chemists, of course we have to deal with molecules that have a different number of atoms so, forgive me my dumb question, how can we featurize molecules (using Coulomb Matix of BoB or whatever)? Doesn’t the neural network need a representation that has always the same size to compare?

bharath · February 21, 2020, 8:36pm

This is a really good question! It’s a subtle point that comes up a lot. There are a few ways this is handled. A common one is to zero-pad everything so it’s actually all the same size under the hood. This means in practice all your coulomb matrices are the same size with a bunch of zeros.

It is possible nowadays I think to have truly dynamic deep learning code that natively handles different sizes. I don’t think we currently do this, but ought to be possible with tools like PyTorch or Jax (or maybe TF 2.X)

peastman · February 21, 2020, 9:06pm

There also are featurizations specifically designed to convert variable sized molecules to fixed size vectors. Fingerprint algorithms like ECFP are a common example. The tutorial at https://deepchem.io/docs/notebooks/seqtoseq_fingerprint.html shows another way of doing it using recurrent networks. Then there’s RdkitGridFeaturizer which represents it with a spatial distribution of physical properties.

This is definitely one of the big open questions in the field though. If you want to handle molecules of different sizes with a single model, how do you pick the best way of doing it for a particular application?

HenriqueCSJ · February 22, 2020, 12:15pm

Thank you so much for your kind replies. This is the first time that I’ve got something to start working less blindly.