Noob question about featurizing molecules

As chemists, of course we have to deal with molecules that have a different number of atoms so, forgive me my dumb question, how can we featurize molecules (using Coulomb Matix of BoB or whatever)? Doesn’t the neural network need a representation that has always the same size to compare?

1 Like

This is a really good question! It’s a subtle point that comes up a lot. There are a few ways this is handled. A common one is to zero-pad everything so it’s actually all the same size under the hood. This means in practice all your coulomb matrices are the same size with a bunch of zeros.

It is possible nowadays I think to have truly dynamic deep learning code that natively handles different sizes. I don’t think we currently do this, but ought to be possible with tools like PyTorch or Jax (or maybe TF 2.X)

There also are featurizations specifically designed to convert variable sized molecules to fixed size vectors. Fingerprint algorithms like ECFP are a common example. The tutorial at https://deepchem.io/docs/notebooks/seqtoseq_fingerprint.html shows another way of doing it using recurrent networks. Then there’s RdkitGridFeaturizer which represents it with a spatial distribution of physical properties.

This is definitely one of the big open questions in the field though. If you want to handle molecules of different sizes with a single model, how do you pick the best way of doing it for a particular application?

Thank you so much for your kind replies. This is the first time that I’ve got something to start working less blindly.

1 Like