Unsupervised Data Augmentation Methods for Molecules

There have been some considerable advances in data augmentation. This recent paper from Google research shows very impressive results through clever use of unsupervised data augmentation which combines augmentation with a new loss function that encourages consistency of labeling for transformed versions of the original sample.

How could we adapt these techniques to molecular data? It feels like it ought be able able to create data augmented versions of molecules and use the unsupervised data augmentation loss to allow for higher accuracies on small molecular datasets.

What kind of augmentation would you use? Their method assumes the output for augmented samples should be similar to the samples they were generated from. For images it’s easy to augment the data in ways that satisfy that. For molecules it’s a lot harder. Adding or removing one atom can radically change the molecule’s properties.

1 Like

One possibility is that we can augment by transforming molecules to chemically valid alternative states that would exist in an equilibrium distribution. For example, altered hydrogenation where different augmented versions of the molecule have different choices of hydrogenation. Or for benzene, would could choose different representations of the resonance. If we have 3D representations, we can apply the usual set of spatial symmetries or different conformers.

In general though, we should only consider augmentations that are physically realistic of course.

My concern is with how much properties vary when we move to other chemically valid states in an equilibrium distribution, and if they can render the idea of similarity between augmentations and original data incorrect. For different resonance structures of benzene, this should not be a problem. If there are multiple positions for hydrogenation or addition of any other functional group, then some positions are more favorable from steric constraints, and property values could be markedly different.

For models operating on SMILES strings, we can just permute atom order and turn it back into SMILES string, thus giving us effectively an augmented molecule.

1 Like

I’ve been experimenting with SMILES based augmentation. Specifically I’ve been using sequence to sequence models for retrosynthesis reaction prediction repo.

I’ve found that SMILES augmentation (using the implementation here) has a strong impact on Top 1 prediction accuracy. Interestingly, it can have a detrimental effect on Top K prediction accuracy. When you use a large number of augmented variants, the Top K predictions from your model turn out to just be different SMILES permutations of the same molecule.

1 Like