Transfer Learning for Molecular Property Prediction

While chemistry has become a more data driven discipline in the recent years, the amount of data available for training deep learning models is limited when compared to their imaging counterparts (ImageNet, for example). Transfer learning is a strategy that leverages representations learnt by training deep learning models on larger datasets with available labels, and using the trained network for fine-tuning on smaller, expensive to label datasets.

There has been growing interest in using transfer learning for chemistry, with this recent paper describing one such protocol. They use a large (~1.8 million molecules), unlabelled ChEMBL dataset and compute approximate molecular descriptors values using RDKit for all the molecules. Molecules can be featurized depending on the model to be used, and the (featurized molecules, molecular descriptors) become a training set. The model is trained on this large dataset and can then be fine-tuned on smaller datasets. They show that this protocol improves property prediction and toxicity classification performance on Tox21, FreeSolv and HIV datasets (which are a part of MoleculeNet.)

As a Google Summer of Code participant with DeepChem, my project revolves around reproducing the results in the paper and putting together an API for transfer learning for DeepChem. To this effect, my progess over the last month and half has been:

  • Preprocessing the ChEMBL dataset, and writing MolNet style loaders for the same
  • Implementing the featurizers used in the paper, with tests and documentation
  • Implementing the ChemCeption and Smiles2Vec models, with tests and documentation
  • Pretrained ChemCeption models for the five parameter settings mentioned in the paper

So far the results look similar to those described in the paper. Over the next few weeks, I plan to carry out the fine-tuning experiments and put together a structured API for the same.

Note: This blog post will be updated with more results and progress, over the course of the Google Summer of Code project.

1 Like

I’ve found a big part of transfer learning is just creating good vectorized representations of whatever your data is. To that end, “toy” tasks or self-supervised tasks work really well. Following that line of thinking, I would expect models like graph autoencoders (for graph representations) or language models (for SMILES representations) to be great starting points for transfer learning. The benefit of a self supervised approach is you don’t introduce bias from an automated labeling process.

For example in the paper you link, they generate labels from RDKit. That means that at the end of the day, their base model is just a very computationally expensive reflection of the algorithms RDKit is using. This of course is fine for transfer learning, but I wonder if bias from mimicking RDKit’s algorithms would impact results when applied to experimental data.

At the same time, I’ve been surprised at how you can get really good transfer learning results from “stupid” models. An example:

I was doing some work applying ULMFiT, a transfer learning protocol from NLP, to genomic data. The idea is you first train a self supervised language model on genomic data, then transfer the learned weights to a classification model. The language model can be trained on a large unlabeled corpus of genomic data, and shows really good results when transferring to tiny labeled genomic datasets.

I represented genomic data as a sequence of k-mers with a set stride between k-mers, usually with some overlap between k-mers. You could tokenize the sequence ATGCATGCA with a k-mer length of 3 and a stride of 1 into ATG TGC GCA CAT ATG TGC GCA. You would then train a language model to process the sequence and predict the next k-mer after each token.

You’ll notice that each k-mer has a 2 base pair overlap to the previous k-mer, which imposes a strong constraint on the organization of the sequence. If you see the k-mer ACC, you know the next k-mer must be one of CCA, CCT, CCG, or CCC. When you train a model on this data, it quickly learns this pattern and places roughly equal probability density over the four k-mers.

The model at this point knows basically nothing. It has learned that for each k-mer, it should guess between four most likely following k-mers based on the self-imposed structure of the tokenization (which is totally arbitrary and only there because I decided to tokenize the data in that way). The model knows nothing about the larger structure of genomic sequences. And yet, when you transfer this model to a classification problem, it performs significantly better than training from scratch. There’s no reason why this should work, but it does.

It seems that the most crucial period in training a model is at the start. This is where your weights are random and meaningless, your activations/gradients are erratic and have no structure. This is where your model can accidentally jump to a bad area of the parameter space. This is where your model quickly overfits to a small dataset trying to crawl its way out of the unstructured initialization. Providing some initial structure through pre-training (even if that structure is stupid and simple, like the example above) has a huge impact on your model’s ability to train on small datasets without overfitting.

Overall I think pre-training is extremely underused with respect to chemical/biological data. This is a shame considering many domains have a large amount of unlabeled data and a small amount of labeled data, a scenario where transfer learning is highly effective. I would expect applying transfer learning to chemical/biological deep learning problems will lead to performance gains. It’s really low hanging fruit.