While chemistry has become a more data driven discipline in the recent years, the amount of data available for training deep learning models is limited when compared to their imaging counterparts (ImageNet, for example). Transfer learning is a strategy that leverages representations learnt by training deep learning models on larger datasets with available labels, and using the trained network for fine-tuning on smaller, expensive to label datasets.
There has been growing interest in using transfer learning for chemistry, with this recent paper describing one such protocol. They use a large (~1.8 million molecules), unlabelled ChEMBL dataset and compute approximate molecular descriptors values using RDKit for all the molecules. Molecules can be featurized depending on the model to be used, and the (featurized molecules, molecular descriptors) become a training set. The model is trained on this large dataset and can then be fine-tuned on smaller datasets. They show that this protocol improves property prediction and toxicity classification performance on Tox21, FreeSolv and HIV datasets (which are a part of MoleculeNet.)
As a Google Summer of Code participant with DeepChem, my project revolves around reproducing the results in the paper and putting together an API for transfer learning for DeepChem. To this effect, my progess over the last month and half has been:
- Preprocessing the ChEMBL dataset, and writing MolNet style loaders for the same
- Implementing the featurizers used in the paper, with tests and documentation
- Implementing the ChemCeption and Smiles2Vec models, with tests and documentation
- Pretrained ChemCeption models for the five parameter settings mentioned in the paper
So far the results look similar to those described in the paper. Over the next few weeks, I plan to carry out the fine-tuning experiments and put together a structured API for the same.
Note: This blog post will be updated with more results and progress, over the course of the Google Summer of Code project.