Summary of GSoC 21: DeepChem Retrosynthesis

Over the past couple of months, I have been contributing to DeepChem as a Google Summer of Code student. For my project, I had planned to bring in support for retrosynthesis models to DeepChem. In this post, I want to provide a summary of the work done over the summer with respect to this project and its current status.

Background

Retrosynthesis is a step-wise approach to plan the synthesis of a molecule. It involves disconnecting a target molecule into smaller, easier to synthesize units, which can then be used to build up the desired molecule. The significance of this method lies in making the right disconnections, which allows chemists to perform the most efficient synthesis of a given molecule. This approach is the standard for synthesizing complex organic molecules and is vital in areas such as medicinal chemistry.

The goal of the project was to provide a platform in DeepChem to train Models that make retrosynthetic disconnections. Previous work in Computer aided synthesis planning using Deep Learning have made use of models from NLP to make these inferences. The forward step (reaction prediction) is recast as a machine translation problem, with the goal of predicting the products (target) given the reactants and reagents (source). This allowed models that were conventionally good for NLP tasks to seep into the field, a prominent example is Philippe Schwaller’s work on the Molecular Transformer.

For this project, we wanted to implement the functionality of the molecular transformer, while opening doors for improvement through large scale pre-training. To go about this, an approach similar to this paper was taken to implement an Encoder Decoder Model. This is achieved by adding support for HuggingFace transformers, which already has the necessary infrastructure to construct and train the model. The Encoder and Decoder can be pre-trained on the USPTO dataset, but for now it is planned to use the ChemBERTa weights to construct the model.

I have posted updates as I was working on the project here: Google Summer of Code 2021: DeepChem Retrosynthesis

Main Contributions

The GSoC period primarily helped set up the infrastructure necessary for putting together the model.
I had added support for loading the USPTO dataset and its subsets which contain the reaction data to train the model, see PR #2546. I had also added support for pre-processing the reactions through the DummyFeaturizer and the RxnSplitTransformer.

The DummyFeaturizer(see #2570) can also be used in a general context to load in datasets that are already featurized. The RxnSplitTransformer (see #2597) enables the user to separate the source and target molecules from the reaction SMILES and also enables mixed training, where the reactants and reagents are not separated.

I had also added the the RxnFeaturizer(see #2656) which is a wrapper for the RobertaTokenizerFast, that can be used to tokenize the reactions as required by the model. There is still some pending documentation and tests for this.

To make Retrosynthesis predictions, the tokenized data will be passed to the HuggingFace models. I will be working on making the HuggingFace dataset from a DeepChem dataset and adding the EncoderDecoderModel in the next few weeks after GSoC so that the project would be complete.

Bugs and miscellaneous fixes

  1. #2641, fixes the splitter error encountered on Molnet datasets that do not have a labels column.
  2. #2651, small additions to the docs regarding running unit tests and doc tests.
  3. #2587, adds a small section to docs on setting up a symbolic link for running tests locally.

Acknowledgements

I would like to thank Bharath, Vignesh, Nathan and Seyone for their constant help and feedback with this project.

3 Likes