Google Summer of Code 2021: DeepChem Retrosynthesis

Hey everyone!

I am Ashwin Murali. I will be working with DeepChem this summer as a GSoC student.

About me

I am a sophomore at Birla Institute of Technology and Science, Pilani pursuing a double major in Chemistry and Computer Science. I am passionate about Chemistry and Machine Learning and am interested in building Machine Learning models for scientific applications. I have been a part of DeepChem’s community since February, and am incredibly excited to work with all of you!

About the Project

My project is about bringing support for retrosynthesis tasks to DeepChem. The popular models for this task rely on either a template free or a template based approach to make the predictions. I would be primarily bringing in support for the template free models, specifically via the addition of the Molecular Transformer Model. A big part of the project also involves interfacing DeepChem with HuggingFace Transformers, as I plan to use their EncoderDecoderModel to make these inferences. I would also be bringing in support to load the USPTO dataset, the largest open source chemical reaction dataset available, to train these models using DeepChem’s infrastructure.

I shall be updating my progress here over the summer on a weekly basis. Stay tuned!

My project description is available here.

4 Likes

Week 1: [7th June - 14th June]

I have been working on bringing the USPTO dataset to Molnet. Getting the USPTO datasets loaded on DeepChem is the first step towards adding support for reaction prediction tasks. The dataset is quite different from most other datasets on Molnet in that it does not have tasks, has splits already computed and does not need featurization. This posed a lot of issues in the way the loader was implemented. The first attempt at it was to add a toggle to the _MolnetLoader base class and to add a new original_split_loader method that loads the train, test and valid splits directly. However, this approach ends up making significant changes to the load_dataset method so this was dropped.

An alternative approach was taken by merging the splits into a single dataset. The idea behind this is to utilize the SpecifiedSplitter by passing the valid and test indices. This ensures that the loader method relies on a more robust and tested design. The current progress on the loader can be viewed here #2546.

1 Like

Weeks 2 - 5:[14th June - 5th July]

The last few weeks were spent resolving various issues that came up while getting the USPTO dataset to load. The first issue was that of featurization. Since the raw SMILES string will be used without modification for the retrosynthesis task, and since there was no way to bypass the featurizer for the Molnet loader, there was a need to implement a no-op featurizer that steps through the dataset and returns the training example. This was done in PR #2570

Once that was out of the way, the main challenge was to load the dataset in the source and target form, required for machine translation tasks. An initial attempt at this was by just splitting the dataset beforehand, and then passing in a List of feature fields to the CSVLoader. This idea worked as expected, however there were type mismatch errors because the feature_field of the CSVLoader only expects a single string and not a List. The reason why the loader worked despite this, was because each shard is just a pandas dataframe and it is able to load in a list of columns without trouble.

To preserve the consistency of the loaders, the split dataset was merged back to the reaction SMILES and it is loaded in directly as per PR #2546. The way forward now would be to implement a transformer that splits this dataset into the appropriate source/target format.

This approach works well for loading the 50K and MIT datasets, however the STEREO and FULL have over one million examples and have out of memory issues while loading. One thing to note, is that the datasets load in much quicker and do not have any memory issues when the feature_field is passed as a List. Hence, the possibility of changing the types of the feature fields seems more reasonable given this performance difference and also because it generalizes the loaders. This is now being worked on in this PR #2583.

Once these issues with the loader are ironed out, work on tokenizing the datasets and integration with the HuggingFace models can begin.

1 Like

Weeks 6 - 7: [5th July - 19th July]

The loading issues previously mentioned, were resolved mysteriously. I am not sure why the dataset was loading in much slower when I was running a local version(before merge), but now the loader works as expected. There is a new issue with the loader, which I had encountered while trying to apply the RandomSplitter. I will be working on resolving this the coming week.

The previous week was spent getting the RxnSplitTransformer ready. This will be used in the loader to split a reactions SMILES string into the source and target strings. The transformer has been implemented #2597 and works well with the USPTO loader, once a few issues with the documentation and the tests are addressed, it should be good to be merged in.

The transformer also implements the separate_reagent functionality for the loader successfully. This is similar to the mixed preprocessing done in Schwaller’s paper, the idea behind it is to remove the context of the reagents for the model while training on the source strings. Reagents are molecules whose atoms are usually not involved in the reaction directly, thus this additional layer of complexity induced by mixing the reagents and reactants, helps the model learn a more general mapping.
This week I would be working on getting a seq2seq model training on the USPTO datasets, and would start by ensuring that the existing SMILES tokenizer is able to handle the datasets properly.

1 Like

Weeks 8 - 9: [19th July - 2nd August]

Once the RxnSplitTransformer was merged, the next thing that I had worked on was to update the loader to use the RxnSplitTransformer, this allows the user to use the transformer with the dataset directly, and also provides a toggle to enable reagent separation right when loading the dataset, #2628.

The splitter error mentioned in the last update had arisen because the USPTO dataset did not have a labels column. This meant that the y column of the dataset was None, which caused problems while calling the get_shard_size method under the hood. This error was common to datasets that do not have labels, and cropped up in the Swiss-Prot dataset too. A fix for this is being worked on now.

I had also worked on using the SMILES tokenizer to tokenize the USPTO datasets. The initial attempts were to use the existing SMILES tokenizer on DeepChem, however this did not fit into the infrastructure for featurizers, as it could only be used after loading the dataset. There was an issue with the vocabulary as it did not recognise a single ‘>’ token. I was also unable to get the tokenizer to tokenize a pair of strings at the same time, so this will cause problems when we have to collate the input_ids and attention_masks of the source and target strings while training the model.

Later, I had tried to do the same using Seyone’s RobertaFeaturizer, this also had similar issues with the vocabulary. A proposed way to resolve the vocabulary issues is to just append the separation token to the vocab json. The RobertaFeaturizer does tokenize a pair of strings correctly, so it resolves the data collation challenge mentioned above.

This week, I would be working on getting the RobertaFeaturizer to tokenize the USPTO dataset through the loader, this would hopefully enable us to get a bare-bones implementation of the EncoderDecoderModel in soon.