Google Summer of Code 2021: Make Deepchem more Robust by implementing cutting-edge models and other features

Hey everyone!

I’m Atreya Majumdar, a pre-final year undergraduate student engineering student from the National Institute of Technology Karnataka, Surathkal, India. My work has mostly been in Deep Learning (Speech Processing); thus, I am familiar with attention models, which I will be using in my project.

I am a GSoC student under Deepchem for the summer. My work will primarily revolve around the addition of a few new models to Deepchem, the most important of them being the Molecule Attention Transformer.

I also plan on adding more models such as the Molecular Autoencoder and upgrading existing algorithms if time permits.


Update as of June 27

Almost 3 weeks have passed since the coding period officially began for GSoC '21!

Being new to Deepchem and chemistry-based Deep Learning in general, I was focussing on familiarizing myself with the Deepchem ecosystem, and the theory behind my project for the first 2 weeks. I also sent out a PR which contained the molecular featurizer for my project, the implementation of the Molecular Attention Transformer, linked above.

For week 3, I have been working on the data loader. I have a skeleton structure which I am now attempting to add to the MolNet loader. I will also be adding a few datasets to MolNet for the purpose of this project, however since these are hormonal datasets, the community may find them useful for their own needs as well.

Update as of July 12

I have uploaded the Freesolv Dataset to the Deepchem AWS S3 bucket and have also written a loader function so that the dataset can be downloaded, featurized and subsequently used! Understanding the dataset and the featurized version took more time than I thought it would, but I am glad that I took the time to understand how it works. I have a more clear idea of how to go about sending similar PRs in the future.

Check the PR here

I am currently working on the layers for the MAT Model. While I was delayed due to some unforeseen circumstances, I should be able to send the MAT Layers PR within a day or two.

Update as of July 27

The layer PRs are under review at the moment. The model PR will basically be a wrapper that ties everything together. I am exploring options to include pretrained weights at the moment.

Update as of August 4

The layer PRs are up:
The Attention module has been merged.
The Encoder module and the Generator module PRs are up as well.

I’m currently working on the wrapper (the MATModel class). I am thinking of also adding the pre-trained weights as released by the original authors, although keeping time-constraints in mind this will be possible in the post-GSoC period.

Update as of August 22nd

I submitted a fix to the MATFeaturizer as I have now implemented a sort of batching with a custom data class: MATEncoding. Link

Also, I have submitted a PR for the Molecular Attention Transformer. It is currently intended to be a regression model running on the freesolv dataset. Link

