Google Summer of Code 2021: Make Deepchem more Robust by implementing cutting-edge models and other features

atreyamaj · June 20, 2021, 8:41pm

Hey everyone!

I’m Atreya Majumdar, a pre-final year undergraduate student engineering student from the National Institute of Technology Karnataka, Surathkal, India. My work has mostly been in Deep Learning (Speech Processing); thus, I am familiar with attention models, which I will be using in my project.

I am a GSoC student under Deepchem for the summer. My work will primarily revolve around the addition of a few new models to Deepchem, the most important of them being the Molecule Attention Transformer.

I also plan on adding more models such as the Molecular Autoencoder and upgrading existing algorithms if time permits.

atreyamaj · June 27, 2021, 11:22am

Update as of June 27

Almost 3 weeks have passed since the coding period officially began for GSoC '21!

Being new to Deepchem and chemistry-based Deep Learning in general, I was focussing on familiarizing myself with the Deepchem ecosystem, and the theory behind my project for the first 2 weeks. I also sent out a PR which contained the molecular featurizer for my project, the implementation of the Molecular Attention Transformer, linked above.

For week 3, I have been working on the data loader. I have a skeleton structure which I am now attempting to add to the MolNet loader. I will also be adding a few datasets to MolNet for the purpose of this project, however since these are hormonal datasets, the community may find them useful for their own needs as well.

atreyamaj · July 11, 2021, 8:36pm

Update as of July 12

I have uploaded the Freesolv Dataset to the Deepchem AWS S3 bucket and have also written a loader function so that the dataset can be downloaded, featurized and subsequently used! Understanding the dataset and the featurized version took more time than I thought it would, but I am glad that I took the time to understand how it works. I have a more clear idea of how to go about sending similar PRs in the future.

Check the PR here

I am currently working on the layers for the MAT Model. While I was delayed due to some unforeseen circumstances, I should be able to send the MAT Layers PR within a day or two.

atreyamaj · July 27, 2021, 8:24am

Update as of July 27

The layer PRs are under review at the moment. The model PR will basically be a wrapper that ties everything together. I am exploring options to include pretrained weights at the moment.

atreyamaj · August 4, 2021, 5:07pm

Update as of August 4

The layer PRs are up:
The Attention module has been merged.
The Encoder module and the Generator module PRs are up as well.

I’m currently working on the wrapper (the MATModel class). I am thinking of also adding the pre-trained weights as released by the original authors, although keeping time-constraints in mind this will be possible in the post-GSoC period.

atreyamaj · August 22, 2021, 4:08pm

Update as of August 22nd

I submitted a fix to the MATFeaturizer as I have now implemented a sort of batching with a custom data class: MATEncoding. Link

Also, I have submitted a PR for the Molecular Attention Transformer. It is currently intended to be a regression model running on the freesolv dataset. Link