Google Summer of Code 2021: Large Scale Protein Modeling in DeepChem

Hello everyone!

I am Alana Xiang. This summer, I am joining DeepChem as a Google Summer of Code student developer. I will be working on expanding DeepChem’s support for modeling biological molecules by integrating new models (such as RostLab’s ProtBert model) into our code base, with the necessary changes to our infrastructure (eg. featurizers, loaders) to make the addition worthwhile for DeepChem’s users.

A bit about me: I’m an incoming freshman at Stanford excited about software, biology, and informatics. I have been part of the DeepChem community since September, and I’m incredibly eager to make some larger contributions to the project this summer.

I will be posting weekly updates on this thread in case you’re interested in following my work over the next few months.

Edit:

My original project proposal is available here.

4 Likes

Weekly Updates!

Here is what I have been up to for the first month of GSoC.

Week 1
This week for my Google Summer of Code project with DeepChem/Open Chemistry, I made significant headway on the new protein tokenizer class as well as with modifying the existing FASTA loader class. For the protein tokenizer, I have written unit tests and then written a tokenizer that parses FASTA format strings and converts them into integer tokens. For the FASTA loader, I have modified it to accept arbitrary tokenizers for loading FASTA files. These changes should make DeepChem far more capable of parsing arbitrary protein sequences. By the end of the summer, my plan is that these tokenizers will be able to directly process FASTA files and feed their output into a Protein BERT model for predictive modeling. My project’s primary goal was to make DeepChem better able to model proteins, and this week has been a big step in that exciting direction.

Week 2
It’s been a very cool week! I have managed to make a lot of progress on the FASTA loader.
I didn’t previously understand all the intricacies about dealing with FASTA files (such as the fact that FASTA files can use different character sets) and data loaders.
The program is now capable of loading and featurizing fairly nontrivial FASTA files.
The process of getting this seemingly simple change ready to merge has taught me a lot about the importance of considering diverse users and usecases.

Week 3
This week, I was out for a short while due to the COVID-19 vaccine. However, I was still able to get the changes to the FASTA Loader ready to merge!
In order to maintain legacy support on the loader, I needed to make some changes that I didn’t previously anticipate. After resolving lots of comments and overcoming a lot of bugs, I’m quite happy with the final product.

Over the next week, I will work on a new PR for the FASTA loader that will bring sharding support into the loader, which will allow users to generate datasets from much bigger FASTA files.
I will also continue to work on developing the ProteinTokenizer.
Hopefully, I will be at a stage where I can begin training HuggingFace models on FASTA data by the end of next week.

Week 4
FASTA loader sharding is already functioning, and should only require minor modifications before being ready to merge.

The creation of a Protein BERT Tokenizer took slightly longer than initially anticipated, because I wanted to make sure that I got the design absolutely correct. However, after some discussions with my mentor Seyone and some planning, I am now feeling ready to implement it.

1 Like

Week 5 (Technically more than a week, since I’m writing this update on Monday instead of Thursday!)

Things are speeding up!

After my first FASTA loader PR was merged, I spent a lot of time fighting cryptic bugs related to enabling sharding. In short, adding unit tests that tested sharding with larger files caused unanticipated failures. Bharath has suggested that I put together a “minimum failing test case” to iterate with, and I am optimistic about solving the bugs plaguing this change.

My mentors have encouraged me to slightly de-prioritize the sharding PR and speed up the protein tokenizer. Over the weekend, I figured out a reasonably elegant way to use multiple inheritance to add a BERT tokenizer wrapper to DeepChem without code that is redundant with Seyone and Walid’s RobertaFeaturizer.

Behind the scenes, I have been trying to improve the way that I approach debugging based on advice from my mentors. Right now, I lean heavily on using existing unit tests, and I am working on being more flexible by not only experimenting with “minimum failing test cases” as previously mentioned, but also interfacing directly with classes and methods to learn their behavior.

1 Like

Another Update!

I have been continuing to work on adding the BertFeaturizer to DeepChem. Building off of Seyone and Walid’s work, it is now possible to use the HuggingFace BertTokenizer without leaving the DeepChem interface. There are still some minutia to sort through, but we are getting much closer to being able to conduct training with HuggingFace infrastructure using purely “DeepChemic” code!

1 Like

Final Update of Official GSoC

This project deviated significantly from the original proposal, largely because of our initial decision to adopt HuggingFace libraries instead of incorporating Facebook code. We also hit numerous roadblocks along the way, including many inscrutable bugs. However, I am very happy about the progress that we’ve made.

This Week

BertFeaturizer is now marked as ready for review. All CI failures are unrelated, and I am fairly confident that this PR will be merged quite soon. https://github.com/deepchem/deepchem/pull/2642

In this PR, I also made changes to the RobertaFeaturizer. Most notably, both BertFeaturizer and RobertaFeaturizer no longer use the dual-inheritance model originally present in RobertaFeaturizer. It turns out that dual-inheritance is a problematic model in our context, as certain deepchem-ic methods will be overridden by the inherited HuggingFace transformer. This issue became apparent in the fact that __call__() did not behave expectedly.

This week, I also created PRs for HuggingFace model wrappers that will add HuggingFace’s BertModel to DeepChem (https://github.com/deepchem/deepchem/pull/2667) and create a MoleculeNet data loader for FASTA data (https://github.com/deepchem/deepchem/pull/2666).

The Project

In my opinion, the biggest accomplishment of this GSoC project is the creation of a pattern for the incorporation of HuggingFace code in DeepChem. This pattern should easily extrapolate to the incorporation of models from other libraries, measurably increasing the efficiency of future efforts to incorporate external code in DeepChem.

I believe that this will enable DeepChem to become a more powerful tool for interoperability in scientific machine learning, making users’ lives easier by crossing the moats between different libraries.

PR Highlights From My GSoC Project

Future Work

I am currently still putting the finishing touches on some of the PRs above.

Bharath informed me that DeepChem operates an optional GSoC “extension,” which allows GSoC students to further extend their projects in the week after GSoC. I plan to participate in this extension with the following planned contributions:

  • Two of the four PRs mentioned above still need a little additional testing before they are ready for a merge as of the time of this writing. In the week after the GSoC project is completed, I plan to not only make these PRs ready to merge, but also to embark upon an additional project for the summer.
  • Based on a suggestion from Seyone, I will be adding a tutorial to DeepChem which attempts to replicate the attention visualization and T-SNE projections from Elnaggar et al’s ProtTrans paper.

Please view my slide-deck for a better idea of my progress throughout the summer! https://docs.google.com/presentation/d/1kWraLm6SPC-beZxBjZZbf8jw48AiAN8qqgHDCKKd02M/edit#slide=id.p

1 Like