Google Summer of Code 2021: Large Scale Protein Modeling in DeepChem

Hello everyone!

I am Alana Xiang. This summer, I am joining DeepChem as a Google Summer of Code student developer. I will be working on expanding DeepChem’s support for modeling biological molecules by integrating new models (such as RostLab’s ProtBert model) into our code base, with the necessary changes to our infrastructure (eg. featurizers, loaders) to make the addition worthwhile for DeepChem’s users.

A bit about me: I’m an incoming freshman at Stanford excited about software, biology, and informatics. I have been part of the DeepChem community since September, and I’m incredibly eager to make some larger contributions to the project this summer.

I will be posting weekly updates on this thread in case you’re interested in following my work over the next few months.

Edit:

My original project proposal is available here.

4 Likes

Weekly Updates!

Here is what I have been up to for the first month of GSoC.

Week 1
This week for my Google Summer of Code project with DeepChem/Open Chemistry, I made significant headway on the new protein tokenizer class as well as with modifying the existing FASTA loader class. For the protein tokenizer, I have written unit tests and then written a tokenizer that parses FASTA format strings and converts them into integer tokens. For the FASTA loader, I have modified it to accept arbitrary tokenizers for loading FASTA files. These changes should make DeepChem far more capable of parsing arbitrary protein sequences. By the end of the summer, my plan is that these tokenizers will be able to directly process FASTA files and feed their output into a Protein BERT model for predictive modeling. My project’s primary goal was to make DeepChem better able to model proteins, and this week has been a big step in that exciting direction.

Week 2
It’s been a very cool week! I have managed to make a lot of progress on the FASTA loader.
I didn’t previously understand all the intricacies about dealing with FASTA files (such as the fact that FASTA files can use different character sets) and data loaders.
The program is now capable of loading and featurizing fairly nontrivial FASTA files.
The process of getting this seemingly simple change ready to merge has taught me a lot about the importance of considering diverse users and usecases.

Week 3
This week, I was out for a short while due to the COVID-19 vaccine. However, I was still able to get the changes to the FASTA Loader ready to merge!
In order to maintain legacy support on the loader, I needed to make some changes that I didn’t previously anticipate. After resolving lots of comments and overcoming a lot of bugs, I’m quite happy with the final product.

Over the next week, I will work on a new PR for the FASTA loader that will bring sharding support into the loader, which will allow users to generate datasets from much bigger FASTA files.
I will also continue to work on developing the ProteinTokenizer.
Hopefully, I will be at a stage where I can begin training HuggingFace models on FASTA data by the end of next week.

Week 4
FASTA loader sharding is already functioning, and should only require minor modifications before being ready to merge.

The creation of a Protein BERT Tokenizer took slightly longer than initially anticipated, because I wanted to make sure that I got the design absolutely correct. However, after some discussions with my mentor Seyone and some planning, I am now feeling ready to implement it.

1 Like

Week 5 (Technically more than a week, since I’m writing this update on Monday instead of Thursday!)

Things are speeding up!

After my first FASTA loader PR was merged, I spent a lot of time fighting cryptic bugs related to enabling sharding. In short, adding unit tests that tested sharding with larger files caused unanticipated failures. Bharath has suggested that I put together a “minimum failing test case” to iterate with, and I am optimistic about solving the bugs plaguing this change.

My mentors have encouraged me to slightly de-prioritize the sharding PR and speed up the protein tokenizer. Over the weekend, I figured out a reasonably elegant way to use multiple inheritance to add a BERT tokenizer wrapper to DeepChem without code that is redundant with Seyone and Walid’s RobertaFeaturizer.

Behind the scenes, I have been trying to improve the way that I approach debugging based on advice from my mentors. Right now, I lean heavily on using existing unit tests, and I am working on being more flexible by not only experimenting with “minimum failing test cases” as previously mentioned, but also interfacing directly with classes and methods to learn their behavior.

1 Like

Another Update!

I have been continuing to work on adding the BertFeaturizer to DeepChem. Building off of Seyone and Walid’s work, it is now possible to use the HuggingFace BertTokenizer without leaving the DeepChem interface. There are still some minutia to sort through, but we are getting much closer to being able to conduct training with HuggingFace infrastructure using purely “DeepChemic” code!

1 Like