Here is what I have been up to for the first month of GSoC.
This week for my Google Summer of Code project with DeepChem/Open Chemistry, I made significant headway on the new protein tokenizer class as well as with modifying the existing FASTA loader class. For the protein tokenizer, I have written unit tests and then written a tokenizer that parses FASTA format strings and converts them into integer tokens. For the FASTA loader, I have modified it to accept arbitrary tokenizers for loading FASTA files. These changes should make DeepChem far more capable of parsing arbitrary protein sequences. By the end of the summer, my plan is that these tokenizers will be able to directly process FASTA files and feed their output into a Protein BERT model for predictive modeling. My project’s primary goal was to make DeepChem better able to model proteins, and this week has been a big step in that exciting direction.
It’s been a very cool week! I have managed to make a lot of progress on the FASTA loader.
I didn’t previously understand all the intricacies about dealing with FASTA files (such as the fact that FASTA files can use different character sets) and data loaders.
The program is now capable of loading and featurizing fairly nontrivial FASTA files.
The process of getting this seemingly simple change ready to merge has taught me a lot about the importance of considering diverse users and usecases.
This week, I was out for a short while due to the COVID-19 vaccine. However, I was still able to get the changes to the FASTA Loader ready to merge!
In order to maintain legacy support on the loader, I needed to make some changes that I didn’t previously anticipate. After resolving lots of comments and overcoming a lot of bugs, I’m quite happy with the final product.
Over the next week, I will work on a new PR for the FASTA loader that will bring sharding support into the loader, which will allow users to generate datasets from much bigger FASTA files.
I will also continue to work on developing the ProteinTokenizer.
Hopefully, I will be at a stage where I can begin training HuggingFace models on FASTA data by the end of next week.
FASTA loader sharding is already functioning, and should only require minor modifications before being ready to merge.
The creation of a Protein BERT Tokenizer took slightly longer than initially anticipated, because I wanted to make sure that I got the design absolutely correct. However, after some discussions with my mentor Seyone and some planning, I am now feeling ready to implement it.