GSoC ’24 final report: Integration of protein language model ESM-2 with Deepchem for predicting binding sites

GSoC project

Project goals

The primary goal of this project was to add documentation and integration of ESM-2, a state-of-the-art protein language model from Hugging Face, with DeepChem to enhance the prediction of protein binding sites. This integration aimed to leverage the powerful protein representations learned by ESM-2, broadening DeepChem’s capabilities and making it more accessible for the drug discovery community. The goals were reached through several contributions, including a comprehensive tutorial for using Hugging Face ESM-2 model for protein binding site predictions, evaluation and inference with visualizations of the true and false positives and negatives for an illustrative use case. Building a DeepChem abstraction for ESM-2 resulted more challenging than initially thought though, and the code for this class is still under review.

DeepChem is a comprehensive toolkit for modeling chemical compounds, and I hope to see both its userbase and codebase expand as a consequence of the valuable contributions that my GSoC peers and I have made.

Special thanks to the mentors and DeepChem’s experienced contributors who have advised and supported my work during these months, Rakshit Singh, Stanley Bishop, David Figueroa and Bharath Ramsundar.

Contributions:

  • Tutorial for introduction to protein binding sites prediction

  • Tutorial for data gathering and preprocessing for binding sites prediction from UniProt

  • Dataset added to DeepChem datasets suite in AWS storage.

  • Tutorial for predicting binding sites using ESM-2

  • Tutorial for protein language models (with GSoC fellow Dhuvi)

  • DeepChem wrapper for ESM-2 model.

- Tutorial for introduction to protein binding sites prediction:

Developed a comprehensive tutorial (Jupyter notebook) with the science behind binding sites and their identification, covering basic concepts, types of binding sites, mechanisms, methods, applications and a case study for illustrative purposes. The tutorial was merged into the DeepChem repository (see resulting tutorial and PR).

- Tutorial for data gathering and preprocessing for binding sites prediction from UniProt:

Developed a tutorial (Jupyter notebook) for downloading and pre-processing data from the UniProt website of proteins. It accounts for special symbols in binding/active site information and converts the data in training and testing sets with binary labels based on protein families. Protein sequences (and their labels correspondingly) are chunked based on the context window for the ESM-2 model. The tutorial was merged into the DeepChem repository (see resulting tutorial and PR).

- Tutorial for predicting binding sites using ESM-2:

Built a tutorial (Jupyter notebook) implementing ESM-2 for the downstream task of protein binding site prediction. It includes data processing, fine-tuning, evaluation, and inference on a case study for illustrative purposes. It also includes visualizations showing correctly predicted, and false positives and negatives. The tutorial was merged into the DeepChem repository (see resulting tutorial and PR).

- Tutorial for protein language model:

Collaborated with another GSoC fellow, Dhuvi, in an introductory tutorial for protein language modeling. The tutorial was merged into the DeepChem repository (see resulting tutorial and PR).

- DeepChem wrapper:

DeepChem wrapper around the ESM-2 model from Hugging Face. ESM-2 can be used for sequence-level tasks and aminoacid/token-level tasks. A PR has been submitted with ESM-2 model class that supports fine-tuning for sequence-level tasks, although it is still under review (see PR). The aminoacid/token-level modelling requires modifications on other classes like DeepChem’s NumpyDataset or DiskDataset and would be a good avenue for follow-up work beyond GSoC timeline.

Lessons learned, future plans and conclusions:

GSoC has been the framework of my first contributions to open source. I started from documentation tutorials to more complex guidelines and eventually starting a contribution for a new model class for ESM-2. I have learned best programming practices of industry production-level code, how scientific codebases are developed and maintained, and more generally about organizing project tasks incrementally, going from really simple to more complex tasks. Although the GSoC schema comes to an end, I plan to address the feedback related to the ESM2 model for sequence PR, and extending the wrapper model class for token-level tasks like binding sites prediction.

GSoC has been a very valuable learning experience that has helped me develop new skills and build my confidence as a developer. I gained a better understanding of open-source development, including the importance of testing, code reviews, and collaboration. In addition, I made new connections and had opportunities to meet other developers and mentors. My take-aways include starting small and being realistic about projects and timelines, and the importance of code quality and regular communication in open-source projects.