DeepChem meets Hugging Face pLM ESM-2: Predicting protein binding sites | GSoC 2024

Hey everyone!

I’m Elisa Gómez de Lope, and I’m excited about joining the DeepChem community as a contributor, as part of the Google Summer of Code 2024 program.

The project: I’ll be working on integrating the pre-trained ESM-2 protein representations from Hugging Face transformers with DeepChem’s framework capabilities for predicting binding sites. The goal is to develop a straightforward way to predict protein binding sites in DeepChem, leveraging the strengths of both libraries, and documenting the process through a tutorial.

About me: My interests root in applying machine learning to biological data. I recently graduated from my PhD and am now a researcher at the university of Luxembourg, where I study graph representation learning methods for modeling omics data in the context of Parkinson’s disease. I’ve been following the developments in LLMs and am particularly fascinated by the potential of these models in predicting protein functions and properties.

I’m eager to contribute to the DeepChem open source library by developing a tutorial that guides researchers and enthusiasts to further exploit protein language models, not only for binding site prediction, but for other use cases and functionalities as well (e.g., generating peptide binders, predicting the effect of mutations, or protein-protein interactions), ultimately enabling larger adoption and accelerating drug discovery efforts.

This thread will be updated weekly with the latest updates on the project, so stay tuned if you’re interested in following up with my work over the next few months.

You can also find me on…
Github: elisagdelope
Twitter: @elisagdelope

1 Like

Progress Report Week 1: Obtaining data from Uniprot

A Jupyter notebook for downloading data from Uniprot website & pre-processing it to generate train/test splits ready to be used for modeling ESM-2. This notebook covers:

  • :inbox_tray: Downloading Data: Retrieve information from the UniProt website, including details on protein families, binding sites, active sites, and amino acid sequences.

  • :hammer_and_wrench: Processing Data: Handle special symbols (angle brackets and question marks) in binding/active site information and convert this data into binary labels. Each amino acid position in the protein sequences is marked as 1 (binding/active site) or 0 (non-binding/active site).

  • :scissors: Splitting Data: Divide amino acid sequences and their labels into stratified train/test sets based on UniProt protein families.

  • :arrows_counterclockwise: Chunking Sequences: Split sequences and their labels into non-overlapping chunks of a specified length to define a context window for the ESM-2 model.

Link to notebook

To-do:

  • Working with David to set up a dataset in deepchem AWS storage
  • Notebook with the basic science of the project: predicting binding sites & protein language model ESM-2

Progress Report Week 2: starter on binding sites tutorial

A Jupyter notebook covering the science behind binding sites was started:
1.Introduction to Binding Sites
2. Basic Concepts
3. Types of Binding Sites
4. Mechanisms of Binding
5. Methods to Study Binding Sites
6. Applications and Relevance
7. Case Studies

Sections 1-3 have been covered; to do for next week: completion of the notebook.

Progress Report Week 3: PR Introductory binding sites + ESM-2 vanilla implementation

The Jupyter notebook covering the science behind binding sites was finished, here’s the PR. The tutorial walks through the following sections:
1.Introduction to Binding Sites
2. Basic Concepts
3. Types of Binding Sites
4. Computational Methods to Study Binding Sites
5. A case for protein language models
6. How does a binding pocket look like?
7. Further reading

In addition, I’m developing a prototype, vanilla implementation of ESM-2 for predicting binding sites in this notebook. It’s work in progress.

Progress Report Week 4: ESM-2 prototype + PR feedback

Last week I raised a PR with an intro tutorial for binding sites; this week I addressed some feedback (mostly adding visuals) to make it more beginner-friendly and easy to follow.

The implementation of ESM-2 for predicting binding sites prototype is up and running now (see notebook), however current performance is poor, likely due to small sample size for finetuning and lack of regularization.

For next week, I plan to download a larger dataset for training from Uniprot, possibly adding it to the deepchem datasets suite. In parallel, I will start an intro tutorial for protein language models. Stay tuned!

Progress Report Week 5: PR merged + intro tutorial on protein language models

The 1st PR on a tutorial for introduction to binding sites was merged! A second PR on this tutorial has been submitted to add a couple of modifications.

No updates with regards to the prototype for ESM-2 predictions on protein binding sites were made this week.

I joined forces with another GSoC fellow, Dhuvi, and an introductory tutorial for protein language models has been drafted. PR will be raised in the next few days.

For next week, I plan to download a larger dataset for training from Uniprot to improve the prototype ESM-2 predictions.