DeepChem meets Hugging Face pLM ESM-2: Predicting protein binding sites | GSoC 2024

Hey everyone!

I’m Elisa Gómez de Lope, and I’m excited about joining the DeepChem community as a contributor, as part of the Google Summer of Code 2024 program.

The project: I’ll be working on integrating the pre-trained ESM-2 protein representations from Hugging Face transformers with DeepChem’s framework capabilities for predicting binding sites. The goal is to develop a straightforward way to predict protein binding sites in DeepChem, leveraging the strengths of both libraries, and documenting the process through a tutorial.

About me: My interests root in applying machine learning to biological data. I recently graduated from my PhD and am now a researcher at the university of Luxembourg, where I study graph representation learning methods for modeling omics data in the context of Parkinson’s disease. I’ve been following the developments in LLMs and am particularly fascinated by the potential of these models in predicting protein functions and properties.

I’m eager to contribute to the DeepChem open source library by developing a tutorial that guides researchers and enthusiasts to further exploit protein language models, not only for binding site prediction, but for other use cases and functionalities as well (e.g., generating peptide binders, predicting the effect of mutations, or protein-protein interactions), ultimately enabling larger adoption and accelerating drug discovery efforts.

This thread will be updated weekly with the latest updates on the project, so stay tuned if you’re interested in following up with my work over the next few months.

You can also find me on…
Github: elisagdelope
Twitter: @elisagdelope

1 Like

Progress Report Week 1: Obtaining data from Uniprot

A Jupyter notebook for downloading data from Uniprot website & pre-processing it to generate train/test splits ready to be used for modeling ESM-2. This notebook covers:

  • :inbox_tray: Downloading Data: Retrieve information from the UniProt website, including details on protein families, binding sites, active sites, and amino acid sequences.

  • :hammer_and_wrench: Processing Data: Handle special symbols (angle brackets and question marks) in binding/active site information and convert this data into binary labels. Each amino acid position in the protein sequences is marked as 1 (binding/active site) or 0 (non-binding/active site).

  • :scissors: Splitting Data: Divide amino acid sequences and their labels into stratified train/test sets based on UniProt protein families.

  • :arrows_counterclockwise: Chunking Sequences: Split sequences and their labels into non-overlapping chunks of a specified length to define a context window for the ESM-2 model.

Link to notebook

To-do:

  • Working with David to set up a dataset in deepchem AWS storage
  • Notebook with the basic science of the project: predicting binding sites & protein language model ESM-2

Progress Report Week 2: starter on binding sites tutorial

A Jupyter notebook covering the science behind binding sites was started:
1.Introduction to Binding Sites
2. Basic Concepts
3. Types of Binding Sites
4. Mechanisms of Binding
5. Methods to Study Binding Sites
6. Applications and Relevance
7. Case Studies

Sections 1-3 have been covered; to do for next week: completion of the notebook.

Progress Report Week 3: PR Introductory binding sites + ESM-2 vanilla implementation

The Jupyter notebook covering the science behind binding sites was finished, here’s the PR. The tutorial walks through the following sections:
1.Introduction to Binding Sites
2. Basic Concepts
3. Types of Binding Sites
4. Computational Methods to Study Binding Sites
5. A case for protein language models
6. How does a binding pocket look like?
7. Further reading

In addition, I’m developing a prototype, vanilla implementation of ESM-2 for predicting binding sites in this notebook. It’s work in progress.

Progress Report Week 4: ESM-2 prototype + PR feedback

Last week I raised a PR with an intro tutorial for binding sites; this week I addressed some feedback (mostly adding visuals) to make it more beginner-friendly and easy to follow.

The implementation of ESM-2 for predicting binding sites prototype is up and running now (see notebook), however current performance is poor, likely due to small sample size for finetuning and lack of regularization.

For next week, I plan to download a larger dataset for training from Uniprot, possibly adding it to the deepchem datasets suite. In parallel, I will start an intro tutorial for protein language models. Stay tuned!

Progress Report Week 5: PR merged + intro tutorial on protein language models

The 1st PR on a tutorial for introduction to binding sites was merged! A second PR on this tutorial has been submitted to add a couple of modifications.

No updates with regards to the prototype for ESM-2 predictions on protein binding sites were made this week.

I joined forces with another GSoC fellow, Dhuvi, and an introductory tutorial for protein language models has been drafted. PR will be raised in the next few days.

For next week, I plan to download a larger dataset for training from Uniprot to improve the prototype ESM-2 predictions.

Progress Report Week 6: Off-week (LOGML 2024 summer school)

The now overdue update from last week (week 6) of GSoC is that I attended the LOGML course, hence could not contribute to GSoC. Will catch up in the next days!

Progress Report Week 7: Larger dataset from UniProt and integration in tutorial

On week 7th (slightly overdue update) I’ve worked on:

  • Reviewing and finalizing Dhuvi’s PR on the introductory tutorial for protein language models.
  • Downloading larger dataset with sequence and binding sites from Uniprot for finetuning -> to be added to DeepChem AWS storage (by David Figueroa)
  • Adapting the prototype tutorial for dataset from DeepChem AWS download.

For next week, all these points should be finalized and a PR should be raised for the tutorial prototype.

Progress Report Week 8: Larger dataset from UniProt and integration in tutorial

The second PR with some modifications on the tutorial for binding sites (see week 5) was merged.

With regards to the prototype for ESM-2 predictions on protein binding sites:

  • Dataset was added to DeepChem AWS storage
  • Accompanying tutorial on UniProt data processing is ready
  • I received some feedback on the tutorial to add some figures and visuals

For next week, I plan to address the feedback on the tutorial prototype and check the data download from the aws storage. If time allows, I will start prototyping a wrapper of the ESM-2 model from HuggingFace based on the DeepChem Huggingface model class.

Progress Report Week 9: Fine-tuning ESM-2 for binding sites prediction tutorial & UniProt data preprocessing tutorials PR-ed

The tutorial for fine-tuning ESM-2 for predictions on protein binding sites was polished and finished; visualizations for a use case protein binding pocket were added on the inference section. In addition, the accompanying tutorial on processing UniProt data for binding site predictions task was committed. Here is the link to the PR.

In addition, the joint tutorial with Dhuvi on protein language models is ready to merge.

Next week I will start prototyping a “deepchemized” wrapper of the ESM-2 model from the HuggingFace model class.

Progress Report Week 10: Updated PR for fine-tuning ESM-2 tutorial and DeepChem wrapper for ESM-2

This week I worked on:

  • Prototype for ESM-2 predictions:

    • PR updated with csv file for website renders
  • New section for deepchem website tutorials:

    • Investigated how tutorials render in the website, to create a new section (new csv) for language models.
    • New section csv to be PR-ed next week.
  • Deepchem wrapper for ESM-2 pLM:

    • Started wrapper for ESM-2
    • First use case: Token classification (e.g., binding sites prediction).
    • Padding will be used for labels of different length.

Progress Report Week 11: Updated PR for fine-tuning ESM-2 tutorial and DeepChem wrapper for ESM-2

This week I worked on:

  • Prototype for ESM-2 predictions:
    • PR updated with footer and website render assigned for data preprocessing tutorial.
    • PR updated with visualization of inference protein.
  • Deepchem wrapper for ESM-2 pLM:
    • Decided to start with the use case of sequence classification instead of token classification as it’s more straightforward to implement.
    • Implementation for sequence classification is analogous to Chemberta, and works locally.