Target Conditioned Antibody Sequence Generation Using Protein Language Models | GSOC 2024 Project

Hi DeepChem Community!

My name is Dhuvi and I am really excited to be contributing to DeepChem as part of the google summer of code program with Aaron Menezes. The project I will be working on is wiring together a protein language model with the deep-chem ecosystem and training it to produce antigen-specific antibody sequences. Currently an open source model of this nature does not exist. If you would like to learn more about the project check out the official GSOC project page: https://summerofcode.withgoogle.com/programs/2024/projects/drmMTzsC

This forum post and subsequent thread updates will serve as part of DeepChem’s effort to build in public and keep the community apprised with updates in the form of successes, failures, blockers, and everything in between as much as possible. It is my hope that in doing so the community can:

  • Hold me accountable to the high bar that the community has held in DC
  • Provide feedback in real time with what is most interesting, useful, or confusing
  • Take the things I learned from negative results and prune them from their own processes
  • Stress test the model and report back with gross failure modes

I’m excited for the journey ahead, stay tuned for updates :slight_smile:

1 Like

Hi DeepChem,

This forum post will be the first update to my project for the summer. This past week I was traveling for my PhD to the the AIRR-C Conference for Adaptive Immune receptors where I was able to present a little bit on our research for combinatorial optimization with ESM-2 as a heuristic. At the Conference I had the pleasure of meeting so many experts in the field of T-cell Receptor and Antibody engineering. It was truly a special experience. There, I learned quite a bit about some of the limitations of language models for Antibodies in particular given their trouble with learning non-germline sequences that have undergone affinity maturation via somatic hypermutation. This will be a key challenge in the coming months ahead.

In terms of deliverables, I have been working on a tutorial to introduce the concept of protein languages and antibodies to the DeepChem community. Its ~95% complete, and is available here: https://colab.research.google.com/drive/18XN_8H0Bs7F_2sSY8yO6s8gbm7x2l_Xe#scrollTo=9V9VIy5xjm5D

Would love any and all feedback as I finish up the final iterations with my mentors/colleagues.

See you all next neek!

Hi all,

This weeks updates include rounding out the tutorial and sending up to the PR gauntlet. Was able to find some time with my mentor Aaron and go through some of the changes that were made and find new areas to improve the tutorial further. Upon review, it looked like there were a couple of areas where some additional explanation could have helped and I made those later this afternoon. Then I worked on a sort of summarizing figure tying antibody design with protein language models. Tried doing this on inkscape which is an opensource alternative to Illustrator and I was pretty impressed with the functionality of it. Highly recommend if you’re looking for an open-source vector graphics studio. Lastly, got to test out the tutorial in the wild with a fellow deepchemmer who was curious about CAR-T and Antibodies for the auto-immune space, and they seemed to have a good grasp of the tutorial so potentially a successful tutorial? Will keep making tweaks and see what the other mentors have to say on the PR.

TO-DOs for Next Week:

  • Continue work on exploratory data analysis of Antibody:Target pairs
  • Try to come up with a reasonable training set and test set split

See you all next week!

Greetings DeepChem,

Pleased to report that this week has been hugely productive. Monday started off with a review from Bharat, who suggested that we break it up into multiple tutorials (potentially up to three tutorials) each with the following scope:

  1. Adaptive vs. Innate Immune System (Broad Strokes Tutorial)
  2. Introduction to Protein Language Models (High Level Introduction)
  3. Ab Design via Directed Evolution (Take end of the first tutorials immunology background on B-cells and Antibodies, and then refer to the PLM tutorial and go into the Ab stuff)

Additionally, I was able to work on in parallel some prototyping of a huggingface pipeline object with a deepchem HF model. Finally, I was able to explore the data some more and split apart some of the paired Ab:Epitope data to see what proportion of them were linear epitopes. Interesting of the ~9k examples pulled from IEDB, roughly 2k were linear epitopes and 5k were discontinuous (with gaps) and the remainder were small molecules. This could prove to be a bit of a challenge depending on how well the models’ accept the distribution of gaps.

Hopefully next week we can make some final tweaks to each of the tutorials and potentially a prototype Ab model with the pipeline object working as well.

Looking forward to more exciting updates next week :slight_smile:

See yall soon!

Hello DeepChemeleons,

This weeks updates kind of cancel out the previous week’s updates. Unfortunate but I suppose a aprt of the process. After pulling out the original tutorial into three different tutorials it struck me that the innate vs. adaptive tutorial was slightly under-developed to be a standalone tutorial. In addition, with the lack of immunology code in the rest of deepchem, it didn’t quite make a ton of sense to write more about the immune system here so I along with my mentor Aaron decided to consolidate the first and third tutorials and pull out the second into its own piece. This leaves us with the following structure for the tutorials:

  1. A single standalone tutorial for protein language models:
  • Covers the basic intuition of language modelling (causal and masked) as well as how this captures co-evolutionary information of protein sequences.
  • Look at a highly conserved protein (hemoglobin) and see if the language model captures the correct signals at both ends of homology
  1. A well encapsulated tutorial with an in-depth background for Ab design via pLMs:
  • Developed background from the innate vs adaptive tutorial
  • Hie et al 2023 Ab stuff
  • Potential to expand for property prediction

In addition I’ve been playing around with prototyping the models, so expect a number of PRs this upcoming week.

Hope you have a great weekend ahead!

Hi DeepChem,

This week, not a ton of updates as I was a little bit caught up with PhD work. However I was able to open the PR for the Ab tutorial and serendipitously discovered that a fellow GSOC contributor was planning a PLM tutorial as well. Was able to reach out to them and set up a scope for us to collaborate. Open science FTW. Holding off on creating the PR for the PLM tutorial but the community is free to check it out and provide feedback:

https://colab.research.google.com/drive/13eXPgZpzTOL3c_S7OM7uR6btP8ZWw7zn#scrollTo=kidvMM-11Yjl

The antibody tutorial can be found here:

https://colab.research.google.com/drive/1yCDg77PxhyzEWl0yUcFaSUwr07g2_QUa#scrollTo=aZszIKEF8AG9

Looking forward to hearing your guys’ thoughts!

Dhuvi

Hello DeepChem-ites,

With the first tutorial accepted as a merge and and the second tutorial currently in reviews, we are cooking with fire. This week I was able to get back into the code and open up hf_models.py, and see where to slot in the Antibody model. As of right now its exact form remains a little fluid as over the course of the summer I’ve met a few experts who have helped refine the functional form of the model. I won’t speak too much on it as of now just because it remains fluid, but the EDA thus far has been immensely helpful as was the tutorial. Looking forward to the weekend and early next week when I should have some prototypes ready for discussion with my mentor and the rest of the GSOC people who are also working on pLM related work.

Looking forward,

Dhuvi