Target Conditioned Antibody Sequence Generation Using Protein Language Models | GSOC 2024 Project

dxk_23 · May 31, 2024, 3:31pm

Hi DeepChem Community!

My name is Dhuvi and I am really excited to be contributing to DeepChem as part of the google summer of code program with Aaron Menezes. The project I will be working on is wiring together a protein language model with the deep-chem ecosystem and training it to produce antigen-specific antibody sequences. Currently an open source model of this nature does not exist. If you would like to learn more about the project check out the official GSOC project page: https://summerofcode.withgoogle.com/programs/2024/projects/drmMTzsC

This forum post and subsequent thread updates will serve as part of DeepChem’s effort to build in public and keep the community apprised with updates in the form of successes, failures, blockers, and everything in between as much as possible. It is my hope that in doing so the community can:

Hold me accountable to the high bar that the community has held in DC
Provide feedback in real time with what is most interesting, useful, or confusing
Take the things I learned from negative results and prune them from their own processes
Stress test the model and report back with gross failure modes

I’m excited for the journey ahead, stay tuned for updates

dxk_23 · June 8, 2024, 3:46pm

Hi DeepChem,

This forum post will be the first update to my project for the summer. This past week I was traveling for my PhD to the the AIRR-C Conference for Adaptive Immune receptors where I was able to present a little bit on our research for combinatorial optimization with ESM-2 as a heuristic. At the Conference I had the pleasure of meeting so many experts in the field of T-cell Receptor and Antibody engineering. It was truly a special experience. There, I learned quite a bit about some of the limitations of language models for Antibodies in particular given their trouble with learning non-germline sequences that have undergone affinity maturation via somatic hypermutation. This will be a key challenge in the coming months ahead.

In terms of deliverables, I have been working on a tutorial to introduce the concept of protein languages and antibodies to the DeepChem community. Its ~95% complete, and is available here: https://colab.research.google.com/drive/18XN_8H0Bs7F_2sSY8yO6s8gbm7x2l_Xe#scrollTo=9V9VIy5xjm5D

Would love any and all feedback as I finish up the final iterations with my mentors/colleagues.

See you all next neek!

dxk_23 · June 15, 2024, 1:53am

Hi all,

This weeks updates include rounding out the tutorial and sending up to the PR gauntlet. Was able to find some time with my mentor Aaron and go through some of the changes that were made and find new areas to improve the tutorial further. Upon review, it looked like there were a couple of areas where some additional explanation could have helped and I made those later this afternoon. Then I worked on a sort of summarizing figure tying antibody design with protein language models. Tried doing this on inkscape which is an opensource alternative to Illustrator and I was pretty impressed with the functionality of it. Highly recommend if you’re looking for an open-source vector graphics studio. Lastly, got to test out the tutorial in the wild with a fellow deepchemmer who was curious about CAR-T and Antibodies for the auto-immune space, and they seemed to have a good grasp of the tutorial so potentially a successful tutorial? Will keep making tweaks and see what the other mentors have to say on the PR.

TO-DOs for Next Week:

Continue work on exploratory data analysis of Antibody:Target pairs
Try to come up with a reasonable training set and test set split

See you all next week!

dxk_23 · June 21, 2024, 4:44pm

Greetings DeepChem,

Pleased to report that this week has been hugely productive. Monday started off with a review from Bharat, who suggested that we break it up into multiple tutorials (potentially up to three tutorials) each with the following scope:

Adaptive vs. Innate Immune System (Broad Strokes Tutorial)
Introduction to Protein Language Models (High Level Introduction)
Ab Design via Directed Evolution (Take end of the first tutorials immunology background on B-cells and Antibodies, and then refer to the PLM tutorial and go into the Ab stuff)

Additionally, I was able to work on in parallel some prototyping of a huggingface pipeline object with a deepchem HF model. Finally, I was able to explore the data some more and split apart some of the paired Ab:Epitope data to see what proportion of them were linear epitopes. Interesting of the ~9k examples pulled from IEDB, roughly 2k were linear epitopes and 5k were discontinuous (with gaps) and the remainder were small molecules. This could prove to be a bit of a challenge depending on how well the models’ accept the distribution of gaps.

Hopefully next week we can make some final tweaks to each of the tutorials and potentially a prototype Ab model with the pipeline object working as well.

Looking forward to more exciting updates next week

See yall soon!

dxk_23 · June 28, 2024, 9:47pm

Hello DeepChemeleons,

This weeks updates kind of cancel out the previous week’s updates. Unfortunate but I suppose a aprt of the process. After pulling out the original tutorial into three different tutorials it struck me that the innate vs. adaptive tutorial was slightly under-developed to be a standalone tutorial. In addition, with the lack of immunology code in the rest of deepchem, it didn’t quite make a ton of sense to write more about the immune system here so I along with my mentor Aaron decided to consolidate the first and third tutorials and pull out the second into its own piece. This leaves us with the following structure for the tutorials:

A single standalone tutorial for protein language models:

Covers the basic intuition of language modelling (causal and masked) as well as how this captures co-evolutionary information of protein sequences.
Look at a highly conserved protein (hemoglobin) and see if the language model captures the correct signals at both ends of homology

A well encapsulated tutorial with an in-depth background for Ab design via pLMs:

Developed background from the innate vs adaptive tutorial
Hie et al 2023 Ab stuff
Potential to expand for property prediction

In addition I’ve been playing around with prototyping the models, so expect a number of PRs this upcoming week.

Hope you have a great weekend ahead!

dxk_23 · July 5, 2024, 3:47pm

Hi DeepChem,

This week, not a ton of updates as I was a little bit caught up with PhD work. However I was able to open the PR for the Ab tutorial and serendipitously discovered that a fellow GSOC contributor was planning a PLM tutorial as well. Was able to reach out to them and set up a scope for us to collaborate. Open science FTW. Holding off on creating the PR for the PLM tutorial but the community is free to check it out and provide feedback:

https://colab.research.google.com/drive/13eXPgZpzTOL3c_S7OM7uR6btP8ZWw7zn#scrollTo=kidvMM-11Yjl

The antibody tutorial can be found here:

https://colab.research.google.com/drive/1yCDg77PxhyzEWl0yUcFaSUwr07g2_QUa#scrollTo=aZszIKEF8AG9

Looking forward to hearing your guys’ thoughts!

Dhuvi

dxk_23 · July 12, 2024, 4:03pm

Hello DeepChem-ites,

With the first tutorial accepted as a merge and and the second tutorial currently in reviews, we are cooking with fire. This week I was able to get back into the code and open up hf_models.py, and see where to slot in the Antibody model. As of right now its exact form remains a little fluid as over the course of the summer I’ve met a few experts who have helped refine the functional form of the model. I won’t speak too much on it as of now just because it remains fluid, but the EDA thus far has been immensely helpful as was the tutorial. Looking forward to the weekend and early next week when I should have some prototypes ready for discussion with my mentor and the rest of the GSOC people who are also working on pLM related work.

Looking forward,

Dhuvi

dxk_23 · July 21, 2024, 11:25pm

Hi all,

This weeks updates were unfortunately slightly lacklaster. Was hoping to get to a stage of setting up the training-runs, however paper submission for ICML consumed a majority of the week.

I was able to clean up the code that I have been working on and share it with the GSOC students to hopefully finalize a protoype that is ready for training in two stages: the first being implementing the fill-mask functionality in DC and then instantiating a specific antibody model as its own class (similar to Shiva’s ProtBERT) next.

While I’ll be away for the later part of this coming week attending ICML, I’ll try to get some feedback on the implementation and think about how to refactor and merge changes before I leave and enjoy some coding from 30,000 ft.

Looking Foward,
Dhuvi

dxk_23 · July 26, 2024, 4:05pm

Hi all,

Posting this weeks updates from Vienna! Was able to do some offline async coordination with Elisa and the tutorial and got some great feedback on the masking pipeline from Shiva. Also got a chance to meet David Zhang, a fellow DeepChem GSOC student here and we got to catch up on our work outside of DC and talk a little bit about our projects too!

Linking David’s work for those interested in single cell featurizers using LLMs:

For next week, looking forward to open a PR w the updated Hf models page and then using that as a springboard to implement the Ab design functionality included in the tutorial.

Auf Wiedersehen from Vienna,
Dhuvi

dxk_23 · August 2, 2024, 4:07pm

Hello DC Fam,

This weeks updates include syncing with Elisa to troubleshoot this issue with NBReviewer to get our joint tutorial up on DC. I think that some of the table rendering is a little bit shotty but when I sent the file over to Elisa it rendered fine. Will try to get that merged today. Other than that was also able to meet with shiva and hash out what the best set of abstractions would be for the protein language models that would be integrated in the DC codebase. We landed on a task distinction of classes that will probably look something like the following:

ProteinLMFeaturizer(HF_Model):
- Wrapper for encoder models that take sequence to vector space and can be finetuned for classification/regression purposes
ProteinMLMDesign(HF_Model):
- Wrapper for models with masked language modeling training to re-design proteins. Will implement functions for masking specific residues and designing proteins for unmasking. With a property prediction model, can be expanded to do genetic algorithm based optimization
ProteinLMForSequenceGeneration(HF_Model):
- Wrapper for decoder only or encoder:decoder models for autoregressive generation (unconditional or conditional).

Once we set up the codebase, I look forward to starting to training

See you all next week,
Dhuvi

dxk_23 · August 9, 2024, 4:14pm

Hello DeepChemites!

Following getting adjusted to the jetlag and the fact that the summer’s close to ending, I’m pleased to report that last week was quite a productive week :). There were exciting updates on multiple fronts. In tutorials, I got the PR for the Language Model tutorial updated to adopt a more formal tone, reflective of the textbook that the DeepChem tutorials are shaping up to be. I also opened a PR for the fill mask pipeline and included the accompanying tests for the code which passed locally. Hopefully both of these get merged soon so that I can move forward with the next prototype. Finally the last update I have is regarding the data. I’ve been reading more antibody papers and it appears that even the sequence based methods use the SabDab dataset and so I pulled that and extracted the single chain antigens which offer a stronger field for sequence models to shine. This allowed us to nearly 8x the data.

Looking forward to merging in the code updates and discussing what the future of the models abstractions looks like.

Cheers,
Dhuvi

dxk_23 · August 17, 2024, 12:55am

Hi DeepChem,

Pleased to report that this week was another great week for work! This week saw a few rounds of review, tons of learning about Git processes in the wild (which I’m especially thankful for) and also a reformatted class abstraction for the Ab language model that works with a variety of language models. Currently it implements the protein re-design from Hie et al and I’m curious to see which directions it makes sense taking once training is underway. Sorry for the terse update, it’s time to build.

Till next time,
Dhuvi

dxk_23 · September 17, 2024, 2:57pm

Hi DeepChemilia,

A few exciting updates for the Antibody Modeling work that I haven’t updated in a while so I will try to break it up into a pseudo linear time-frame for the illusion of continuity of weekly updates.

Week 08/19-08/23

This week I spent some time with the class DeepAbLLM class abstraction and really it broke it down into a tiered development trajectory. Breaking down the desired functionality, weighing out the utility and novelty, as well as expected effort yield has been an entirely new way of thinking and I can see why these meta design practices exist in the SWE space. Instead of going in guns blazing and accumulating risk in terms of getting lost in the weeds, I found the minimal viable code implementation that simply does the thing and in the notes section of the docs, laid out the currently implements, wip, and planned functionality. The think like a SWE element I’ve begun to pick up has had a significantly higher ROI than I would have appraised it if you asked me earlier this summer. Currently the abstraction works with the ProtTrans (ProtBERT, ProtT5, etc) as well as the IgBERT and AbLang models. I hope to expand this to the ESM models as well.

Week 08/26-08/30

This “week” I was able to make a simple change to the code that alters the masking and tokenization scheme that allows for the handling of the ESM class of models. I tried ESM-1b, ESM-1v, and ESM-2 successfully so the current class should be a very beginner friendly entry point to recreate and extend the methods laid in Hie et al. 2023 for the evolution of antibody sequences given large language models. The next steps on this front are further finetuning an arbitrary model and investigating the plan to incrementally bake in epitope information and evaluate performance gains along the way.

Week 09/02-09/06

This “week” I spent my time implementing a very simple prototype an epitope conditioned model in a non-deepchemmy way just to see what the pain points in the data and model training dynamics looked like without the extra machinery of the DeepChem training code involved. DeepChem has a nice abstraction for Seq2Seq models included that inherit from TorchModel but not HuggingFace model and it appears that doing an overhaul of this codebase might be worthwhile if the model’s performance is competitive, hence the need for a quick and dirty 0-1 exploration script to see if the model class produces coherent output.

Week 09/09-09/13

Last week I got feedback to make the training scripts deepchemic if I wanted to avoid the OOM issues and train using the DFS hardware, so I went back to the DeepAbLLM abstraction and tweaked the init function s.t. it resembled the ChemBerta function call a little more. This was done to specifically take the config dictionary as a parameter to initialize a model_type but then take hyper-parameters in the form of a dictionary for the model config and allow the user to pretrain from a scratch a much smaller network. In the meantime I’m still juggling with the non deepchemmy version just to see if we can generate any Antibodies for clinically relevant but not overly saturated target spaces. Looking forward to next week