GSoC '24 Final Report | Target Conditioned Antibody Sequence Generation Using Protein Language Models

dxk_23 · August 26, 2024, 11:32pm

Final Report

Project Title: Target Conditioned Antibody Design via Protein Language Models

Contributor: Dhuvarakesh (Dhuvi) Karthikeyan
Mentor: Aaron Rock Menezes

1. Project Overview

Broadly, the primary objective of the project was to integrate code that would enable antibody design using the DeepChem codebase. Specifically, I wanted to transfer some of my previous work with target conditioned T-cell receptor design into the open source space for a less-experimental and widely used therapeutic modality: antibodies. Monoclonal antibodies (mAbs), a subset of Adaptive Immune Receptors (AIRs), precisely target molecular surfaces at a sub-protein level resolution. They are used in cancer checkpoint blockades, treatments for viral infection, and are even useful in situations of snakebites as anti-venom. Methods of antibody discovery are time consuming and resource intensive, motivating the recent spike in papers working on computational design. This project explores the use of protein language models (pLMs) for epitope-specific antibody design.

2. Summer in Review

When I applied to GSoC, one of the most important things I was hoping to get was the ability to start writing production-level code and even before that, thinking in systems. As a PhD student studying computational biology, I had built up significant domain knowledge and intuition with how to approach problems as well as an ability to build train and debug large models. However, I entered the summer a complete novice in terms of building code for use with other code. In the span of a few weeks, I was surprised at the fluency in which I could navigate the deepchem codebase, not only to find tools, and functions, but more importantly, determine a reasonable plan of attack to make a contribution. Dr. Ramsundar can testify to this. Compared to the initial proposal, the actual contributions were infinitely more harmonious with the existing code. Additionally, beyond developing my skills as a developer (you’re welcome for the smile), I got a chance to learn about best practices in terms of breaking up functionality and thinking in abstractions. One of the problems I’m notoriously bad at is thinking in terms of what makes sense to me. A key lesson I learned from GSoC is writing code in the style that fits a particular repository and changing my own code writing style to match a larger organizations. Something I had to unlearn over the course of the summer was chasing the SoTA for a particular metric. While prevalent in academia, in the production side, I learned that simpler methods that are with real life demonstrated use cases are preferred over complicated methods with un-verified real world performance. This helped me change the trajectory of the project from instead of endeavoring to build an encoder:decoder transformer and training it in piece-meal to instead start with the simplest, yet still effective approach and iteratively adding complexity, which not only helped build intuition, but also explicate the difference in performance of additional layers of complexity, and also bring more contributions sooner!

3. Key Contributions

Motivating Tutorial: Tutorial breaking down the necessary biology of antibodies and why they are useful along with a simple tutorial demonstrating the Hie et al. 2023 Anitbody Redesign method. PR
Extending Hugging Face Functionality: Implemented the huggingface fill-mask pipeline object from scratch in DeepChem to enable antibody re-design via iterative un-masking. PR
Protein Language Model Tutorial: Created a more in-depth tutorial on the inductive biases that help protein language models succeed in a plethora of tasks. Worked closely with fellow GSoC contributor Elisa in a fun cross-over contribution! PR
Antibody Modeling Class: Added the Anitbody Modeling class abstraction into DeepChem, providing users with the ability to optimize antibody sequences by masking and un-masking residues like Hie et al.'s method. PR
(In-progress): Extending the functionality of the class abstraction to more complex methods of design with decreasing dependence on seed sequences and increasing dependence on epitope sequences.

4. Current Status

Antibody LLM Integration: The model class is functional and currently undergoing reviews by DeepChem. It can currently be used for the masked language modeling task, but can also be trained and fine-tuned on specific datasets.
Documentation: All necessary documentation have been included in the RST docs. An updated tutorial demonstrating the new class is the next step once the code is merged in.
Dataset: (In-progress) A curated dataset of antibody-epitope pairs is ready for DeepChem integration, allowing members of the OSS community to try out their own design methods in the space.
Testing: All PRs that have been merged pass unit tests and are vetted extensively by the maintainers.

5. Future Directions

Additional Task Support: Currently working on integrating additional more complicated tasks such as autoregressive generation of Ab sequences, or featurization of antibodies for classification tasks.
Finetuning a large scale model.
Long-term Maintenance: As the huggingface models grow (we had 3 this summer), I would love to help out with maintaining proper class abstractions and code organization structure along with functionality especially in light of updates to the dependencies.