Google Summer of Code 2024 Ideas

davidRFB · March 8, 2024, 2:44pm

Protein Language Models

hugging face models
Regarding some application of generating proteins, the training of these models is an expensive task. Usually, transformer type architectures are used for big datasets around 150 million or more sequences. Deepchem offers a huggingface wrapper that can be used to load pre-trained models. Now it is possible to use mask language modeling to fill certain sequences. However, fine tunning needs a considerable amount of RAM: An example in https://github.com/deepchem/deepchem/issues/3838#issuecomment-1956970012
Other nice use of LLM is to produce conditional generation https://www.nature.com/articles/s41587-023-02115-w and would be a nice application. Models like protGTP2 in hugging face offers this capability https://huggingface.co/nferruz/ProtGPT2
Finally, other application of this LLM can be the extraction of embedding that will be used in ML models. One example is (# Low-N protein engineering with data-efficient deep learning) https://www.nature.com/articles/s41592-021-01100-y which uses UniRep, a big protein LLM, to generate features for a supervised learning task. A hugging face model with that option is protein BERT. https://huggingface.co/GrimSqueaker/proteinBERT https://github.com/nadavbra/protein_bert/tree/master

Other models that are not present in hugging face but can be of interest. https://www.nature.com/articles/s41587-022-01618-2#code-availability
Other examples for Protein structure generation. https://huggingface.co/spaces/simonduerr/ProteinMPNN

Antibody support.
A couple of definitions to start the discussion are :

an antibody? - an immunoprotein responsible for specifically recognizing and binding to potentially pathogenic molecules.
an antigen? - the molecule that the antibody targets.

Some problems that can be studied in antibody design are structural: For example, the accurate modeling of Antibody-Antigen pairs. Specially in the interaction spots. https://www.sciencedirect.com/science/article/pii/S0959440X22000586. For this kind of task is important to have structural databases such as:

https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/about
Once the dataset is selected, the models used for this should be sequence-to-structure or structure-to-structure. The featurization can be processing of sequences or coordinates.
The other issue that can be studied is the binding affinity between the antigen and antibody. This supervised problem needs that affinity value for a pair antibody-antigen.
One of the biggest databases is https://life.bsc.es/pid/skempi2/database/summary. However, other articles used processed versions. For example, https://biosig.lab.uq.edu.au/csm_ab/datasets (CSM-AB: graph-based antibody-antigen binding affinity prediction and docking scoring function) which uses a graph signature as feature. Other examples are https://www.nature.com/articles/s42004-023-01037-7#Sec11
Finally, Other models can use only sequences and are made for optimization of other properties such as aggregation or pharmacokinetics. A full set of models-databases and future perspectives can be found at:
https://www.cell.com/trends/pharmacological-sciences/fulltext/S0165-6147(22)00279-6

New Emerging Drug Modalities.
For this type of functionalities, datasets are crucial. For PROTACS and macrocycle, some featurizer already work. Therefore, some databases of interest can be:
PROTACT-DB

http://cadd.zju.edu.cn/protacdb/help
Macrocycles.
Found this article with an analysis of existing literature.
https://pubs.acs.org/doi/epdf/10.1021/acs.jmedchem.3c00134

pranjalverma78 · March 10, 2024, 10:41am

Hi, I’m Pranjal Verma, a 3rd year student from IITDelhi. I have gone through the description of Protein Language Modelling and found it really fascinating. I am have done projects in LLMs and hugging face. I would be very excited in contributing to this project. I would highly appreciate your guidance. I am ready to get started. Please let me know if I can communicate with you all regarding this project.

bharath · March 11, 2024, 8:38am

A quick reminder this thread is only for discussions about project ideas and not introductions. Please introduce yourselves on The Introductions Thread!

sherry · March 11, 2024, 11:22pm

Hello,
My name is Sherry, and I’m currently pursuing my Ph.D. at the School of Computer Science at Zhejiang University, where I’m in my second year. My research is focused on AI4Science, a fascinating field that blends artificial intelligence with scientific discovery.
I have hands-on experience in protein design, which is reflected in my ongoing research and a manuscript I’m preparing for submission. My previous work was grounded in the SE3Diffusion framework, a model I’m intimately familiar with, both in terms of its codebase and underlying principles.
I wholeheartedly agree with the significance of equivalence in biomolecular representation, a concept that resonates with my research interests. I’m particularly excited about the project aimed at improving equivalence support, and I’m eager to contribute to the DeepChem community.
I’m looking forward to exploring how my expertise can align with the goals of this project and to potentially collaborate with like-minded individuals who share a passion for pushing the boundaries of scientific research through AI.
Best regards,
Sherry

keyakk · March 13, 2024, 4:49am

Hi, I am a PhD student interested in Protein Language Modeling and Improving New Drug Modality Support, 2 projects. Let me know what should I need to do.
Thanks.
Regards,
Keya

rishi0110 · March 13, 2024, 10:00am

Hi, I am a third-year undergraduate student from IIT Kharagpur interested in Protein Language Modeling and Improving New Drug Modality Support, 2 projects. How am I supposed to go about contributing.
Thanks.

bharath · March 14, 2024, 11:15pm

We recommend joining the discord (https://discord.gg/RYTrUY8Ssn) and coming by the office hours Announcing the DeepChem Office Hours. We can answer general questions on how to get started there

marija-stanojevic · March 17, 2024, 12:58pm

Hello, I am Marija. I am a Machine Learning Researcher, working full time, who would like to join an open source initiative. I have been familiar with GSOC since my time as a PhD student, but I never participated in it. I looked there now to find a suitable project to begin my journey in open source. I would like to implement a wishlist model for your library. Based on what I’ve read in your GitHub issues and documentation, you are looking to implement several algorithms which I am also interested in: 1) E3, 2) DeepONet, 3) Neural ODEs, 4) GenAI for Proteins and Sequences (https://www.nature.com/articles/s41587-023-02115-w), and 5) LLM for proteins (https://www.nature.com/articles/s41587-022-01618-2). Could you let me know which of these projects are still available? I am most interested in the last two projects. Additionally, I would like to know if I am eligible to participate through GSOC. If not, is it possible to get the support of a mentor and to create a timeline for this project? I’ll come to the office hours tomorrow to discuss more details.

sriphani · March 22, 2024, 12:06am

Hi guys, I’m adding a few important links for people who are interested in the project Support of Equivariance in Deepchem.

Equivariance Transformer by using the e3nn library
https://docs.e3nn.org/en/latest/guide/transformer.html
Equivariant GNN and experiments on QM9 dataset.
https://projects.volkamerlab.org/teachopencadd/talktorials/T036_e3_equivariant_gnn.html

Possible further steps.

Training equivariance transformer on the QM9 dataset and getting comparable results.
Adding the module to DeepChem

LaibaKhan1 · March 27, 2024, 5:20am

Hello, I am Laiba Khan I am currently an undergraduate student in computer science and I am a Research Intern at Micro Electronic Research Lab (MERL). I am an early researcher in the area of neural networks or neuromorphic computing and I am interested in working in AI/ML projects. Also, I have worked on python.

I am interested in two projects 1. Protein Language Modeling and 2. PyTorch Porting

ayushi · March 28, 2024, 11:54am

Hello, I’m Ayushi Awasthi, currently an undergraduate student inn computer science. Im interested in contributing to Layer tutorials. I have working knowledge of Python as well as Jupyter/Colab and this project perfectly aligns with my knowledge and interest.

rsaketh002 · March 28, 2024, 5:36pm

My name is Gandeed Saketh Reddy, I am currently pursuing my final year Btech at national institute of technology warangal (Nit warangal), and I have keen interest in the “Protein Language Modeling” project listed on the potential mentor list. I have very good experience in forking with hugging face, nlp , pytorch and deep learning. I also have experience in finetuning models using hugging face libraries. I would greatly appreciate the chance to discuss this project further and explore how I can contribute to its success.

bharath · March 28, 2024, 8:06pm

As a reminder, this thread is only for discussion of specific ideas. Please save introductions for the introductions thread

koussayinsat12 · March 31, 2024, 9:15am

I’m Kousai Ghaouari, a fourth-year Computer Science Engineering student at the National Institute of Applied Science and Technology in Tunisia. The Wishlist Model project caught my attention, and I’m eager to contribute, given my passion for research papers and building models from scratch. You can find examples of my previous projects on my GitHub account [https://github.com/koussayinsat12]. I would greatly appreciate your guidance on how to get started. Please let me know the best way to communicate further. Thank you for considering my interest.