Protein Language Models
- hugging face models
Regarding some application of generating proteins, the training of these models is an expensive task. Usually, transformer type architectures are used for big datasets around 150 million or more sequences. Deepchem offers a huggingface wrapper that can be used to load pre-trained models. Now it is possible to use mask language modeling to fill certain sequences. However, fine tunning needs a considerable amount of RAM: An example in https://github.com/deepchem/deepchem/issues/3838#issuecomment-1956970012
Other nice use of LLM is to produce conditional generation https://www.nature.com/articles/s41587-023-02115-w and would be a nice application. Models like protGTP2 in hugging face offers this capability https://huggingface.co/nferruz/ProtGPT2
Finally, other application of this LLM can be the extraction of embedding that will be used in ML models. One example is (# Low-N protein engineering with data-efficient deep learning) https://www.nature.com/articles/s41592-021-01100-y which uses UniRep, a big protein LLM, to generate features for a supervised learning task. A hugging face model with that option is protein BERT. https://huggingface.co/GrimSqueaker/proteinBERT https://github.com/nadavbra/protein_bert/tree/master
Other models that are not present in hugging face but can be of interest. https://www.nature.com/articles/s41587-022-01618-2#code-availability
Other examples for Protein structure generation. https://huggingface.co/spaces/simonduerr/ProteinMPNN
Antibody support.
A couple of definitions to start the discussion are :
- an antibody? - an immunoprotein responsible for specifically recognizing and binding to potentially pathogenic molecules.
- an antigen? - the molecule that the antibody targets.
Some problems that can be studied in antibody design are structural: For example, the accurate modeling of Antibody-Antigen pairs. Specially in the interaction spots. https://www.sciencedirect.com/science/article/pii/S0959440X22000586. For this kind of task is important to have structural databases such as:
-
https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/about
Once the dataset is selected, the models used for this should be sequence-to-structure or structure-to-structure. The featurization can be processing of sequences or coordinates.
The other issue that can be studied is the binding affinity between the antigen and antibody. This supervised problem needs that affinity value for a pair antibody-antigen.
One of the biggest databases is https://life.bsc.es/pid/skempi2/database/summary. However, other articles used processed versions. For example, https://biosig.lab.uq.edu.au/csm_ab/datasets (CSM-AB: graph-based antibody-antigen binding affinity prediction and docking scoring function) which uses a graph signature as feature. Other examples are https://www.nature.com/articles/s42004-023-01037-7#Sec11
Finally, Other models can use only sequences and are made for optimization of other properties such as aggregation or pharmacokinetics. A full set of models-databases and future perspectives can be found at:
https://www.cell.com/trends/pharmacological-sciences/fulltext/S0165-6147(22)00279-6
New Emerging Drug Modalities.
For this type of functionalities, datasets are crucial. For PROTACS and macrocycle, some featurizer already work. Therefore, some databases of interest can be:
PROTACT-DB
http://cadd.zju.edu.cn/protacdb/help
Macrocycles.
Found this article with an analysis of existing literature.
https://pubs.acs.org/doi/epdf/10.1021/acs.jmedchem.3c00134