Integrating Protein Language Modeling into DeepChem | GSOC 2024 Project

Hello Fellow Developers,

This post will serve as a time capsule documenting the development of the project. For more details, you can find information about the project here.

1 Like

Week 1 Update

  • Added preliminary classifier heads for ProtBERT.
  • PR for integrating ProtBERT is done.

Week 2 Update

  • Fixed overfit test for ProtBERT
  • Amended changes based on initial reviews.

Week 3 Update

  • ProtBERT PR
    • Added detailed explanation for the model
    • Fixed type annotations
    • Merged!
  • PortBERT sequence classification
    • Parsed DeepLoc data used to testing membrane solubility and cellular localization
    • Trained model on sample dataset, yielding test accuracy of 82%(authors reported ~85%)
    • Experimented with the released ProtBERT sequence classification models by the authors, through HuggingFace Pipeline object.

Week 4 Update

  • ProtBERT Complete PR
    • Restructured the original ProtBERT code to support additional features with minimal changes to the user-level API.
    • Added support for BFD pretrained ProtBERT model.
    • Added support for membrane and cell localization trained classification models.
    • ProtBERT now supports custom classification heads from torch.nn.Module.
    • Fixed type annotations.
    • PR in review.
  • ProtBERT tutorial
    • Started an initial draft of ProtBERT tutorial.

Week 5 Update

  • ProtBERT Complete PR
    • Fixed failing JAX unit tests.
    • After review: Reverting back to previous implementation for more flexibility.
    • Will add classification datasets to DeepChem.
  • ProtBERT tutorial
    • ProtBERT tutorial done.
    • Making changes as per review suggestions.

Week 6 Update

  • ProtBERT Complete PR
    • Reverting back to previous implementation for more flexibility.
    • Under review
  • ProtBERT tutorial
    • ProtBERT tutorial done.
    • Waiting for ProtBERT Complete PR to be merged in
  • Deeploc data
    • Parsed data into csv fails