Integrating Protein Language Modeling into DeepChem | GSOC 2024 Project

Hello Fellow Developers,

This post will serve as a time capsule documenting the development of the project. For more details, you can find information about the project here.

1 Like

Week 1 Update

  • Added preliminary classifier heads for ProtBERT.
  • PR for integrating ProtBERT is done.

Week 2 Update

  • Fixed overfit test for ProtBERT
  • Amended changes based on initial reviews.

Week 3 Update

  • ProtBERT PR
    • Added detailed explanation for the model
    • Fixed type annotations
    • Merged!
  • PortBERT sequence classification
    • Parsed DeepLoc data used to testing membrane solubility and cellular localization
    • Trained model on sample dataset, yielding test accuracy of 82%(authors reported ~85%)
    • Experimented with the released ProtBERT sequence classification models by the authors, through HuggingFace Pipeline object.

Week 4 Update

  • ProtBERT Complete PR
    • Restructured the original ProtBERT code to support additional features with minimal changes to the user-level API.
    • Added support for BFD pretrained ProtBERT model.
    • Added support for membrane and cell localization trained classification models.
    • ProtBERT now supports custom classification heads from torch.nn.Module.
    • Fixed type annotations.
    • PR in review.
  • ProtBERT tutorial
    • Started an initial draft of ProtBERT tutorial.

Week 5 Update

  • ProtBERT Complete PR
    • Fixed failing JAX unit tests.
    • After review: Reverting back to previous implementation for more flexibility.
    • Will add classification datasets to DeepChem.
  • ProtBERT tutorial
    • ProtBERT tutorial done.
    • Making changes as per review suggestions.

Week 6 Update

  • ProtBERT Complete PR
    • Reverting back to previous implementation for more flexibility.
    • Under review
  • ProtBERT tutorial
    • ProtBERT tutorial done.
    • Waiting for ProtBERT Complete PR to be merged in
  • Deeploc data
    • Parsed data into csv fails

Week 7 Update

  • ProtBERT extension PR
    • Done, ready to be merged in.
  • ProtBERT tutorial
    • Modified ProtBERT tutorial to the new model.
  • Deeploc data
    • Waiting to upload

Week 8 Update

  • ProtBERT extension PR
    • Merged.
  • ProtBERT tutorial
    • Adapted tutorial to the extended ProtBERT.
    • Replaced dummy data with Deeploc data.
    • Got two LGTMs. Waiting to be merged.
  • Reviewed other PLM tutorials

Week 9 Update

  • Discussed with Dhuvi regarding structure of PLMs in Deepchem.
    • Finalized on designing class functions based on common tasks/modules of proteins.
    • Yet to discuss with Elisa and Anamika.
  • Fixed a naming bug in ProtBERT tests.
  • Merged ProtBERT tutorial.
  • Fixed recent CI breakage due to ProtBERT and torchdata update.

Week 10 Update

  • In discussion with Elisa regarding integrating ESM-2 into DeepChem.
  • Fixed a linting issue in CI