GSoC ‘24 | Protein Language Modeling | Final Report

Google Summer of Code 2024:

Final Report

Project Title: Protein Language Modeling

Contributor: Shivasankaran Vanaja Pandi
Mentor: Nimisha Dey

1. Project Overview

The primary goal of this project was to integrate ProtBERT, a protein language model (PLM), into the DeepChem library using PyTorch and Hugging Face. This integration aimed to provide researchers with access to state-of-the-art protein language models within the DeepChem ecosystem, enabling advancements in computational biology and chemistry.

2. Key Contributions

  • Model Integration: Successfully integrated ProtBERT into DeepChem, making it the first protein language model added to the library. This integration sets a precedent and serves as a template for future PLM integrations into DeepChem.
  • Hugging Face Compatibility: Implemented seamless compatibility with Hugging Face, enabling the use of pre-trained models and tokenizers from the Hugging Face model hub.
  • Documentation and Tutorials: Created detailed documentation and a tutorial to guide users on how to utilize ProtBERT within DeepChem for various applications, such as protein sequence classification and pre-training.
  • DeepLoc Data Integration: Added the DeepLoc dataset into DeepChem, providing users with additional resources for protein subcellular localization tasks.
  • Code Contribution: Submitted multiple pull requests (PRs) to the DeepChem repository, focusing on the integration of ProtBERT and ensuring its smooth operation within the DeepChem framework.

3. Current Status

  • ProtBERT Integration: The integration is complete and functional. The model can be used for various tasks, including training from scratch and fine-tuning on specific datasets.
  • Documentation: The user guide and tutorial have been merged into the DeepChem documentation, making it accessible to the broader community.
  • DeepLoc Dataset: The DeepLoc dataset is now integrated into DeepChem, offering new opportunities for research in protein subcellular localization.
  • Testing: The integration has passed all unit tests and has been verified by the DeepChem maintainers.

4. Future Directions

  • Additional Model Support: Future work could include integrating other protein language models into DeepChem to further expand its capabilities, using ProtBERT as a foundational example.
  • Supporting additional operations: Future work can include ProtBERT for token classification and as a feature extractor.
  • Long-term Maintenance: Ensuring the integration remains compatible with future updates to DeepChem, PyTorch, and Hugging Face.

5. Merged Contributions

  • Model PRs:
    • Integration of preliminary ProtBERT into DeepChem.[PR]
    • Added ProtBERT tutorial delineating its usage.[PR]
    • Added support for custom classifiers.[PR]
    • Fixed a minor typo in unit tests.[PR]
    • Added DeepLoc data
  • Additional PRs:
    • Fixed CI doctest issues.[PR]
    • Fixed CI breakage due to torchdata update.[PR]
    • Fixed CI linting issue.[PR]

6. Challenges and Learnings

  • Navigating Complex Codebases: Understanding and contributing to the DeepChem codebase was challenging, especially when ensuring that the ProtBERT integration did not disrupt existing functionalities. This required extensive code reviews, testing, and coordination with other contributors.
  • Dependency Management: Encountered issues with dependencies, particularly when the torchdata package update broke compatibility with DeepChem.
  • Collaboration: Learned the importance of communication and collaboration with the open-source community, especially when working on complex integrations that impact multiple parts of the library.