DeepChem GSoC 2025 Project Ideas

This post lists draft potential ideas for GSoC 2025 projects. (Note that we are planning on submitting an application but we will not know if we have been accepted until later.) We have divided projects into beginner, intermediate, and advanced categories. Projects are also listed with suggested GSoC project size (small, medium, large, see https://developers.google.com/open-source/gsoc/faq)

Beginner Friendly Projects

Beginner projects should be accessible for new developers and focus on projects that don’t require engineering sophistication.

  • Layer Tutorials
    • Length: Small (90 hours)
    • Description: DeepChem has been moving towards first class layers and now has a collection of general layers. We still need to improve the documentation for existing layers to make them more useful for the community. This project should add tutorials for using existing layers to the DeepChem tutorial series, and should plan to add a few new layers that would be useful to the community.
    • Educational Value: Students will learn to improve their technical communication skills and learn how to construct useful Jupyter/Colab tutorials. Layers are easier to add than full models since they are effectively functions.
    • Potential Mentors: Aryan, Jose, Riya, Maithili, Nimisha, Shreyas
    • Note: This was also a 2024 project, but there remains more work to be done for 2025 expanding tutorials/ideas.
  • Improving New Drug Modality Support
    • Length: Small (90 hours)
    • Description: DeepChem at present doesn’t have much tooling or support for working with emerging drug modalities. These include PROTACs, Antibody-drug-conjugates, macrocycles, oligonucleotides and more. This project would add new tutorials introducing these new drug modalities and provides examples of how to work with them with deepchem. It would also be useful to identify and process relevant datasets.
    • Educational Value: New drug modalities drive many emerging startups in the space. Improving DeepChem’s support for these new modalities of therapeutics could help drive discovery of new medicine at the cutting edge. It would prepare students to potentially find jobs at these up-and-coming biotech firms as well.
    • Potential Mentors: Jose, David, Bharath
    • Note: This was also a 2024 project, but there remains more work to be done for 2025 expanding support for new modalities.
  • Improving support for drug formulations
    • Length: Small (90 hours)
    • Description: Drug formulations are a rich area of industrial study that is often critical for actually bringing a drug to patients. See https://drughunter.com/resource/the-modern-medicinal-chemist-s-guide-to-formulations/ for example for a guide. In this project, you will build a tutorial introducing readers to the study of drug formulations along with DeepChem examples of how you can computationally help design a potential formulation.
    • Educational Value: Formulations are critical for bringing drugs to patients. Improving DeepChem’s support for these new modalities of therapeutics could help drive discovery of new medicine at the cutting edge. It will prepare students to find jobs at large biotech/pharma firms as well.
    • Potential Mentors: Jose, David, Bharath

Intermediate Projects

These projects require some degree of hacking, but likely won’t raise challenging engineering difficulties.

  • Improve Equivariance Support

    • Length: Medium (175 hours)
    • Description: DeepChem has limited support for equivariant models. This project would extend support for equivariance to DeepChem and add additional equivariant model such as tensor field networks to DeepChem
    • Educational Value: Equivariance is one of the most interesting ideas in modern machine learning and underpins powerful systems like AlphaFold2. Contributors will learn more about this field and could potentially write a research paper about their work on this project.
    • Potential Mentors: Aryan, Riya, Nimisha, Bharath, Shreyas
    • Note: This was also a 2024 project, but this project was not taken up by a student last year.
  • Numpy 2.0 Upgrade

    • Length: Large (175 hours)
    • Description: DeepChem is currently on Numpy < 2.0. The upgrade to 2.0 is not backwards compatible. We need to fix any broken compatibilities.
    • Educational Value: Complex version upgrades take a lot of sophistication and will teach students challenging debugging skills.
    • Potential Mentors: Bharath
  • Conversion of Smiles to IUPAC and IUPAC to smiles

    • Length: Large(300 hours)/Medium(175 hours)
    • Description: This project focuses on developing tools within DeepChem to enable accurate, bidirectional conversion between SMILES (Simplified Molecular Input Line Entry System) strings and IUPAC (International Union of Pure and Applied Chemistry) names. The final deliverables will include user-friendly APIs, thorough documentation, and comprehensive testing to facilitate reliable molecular representation transformations.
    • Educational Value: Deepening of understanding of chemical data structures, algorithm optimization for molecular conversions, and contributing to the Deepchem ecosystem.
    • Potential Mentors: Shreyas, Bharath

Advanced Projects

These projects raise considerable technical and engineering challenges. We recommend that students who want to tackle these projects have past experience working in large codebases and tackling code reviews for complex code.

  • Implement a Wishlist Model

    • Length: Large (300 hours)
    • Description: DeepChem has an extensive wishlist of models (https://github.com/deepchem/deepchem/issues/2680). Pick a model from the wishlist and implement it in DeepChem. We suggest tackling a model such as Hamiltonian or Lagrangian Neural networks or Physics Inspired Neural Operators (PINO) that will improve DeepChem’s physics support.
    • Educational Value: Implementing a machine learning model from scratch or from an academic reference into a production grade library like DeepChem is a challenging task. Doing so requires understanding the base model, dealing with numerical issues in implementation, and benchmarking the model correctly. Multiple past GSoC contributors have leveraged their implementations to write papers on their work and have gained skills that they have used subsequently in industry or in academia.
    • Potential Mentors: Depends on model.
  • PyTorch Porting

    • Length: Medium (175 hours)
    • Description: DeepChem has mostly shifted to PyTorch as its primary backend, but a couple models are still implemented in TensorFlow, in particular our Chemception implementation. This project would port Chemception and do final testing to fix issues with PyTorch/DeepChem compatibility on implementations. https://github.com/deepchem/deepchem/issues/2863.
    • Educational Value: Porting models while preserving numerical properties requires a strong understanding of deep learning implementations. It serves as a test of machine learning know-how that will serve students well in future machine learning positions in academia or industry.
    • Potential Mentors: Aryan, Jose, Riya, Nimisha, Bharath, Shreyas
  • HuggingFace-style easy pretrained-model Load

    • Length: Large (300 hours)
    • Description: DeepChem requires you to know the parameters used to train a model in order to reload it from disk. This is unfriendly for distributing pretrained models. In this project, you will implement an easy HuggingFace-style function call to load weights from disk without having to know training parameters. To do this, you will set a standard metadata format for saving model parameters that can be used behind the scene to autoload models from disk.
    • Educational Value: This is a technically challenging project which will require understanding metadata formats and changing saving/reloading for existing models.
    • Potential Mentors: Aryan, Bharath
  • Model-Parallel DeepChem Model Training

    • Length: Large (300 hours)
    • Description: DeepChem now has good support for training LLM models through huggingface. At present though, these models cannot be too large and must fit on a single GPU. In this project, you will implement basic support for model parallel training to train models with weights that don’t fit on a single GPU.
    • Educational Value: This is a technically challenging project which will require understanding multi-GPU training methods. You may need to explore existing PyTorch frameworks for model-parallel training and adapt them to DeepChem.
    • Potential Mentors: Aryan, Bharath

Community members, please add on more suggestions!

5 Likes