This post lists potential ideas for GSoC 2024 projects. (Note that we are planning on submitting an application but we will not know if we have been accepted until later.) We have divided projects into beginner, intermediate, and advanced categories. Projects are also listed with suggested GSoC project size (small, medium, large, see https://developers.google.com/open-source/gsoc/faq)
Edit: We are pleased to announce we have been selected for GSoC 2024! https://summerofcode.withgoogle.com/programs/2024/organizations/deepchem.
Beginner Friendly Projects
Beginner projects should be accessible for new developers and focus on projects that don’t require engineering sophistication.
-
Layer Tutorials
- Length: Small (90 hours)
- Description: DeepChem has been moving towards first class layers and now has a collection of general layers. We still need to improve the documentation for existing layers to make them more useful for the community. This project should add tutorials for using existing layers to the DeepChem tutorial series, and should plan to add a few new layers that would be useful to the community.
- Educational Value: Students will learn to improve their technical communication skills and learn how to construct useful Jupyter/Colab tutorials. Layers are easier to add than full models since they are effectively functions.
- Potential Mentors: Aryan, Jose, Riya, Maithili, Nimisha, Shreyas
-
Improving Antibody Support
- Length: Small (90 hours)
- Description: DeepChem at present doesn’t have much tooling or support for working with antibodies. This project would add suitable antibody datasets to MoleculeNet and create a tutorial walking users through antibody design and modeling with DeepChem. If time permits, students may add antibody-specific models as well.
- Educational Value: Antibodies make up an increasing fraction of newly approved medicines. Learning the basics of antibody machine learning will enable a student to work on antibody informatics in industry or in graduate school. Students will learn to improve their technical communication skills and learn how to construct useful Jupyter/Colab tutorials.
- Potential Mentors: Jose, David, Bharath
-
Improving New Drug Modality Support
- Length: Small (90 hours)
- Description: DeepChem at present doesn’t have much tooling or support for working with emerging drug modalities. These include PROTACs, Antibody-drug-conjugates, macrocycles, oligonucleotides and more. This project would add a new tutorial introducing these new drug modalities and provides examples of how to work with them with deepchem. It would also be useful to identify and process relevant datasets.
- Educational Value: New drug modalities drive many emerging startups in the space. Improving DeepChem’s support for these new modalities of therapeutics could help drive discovery of new medicine at the cutting edge. It would prepare students to potentially find jobs at these up-and-coming biotech firms as well.
- Potential Mentors: Jose, David, Bharath
-
PK/PD Tutorials
- Length: Small (90 hours)
- Description: Pharmacokinetics and pharmacodynamics are critical for modeling the presence of drugs in the human body. This project will focus on building a tutorial for PK/PD modeling and optionally on adding tooling to DeepChem to improve PK/PD modeling support.
- Educational Value: Students will learn to improve their technical communication skills and learn how to construct useful Jupyter/Colab tutorials. Students will learn the basics of PK/PD computational modeling and science.
- Potential Mentors: Bharath
-
Improving support for drug formulations
- Length: Small (90 hours)
- Description: Drug formulations are a rich area of industrial study that is often critical for actually bringing a drug to patients. See https://drughunter.com/resource/the-modern-medicinal-chemist-s-guide-to-formulations/ for example for a guide. In this project, you will build a tutorial introducing readers to the study of drug formulations along with DeepChem examples of how you can computationally help design a potential formulation.
- Educational Value: Formulations are critical for bringing drugs to patients. Improving DeepChem’s support for these new modalities of therapeutics could help drive discovery of new medicine at the cutting edge. It will prepare students to find jobs at large biotech/pharma firms as well.
- Potential Mentors: Jose, David, Bharath
Intermediate Projects
These projects require some degree of hacking, but likely won’t raise challenging engineering difficulties.
-
Protein Language Modeling
- Length: Large (300 hours)
- Description: DeepChem now has integrated HuggingFace support for using language models with chemistry applications. Extend this support to add a protein language model integrated with DeepChem. This can first be done in a standalone tutorial and then as direct contributions to the main library.
- Educational Value: Protein machine learning is a hot field with a host of academic groups and startups working on it. Students will also learn to work with HuggingFace tools, which are increasingly standard for open source transformers. Students will gain an introduction to this rapidly growing area and will be able to possibly publish a research paper based on their work if they continue working with us.
- Potential Mentors: Arun, David, Sriphani, Bharath
-
Improve Equivariance Support
- Length: Medium (175 hours)
- Description: DeepChem has limited support for equivariant models. This project would extend support for equivariance to DeepChem and add an equivariant model such as SE(3)-transformers or tensor field networks to DeepChem possibly by integrating with e3nn.
- Educational Value: Equivariance is one of the most interesting ideas in modern machine learning and underpins powerful systems like AlphaFold2. Contributors will learn more about this field and could potentially write a research paper about their work on this project.
- Potential Mentors: Aryan, Riya, Nimisha, Sriphani, Bharath, Shreyas
-
Torch compile and PyTorch 2.2.0
- Length: Large (175 hours)
- Description: Torch compile offers the potential for significantly speeding up DeepChem models and facilitating serving models. In this project, you will experiment with using Torch compile in PyTorch 2.2.0 to speed up DeepChem models and benchmarks speedups.
- Educational Value: Learning to optimize models is a powerful skill that will teach GPU basics, numerical methods, and potentially distributed programming.
- Potential Mentors: Arun, Bharath
Advanced Projects
These projects raise considerable technical and engineering challenges. We recommend that students who want to tackle these projects have past experience working in large codebases and tackling code reviews for complex code.
-
Implement a Wishlist Model
- Length: Large (300 hours)
- Description: DeepChem has an extensive wishlist of models (https://github.com/deepchem/deepchem/issues/2680). Pick a model from the wishlist and implement it in DeepChem. We suggest tackling some of the neural differential equation models such as Neural ODEs or Fourier Neural Operators to improve DeepChem’s physiological modeling toolkit.
- Educational Value: Implementing a machine learning model from scratch or from an academic reference into a production grade library like DeepChem is a challenging task. Doing so requires understanding the base model, dealing with numerical issues in implementation, and benchmarking the model correctly. Multiple past GSoC contributors have leveraged their implementations to write papers on their work and have gained skills that they have used subsequently in industry or in academia.
- Potential Mentors: Depends on model.
-
PyTorch Porting
- Length: Medium (175 hours)
- Description: DeepChem has mostly shifted to PyTorch as its primary backend, but a few models are still implemented in TensorFlow. A good project could be to pick a TensorFlow model or two, then port its layers and model into PyTorch along with suitable unit tests. See https://github.com/deepchem/deepchem/issues/2863.
- Educational Value: Porting models while preserving numerical properties requires a strong understanding of deep learning implementations. It serves as a test of machine learning know-how that will serve students well in future machine learning positions in academia or industry.
- Potential Mentors: Aryan, Jose, Riya, Nimisha, Bharath, Shreyas
-
ModularTorchModel Implementation Work
- Length: Large (300 hours)
-
Description: DeepChem recently added support for
ModularTorchModel
which enables systematic pretraining by breaking models into modular components which can be pieced together. In this project, you will extend DeepChem’s library of models to port more models currently usingTorchModel
, our older Torch infrastructure ontoModularTorchModel
. - Educational Value: Software support for self-supervised learning and systematical pretraining will teach very useful machine learning engineering skills and will push students to learn benchmarking skills and CI handling.
- Potential Mentors: Arun, Jose, Bharath
-
Target Validation
- Length: Large (300 hours)
- Description: DeepChem currently has no infrastructure to help with target selection. A target is a protein or other biomolecule/system believed to be implicated with a disease. This project will aim to build infrastructure for target selection and validation. For example, a protein language model could be used to classify proteins as suitable/druggable targets or not. Or alternatively a new tutorial could be added introducing the problem of target validation.
- Educational Value: This is a scientifically challenging project which will require reading papers and forming hypotheses on good directions. If this project is executed well, it could reasonably lead to a solid publication.
- Potential Mentors: Jose, David, Bharath
Community members, please add on more suggestions!