Google Summer of Code 2024 Ideas

This post lists potential ideas for GSoC 2024 projects. (Note that we are planning on submitting an application but we will not know if we have been accepted until later.) We have divided projects into beginner, intermediate, and advanced categories. Projects are also listed with suggested GSoC project size (small, medium, large, see https://developers.google.com/open-source/gsoc/faq)

Edit: We are pleased to announce we have been selected for GSoC 2024! https://summerofcode.withgoogle.com/programs/2024/organizations/deepchem.

Beginner Friendly Projects

Beginner projects should be accessible for new developers and focus on projects that don’t require engineering sophistication.

  • Layer Tutorials
    • Length: Small (90 hours)
    • Description: DeepChem has been moving towards first class layers and now has a collection of general layers. We still need to improve the documentation for existing layers to make them more useful for the community. This project should add tutorials for using existing layers to the DeepChem tutorial series, and should plan to add a few new layers that would be useful to the community.
    • Educational Value: Students will learn to improve their technical communication skills and learn how to construct useful Jupyter/Colab tutorials. Layers are easier to add than full models since they are effectively functions.
    • Potential Mentors: Aryan, Jose, Riya, Maithili, Nimisha, Shreyas
  • Improving Antibody Support
    • Length: Small (90 hours)
    • Description: DeepChem at present doesn’t have much tooling or support for working with antibodies. This project would add suitable antibody datasets to MoleculeNet and create a tutorial walking users through antibody design and modeling with DeepChem. If time permits, students may add antibody-specific models as well.
    • Educational Value: Antibodies make up an increasing fraction of newly approved medicines. Learning the basics of antibody machine learning will enable a student to work on antibody informatics in industry or in graduate school. Students will learn to improve their technical communication skills and learn how to construct useful Jupyter/Colab tutorials.
    • Potential Mentors: Jose, David, Bharath
  • Improving New Drug Modality Support
    • Length: Small (90 hours)
    • Description: DeepChem at present doesn’t have much tooling or support for working with emerging drug modalities. These include PROTACs, Antibody-drug-conjugates, macrocycles, oligonucleotides and more. This project would add a new tutorial introducing these new drug modalities and provides examples of how to work with them with deepchem. It would also be useful to identify and process relevant datasets.
    • Educational Value: New drug modalities drive many emerging startups in the space. Improving DeepChem’s support for these new modalities of therapeutics could help drive discovery of new medicine at the cutting edge. It would prepare students to potentially find jobs at these up-and-coming biotech firms as well.
    • Potential Mentors: Jose, David, Bharath
  • PK/PD Tutorials
    • Length: Small (90 hours)
    • Description: Pharmacokinetics and pharmacodynamics are critical for modeling the presence of drugs in the human body. This project will focus on building a tutorial for PK/PD modeling and optionally on adding tooling to DeepChem to improve PK/PD modeling support.
    • Educational Value: Students will learn to improve their technical communication skills and learn how to construct useful Jupyter/Colab tutorials. Students will learn the basics of PK/PD computational modeling and science.
    • Potential Mentors: Bharath
  • Improving support for drug formulations
    • Length: Small (90 hours)
    • Description: Drug formulations are a rich area of industrial study that is often critical for actually bringing a drug to patients. See https://drughunter.com/resource/the-modern-medicinal-chemist-s-guide-to-formulations/ for example for a guide. In this project, you will build a tutorial introducing readers to the study of drug formulations along with DeepChem examples of how you can computationally help design a potential formulation.
    • Educational Value: Formulations are critical for bringing drugs to patients. Improving DeepChem’s support for these new modalities of therapeutics could help drive discovery of new medicine at the cutting edge. It will prepare students to find jobs at large biotech/pharma firms as well.
    • Potential Mentors: Jose, David, Bharath

Intermediate Projects

These projects require some degree of hacking, but likely won’t raise challenging engineering difficulties.

  • Protein Language Modeling
    • Length: Large (300 hours)
    • Description: DeepChem now has integrated HuggingFace support for using language models with chemistry applications. Extend this support to add a protein language model integrated with DeepChem. This can first be done in a standalone tutorial and then as direct contributions to the main library.
    • Educational Value: Protein machine learning is a hot field with a host of academic groups and startups working on it. Students will also learn to work with HuggingFace tools, which are increasingly standard for open source transformers. Students will gain an introduction to this rapidly growing area and will be able to possibly publish a research paper based on their work if they continue working with us.
    • Potential Mentors: Arun, David, Sriphani, Bharath
  • Improve Equivariance Support
    • Length: Medium (175 hours)
    • Description: DeepChem has limited support for equivariant models. This project would extend support for equivariance to DeepChem and add an equivariant model such as SE(3)-transformers or tensor field networks to DeepChem possibly by integrating with e3nn.
    • Educational Value: Equivariance is one of the most interesting ideas in modern machine learning and underpins powerful systems like AlphaFold2. Contributors will learn more about this field and could potentially write a research paper about their work on this project.
    • Potential Mentors: Aryan, Riya, Nimisha, Sriphani, Bharath, Shreyas
  • Torch compile and PyTorch 2.2.0
    • Length: Large (175 hours)
    • Description: Torch compile offers the potential for significantly speeding up DeepChem models and facilitating serving models. In this project, you will experiment with using Torch compile in PyTorch 2.2.0 to speed up DeepChem models and benchmarks speedups.
    • Educational Value: Learning to optimize models is a powerful skill that will teach GPU basics, numerical methods, and potentially distributed programming.
    • Potential Mentors: Arun, Bharath

Advanced Projects

These projects raise considerable technical and engineering challenges. We recommend that students who want to tackle these projects have past experience working in large codebases and tackling code reviews for complex code.

  • Implement a Wishlist Model
    • Length: Large (300 hours)
    • Description: DeepChem has an extensive wishlist of models (https://github.com/deepchem/deepchem/issues/2680). Pick a model from the wishlist and implement it in DeepChem. We suggest tackling some of the neural differential equation models such as Neural ODEs or Fourier Neural Operators to improve DeepChem’s physiological modeling toolkit.
    • Educational Value: Implementing a machine learning model from scratch or from an academic reference into a production grade library like DeepChem is a challenging task. Doing so requires understanding the base model, dealing with numerical issues in implementation, and benchmarking the model correctly. Multiple past GSoC contributors have leveraged their implementations to write papers on their work and have gained skills that they have used subsequently in industry or in academia.
    • Potential Mentors: Depends on model.
  • PyTorch Porting
    • Length: Medium (175 hours)
    • Description: DeepChem has mostly shifted to PyTorch as its primary backend, but a few models are still implemented in TensorFlow. A good project could be to pick a TensorFlow model or two, then port its layers and model into PyTorch along with suitable unit tests. See https://github.com/deepchem/deepchem/issues/2863.
    • Educational Value: Porting models while preserving numerical properties requires a strong understanding of deep learning implementations. It serves as a test of machine learning know-how that will serve students well in future machine learning positions in academia or industry.
    • Potential Mentors: Aryan, Jose, Riya, Nimisha, Bharath, Shreyas
  • ModularTorchModel Implementation Work
    • Length: Large (300 hours)
    • Description: DeepChem recently added support for ModularTorchModel which enables systematic pretraining by breaking models into modular components which can be pieced together. In this project, you will extend DeepChem’s library of models to port more models currently using TorchModel, our older Torch infrastructure onto ModularTorchModel.
    • Educational Value: Software support for self-supervised learning and systematical pretraining will teach very useful machine learning engineering skills and will push students to learn benchmarking skills and CI handling.
    • Potential Mentors: Arun, Jose, Bharath
  • Target Validation
    • Length: Large (300 hours)
    • Description: DeepChem currently has no infrastructure to help with target selection. A target is a protein or other biomolecule/system believed to be implicated with a disease. This project will aim to build infrastructure for target selection and validation. For example, a protein language model could be used to classify proteins as suitable/druggable targets or not. Or alternatively a new tutorial could be added introducing the problem of target validation.
    • Educational Value: This is a scientifically challenging project which will require reading papers and forming hypotheses on good directions. If this project is executed well, it could reasonably lead to a solid publication.
    • Potential Mentors: Jose, David, Bharath

Community members, please add on more suggestions!

2 Likes

Hi, I’m Ronan Coutinho, a second year student pursuing BTech Computer Science Engineering from BITS Pilani, India. I found the description of the Protein Language Modeling project really interesting and am eager to contribute to this project. I have worked with HuggingFace before and have experience in fine tuning LLM’s. Before diving into the project, I would greatly appreciate your guidance on how best to get started. Please let me know a means by which I can communicate with you all.

Hi I am Sangameshwar Narayanan ,a student pursuing B.Tech CSE .I saw the description of the Protein Language Modelling fascinating and am ready to contribute to the project.I have worked with langchain,huggingface and LLMs.I would like to have a guidance on how to start what you expect from the project .Also what is the mode of communication with your team

Please join the discord https://discord.gg/FKh47UEctV and we would be happy to discuss more with you there. You can also join the office hours Announcing the DeepChem Office Hours

Hello,
My name is Priya Yadav, and I am currently a sophomore pursuing a degree in computer science and engineering. I am familiar with Python and have experience working with Jupyter Lab. While I don’t have professional experience, I have engaged in coding, particularly in areas such as data analysis, machine learning, and Android development.So, I would like to contribute to the beginner-friendly project- Improving Antibody Support .
I’d appreciate any advice on getting started and approaching the project.Excited about the possibility of contributing.
Thanks

Dear Bharath,

I am Shashank Shekhar Singh, a sophomore from IIT BHU, India, with a keen interest in AI and machine learning. Intrigued by the Protein Language Modeling project, I’m eager to contribute my expertise in Hugging Face, TensorFlow, PyTorch, and LLMs. Before diving in, I’d appreciate your guidance on getting started.

I’m also interested in the PyTorch Porting and ModularTorchModel Implementation Work projects. Could you advise on the best ways to procced further? Excited to collaborate with you and @deep_chem!

Hi, I am Param Parekh, a Second Year B.Tech student at VJTI, Mumbai, India (GMT +5:30). The project describing “Implementation of Wishlist Model” fascinates me with its extensive utilization deep neural networks models as well as its profound educational value. I have more than 1 year experience with basic machine learning algorithms. I have developed and optimized deep learning models while implementing projects on object detection and segmentation using Convolutional neural networks and time series forecasting using Recurrent neural networks. I work on Ubuntu 22.04 and I am familiar with VS code, Jupyter Notebook and PyCharm IDEs. Kindly throw light on desired project outcomes and how should I begin contributing

Hi, I’m Arya, a third year B.Tech CSE Student from VIT Vellore, and a Research and Development Intern at IIT Kanpur, in the AI-ML Field. I’m extremely intrigued by Improving Antibody Support, Protein language model, and Improve Equivariance support. I am extremely familiar with Hugging Face, TensorFlow, Pytorch, LLMs and almost all existing python libraries in Machine Learning using Jupyter Notebooks, IDEs and VSC, with almost 3+ years of experience with Projects, Research Papers in foreign conferences and Internships, with hands on experience in live projects.

Would love to contribute to GSOC @DeepChem

For potential applications, please introduce yourselves at The Introductions Thread! and not on this thread. Let’s keep this focused on GSoC ideas and discussions

Hello, my name is Amit Subhash Chejara and I am learning machine learning and PyTorch. I have completed my BSc last year. I am interested in two ideas, protein language modeling and torch compile and PyTorch. Since I am currently learning pytorch and have some experience in linear regression models and classification models in pytorch, I want to ask weather I can contribute to this project or I need a more deeper knowledge of pytorch?
Please Guide me!

Hello DeepChem Team,
I am Suraj Mahapatra, a third-year B.Tech student at SRM Institute of Science and Technology. Currently engaged in a research and development program focusing on Large Language Models (LLMs) at NIT Rourkela, I am eager to leverage my skills in the field of Deep Learning and develop successful models out of it. I am writing to express my strong interest in contributing to the development of the Protein Language Model and Improving Equivariance Support.

Previously I have been working on the Hugging Face Model, Vision Transformers, Tensor-flow, PyTorch and I am excited to share my expertise by contributing in your project. Before taking the plunge, I would value your suggestion and mentoring in the project.

Hello! I am Divyanshu Rana, A 1st year student at Graphic era University, where I am pursuing a degree in Master of Computer Applications (MCA). I am really excited about Improving Antibody Support and Layer Tutorials. I have familiarity with deep learning frameworks such as PyTorch that would be essential for implementing antibody-specific models.

Thank you for your attention, consideration, and ongoing support. Together, let us continue to strive for excellence and make a positive difference in the lives of individuals

With warm regards,
Divyanshu Rana

Hello Bharath,

I am Sasidharan, a final-year B.E. student studying Computer Science. I am intrigued by Layers Tutorials, Improving Antibody Support, and Improving New Drug Modality Support, and I’m enthusiastic about contributing my expertise to @DeepChem. Before delving deeper, I would greatly appreciate your guidance on how to get started.

Thank you!

To repeat folks, please don’t introduce yourselves on this channel. It’s not the right place. Please use the introductions channel: The Introductions Thread!

Hello @bharath,
I have been studying the deepchem codebase recently. The community have done quiet an advancement in software aspect in this domain. But I think there is a vacancy in polymers domain. I could not find much of code or contributions on this field. Even in molecule net I could not find much regarding polymers. If there are codes for studying monomers or polymer behaviours or any dataset please help me find it else if it’s suitable for you we can come up with a proposal for a project for GSoC’24. I have few ideas we could discuss.

Hey there! :wave: I’m Aparna, a third year student at IIT BHU and I’m super excited about the DeepChem Layer Tutorials Enhancement project! I’ve been working on Jupyter Notebook and Colab projects, diving into different machine learning concepts. The project “Layer Tutorials” aligns perfectly with my interests, as I’m eager to improve technical communication skills and contribute to the community. Could you guide me on the best way to get started? I’m ready to dive in and make some meaningful contributions! :rocket:

A proposal about polymers could be very welcome. You should try to center it around applications to drug discovery. Come by office hours to discuss with us!

Come by office hours and we would be glad to give guidance!

I am thrilled to express my interest in the project “Protein Language Modeling” for GSOC 2024. My name is Awnish Singh, a fourth-year undergraduate student at BITS Pilani, where I have been deeply involved in research projects under the guidance of Dr. S. Murugesan, focusing on target drug prediction. My experience spans various domains, from computer vision to software development, and I have actively contributed to GSOC in the past.

I am particularly excited about the opportunity to extend DeepChem’s support for using language models with chemistry applications to include protein language modeling. Given the growing importance of protein machine learning in both academia and startups, I believe this project offers a unique opportunity to make a meaningful contribution to the field.

Currently, I am gaining valuable experience through an internship where I am working on power automation with MS Azure and the EasyOCR library for text extraction. My previous involvement in a Genetic Algorithm-based project focused on implementing deep learning models for identifying features of protein coding genes has provided me with a strong foundation in this area.

I am eager to collaborate with the DeepChem community, contribute to the development of a protein language model. I am committed to attending office time meetings and engaging in discussions to ensure the success of this project.

@awnish10-scs Please post in the introductions thread The Introductions Thread!. This channel is only for general GSoC questions about topics