DeepChem GSoC 2026 Potential Ideas

We have a lot of newcomers coming onto here. Welcome to the community! I am scoping out potential projects for GSoC 2026 (remember we have to apply to get in, so no guarantee DeepChem will be selected yet). Here are some tentative project directions (I will update this forums post as we get new ideas). The expected outcomes are the first portion of each entry:

  • Symbolic machine learning
    Description: We want to implement a symbolic regression capability in PyTorch that we can use in DeepChem (we prefer not to call to a Julia backend like PySR). Think of something like https://arxiv.org/abs/2305.01582 except implemented in PyTorch.
    Skills Required: PyTorch experience, some mathematical background
    Expected Outcomes: (1) A robust implementation of a symbolic regression system using PyTorch within DeepChem, (2), Comprehensive benchmarks comparing the proposed system against standard tools like PySR, .
    Potential Mentors: Aryan, Shreyas, Bharath
    Expected Size: Medium
    Expected Difficulty: Medium
  • MLIP support
    Description: Implement a MLIP model as a TorchModel in DeepChem. Make sure to leverage existing deepchem equivariance tools and not to just call out to an external framework. For reference see, https://github.com/instadeepai/mlip, but we want to do a full implementation in pytorch
    Expected Outcomes: (1) A robust implementation of a MLIP such as Nequip or MACE using PyTorch within DeepChem leveraging DeepChem’s existing equivariant architecture, (2), Comprehensive benchmarks testing the MLIP on standard tests like force correctness or stability of small molecular dynamics simulations.
    Skills Required: PyTorch, some mathematical background
    Potential Mentors: Jose
    Expected Size: Medium or Large
    Expected Difficulty: Hard
  • LLM support for 7B models in DeepChem
    Description: Make a Olmo model in DeepChem https://huggingface.co/allenai/OLMo-7B using the HuggingFaceModel wrapper in DeepChem. Should be able to train/run inference with Olmo. Make sure you support the ability to do generation, regression, and classification and have the ability to continue pretraining on molecular data.
    Expected Outcomes: (1) A robust implementation of Olmo in DeepChem using the HuggingFaceModel wrapper. (2) Demonstration of how to do fine-tuning for classification/regression/generation and (3) the ability to perform additional pretraining on molecular datasets.
    Skills Required: HuggingFace, PyTorch
    Potential Mentors: Riya, Harindhar
    Expected Size: Medium or Large
    Expected Difficulty: Medium
  • Implement RFDiffusion, RFDiffusion-2
    Description: Implement RFDiffusion/RFDiffusion2 or other protein design models in DeepChem. Implementations should be end-to-end in PyTorch and interface with standard DeepChem abstractions such as TorchModel and DeepChem datasets
    Expected Outcomes: (1) A clean implementation of RFDiffusion/RFDiffusion2 in DeepChem using torch model. Should leverage existing equivariance primitives. (2) Benchmarking of the model on standard protein generation tests/tasks to validate performance.
    Skills Required: PyTorch, some background in biology and mathematics
    Potential Mentors: Rishi, Jose, Bharath
    Expected Size: Medium or Large
    Expected Difficulty: Hard
  • Improve DFT support in DeepChem
    Description: DeepChem has preliminary density functional theory support (https://arxiv.org/abs/2309.15985). Build on this! Can you solve new systems, make this scale better, implement other xc-functions? Can you model more complex systems like reactions?
    Expected Outcomes: (1) Add a new concrete functionality to DeepChem’s DFT support. For example, implement a new XC-functional. You may suggest reasonable alternatives (2) Benchmark this new capability on an appropriate choice of system and validate against existing DFT tools like GPAW or PySCF.
    Skills Required: PyTorch, some background in chemistry or quantum mechanics
    Potential Mentors: Rakshit, Aryan, Bharath
    Expected Size: Small, Medium, or Large
    Expected Difficulty: Medium
  • Improve materials machine learning in DeepChem
    Description DeepChem has simple crystal graph convolutions and lattice adsorption model support from a few years ago. Test these models on real systems and improve them. We encourage you to explore generative models such as https://arxiv.org/abs/2110.06197. Possibly implement new papers from the last few years such as MACE. Please make sure to do implementations in DeepChem using standard tools like TorchModel.
    Expected Outcomes: (1) Implement a new model for materials machine learning such as MACE or Crystal Diffusion Variational Autoencoders in DeepChem using TorchModel. Please leverage DeepChem’s existing equivariance utilities as needed. (2) Benchmark this system on a suitable scientific dataset.
    Skills Required: PyTorch, some background in materials science
    Potential Mentors: Aryan, Bharath
    Expected Size: Medium or Large
    Expected Difficulty: Medium
  • Single Cell and DNA Foundation Models
    Description: Implement a single cell or DNA foundation model in DeepChem. Using the existing ChemBERTa and MolFormer models as a guide. I.e, use HuggingFace as a backend, but make sure to integrate fully with DeepChem pretraining and fine-tuning infrastructure. Also need to inherit from TorchModel and DeepChem datasets
    Expected Outcomes: (1) Implement a single cell or DNA foundation model in DeepChem leveraging HuggingFaceModel (2) Implement tokenizers or other needed utlities in DeepChem. (3) Benchmark this model on a suitable dataset.
    Skills Required: HuggingFace, PyTorch
    Potential Mentors: Rishi, Harindhar
    Expected Size: Medium or Large
    Expected Difficulty: Medium
  • Differentiable FEM, FVM:
    Description: Implement a differentiable finite element method or finite volume method in DeepChem. Here are couple potential references https://arxiv.org/abs/2506.18427, https://arxiv.org/abs/2307.02494. Make sure to benchmark against standard FEM/FVM solvers. Also make sure to use standard DeepChem abstractions such as TorchModel and deepchem datasets.
    Expected Outcomes: (1) Implement a finite element method (or finite volume method) as a well designed utility function. This must use PyTorch and be end-to-end-differentiable (2) Provide an implementation of a mesh datastructure for use in the finite element method. (3) Run benchmarks to demonstrate differentiability.
    Skills: PyTorch, numerical methods
    Potential Mentors: Rakshit, Abhay
    Expected Size: Medium or Large
    Expected Difficulty: Hard
  • Robust Bi-directional Translation Between SMILES and IUPAC Nomenclature
    Description: Accurate conversion between systematic IUPAC names and SMILES strings is a fundamental requirement for many chemistry-AI workflows. While current state-of-the-art models like Claude show promise in understanding chemical structures, they are prohibitively expensive for processing millions of molecules in research databases. Furthermore, general-purpose models often lack the necessary precision for complex chemical structures, frequently hallucinating names or failing to correctly interpret stereochemistry and complex ring systems. This project seeks a robust and scalable solution for the bi-directional translation of these identifiers. This is an open-ended challenge where contributors are encouraged to propose and evaluate various architectures. (Potential directions include sequence-to-sequence (Seq2Seq) transformers, specialized graph-to-string architectures, or hybrid rule-based and machine learning approaches.) The primary focus is on achieving high chemical fidelity and computational efficiency compared to proprietary frontier LLM models.
    Expected Outcomes: (1) A robust implementation of a SMILES-to-IUPAC and IUPAC-to-SMILES translation system within DeepChem, (2), Comprehensive accuracy benchmarks comparing the proposed solution against existing tools and general-purpose LLMs, (3) Support for a wide range of chemical entities, including those with complex branching and stereocenter definitions, (4) A pre-trained model or set of weights available for the DeepChem community.
    Skills Required: Strong Python programming skills, Experience in machine learning, particularly sequence modeling or natural language processing (NLP), Knowledge of cheminformatics tools such as RDKit, OpenBabel, or PubChem APIs, Understanding of the rules governing IUPAC nomenclature and SMILES syntax.
    Difficulty: Medium to Hard
    Potential Mentors: Shreyas, Bharath

If you are looking to apply this year, please start scoping out these directions. The more work you do up front, the more likeley we will pick you!

I will restart office hours in limited format by the start of next year once fully back from paternity leave (at least 1 day a week)

IMPORTANT AI Policy: We allow students to use AI for research and for initial prototyping but we expect students to be fully responsible for having reviewed their code and tests. Any PRs with obvious signs of non-edited AI usage (i.e, dead code, unused variables, nonsensical usages) will be rejected without review. All code must meet standard DeepChem review standards and testing policies. Please do not use ChatGPT to respond to questions from maintainers on chat channels like Discord. Doing this is very disrespectful to the time of maintainers who are trying to give live feedback. Violating the AI policy will get a warning to start, but if repeated, we will no longer invest time in trying to give feedback or review PRs.

This post is mirrored at https://github.com/deepchem/deepchem/discussions/4703. Please feel free to discuss at either location.

17 Likes

Been working with LLMs for quite a while. Excited to contribute!

Hi Bharath, thanks for sharing these directions — this is super exciting, and congratulations on the new addition to your family!
I’m very interested in applying for GSoC 2026 with DeepChem , and I’ve started scoping out how I could contribute meaningfully before the application phase.

Among the listed ideas, I’m particularly drawn to:

  • LLM support for 7B models in DeepChem (e.g. OLMo)

  • Symbolic / physics-inspired ML

As a starting step, I’m planning to:

  • Study DeepChem’s current model/training abstractions and identify what’s missing for large-scale transformer models

Prototype a minimal PyTorch-based integration path for loading + running inference on an existing 7B model (initially CPU / small-GPU focused)

  • Open an exploratory issue or PR once I have a clearer design proposal

I’m also going through the DeepChem codebase and relevant papers you linked (especially the symbolic ML work) to better understand where contributions would be most impactful.

I’d really appreciate any guidance on:

  • Which of these directions you feel is most valuable / feasible for a GSoC-scale project

Whether you’d prefer early design docs, benchmarks, or small PRs as a signal of readiness
Looking forward to the office hours when they resume — and I’ll keep sharing progress as I dig deeper. Thanks again for supporting newcomers!

Best,
Divakar Daya

Can you put the expected number of hours for each project

Each project could support different hour counts. We have left them a little open ended this year based on our experience last year. We recommend specifying in your applications