DeepChem GSoC 2026 Potential Ideas

We have a lot of newcomers coming onto here. Welcome to the community! I am scoping out potential projects for GSoC 2026 (remember we have to apply to get in, so no guarantee DeepChem will be selected yet). Here are some tentative project directions (I will update this forums post as we get new ideas):

  • Symbolic machine learning (think like https://arxiv.org/abs/2305.01582 except Python)
  • MLIP support (like https://github.com/instadeepai/mlip, but we want to do in pytorch)
  • LLM support for 7B models in DeepChem (i.e, make a Olmo model in DeepChem https://huggingface.co/allenai/OLMo-7B). Should be able to train/run inference with models
  • Implement RFDiffusion, RFDiffusion-2 or other protein design models in DeepChem
  • Improve DFT support in DeepChem: DeepChem has preliminary density functional theory support (https://arxiv.org/abs/2309.15985). Build on this! Can you solve new systems, make this scale better, implement other xc-functions?
  • Improve materials machine learning in DeepChem: DeepChem has simple crystal graph convolutions and lattice adsorption model support from a few years ago. Test these models on real systems and improve them. Possibly implement new papers from the last few years.

If you are looking to apply this year, please start scoping out these directions. The more work you do up front, the more likeley we will pick you!

I will restart office hours in limited format by the start of next year once fully back from paternity leave (at least 1 day a week)

15 Likes

Been working with LLMs for quite a while. Excited to contribute!

Hi Bharath, thanks for sharing these directions — this is super exciting, and congratulations on the new addition to your family!
I’m very interested in applying for GSoC 2026 with DeepChem , and I’ve started scoping out how I could contribute meaningfully before the application phase.

Among the listed ideas, I’m particularly drawn to:

  • LLM support for 7B models in DeepChem (e.g. OLMo)

  • Symbolic / physics-inspired ML

As a starting step, I’m planning to:

  • Study DeepChem’s current model/training abstractions and identify what’s missing for large-scale transformer models

Prototype a minimal PyTorch-based integration path for loading + running inference on an existing 7B model (initially CPU / small-GPU focused)

  • Open an exploratory issue or PR once I have a clearer design proposal

I’m also going through the DeepChem codebase and relevant papers you linked (especially the symbolic ML work) to better understand where contributions would be most impactful.

I’d really appreciate any guidance on:

  • Which of these directions you feel is most valuable / feasible for a GSoC-scale project

Whether you’d prefer early design docs, benchmarks, or small PRs as a signal of readiness
Looking forward to the office hours when they resume — and I’ll keep sharing progress as I dig deeper. Thanks again for supporting newcomers!

Best,
Divakar Daya