Towards a Deep Science Suite

I’ve been really happy to see the growth of DeepChem over the last year. The codebase is increasingly stable and supports an ever growing set of new features and capabilities. That said, DeepChem has a bit of a marketing problem today. I frequently have conversations with people that go something like:

  • Potential User: I’m working on X problem in field outside of chemistry
  • Me: DeepChem can help! Check out deepchem.X.Y
  • Potential User: Oh I had no idea DeepChem wasn’t just for chemistry!

DeepChem has grown useful scientific capabilities in multiple fields outside chemistry. Notably DeepChem has support and tutorials for

  • structural biology (protein/ligand interaction datasets, fingerprints, and models)
  • materials science (datasets, featurizers, models)
  • bioinformatics (datasets, featurizers, and preliminary models)

I anticipate over the coming year we will have growing support for

  • PDE/ODE modeling
  • fluid dynamics

There are considerable advantages to developing DeepChem in a monolithic unified repo. We’ve invested a lot of time in CI/CD, build, and testing infrastructure so I don’t anticipate that we want to move any core development work outside the main deepchem repo. At the same time, I think it may be time for us to start some auxiliary repos. In particular, I think we should start the following repos:

  • deepmat: For deep learning driven materials science
  • deepbio: For deep learning driven bioinformatics

I anticipate that these will be virtual repos. That is, the “code” in these repos will only be import statements from deepchem. For example, deepmat/ would be something like

import deepchem
from deepchem.models import MEGNet
from deepchem.feat import ...

My hope is that these auxiliary repos will help lower barriers to other scientific communities and avoid the friction where scientists in other communities don’t realize that deepchem is actually relevant for their work. We may also choose to have focused publications about these new auxiliary repos to highlight their capabilities to the relevant scientific audience. Over time, we will likely want to start other auxiliary repos as our support for other scientific application areas grows.

I’ve discussed this idea with a few community members on the developer calls already, but I wanted to write it up on our forums so we could solicit broader feedback. Let me know thoughts or comments below!


One thought might even be to create an umbrella project (“DeepScience”) and gradually promote that: existing users can continue on undisturbed, and new users can find a subproject more easily. I would imagine projects like OpenCV (which has now vastly outgrown computer vision) run into the same issues.

1 Like

Another observation is that “MoleculeNet” has grown to have datasets for materials science, structural biology, and bioinformatics. It would be more apt to say we have a general “ScienceNet” of datasets.