Figuring out DeepChem's Scaling Story

bharath · July 30, 2021, 1:33am

A few months ago I wrote up a post with ideas for expanding DeepChem to become a better framework for AI driven science (Making DeepChem a Better Framework for AI-Driven Science). In the months since we’ve made progress on making DeepChem a better framework for AI driven science: notably, our support for materials science and bioinformatics applications has continued to improve. However we haven’t made much progress on building infrastructure to scale DeepChem. Here are a few types of infrastructure that I would like to see us build over the coming months:

Multi-GPU Training Support: DeepChem lacks technology to scale its models to multiple GPUs. I think that tentatively we could leverage either PyTorch Lightning directly or PyTorch distributed (https://github.com/deepchem/deepchem/issues/2594).
Distributed Featurization/Transformation: DeepChem currently lacks tools to easily run large scale featurization/transformation jobs. For example, multiple folks have asked about featurizing 1 billion datapoints with DeepChem. Better integration with tools like Ray could make this easier.
Distributed Hyperparameter Search: If you have access to multiple nodes, it should be possible to run a large scale hyperparameter search on DeepChem models on these nodes easily. In general, our hyperparameter tuning infrastructure could use more work. It’s good for simple applications but not really at industrial robustness/scale yet.

All of these are complex engineering challenges so I’m laying these out more as challenge problems rather than as concrete plans. If you have ideas/thoughts for one of these scaling challenges please chime in!