Making DeepChem/DeepScience a Better Tool For Researchers

bharath · June 22, 2022, 2:22am

DeepChem/DeepScience has lots of features and documentation that make it a good tool (I believe) for students. Our focus on production engineering has also increasingly made DeepChem a good tool for production engineering. DeepChem is also a useful tool for researchers doing applied machine learning since it allows them to apply/optimize a large battery of machine learning models to datasets.

However, DeepChem/DeepScience has not yet succeeded in winning mindshare as a tool for ML researchers (beyond its use as a source of benchmarking baselines in which it has a healthy niche). I believe the main reason for this is twofold:

DeepChem is not easily composable: DeepChem primitives can’t chain together to create new algorithms (say the way PyTorch tensors chain together to create new architectures)
DeepChem’s featurizers and models are not interoperable: Most models have a uniquely associated featurizer. This means it’s hard to mix/match different featurizations and architectures

Here are two proposed improvements which I believe will partially improve the state of affairs:

Port DeepChem Layers to PyTorch: Our layers are currently split between Torch/TensorFlow/Jax. As a result, we have fragmentation where it isn’t easy to mix/match layers. Porting all layers to Torch means our layers will readily interoperate with one another. Improving layer/documentation and tutorials will help with interoperability
Standardize featurizer data classes: We should have all future graph convolutional models use GraphData as their data class. Over time, we should deprecate ConvMol, WeaveMol etc and standardize on GraphData. This will enable graph featurizations to be used across all graph convolutional architectures. We should also remove or reduce custom transformation code (the code that loads ConvMol into TensorFlow tensors in GraphConvModel for example). By using GraphData we can have one standard function to load GraphData into Torch tensors.

Both of these efforts are already informally underway but I wanted to document the general push so the community can suggest ideas to improve our efforts and make DeepChem a better researcher tool