Making DeepChem a Better Framework for AI-Driven Science

Over the last several years DeepChem has evolved from a very limited set of scripts (my first implementation could only train multitask networks on chemoinformatic data) into a sophisticated system for scientific machine learning. At the same time, DeepChem has a number of limitations that prevent it from addressing many important scientific problems. The core of DeepChem is rigid; it isn’t easy for users to compose DeepChem tools to make custom models and architectures without diving into the internals of DeepChem. DeepChem is also designed to run on single devices, so it’s hard to tackle large scale scientific problems. This note studies DeepChem’s current design and highlights awkward APIs, rigid constructions, and scaling issues. I then suggest improvements to DeepChem’s core to address these challenges.

I want DeepChem to evolve into a powerful framework for AI-driven science and engineering and enable users to solve hard problems in other fields of science (such as semiconductors, energy, climate and more) in addition to our core strengths in open source drug discovery. This transition will require us to generalize and expand DeepChem’s API for building sophisticated and scalable scientific programs.

The proposals here are first sketches which will require major work to turn into actual implementations. This write-up draws on a number of discussions that we’ve had at DeepChem developer calls on DeepChem flexibility, model pretraining, model hubs, and many more topics. The developer calls are a continuing source of inspiration and my sincere thanks to all developers and community members for bringing your time and energy!

DeepChem’s Current Design

The diagram below lays out the design of DeepChem’s API today. DeepChem programs flow from inputs to outputs and perform a sequence of transformations. There are many different choices to take at each step of this flowchart (for example, there are dozens of different models, many different featurizers, many different transformers and so on), so the flowchart can be expanded into thousands of different systems. The different arrows show different paths that users can take; for example a Dataset can be fed directly into a model or routed through a Splitter and Transformer and then fed into a model. A program corresponds to a path from an input to an output following the arrows.

In effect, DeepChem provides a domain-specific-language (DSL) for scientific machine learning. A sample DeepChem program can be represented in condensed DSL pseudocode as follows (the code below is a conceptual simplification, not actual DeepChem code).

# Conceptual DSL code; Not actual DeepChem code!
input = “filename.csv”
featurizer = Featurizer()
dataset = Loader(input, featurizer)
splitter = Splitter()
train, valid, test = splitter(dataset)
transformer = Transformer()
train, valid, test = transformer(train, valid, test)
model = Model()
metric = Metric()
hyper = HyperparameterTuner()
model = hyper(model, valid, metric)
output = model(test)

Depending on the specific choice of Featurizer/Loader/Splitter/Transformer/-
Model/Metric/HyperparameterTuner we end up with a different program. This basic architecture is very broad and has allowed DeepChem to be applied to a very wide range of different applications (see DeepChem Papers and Discoveries List). At the same time though, there are obvious limitations to this basic DSL structure. To start, DeepChem’s DSL flow is very linear; inputs flow through a series of transformations to produce outputs. We don’t have good support for more complex iterative flows such as pretraining, active learning, or generative models in which the line between model and data is blurred. There are a number of other major limitations with DeepChem’s current design which we outline below

Single Machine Design
DeepChem at present is designed to run on a single machine. We have some limited support for multiprocessing, but no support for working with multiple GPUs or running larger jobs across clusters. The python ecosystem for distributed workloads has improved tremendously with libraries like dask and ray, but DeepChem doesn’t yet leverage these tools to enable large scale featurization, transformation, splitting, hyperparameter optimization and model training.

API Misfits
Referring back to the diagram above, we also see that the dc.dock, dc.metalearning, and dc.rl modules don’t cleanly fit into the existing DeepChem flow and fit awkwardly with the other modules. dc.dock is very special purpose for drug-discovery applications, while dc.metalearning and dc.rl are split into their own submodules, but conceptually are just types of models. Having these dangling API parts doesn’t hurt DeepChem but does make the entire system less conceptually unified.

Lack of Model Composability
DeepChem models are black boxes. There isn’t a convenient method for users to mix and match primitives to construct new models without diving into the internals of model implementation. The situation is worsened by the fact that DeepChem uses a variety of different backend implementations for its models; we currently have models implemented using scikit-learn, xgboost, lightgbm, TensorFlow, PyTorch, and DGL. We will likely add HuggingFace and Jax models as well. This list of backend technologies is unlikely to shrink anytime soon given the sustained investment into new machine learning tools by the broader community. But due to the multiple backends, we have siloed model code that can only be used for one backend, worsening our issues with composability. For example, we have a large collection of Keras model layers which can’t be used with PyTorch/DGL models.

Making DeepChem More Flexible and Scalable

As DeepChem developers, our goal is to broaden the range of useful programs that our users can construct. With each new release, we expand the flexibility and power of the underlying DSL, enabling our users to construct richer scientific programs. Over the last year, we’ve worked to expand DeepChem past its roots as a chemoinformatic library by making splitters, transformers, models and metrics handle non-chemical datatypes. This work is still in progress, but DeepChem 2.5.0 is much more suited to write general purpose scientific machine learning programs than past releases.

The diagram below lays out a potentially expanded design for DeepChem that provides a more powerful framework for scientific programs. This is a preliminary architecture I’m posting to invite review and discussion. I suggest a few major changes to DeepChem’s architecture. First, API misfits like dock/reinforcement learning/metalearning should be integrated more tightly into DeepChem. Docking is the first example of an eventual DeepChem API for simulation. Free energy perturbation could be the next simulation type we support. Constructing a more general framework for running simulations will enable DeepChem programs to implement and use simulations to generate training data. I also propose the addition of a model hub, and subsuming reinforcement learning and metalearning into DeepChem’s models (conceptually, reinforcement learning and metalearning are already models so this unification would fit naturally). Finally, following the lead of popular projects like DGL, I propose creating a DeepChem tensor API with different backends to unify our deep model implementations.

As before, a DeepChem program is a path from input to output following the arrows, but unlike the previous diagram, we’ve added some additional backwards connections. Models can generate datasets (using generative models) and models can power simulations (for example, neural force fields for molecular dynamics).

Here is a sample DSL for the expanded vision of DeepChem that runs a simulation to generate a dataset and uses a pretrained model from the ModelHub for training.

# Conceptual DSL code; Not actual DeepChem code!
input = “filename.csv”
# Run simulation to generate data starting from input
sim_results = Simulation(input)
featurizer = Featurizer()
dataset = Loader(sim_results, featurizer)
splitter = Splitter()
train, valid, test = splitter(dataset)
transformer = Transformer()
train, valid, test = transformer(train, valid, test)
# Load pretrained weights from ModelHub
model = Model(ModelHub())
metric = Metric()
hyper = HyperparameterTuner()
model = hyper(model, valid, metric)
output = model(test)

You can imagine making more sophisticated DSL examples that leverage generative models, or use training to improve simulations to create sophisticated feedback loops. We now explain proposed API changes in more depth.

Adding a Simulations Module
Whole ranges of scientific applications depend on simulation. Docking, free energy perturbation, molecular dynamics, fluid simulations, systems biology and many other simulation tools could all be useful to DeepChem users trying to build scientific AI applications. For us to enable these broader use cases, we need a broader framework for using simulation tools with DeepChem. I propose that we add a new dc.sim module that provides tooling for simulations and move dc.dock to dc.sim.dock to become DeepChem’s first simulation API. Over time, DeepChem should grow to support other classes of simulations, starting with free energy perturbation support. At present, we shouldn’t try to mandate a common API for DeepChem simulations, but as our support for new types of simulations grows, we should try to build a sensible common API.

Distributed Computation Support for Featurization/Transformation
DeepChem should more tightly integrate with tools like ray/dask to enable larger jobs. It should become straightforward for a user to kick off a large-scale featurization job (for example) using DeepChem. Underneath the hood, DeepChem should leverage ray/dask or other tools to perform the work necessary for these computations. This functionality will be critical if we want to enable AlphaFold-esque applications in which running featurizations (for protein structure prediction tasks, that means multiple sequence alignment) requires major computational effort.

Multi-GPU and Distributed Training Support
DeepChem models should be trainable with multiple GPUs. New packages and tools like PyTorch Lightning and Keras have made it easy to train models with multiple GPUs. We should aim to have all DeepChem models be able to leverage multi-GPU training. More ambitious would be to support distributed training for very large models. This may require tighter integration with tools like Kubernetes or RaySGD in the longer run.

Standardizing Model APIs
At present, DeepChem’s model infrastructure is mostly geared towards classifiers and regressors. Our core abstract methods in Model are designed for classifiers/regressors, and we don’t have a unified API for other types of models. I propose that we introduce the following abstract subclasses of Model to standardize APIs for other important classes of models:

  • Estimator: The base class for classifiers/regressors
  • Generator: The base class for generative models
  • Metalearner: The base class for metalearners (moved from dc.metalearning)
  • StructuredOutputLearner: The base class for models with structured output (like Seq2Seq).
  • Policy: The base class for reinforcement learning (moved from dc.rl).
  • UnsupervisedLearner: The base class for unsupervised learning methods (such as ChemBERTa pretraining or other language modeling methods)

Standardizing Model Pretraining and Build a Model Hub
Pretrained models are increasingly important for modern machine learning applications. DeepChem should build model hub infrastructure with standard APIs for model pretraining/loading that enables users to easily leverage pretrained models. To start with, I suggest we follow the MoleculeNet model of limiting uploads of pretrained models to DeepChem developers, but it may be possible to open out model uploading to all users in time by partnering Deep Forest Sciences or other companies.

Unifying our Model Infrastructure
At present, we share no code between our models for different backends. We have large amounts of code for Keras that isn’t of any use for constructing PyTorch models. We also lack a clear API for users to compose models/layers. In earlier releases of DeepChem, we supported a custom model building framework called TensorGraph. This framework offered capabilities similar to Keras, but with some additional flexibility. We eventually decided to remove this framework and just support Keras directly since the maintenance overhead was too high. At the time, TensorFlow was our only deep learning framework and Keras had been chosen as the standard API for TensorFlow models, so it seemed like supporting Keras would be good enough for our applications.

In the last couple of years though, the deep learning library situation has continued to evolve. We now support a number of PyTorch models and will likely continue to support both TensorFlow and PyTorch models for several major releases to come. It is also likely that we will support Jax models within the next few releases, and entirely possible that we will support other frameworks as well. The future of DeepChem’s deep learning infrastructure appears to be multi-framework for the foreseeable future.

These changes in the broader ecosystem suggest that it might be useful to revisit our earlier goals with TensorGraph. DeepChem now has a larger community base of developers more able to support custom infrastructure. We also suffer from a fragmentation problem, with an inability to share infrastructure between TensorFlow/PyTorch/Numpy/Jax. Other libraries such as DGL have addressed similar issues by constructing a common tensor API along with different backend implementations for tensor operations (see https://github.com/dmlc/dgl/tree/master/python/dgl/backend). I propose that we consider establishing a common DeepChem tensor API along the lines of DGL’s.

A new DeepChem tensor API wouldn’t change user-facing APIs for DeepChem, but would require us to migrate models to use a common DeepChem tensor standard behind the scenes. This migration can be done gradually over several releases. Once completed, users will be able to use the DeepChem tensor API for their own models. Users should be able to build their own custom DeepChem architectures using the same tools as developers. (Users can of course continue to use KerasModel and TorchModel if they prefer). As developers, we would have to maintain multiple backends, but gain the advantage of a unified model codebase without backend fragmentation. A unified representation may possibly make backend upgrades easier; rather than having to migrate 30 different models to new versions of TensorFlow/PyTorch, we need only migrate the backend tensor implementation. All that said, it’s worth noting up front that some DeepChem models (like DGL wrapped models or scikit-learn/lightgbm models) will likely not be possible to migrate to a shared DeepChem tensor implementation. But enough models should be migratable that we could potentially create a large repository of reusable model infrastructure for our users.

One intriguing possibility is that we can use a common DeepChem tensor API to also power simulations. Jax-md demonstrates how differentiable simulators can be built with Jax. We could potentially support differentiable deepchem simulators in the future. At present, Jax-md is still quite a bit slower than traditional molecular dynamics engines, but as Jax and the underlying XLA compiler speed up, this may start to change over the coming years. Combined with improved distributed computing support through Ray/Dask, we could envision using DeepChem for large-scale differentiable scientific applications.

A DeepChem Maintenance Policy

As DeepChem continues to grow as a library, we want to maintain the fine balance between being a stable platform for production releases and continuing to evolve and grow to address new scientific application areas. I propose that we take the following steps to achieve this goal:

  • Breaking Changes Policy: Any API changes in the public-facing DeepChem from 2.5.0 onwards must maintain backwards API compatibility with deprecation warnings for any API changes. A deprecated API must be kept in place for at least a year and can only be removed as part of a DeepChem major version release.
  • Long Term Support: As DeepChem continues evolving, we may find large companies that have to use older versions of DeepChem. I propose that the last non-major release before a major version release becomes a long term release. That is, if 2.X is the last version 2 release, 2.X becomes a long term release that will be supported by maintainers for 2 years. This will require us to maintain the 2.X branch on the main repo and perhaps make minor bugfixes if required.

The changes outlined in this document would require major work, likely spanning multiple DeepChem minor and major releases.

Discussion

The ecosystem of scientific machine learning is exploding with hosts of new models and applications. I’d like DeepChem to grow and evolve with its community to support rich ranges of new models and applications. These changes will require DeepChem to shed some of its current rigidity and move towards a more flexible future. In this note, I’ve suggested a number of possible design changes for making DeepChem more flexible.

Part of my inspiration has come from interactions with the Julia community, which has built powerful tools for scientific machine learning (see https://diffeq.sciml.ai/v2.0/, https://github.com/SciML/ModelingToolkit.jl, https://turing.ml/stable/, https://github.com/SciML/DiffEqFlux.jl, https://fluxml.ai/Flux.jl/stable/, https://yaoquantum.org/, https://github.com/NREL-SIIP/PowerSimulations.jl, https://discourse.julialang.org/t/large-scale-hpc-project-on-probabilistic-programming-at-scale-in-conjunction-with-scientific-simulators/39416 and many other efforts). Julia’s scientific developers are often also Julia language maintainers which gives Julia considerable flexibility to evolve its language to better support important scientific applications. In contrast, DeepChem is based on Python and will be for the foreseeable future since that allows us to leverage the vast Python machine learning ecosystem. But, DeepChem can continue to evolve the flexibility and design of its underlying scientific DSL to enable richer classes of DeepChem programs for new applications.

I’d like for DeepChem, working with the broader Python ML community, to enable cutting edge scientific AI research. One day, I’d like for DeepChem to help solve problems in semiconductors, materials, battery design, fluid modeling, systems biology and many other application areas in addition to our core strengths in chemoinformatics and drug discovery. I personally have found DeepChem a powerful tool for my work and I hope we can make it into an even better tool for scientific discovery by making it more scalable, flexible and user-friendly.

Thanks for reading along! Feel free to respond below or on new threads. I anticipate we will need a lot of discussion before we start executing on some of these ideas.

6 Likes

I have mixed views on these changes.

I totally agree about supporting distributed and multi-GPU training. Those are important features we should add.

Regarding unifying the model classes, I’m not sure exactly what you’re suggesting. Currently the Model class is specific to classifiers and regressors. So presumably most of its features would move into the Estimator class. What would that leave in Model? A classifier and a RL policy don’t have much in common. What do we gain by forcing them into a common inheritance hierarchy? And what would be in the other abstract classes? For example, what do all generative models have in common? Or all unsupervised learning methods?

(Also note that Model currently inherits from sklearn’s BaseEstimator. I’m not sure there’s really a good reason for that anymore, though.)

Similar for the suggested simulation module. I’m just not sure what you’re suggesting. The term “simulation” is incredibly generic and can include all sorts of things. What do all of them have in common that can be supported with common code? (On the other hand, docking is not a kind of simulation in my opinion. It’s mostly just a scoring function that takes an input and produces an output, much like any other model.)

I really don’t think we want to create our own tensor API. It’s a huge maintenance burden that just creates barriers for other people to get involved. That’s why we got rid of TensorGraph. We had our own modelling API that no one else in the world used. On the other hand, we could consider making use of some other existing tensor abstraction. (Maybe the one in DGL? I don’t know anything about it.) We would need to make sure it was something widely used and that we could rely on it not to be abandoned in the future.

For the maintenance policy, I don’t think we have the resources to make that kind of commitment. A company with reliable revenue can hire someone just to keep fixing bugs in two year old versions of the code. We don’t have money to do that, and volunteer contributors can’t be expected to take on a commitment like that. We can say that we’ll do our best not to break public APIs, and that we’ll try to keep them as deprecated for a reasonable time to give people a chance to adapt. Any firm commitment beyond that would be very hard to make.

1 Like

Awesome! This will take some more research to figure out sensible APIs for us to follow. I’m planning to learn more Ray/Dask/distributed Keras/Pytorch Lightning and I can take a cut at suggesting some potential APIs for distributed workflows once I’ve processed. If you have thoughts at APIs for these, please feel free to suggest as well!

I think the core for all Model classes is a collection of hyperparameters alongside a collection of weights and some method for accessing/updating weights. I’d also add a standard method for instantiating from a pre-trained model. For example, here’s a sketch of an abstract Model superclass

class Model():
   
  def __init__(...):
    # The same as the __init__ method we have now

  def get_weights() -> Dict:
    # Returns a dictionary mapping weight names to weights

  def set_weight(name, value):
    # Set specified weight to specified value

  def get_parameters() -> Dict:
    # Returns a dict of hyperparameters

  @staticmethod
  def from_pretrained(parameters, weights) -> Model:
    # A constructor method that creates a Model from specified pretrained weights  

This should generalize across the different types of models including RL/metalearning. The advantage here is we gain a standard API for dealing with pretrained models that generalizes across all model types . I have a few more thoughts here on how to support pretraining and work towards a model hub that I’ll post once my thoughts are a little more ordered :slight_smile:

You make a good point that we don’t really know what a standard API for generative models or unsupervised models looks like so this idea may be better dropped. We already have abstract classes for GANs and NFs and can add other more base classes as needed.

Organizationally, I think that dc.metalearning and dc.rl shouldn’t really be their own top level modules (GANs and Seq2Seq models aren’t classifiers/regressors and comfortably live in dc.models, so it’s awkward that metalearning and rl live outside when they’re conceptually also learning algorithms).

Fair point that a simulation is a very generic concept that can mean many things. (On Docking, I was thinking of pose generation as a sort of molecular simulation, but agree that may not be the best description!)

When I think of simulations, I’m viewing them as computational methods for data generation that draw on existing knowledge about systems. Docking generates synthetic binding poses and can be used to create new datasets for training/prediction by interaction fingerprint models or atomic convs. Similarly, perhaps we run free energy perturbation methods to gain estimates of binding free energies (relative or absolute) which we use as a training data for a downstream model. As DeepChem extends into new scientific domains, I anticipate we’ll likely want to leverage similar sources of synthetic data. For example DFT calculations for materials models. Or maybe whole cell models for cell biology applications.

Organizationally, I think dc.dock shouldn’t be a top level module in DeepChem; it’s too specific an application. I think it’s an instance of a more general pattern of synthetic data generation that we’ll see more of though, which may merit a top level DeepChem module (dc.sim). This could very reasonably be named something else (dc.synthetic for synthetic data? dc.app for applications?) For now, I think there isn’t much call for common code, but that could change with time as we start to see common patterns emerge. For example, perhaps a standard API for synthetic data generation?

This was the most speculative part of the proposals above. After writing the post above, I learned about the growing Python array standard https://data-apis.org/blog/array_api_standard_release/. This is trying to unify arrays in different python libraries (but isn’t a codebase in its own right). One possibility for an external library is https://github.com/tensorly/tensorly which allows for constructing tensors with different backends (Jax/TF/etc). I think DGL’s backend isn’t part of its public API so we probably can’t use it directly either. It might be worth an experiment to see if we can make tensorly work for us.

I’m certainly not deeply tied to this particular proposal. It was more of a thought experiment trying to solve the more general issue that our models are pretty black box right now; it isn’t easy to build variants of DeepChem models easily without diving into backend internals. Are there other ways we can make it easy for users to do things like build their own weave variants or atomic conv variants?

I’ve been using DeepChem a good bit for Deep Forest Sciences work so I may eventually be able to provide some funds for longer term maintenance, but agree that a soft commitment may well be all we can offer for now!

1 Like

I got a pointer to look at https://github.com/jonasrauber/eagerpy. This library provides a way to write tensor functions that work for numpy/jax/pytorch/tensorflow tensors. The API looks really slick and usable.

1 Like

We had a good discussion about the ideas presented here at the last DeepChem developer call on Friday. @peastman raised the question, “What is DeepChem?” This has been a challenging question for us historically since the library’s scope has spread and changed considerably over the years.

My answer to the question (cleaned up for the record) was that “DeepChem is a scientific DSL embedded in Python that empowers its users to do AI-driven science.”

@peastman suggested that if we were thinking of DeepChem as a DSL, it would be useful to put together some sample programs to guide the design of the system. I’ve put together a number of different examples below of different types of programs that we could build with DeepChem. Some of these can almost be implemented in DeepChem 2.5.0, while others are more exploratory.

Large scale featurization for multiple sequence alignment

This program spawn a featurization workload for multiple sequence alignment on a large protein sequence dataset.

## Specify DeepChem cluster. Syntax TBD
feat = MultipleSequenceAlignmentFeaturizer()
# Spawn a workload across 256 cores
dataset = FASTALoader(files, feat, ncores=256)

We’ve done various experiments with extending DeepChem to protein-folding applications and a continual roadblock is the difficulty of running these featurization jobs. With a ray backend, and access to cloud, this featurization should be as simple as running the code above.

Neural Force Field Testing

This example creates a sample neural force field. This forcefield is used to run sample molecular dynamics calculations which are then used to update the force field parameters.

## Specify DeepChem GPU cluster
# some neural force-field formula represented on tensors
formula = (xi-xj)^2+.... 
forcefield = ForceField(formula)
# Simulate on 20 GPUs for 1 millisecond sampling
trajectories = forcefield.simulate("protein.pdb", ngpu=20, time="1 millisecond")
# Use simulation trajectories to refine force field parameters
forcefield.refine(trajectories)

OpenMM and other packages like JaxMD offer support for neural force fields already. DeepChem could potentially help grow this ecosystem by enabling support for experimentation with different forms of forcefield forms.

Pretrain ChemBERTa

This example constructs a ChemBERTa model and trains on multiple GPUs on a single device

feat= SmilesTokenizer()
dataset = molnet.load_zinc15(feat, ncores=96)
model = ChemBERTaLanguageModel()
model.fit(dataset, ngpus=8)

With this hypothetical syntax, a ChemBERTaLanguageModel could be pretrained on multiple GPUs with a short script.

Use pretrained ChemBERTa from ModelHub to Fit Model

At present, we use HuggingFace’s hub to store ChemBERTa pretrained weights. This example shows how we could do this with a future DeepChem ModelHub

# Run the following shell command to set model hub location
# export DEEPCHEM_MODEL_HUB=...
model = ChemBERTAModel.from_pretrained(“zinc15”)
# Downstream assay
dataset = load_assay()
model.fit(dataset)

Pretrain Protein BERT model

Pretrain a protein BERT model on multiple GPUs.

## Specify DeepChem cluster
# A featurizer to one-hot encode codons
feat = CodonFeaturizer()
dataset = molnet.load_uniprot(feat, ncores=192)
model = ProteinBERT()
model.fit(dataset, ngpus=16)

This example is similar to the earlier ChemBERTa one, but we combine distributed featurization and multi-gpu training.

Differentiable Cell simulation

A differentiable cell model would enable us to study the behavior of cells using simulations and machine learning. Specify a differentiable cell model and simulate for a cell life cycle model

# Building a differentiable cell model would be a large undertaking. Assume this has been done in a separate class
simulator = DifferentiableCellModel()
trajectory = simulator.simulate(time="8 hours", ncores=24)
simulator.refine(trajectory)

This example is similar to those we’ve see already but with the more complex simulation.

Differentiable lattice quantum chromodynamics

Lattice chromodynamics is a powerful tool to investigate quantum field theory numerically. This example builds a sampler for field states based on normalizing flows (see https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.125.121601)

# This is implemented in a separate class
sampler = LatticeChromdynamicsNormalizingFlow()
# Sample a lattice state
lattice = model.sample()

PINN model for fluid dynamics

This example constructs a physics informed neural network for fluid flow (see https://arxiv.org/abs/2002.10558)

simulator = PINNFluidSimulator()
simulator.simulate(time=”1 minute”, ngpus=16)

Fluid simulations are outside our current scope, but very important for a whole range of applications (in climate and other applications) that I hope to see us support.

Large Scale Docking

This example docks 100 million compounds against 1000 sampled protein conformations and picks the highest scoring compounds from the lot.

dataset = load_zinc15(...)
protein = PDBLoader(“protein.pdb”)
# A high temperature protein conformation sampler
conformations = HighTemperatureSampler(protein).sample(nposes=1000)
# Distribute the large scale docking job across 256 cores
docker = Docker()
docker.dock(conformations, dataset, cores=256)

Microscopy

Load a pretrained ImageNet model and fine-tune on microscopy dataset

dataset = load_microscopy()
model = ResNet.from_pretrained(“imagenet”)
model.fit(dataset, ngpus=4)

Retrosynthesis Calculations

Load a pretrained retrosynthesis model. Use it to find a synthetic pathway to a desired target from starting points

solver = RetrosynthesisPlanner().from_pretrained(“uspto”)
route = solver.plan(source, target)

Being able to solve retrosynthetic routes has been a longstanding goal for DeepChem.

Robotic simulator for training a walking robot

This example uses reinforcement learning and a differentiable simulator to jointly train a robotic walking system

simulator = DifferentiableRoboticsSimulator()
walker = ReinforcementLearningWalker()
walker.fit(simulator, ngpus=16)

Integrated Circuit Simulator

This example simulates the design for an integrated circuit on a differentiable semiconductor simulator.

simulator = DifferentiableSemiconductorSimulator()
circuit = # some specification of circuit
trajectories = simulator.simulate(circuit)
simulator.refine(trajectories)

Parting Thoughts

Some of these examples are very light on details, especially the complex simulators, but I hope this provides a sampling of the type of DeepChem programs I hope to see our users writing over the next few years. The code samples above generously leverage the proposed model hub and distributed featurization and training support. I’d also really like to see us make differentiable simulations easier, but I don’t yet have any proposed syntax for doing this.

Another thought is that I’d like to see a future DeepChem ecosystem grow. Right now, there are relatively few packages out there that depend on DeepChem. This is perhaps due to the immaturity of DeepChem as a production codebase. As DeepChem matures as a library, we should aim to bootstrap an ecosystem of other packages that can leverage DeepChem to solve downstream problems.

3 Likes

Thanks, these are good examples.

The most prominent new feature I see is the ncores and ngpus options. Those should be easy to support for parallelizing on a single computer. Distributing computation on a cluster is a bit more difficult, but it should also be possible. Actually, I’d suggest that the ncores option should usually not be needed. The default behavior should be to use all cores on your local machine. You should only need to specify something different if you want to restrict it to fewer cores, or if you’re distributing it to multiple machines.

Some of these examples don’t make sense. For example, your code for training a force field:

# Simulate on 20 GPUs for 1 millisecond sampling
trajectories = forcefield.simulate("protein.pdb", ngpu=20, time="1 millisecond")
# Use simulation trajectories to refine force field parameters
forcefield.refine(trajectories)

Force field’s aren’t trained on trajectories. They’re trained on datasets of either experimental data or quantum chemistry calculations. A force field is just a function that takes positions as inputs and produces forces and energy as outputs. You train it much like you would any other model, trying to make its outputs match the training data. Sometimes people use active learning approaches where you identify the samples for which the results are least accurate, generate new conformations in that area, invoke a quantum chemistry code to compute forces and energies for them, and add them to the dataset. Either way, it doesn’t look like the code you wrote.

The same is true for some of the other examples, like the differentiable cell simulation and the integrated circuit simulator. You’re missing any mention of the data you’re training the model on, and I don’t know what it means to “refine” a simulator based on a simulation it just generated.

Some of this feels to me a bit like a grab-bag of unrelated features. You assume the existence of classes like LatticeChromodynamicsNormalizingFlow, RetrosynthesisPlanner, and DifferentiableSemiconductorSimulator. What do QCD, retrosynthesis, and semiconductor simulation have to do with each other? Why would they all go in the same software package? What ties them together, such that they would be features of DeepChem rather than unrelated software packages used by unrelated user communities?

Good suggestion!

My bad here! I was going for something like an active learning optimization loop, but agree that the current code snippet doesn’t do a good job of capturing this flow at all. I’ll do some thinking and put together a better code snippet.

Will rework these code snippets to be more sensible as well!

Part of this is driven by my personal recent work. I’ve spent time for various projects thinking about each of these three particular ideas which is why they came to mind :slight_smile:.

You’re asking excellent questions about what ties them together so that they would be part of DeepChem. I’ll exclude retrosynthesis since it’s been a long-time feature request for DeepChem (and something we already partially support with dc.molnet.load_uspto).

For the other snippets (differentiable cell simulator, differentiable semiconductor modeler, lattice chromodynamics) my intuition is that all of these models could benefit from a shared infrastructure for differentiable simulation. This common differentiable simulation infrastructure could live naturally within DeepChem and different applications could become different libraries in a broader DeepChem ecosystem.

My apologies though that these ideas are clearly half baked! I’ll do more serious thinking and return with better code once I’ve put it together for myself :slight_smile:

I have not yet had a chance to read all the above comments in-depth, but one suggestion I have is to look at systems like Nextflow (nextflow.io) and Cromwell/WDL from the Broad. The bioinformatics community has a bunch of good thoughts in terms of pipelines and workflows and software ergonomics (for the most part!) In terms of distributing problems over large and scalable systems, I’ve had excellent luck with Nextflow combined with more conventional task management systems such as SLURM.

I’m pleased that you’re also connected into the Julia world, as I’ve found it to be a real pleasure to work with.

3 Likes

This should be relevant: https://data-apis.org/array-api/latest/. All the major packages that have array or tensor APIs (Numpy, TensorFlow, PyTorch, JAX, etc.) are working to unify their APIs so it will be possible to write a single piece of code that can operate on any of them.

1 Like

I think this discussion ties into the ongoing discussion at https://github.com/deepchem/deepchem/issues/2504. If we decide to create additional packages within the monorepo (deepscience, deepchem, deepmat, deepbio, etc), that might also help us better message DeepChem’s broadened focus as a framework for AI-driven science

Hi,
I’m relative new to deepchem and machine learning in general. So I start off with the 6th notebook in the tutorials: 06_Introduction_to_Graph_Convolutions
I like to visualize or imagine how and where deepchem fit into a ML framework, that’s why I post here, it might help one way or another (if you guy see this should be posted elsewhere please advise)

  1. The predictor
  • So the final goal is the predictor that can predict y (classification or regression…) given input X
  • The predictor is built base on data and model
    predictor - data - model
  1. Deepchem built-in datasets and models
  • Deepchem provide a framework to deal with datasets, models and ‘connection’ between them, because 1 model is good at certain tasks and datasets
    data-model
  1. Example with tox21 dataset and Graph Convolutions Model, both provided by deepchem
  • I would explore a bit more what is in tox21
  • What is GraphConvModel, its inputs and outputs
  1. The ‘connection’ provided by deepchem
    dc.featurizer that turns tox21 dataset to new `dataframe that can be fed to GraphConvModel

Probably that ‘thinking workflow’ is more for people who are new to ML, but I believe it can be applied for many other datasets and models which can partly answer the question “What is DeepChem?”.

Some follow-up thoughts at Some thoughts on DeepChem Architecture