DeepChem Minutes 1/17/2020

Date: 1/17/2020
Attendees: Bharath, Peter, Vignesh
Discussion: We did a sync-up about feedback to the DeepChem survey. There have been 14 responses total so far with a range of asks and suggestions. Here are a few of the high level summary points:

  • DeepChem has gotten large and complex. A number of respondents used to use DeepChem but stopped. The common reason was they switched to a custom system more suited to the problems they were focused on.
  • Respondents find value in the featurizers and chemically relevant models, but found that the custom TensorGraph framework was hard to understand. A number of respondents said they’d switched to using PyTorch (and in one case Jax) instead of TensorFlow
  • When asked about pain points, respondents pointed to the difficulty of installation, lack of interoperability with other ML tools, the difficulty of loading CSV files, the difficulty of hyperparameter optimization. There was a general desire for DeepChem to be simpler.

Bharath and Peter have been working on synthesizing these responses into a cogent plan for revamping DeepChem. This plan is still a work in progress, but a few common themes are emerging. Here’s a rough sketch of the changes being considered:

  • DeepChem could be split into 2, maybe 3 simpler packages. These packages would be independent, and users could choose to use their own tooling with just one of these new packages:
    • A threeway split would create a moleculenet package that provides just access to the datasets, a featurizer package that provides chemoinformatic featurizers, and a models package that provides implementations of interesting models.
    • An alternate two-way split could be into a “data” package and a “featurizers/models” package.
    • A third split would have a “data/featurizers” package and a “models” package (the first would be everything independent of ML framework and the second the parts that depends on TensorFlow).
  • TensorGraph has already been deprecated and should be outright removed
  • Ideally, we can phase out the deepchem Dataset object in favor of a more common standard. Pandas dataframes likely suffice for 90% of use cases, but it’s not clear what to do for the bigger datasets. Alternatively, if we don’t find any good replacement package, we can extend Dataset so that it can more easily interoperates with Pandas and other common ML frameworks.
  • The DeepChem test suite needs to be trimmed down so that tests run quickly and travis CI stays green.
  • DeepChem needs better Mac/Windows support. CI tests need to be added for these other platforms and installation instructions need to be updated.
  • The tutorials and website need to be refreshed

Bharath is working on a design document that lays out how to architect these changes to DeepChem. As a sneak peek, here’s the rough plan so far.

  • Release DeepChem 2.4 which takes a set of intermediate changes:
    • Simplify test suite so it runs more quickly and development is easier
    • Update to TensorFlow 2.X
    • Add Windows/Mac CI tests
    • Refresh tutorials/website.
    • Add new pip packages and test
  • Release DeepChem 3.0 which creates a breaking set of major changes.
    • Split out 2 or 3 independent packages from core DeepChem on lines discussed above
    • Possibly remove DeepChem Dataset class in favor of more standard machine learning tools. Focus on interoperability with common ML packages.
    • Remove TensorGraph framework and sklearn wrappers
    • Create PyTorch DeepChem models package that provides simple pytorch infrastructure as needed.
    • Revamp tutorials/website/book to use independent packages

Vignesh worked on completed a pull request that added get_config for custom DeepChem layers. This allows them to be loaded using keras’s load_model. Bharath merged in this change.

2 Likes

This looks great, where would capabilities like scaffold and cluster-based cross validation live in the new package hierarchy? As I mentioned in the survey, I think it’s important to make DeepChem more familiar by allowing it to work like, and interoperate with, tools like Pandas and scikit-learn.

That looks like a good plan. I can get started on the transition to TF2. It would be very helpful if we could drop TensorGraph at the same time, since otherwise that’s a lot of code that will need to be converted to TF2 only to be deleted in the very next release.

Hopefully speeding up the test suite will take care of itself. As we remove deprecated models, and especially TensorGraph, that will eliminate a lot of tests. Also, there are a lot of tests that we run twice, once in graph mode and once in eager mode. Since graph mode doesn’t really exist in the same sense in the TF2 API, we’ll only need to run them once. After we’ve done all the other changes, we can see how long the test suite is taking and whether we need to do anything else to speed it up.

My vote is to focus on a reduction of complexity. The core DeepChem GraphConvs data loading / training / inference should be like 500 lines of code tops.

It would also be nice to see some packages removed. Do we really need pandas to read csv files? Can we dump rdkit?

RDKit is used in all the chemical featurizers. When it takes a SMILES string and produces a graph or a fingerprint or a list of physical properties, that’s done with RDKit. So we can get rid of it. But if we separated the featurizers into their own package, you wouldn’t need it if you weren’t doing chemistry.

Should we look to add typed annotations for DeepChem, now that we run only on Python 3.5 and 3.7?

@PatWalters Great question! I think the splitters and transformers should be packaged up alongside the dataset object. Strongly agree on the need to make DeepChem work with common tools like pandas and scikit-learn. I’m still working to figure out a good structure for interoperability but will report back soon!

@peastman I think it would be a great idea to drop TensorGraph as we get TF2 support! There’s no point converting code we’re going to get rid off. I like the idea of making the TensorGraph/TF2 changes and seeing where our test suite is at before we make more serious cuts.

@patrickhop Agree we should focus on reducing complexity! I’m looking at the DeepChem package structure and I think a lot of subpackages (deepchem.dock, deepchem.metalearning, deepchem.rl) could be removed or moved out into small standalone packages. This would let the core package remain very slender

@Vignesh Typed annotations are useful, but I think that’s probably a separate effort from this refactoring. Adding annotations typically adds complexity, and I’d like us to focus on simplicity in the current push. This would be worth revisit in the future though!