Attendees: Bharath, Peter, Vignesh
Discussion: We did a sync-up about feedback to the DeepChem survey. There have been 14 responses total so far with a range of asks and suggestions. Here are a few of the high level summary points:
- DeepChem has gotten large and complex. A number of respondents used to use DeepChem but stopped. The common reason was they switched to a custom system more suited to the problems they were focused on.
- Respondents find value in the featurizers and chemically relevant models, but found that the custom TensorGraph framework was hard to understand. A number of respondents said they’d switched to using PyTorch (and in one case Jax) instead of TensorFlow
- When asked about pain points, respondents pointed to the difficulty of installation, lack of interoperability with other ML tools, the difficulty of loading CSV files, the difficulty of hyperparameter optimization. There was a general desire for DeepChem to be simpler.
Bharath and Peter have been working on synthesizing these responses into a cogent plan for revamping DeepChem. This plan is still a work in progress, but a few common themes are emerging. Here’s a rough sketch of the changes being considered:
- DeepChem could be split into 2, maybe 3 simpler packages. These packages would be independent, and users could choose to use their own tooling with just one of these new packages:
- A threeway split would create a moleculenet package that provides just access to the datasets, a featurizer package that provides chemoinformatic featurizers, and a models package that provides implementations of interesting models.
- An alternate two-way split could be into a “data” package and a “featurizers/models” package.
- A third split would have a “data/featurizers” package and a “models” package (the first would be everything independent of ML framework and the second the parts that depends on TensorFlow).
- TensorGraph has already been deprecated and should be outright removed
- Ideally, we can phase out the deepchem Dataset object in favor of a more common standard. Pandas dataframes likely suffice for 90% of use cases, but it’s not clear what to do for the bigger datasets. Alternatively, if we don’t find any good replacement package, we can extend Dataset so that it can more easily interoperates with Pandas and other common ML frameworks.
- The DeepChem test suite needs to be trimmed down so that tests run quickly and travis CI stays green.
- DeepChem needs better Mac/Windows support. CI tests need to be added for these other platforms and installation instructions need to be updated.
- The tutorials and website need to be refreshed
Bharath is working on a design document that lays out how to architect these changes to DeepChem. As a sneak peek, here’s the rough plan so far.
- Release DeepChem 2.4 which takes a set of intermediate changes:
- Simplify test suite so it runs more quickly and development is easier
- Update to TensorFlow 2.X
- Add Windows/Mac CI tests
- Refresh tutorials/website.
- Add new pip packages and test
- Release DeepChem 3.0 which creates a breaking set of major changes.
- Split out 2 or 3 independent packages from core DeepChem on lines discussed above
- Possibly remove DeepChem Dataset class in favor of more standard machine learning tools. Focus on interoperability with common ML packages.
- Remove TensorGraph framework and sklearn wrappers
- Create PyTorch DeepChem models package that provides simple pytorch infrastructure as needed.
- Revamp tutorials/website/book to use independent packages
Vignesh worked on completed a pull request that added
get_config for custom DeepChem layers. This allows them to be loaded using keras’s
load_model. Bharath merged in this change.