DeepChem 2.4.0 Release Notes

DeepChem 2.4.0 constitutes a major update to DeepChem incorporating over a year’s development work. The package has undergone major overhaul on the backend with much more robust infrastructure, along with a considerable expansion of our features and capabilities.

To give a brief overview, DeepChem 2.4.0 adds full support for TensorFlow 2 and PyTorch based models to DeepChem, infrastructure for materials science datasets and models, normalizing flow infrastructure, interaction fingerprints for protein-ligand complexes, partial support for language models like ChemBERTa, major improvements to our dataset infrastructure, cleanup of our logging infrastructure, an overhaul of our tutorial series, a rewrite of the MoleculeNet backend and more. DeepChem 2.4.0 is much more production hardened than DeepChem 2.3.0 and should be able to serve as a stable library in corporate and research infrastructure.

From DeepChem 2.5.0 onwards we will be switching to a more accelerated release schedule and are tentatively aiming to release DeepChem 2.5.0 at the end of February.

Detailed Overview

To start, our Jupyter notebook tutorial series has been considerably expanded and updated to run on Google Colab. We’ve had a number of newcomers to the community work through the tutorial series and give us good feedback on the series. We anticipate growing out the tutorial series with more material in the releases to come.

Dataset objects have improved interoperability with standard Python ecosystem objects and can now be converted to/from Numpy Arrays, Pandas Data Frames, TensorFlow datasets and to PyTorch datasets. Datasets can also now be nicely printed out in IPython making them much friendlier to work with. Underneath the hood, dataset loading has been considerably optimized for larger datasets by swapping to using npy formats and the addition of a new in-memory cache for DiskDataset objects. We have also swapped to a new metadata format that allows for shapes. Dataset objects also now support complete shuffles for datasets too large to fit into RAM. We’ve also added Improved data loading capabilities including support for loading JSON files and improved support for in memory loading (from pandas dataframes).

Our continuous integration support has been improved considerably with Windows support. The continuous integration system has been converted from Travis CI over to Github Actions. We now have a nightly pip build. On the code internals, types have been added to most of the DeepChem core API creating a more robust internals for the codebase. The testing suite has been converted to pytest from nosetest. We’ve also started using flake8 on the codebase.

DeepChem now has improved support for docking with a major overhaul of the dc.dock module. We will have further improvements and tutorials in the upcoming 2.5.0 release.

Logging is now controlled by the logging module rather than being done by print statements. We also now support Weights&Biases integration for logging. A broad goal during this release has been to move DeepChem closer to production-readiness from being a pure research project. Although we still have a long way to go on some fronts, we believe that this version of DeepChem can start to be used more confidently in production settings. We anticipate that we will continue improving the production readiness of DeepChem over the next several minor releases.

We’ve improved the performance of a number of our splitters. RandomStratifiedSplitter has been fixed to match the API of the other splitters. ButinaSplitter was overhauled to have improved performance and stability. The new FingerprintSplitter provides a powerful alternative to ScaffoldSplitter that we believe will help molecular machine learners better estimate the performance of their models.

The MoleculeNet suite has been considerably expanded with new datasets for materials science. On the backend, we’ve overhauled the MoleculeNet codebase to be considerably more maintainable and extensible. We’re actively working on the next release of MoleculeNet (see repo) and will continue these improvements over the months to come.

Our documentation has been dramatically improved. Check out our release documentation for 2.4.0 here. We plan to continue to expand our documentation support for DeepChem over the next several release.

We’ve also considerably improved our collection of new models. Check out our new model cheatsheet which lists a collection of all of our models. Support for LightGBM has been added to our new GBDTModel (in addition to XGboost for gradient boosted trees). The new PytorchModel class allows wrapping of arbitrary PyTorch models into the DeepChem API. We’ve also partnered with DGL to introduce wrappers for a number of DGL models in DeepChem. In addition, we have a number of new models for materials scientists. We hope to make DeepChem a quality tool for materials science discovery just as it is for computational chemists.

Changes Breaking Backwards Compatibility

Exhaustive List of Pull Requests

DeepChem 2.4.0 features literally hundreds of merged pull requests. We’ve listed these pull requests below and highlighted a few of the major ones with descriptions. DeepChem is very actively maintained and developed!

3 Likes

There is another major change that breaks backward compatibility. In 2.3 we moved from TensorGraph to Keras as our recommended modelling API, but we still supported TensorGraph for backward compatibility. In 2.4 that is no longer true. TensorGraph is completely gone. The recommended ways of building models are Keras (with KerasModel) and PyTorch (with TorchModel).

1 Like

One other breaking change to mention here that I missed:

  • dc.molnet.load_pdbbind_grid is now removed. Instead use dc.molnet.load_pdbbind directly and specify the grid featurization.

CC @ncfrey