DeepChem Minutes 7/10/2020

bharath · July 15, 2020, 9:21pm

Date: 7/10/2020

Attendees: Bharath, Daiki, Seyone, Nathan, Gabe, Peter

Summary: Gabe was a first time attendee on the call so we started with a brief round of introductions. Gabe is a machine learning engineer at Reverie labs who’s worked on building out their deep learning stack.

We then did the usual roundtable updates. This week, Bharath’s been working on number of DeepChem improvements. The biggest one was getting support for a deepchem-nightly build merged in over a series of PRs (PR1960, PR1963, PR1964, PR1968). Daiki did some critical fixes to the nightly build as well (PR1986). We now have support for the new convenient installation command pip install --pre deepchem which allows for a rapid installation of the nightly DeepChem build!

Bharath also put in some time this week improving the transformers. The BalancingTransformer had some issues with efficiency and multiclass data (PR1980) and the transformers needed documentations improvements (PR1976) and fixes for directing output (PR991). Bharath also swapped over more of DeepChem to using logging rather than print statements (PR1994) and put up a PR refactoring out a MolecularFeaturizer superclass (PR1992) for featurizers on molecules. This cleans up the featurizers and creates a template for featurizing different types of data in DeepChem.

Bharath has also been working on transitioning DeepChem infrastructure from Stanford AWS to our own AWS accounts. Unfortunately, the funding that was paying for the Stanford infrastructure has run out, but the silver lining is that since we’re on our own infrastructure now, we can provide IAM access to DeepChem’s S3 bucket for all our developers. Bharath took a first cut at getting IAM access set up and tested with Daiki’s help. It looks to be working! Bharath is working on moving material over from the Stanford bucket to the DeepChem bucket and will set up IAM access for DeepChem developers.

Transitioning to our own infrastructure means that we’ll need to apply for grants to fund our S3 usage. AWS looks to provide some open source project support (see Nathan’s issue) which we should prepare an application for.

Daiki wrote a blog post about his work so far on Jaxchem for Google Summer of Code. He did some benchmarking of various models and found that sparse graph convolutions in Jax seem to have some serious performance issues (Jaxchem issue). Daiki is looking into improving these performance issues. Daiki also added Jaxchem documentation and put a readthedocs site (Jaxchem PR12, Jaxchem PR13, Jaxchem PR14). On the DeepChem side, Daiki worked on improving documentation (PR1983, PR1988, PR2001, PR2002).

Seyone put up a first PR (PR1970 that is working towards adding a ChemBerta Pytorch implementation. There was some good discussion on the PR about proper design for a PyTorch module. It looks like getting this merged in will take some iteration since this is our first PyTorch model to merge into DeepChem. Seyone is planning to keep iterating on this PR and is also working on scaling up ChemBerta training. Seyone is working to transition his training infrastructure to AWS Sagemaker so he can train larger models.

Nathan merged in the overhaul of the MoleculeNet template (PR1938 after an extensive review process. The new template considerably generalizes MoleculeNet to allow for user specified featurizations of MoleculeNet datasets. We have an issue to track the conversion of existing MoleculeNet loaders to the generalized style. Nathan also put up a new PR (PR1998) that adds a new JsonLoader class that allows for loading data in from JSON files. These files are commonly used in materials science datasets. Once the MolecularFeaturizer PR is merged in, Nathan plans on putting up a PR for a new CrystalFeaturizer to featurize crystal datasets. Nathan is also planning on submitting a new PR that adds new crystal datasets to MoleculeNet.

Peter has been working to add type annotations to DeepChem (PR1979). This is a really powerful new feature since it will help us bring the core DeepChem codebase into better control and check that our APIs are coherent. Peter has finished adding types onto several of the core classes and is planning to write up guidelines and recommendations on how to properly annotate the code. It’s likely we’ll have to do a few iterations to add type annotations throughout the codebase but this should help stabilize the core of DeepChem.

With the roundtable complete, we proceeded to open discussion of DeepChem issues.

Bharath asked Gabe if from his experience at Reverie he had any feedback on how to improve DeepChem. Gabe mentioned that he’d found that in practice that splitting datasets to test generalization was a major challenge. While tools like dc.splits.ScaffoldSplitter help somewhat, it’s still hard to really measure the generalizability of a model. Gabe mentioned that he wasn’t yet familiar with the core DeepChem structure but might have more feedback once he had a chance to go through the library.

Seyone asked Gabe if from his experience he found that any techniques for interpretability and visualization were useful in practice. Gabe said he had a few ideas but he’d have to dig them up offline.

Nathan mentioned that he was working on designing a new normalizing flow API and wanted some feedback on the core design. Nathan’s been looking through the core DeepChem models design. Most DeepChem models live in dc.models, but we have a couple of outliers in dc.rl and dc.metalearning. The models in dc.models inherit from sklearn.base.BaseEstimator and are mostly classification/regression models, with a few generative and sequence models as well. There are also a large collection of Keras layers in dc.models.layers. Given this layout of different model classes, he was trying to think through a good design for Normalizing flows.

Peter asked for a bit more clarification on the types of design Nathan was thinking about. As in, an abstract general API or something more specific like Keras layers.

Nathan mentioned that Tensorflow Probability has a large library of normalizing flows. Nathan has also been doing a literature review of interesting normalizing flow papers and found some normalizing flows for molecular graph generations. To add general normalizing flow support, we’d need some support for probability distributions and some implementation of normalizing flow layers. Given that we’re looking into adding PyTorch support as well, should we try to make these layers general, or should we stick to a TensorFlow implementation?

Bharath mentioned that he’d seen a few different structures for multiple framework support. DGL for example wraps TensorFlow/PyTorch tensors and codes its models in a framework-agnostic fashion. Keras used to do something similar for TensorFlow and Theano as well. The downside of this approach was that it tended to get messy and hard to maintain due to the extra abstraction.

Peter mentioned that we don’t currently have a notion of DeepChem layers. We previously had TensorGraph custom TensorFlow layers, but as TensorFlow has standardized on Keras layers, we decided to make the switch too. Peter mentioned that one possibility would be to create a custom DeepChem layer class which allowed for sharing code across frameworks.

Bharath asked Gabe whether Reverie used PyTorch or TensorFlow. Gabe mentioned that Reverie used a mix of both, but primarily used TensorFlow since TF Serving provides a nice framework for pushing models to production. Gabe asked whether DeepChem had support for pushing models to production. Bharath mentioned that until recently, DeepChem hadn’t been stable enough to really use in production, but now that DeepChem was stabilizing, this was something worth thinking more closely about. Gabe also mentioned that with TensorFlow 2, APIs across libraries had become pretty similar and that there weren’t large differences between PyTorch and TensorFlow.

Bharath asked if there was a standard PyTorch analog for Keras. Peter mentioned that from his experience, not really. PyTorch models tend to be similar to TensorFlow subclassing style, but the Keras declarative style doesn’t seem to match how PyTorch models are structured.

Nathan mentioned that given that PyTorch was more broadly adopted in the research community, the major advantage was that using PyTorch makes it easier to translate models from academic research into DeepChem.

Bharath said adding support for DeepChem layers felt like a bit too much technical debt right now, but that it might make sense to standardize on APIs for different model types as suggested in this issue. Bharath asked what Peter felt would be a more sensible design choice. Peter mentioned that PyTorch support was very new to DeepChem and that we didn’t yet have understanding of what shared PyTorch infrastructure (like a TorchModel for example) we should build. Building abstract APIs for now and concrete implementations in given frameworks seemed to make the most sense.

It looked like the discussion was settling around the creation of an abstract NormalizingFlow superclass that defined the common API for normalizing flows, but doing the implementation using TensorFlow Probability and TensorFlow. Nathan said this made sense and provided a good framework for next steps.

As a quick reminder to anyone reading along, the DeepChem developer calls are open to the public! If you’re interested in attending, please send an email to X.Y@gmail.com, where X=bharath, Y=ramsundar.