DeepChem Minutes 7/3/2020

Date: 7/3/2020
Attendees: Bharath, Peter, Daiki, Seyone, Steven
Summary: Steven is a first time attendee on the call, so we started with a quick introduction. Steven is an ML engineer from Nigeria that did his undergrad in chemistry and is interested in getting more involved with DeepChem.

We then did the usual roundtable updates. Bharath finished work on the hyperparameter overhaul PR after a second round of careful comments from Daiki and Peter. Bharath also put up a first PR working towards setting up a deepchem-nightly package for developers working on the cutting edge. Bharath also put up a model cheatsheet PR that adds a cheatsheet for developers. Bharath also merged in some namespace cleanup and is currently working on overhauling moleculenet.

Peter has been working on bugfix PRs this week. He merged in one PR that made fixes to KerasModel.fit_on_batch and Dataset.iterbatches and a optimization PR which speeds up the printing of large datasets on the python repl. Peter’s started working on adding mypy annotations to DeepChem. This has turned up a number of bugs. For example, there are API mismatches between KerasModel.fit() and Model.fit() that have come up when running mypy on the codebase even before adding annotations. This seems like something we’ll have to figure out, but mypy seems like a very useful tool that will add value to the codebase.

This week Daiki has been working on putting up tutorials and issues summarizing his progress on Jaxchem. Daiki put up this issue summarizing some of the optimization issues he’s been having with Jax. Simply put, Jax is very slow on models that use Python for-loops to iterate and recommends using fori_loop (akin to tf.while). However, fori_loop doesn’t work well with generators, so it’s been challenging to make an efficient Jax implementation as documented in the issue. Daiki also put up this intriguing suggestion that we should unify our graph convolutional data class (like PytorchGeometric has done) so that we don’t have different classes like ConvMol and WeaveMol. This seems like an idea worth implementing as we continue cleaning up the library. Daiki also merged in this PR adding installation information for Jaxchem alongside a Tox21 tutorial.

Seyone has been continuing work on Chemberta this week. He’s been exploring different pre-training techniques such as Electra and has been looking into fine-tuning on different datasets. Bharath suggested that it might be useful to compare against the transfer learning results from Chemception from Vignesh’s blog post. Seyone is also continuing to work on the DeepChem Chemberta implementation.

Nathan couldn’t make this week’s call, but sent over some updates over email. Nathan merged in a initial PR adding in support for inorganic crystal featurizers and is working on a new PR adding in contribution template for MoleculeNet datasets. Daiki had some very helpful suggestions on how to generalize the loader to make it more useful. Nathan is also working on creating an API for normalizing flows which he will post once ready. Nathan also debugged some upstream breakage caused by a cloudpickle API change.

Bharath asked Steven what parts of the library he was interested in getting involved with. Steven mentioned that MoleculeNet was very interesting and in particular solubility prediction. Bharath mentioned that he was overhauling the MoleculeNet loaders and would put up new documentation on how to replicate the MoleculeNet benchmarks. Steven asked if there were any good places to start getting involved with DeepChem development. Bharath suggested that adding examples to the documentation might be very helpful, and Peter suggested taking a look at Github and at the issues marked “Good First Contribution.”

Bharath asked if there were any topics of discussion for the week. Seyone asked whether it would be useful to extend DeepChem into protein engineering. Bharath said it would be very interesting, and suggested that perhaps we could add MoleculeNet wrappers around the tape benchmark suite (see also https://arxiv.org/pdf/1906.08230.pdf . Seyone said he’d look into it. Bharath mentioned that he felt that DeepChem wasn’t ready for the 2.4.0 release yet. Peter asked which features would need to make it in before we can make the cut. Bharath said the fixes to the metrics for multiclass classification, the atomic convolution fixes and the moleculenet fixes would need to make it in before we can make the cut. Hopefully once the new deepchem-nightly package is up, we can start directing developers to build on there while they’re waiting for 2.4.0.

As a quick reminder to anyone reading along, the DeepChem developer calls are open to the public! If you’re interested in attending, please send an email to X.Y@gmail.com, where X=bharath, Y=ramsundar.