DeepChem Minutes 6/19/2020

bharath · June 25, 2020, 4:06am

Date: 6/19/2020
Attendees: Bharath, Vignesh, Sean, Seyone, Peter, Nathan, Nicolas, Daiki
Summary: Bharath has been working on the hyperparameter overhaul PR. This has turned into a larger effort than expected since our old hyperparameter module didn’t have a clear API or a clear way to add new hyperparameter optimization methods. In addition, the Gaussian process and Grid search methods had very different APIs. Bharath is continuing work on this module and hopes to have something ready for review in a couple days.

Vignesh worked this week on finishing the first cut of the transfer learning PR. Peter reviewed the PR, and suggested that the early stopping portion of the PR should perhaps be separated from the full PR. The early stopping portion is ready to merge, but the full transfer learning portion might need a bit more work. Vignesh will put up the early stopping PR for merger, and has started an issue to discuss the structure of a general transfer learning system. There’s been healthy discussion on the issue, and at present, the design is trending towards the addition of a generic load_pretrained function (similar to the Hugging Face transformers library design) that will allow users to load pretrained models based on string names for available models.

Bharath also put up an issue suggesting that we’re now open to start accepting PyTorch models into DeepChem using the new framing of import guards. Peter asked how we’d think about code organization in this new design, whether we’d split up code from different frameworks into different submodules of dc.models or blend code. Bharath mentioned we already had the sklearn_models and xgboost_models submodules so splitting into submodules made good sense. We could add a pytorch_models new submodule to contain the pytorch code and a tensorflow_models submodule and migrate the TensorFlow code into that submodule. Vignesh mentioned that he should be able to work on the PyTorch models a bit as well.

Sean has been continuing his work on the Julia implementation of DeepChem. In particular, he’s been looking more closely at Flux. Flux is pretty flexible and has good integration with many external Julia packages. It should be possible to implement TensorFlow functions in Julia via and achieve good performance. The Julia implementation is 2x faster than the previous implementation now and now 5x slower than the Python implementation. Sean is continuing work on optimizing the implementation.

Seyone has been working this week on extending the Chemberta tutorial with attention weight visualizations. Having better visualizations of the weights should make it the tutorial more informative for our readers. Seyone said the tutorial should hopefully be done soon. Seyone is also planning to work on a ChemBerta implementation for DeepChem. This will live in the new pytorch_models submodule of DeepChem and will enable users to use ChemBerta in their own applications. Bharath said this was really cool since it will provide another transfer learning system for DeepChem and since BERT styles models have been making a lot of waves in the biological community recently.

Peter has been working on trying to get the remaining test cases marked, passing again. Almost everything is working again, without too much work required. The Vina pose generator was originally written to run on Linux and didn’t work on Mac and Windows. This was a straightforward fix to extend. The scarier issue was that the atomic conv featurizer had never been converted to Keras. This means it had been broken for an entire release and apparently no one noticed. We’re also getting errors on AtomicConvModel. Only one of the slow tests was testing that the model made proper predictions. It looks like this test has also been silently failing. Bharath mentioned that he could help take a pass at this once he was done with his hyperparameter PR and that we should get it fixed since AtomicConvModel is an important model for structural applications.

Peter mentioned that in general we haven’t been good about running slow tests. We need a system in place to make sure that slow tests are run before releases and more regularly even to make sure changes don’t introduce breakage. Bharath said he’d put up an issue suggesting that we create a build server for DeepChem. The downside to this step is that it’s expensive. Bharath also mentioned that we had a similar problem with MoleculeNet. The full suite hasn’t been run in quite a while. Bharath said he’s planning to do a full MoleculeNet run soon so we could start tracking benchmark performances for our models.

Nathan has been working on assembling some inorganic crystal data for integration with MoleculeNet. He’s mostly been focused on datasets from materials project. Nathan also mentioned matbench, a new effort out of LBNL that is working to make a new MoleculeNet type project for materials. Nathan’s also been working on adding in support for the Crystallography Open Database. Nathan has a PR he’s working on that adds in a first cut at featurizers and data loaders for these projects. He hopes to have a WIP PR up for review next week. Bharath said that this was really cool. There’s been a lot of progress in deep learning for materials science, including new models like crystal graph convolutions and lattice graph convolutions that it would be great to add DeepChem support for.

Nicolas gave a brief intro about himself since he’s new to the calls. Nicolas is working on some applications of cheminformatics and is interested to learn more about DeepChem and what it can do. Bharath said it would be great to get feedback on what DeepChem is good at and bad at as Nicolas starts using it.

Daiki has been working this week on refactoring jaxchem to use Haiku (PR). Haiku is a high level framework for jax that adds PyTorch-like layers. The code looks much cleaner and is easier to handle. Daiki has also done some prep work to implement sparse pattern GCNs. Next week, Daiki plans to finish the implementation of sparse pattern GCNs and start some benchmarking work comparing the jaxchem implementation to DGL, DeepChem, and PyTorch geometric. Daiki’s daily updates for this week are available here.

Bharath raised one last discussion item before concluding the call. A lot of DeepChem’s infrastructure lives on AWS. This includes the moleculenet datasets and eventually will include the pretrained models. This is currently split between the Pande group AWS and Bharath’s personal AWS account. This isn’t ideal since Bharath is the only one who has access to these systems, so if he goes on vacation and something breaks, we’ll be in bad shape. Bharath asked if there were any ideas for how we could decentralize this a bit, so that other maintainers of DeepChem could access AWS and spread the maintenance load. Nathan mentioned that the materials project was planning to move over to AWS so he could ask how they were planning to handle AWS access. Bharath said that would be great and mentioned one possibility is that we could potentially set up @deepchem.io email addresses. We had these at some point in the past. The access to AWS could be set up so that it’s mediated through @deepchem.io AWS accounts, with @deepchem.io accounts available to all active DeepChem maintainers.

As a quick reminder to anyone reading along, the DeepChem developer calls are open to the public! If you’re interested in attending, please send an email to X.Y@gmail.com, where X=bharath, Y=ramsundar.