DeepChem Minutes 7/31/2020

bharath · August 6, 2020, 12:29am

Date: 7/24/2020
Attendees: Bharath, Peter, Seyone, Nathan, Daiki, David
Summary: We had a new attendee on the call, David, so we started with a quick intro.

David is a recent graduate from Harvard and was a CS and bio major. He did research applying machine learning to designing with proteins with Debora Marks lab and was interested in learning more about applications of ML to chemistry.

We then moved into the usual round table updates. Bharath worked on a number of improvements and bugfixes this week. Bharath merged in an improvement to the loss function reporting for KerasModel that allowed for returning the full loss curve (PR) after a careful review by Peter that tightened up the API. Bharath also made a bugfix to DiskDataset.shuffle_each_shard (PR). We’d had a bad unit test that wasn’t testing correctness of the method properly so we hadn’t caught a bug in the shuffling method. Bharath put up another PR that adds shape metadata to DiskDataset. This improvement will make DiskDataset easier to work with since we’ll have to check information on Disk less to do useful things and will speed up printing disk objects in the console. Bharath also put up a PR migrating MoleculeNet datasets to the new DeepChem S3 bucket. We’re now in the middle of migrating DeepChem data to its own infrastructure.

Daiki this week has updated the new GraphData data structure to be more useful and interoperable with PyTorch Geometric and DGL (PR). Our goal is to convert our existing graph featurizers over to using the new GraphData data structure as the standard data class for DeepChem molecular graphs. This will allow for easy interoperability since GraphData has native conversion methods that convert to with PyTorch Geometric and DGL graph classes. Daiki has also implemented a featurizer for crystal graph convolutions and plans to next implement a crystal graph convolutional model in either PyTorch Geometric or DGL.

Peter is working on creating a new TorchModel class (issue). Peter’s goal with designing TorchModel is for us to be able to get close to the existing KerasModel API so users can carry over their understanding of the DeepChem API to models built in PyTorch. This should facilitate some of the work that Daiki and Seyone have been doing with PyTorch models.

Seyone this week was primarily working with the Aspuru-Guzik lab on their release of the 1.0 version of SELFIES. Now that this has been release, Seyone will have more bandwidth to continue the ChemBERTa implementation (PR). Seyone is working on scaling up pretraining to 1 million SMILES strings and has finishing up the fit and predict methods for the PR on a local branch. Bharath mentioned that it might be useful to look at Peter’s upcoming TorchModel implementation and use it as a framework for ChemBERTa since core HuggingFace models like that for RoBERTa use torch.nn.module as their base class. This should allow for HuggingFace models to fit cleanly into the TorchModel framework. Seyone also mentioned that he’s been looking more closely at rxnfp (issue). This package has some cool support for reactions fingerprints and SMILES tokenizers which might be useful for ChemBERTa and for reaction modeling in general.

Nathan has continued iterating on his PR adding in MoleculeNet loaders for crystal datasets. In the process of working with these MoleculeNet loaders, he found some issues with the MoleculeNet contribution template and the unit testing that we’re starting to build up and might have some additional changes inbound in a new PR. Nathan also put up a first cut of a PR adding in an API for normalizing flows. Peter had some good comments on the design of this, and Nathan is planning to do a second pass to iterate on making the API more closely match that for KerasModel.

With the roundtable complete, we swapped over to discussing general topics. Bharath mentioned that there had been an exciting new paper from the Chodera lab demonstrating the use of machine learned forcefield to improve the accuracy of Alchemical free energy calculations. Peter mentioned that he’d heard a talk about this paper recently and mentioned that it featured an interesting way to swap between forcefield energies and ANI-2 energies in a way that improved the empirical accuracy of the alchemical method while keeping the speed of the classical force field. Bharath mentioned it might be interesting to get more support for these and similar techniques into DeepChem.

Bharath mentioned that he’d updated the list of critical blockers remaining before the DeepChem 2.4.0 release. There were still a few core components that needed to be updated before we could make the cut.

As a quick reminder to anyone reading along, the DeepChem developer calls are open to the public! If you’re interested in attending, please send an email to X.Y@gmail.com, where X=bharath, Y=ramsundar.