DeepChem Minutes 8/7/2020

Date: 8/7/2020
Attendees: Bharath, Peter, Daiki, Nathan, Seyone, Neel
Summary: We had a new attendee on the call, Neel, so we started off with a quick introduction. Neel is a software engineer at Google based in Switzerland. Neel’s background is as a software engineer, but is excited to learn more about DeepChem.

Bharath this week spent more time using DeepChem rather than developing it. He did make one fix that addressed the memory usage of SDFLoader (PR). Previously, SDFLoader was loading everything into memory which meant that it couldn’t work with large files, but with the fix in place, the memory usage is constant even for arbitrarily large SDF files.

Peter put up an exciting new PR that adds support for PyTorch models. This PR introduces a new TorchModel class that enables wrapping torch.nn.module objects into DeepChem modules easily. Daiki reviewed the PR and pointed out a few needed fixes for GPU support which would need to be added before we could merge in.

Daiki this week finished up his PR that added in the new GraphData class and a featurizer for crystal graph convolutions. Daiki also added in a PR which updates the installation script for the DeepChem tutorials that allows them to be run with DeepChem’s nightly build. Daiki noted that some of the tutorials are broken for DeepChem nightly and will need to be fixed before we can release DeepChem 2.4.0.

Nathan merged in his PR that added in two inorganic crystal structure loaders to MoleculeNet and put up a new PR that adds two additional datasets from the materials projects. Nathan also uploaded all 4 materials projects datasets to the AWS bucket using the new IAM roles and is now continuing to work on the normalizing flow API PR. Nathan is still working to figure out a clean API for handling losses in normalizing flows since it doesn’t fit cleanly into the existing DeepChem losses. Bharath asked if there was a way we could help brainstorm a clean design for the loss functions, and Peter suggested that it might be useful to add a brief write-up to the github pull request. Nathan said he should be able to put a short write-up together to help clarify the issue.

Seyone has been focused on the pretraining side of Chemberta. In particular, he’s been working a lot with the PubChem dataset of SMILES strings and has finished pretraining jobs on a subset 1 million smiles. He’s also extended the pretraining on the old ChemBerta Zinc-250K model from pretraining for 5 epochs (which is what the tutorial does at present) to training to 10 or more epochs. He was also able to get the reaction tokenizer working on a local branch. Seyone mentioned that he’d made progress on the ChemBerta implementation PR, but now that Peter’s new TorchModel infrastructure is almost ready, it might make sense to rework ChemBerta to fit into the new standard framework.

Bharath mentioned that he hoped to get to work on some of the critical blockers of critical blockers for the DeepChem 2.4.0 release next week and that he’d try to fix the DeepChem website certificate issues as soon as possible.

As a quick reminder to anyone reading along, the DeepChem developer calls are open to the public! If you’re interested in attending, please send an email to X.Y@gmail.com, where X=bharath, Y=ramsundar.