DeepChem Minutes 9/4/2020

Date: September 4th, 2020

Attendees: Bharath, Peter, Vijay, Nathan, Seyone

Summary: Vijay hasn’t been on the calls recently so we started with a quick intro for folks who haven’t met him before. Vijay was Bharath and Peter’s boss back at Stanford. DeepChem started life as a Pande group project back in the day. Vijay mentioned that he was interested to hear more about what the project had been up to recently.

With introductions done, we moved to the roundtable updates. Bharath this week has been primarily working on this PR implementing N2 weave models. Our original weave models implemented something called N-infinity. This meant that information from all atoms in a molecule were combined at each weave convolution step. This turns out to lead to “destructive interference” which leads to worse performance. The N2 implementation allows for more local convolutions which seems to help stabilize learning. Bharath also fixed a small bug in chirality handling for weave models PR and added some additional documentation for the new torch models PR.

This week, Peter has been working on the tutorials. He put up a PR overhauling the old tutorial 4 and also put up another PR on new tutorial for TorchModel and KerasModel. Peter mentioned that he’s also started work on a new open source library for high performance ANI-1 potential. The TorchANI library offers a clean open implementation but is unfortunately slow. There are proprietary implementations out there that are faster, but not yet a good clean fast open implementation. Peter mentioned that we might want to consider using this library in the future. Bharath said this could be a really useful utility for downstream DeepChem applications.

Nathan has continuing work on normalizing flow infrastructure. He’s got a tutorial in the works that walks through an initial example applying normalizing flows on QM9 molecules. He’s also been looking at tools to apply normalizing flows to graph generation, and in particular the MoFlow paper. Nathan got in contact with MoFlow authors recently, and is parsing through those papers and figuring out how to potentially implement some of these techniques in DeepChem. Nathan and Bharath also chatted about incorporating Zinc15 into MoleculeNet. Zinc15 is a large set of openly available compounds (230 million) that could be a powerful resource for training normalizing flow and ChemBerta style models.

Seyone this week finished up the SmilesTokenizer PR. This PR adds in infrastructure to tokenize smiles strings for use in downstream applications. There was some really useful design discussion on the review for this PR suggesting that there might be a clean way to make SmilesTokenizer both a DeepChem Featurizer and a HuggingFace tokenizer. This would let us easily apply it to MoleculeNet datasets while retaining easy interoperability with HuggingFace’s infrastructure. Seyone also put up a docs fix PR that fixed the docs build for tokenizers. Seyone is continuing work on ChemBerta and plans to put up future PRs that continue building out the SmilesTokenizer infrastructure later this month.

Bharath asked if there were any interesting trends and topics in the field that Vijay had seen recently. Vijay mentioned that he was really interested in GPT-3 and found there were some really compelling new capabilities to these models. Vijay was curious to hear whether we thought ChemBerta style models could generalize broadly to chemical applications. Bharath mentioned that this was still an open question. Seyone’s recent works hints that there could be powerful generalization possible, but we’d need to train models on larger datasets to see if we could get close to the GPT-3 effect. A good signal would be if we could see pan-MoleculeNet improvements from pretrained models. It’s likely we’d need to train on larger datasets to really see these effects though.

Bharath asked Seyone how many credits we were using for current models and how many we’d need to use to be able to train on larger datasets like Zinc15 or Enamine REAL. Seyone said we were currently training on $500 worth of credits for models. It’s likely that training a model on Enamine REAL (which has 1.4 billion compounds) would take up to $10K in credits. Vijay mentioned it might be possible to use Stanford’s GPU cluster since DeepChem was originally founded at Stanford. Another possibility is to seek donations of cloud credits to finish the research through grants or corporate sponsors.

Bharath mentioned that there was a lot of similarly interesting work going on in the protein space. The Unirep project showed that NLP style techniques could work on protein engineering. Vijay mentioned that the follow-up papers paper showing low-N protein-engineering with Unirep were really interesting.

Bharath mentioned he’d been working with Michael on adapting some of the AlphaFold techniques into DeepChem. However, this work had run into a critical roadblock which was that the hh-suite of tools for homology calculation were GPL licensed. Bharath said he thought that this meant that we couldn’t use hh-suite for DeepChem featurizers. Peter said he thought this wasn’t the case and that it should be possible to release under both GPL and MIT license. He recommended looking at the FAQ about GPL-compatible licenses. Bharath said this was really exciting and might open it up for us to support homology calculations and AlphaFold style models in DeepChem. Bharath said he’d

As a quick reminder to anyone reading along, the DeepChem developer calls are open to the public! If you’re interested in attending, please send an email to X.Y@gmail.com, where X=bharath, Y=ramsundar.