DeepChem Minutes 7/24/2020

Date: 7/24/2020
Attendees: Bharath, Peter, Seyone, Nathan, Shakti
Summary: Bharath has been making a number of smaller fixes to DeepChem this week. He fixed some issues in the WeaveModel (PR1, PR2), adding fully connected layers and allowing for control of the number of weave layers. Bharath also put up a PR to fix our featurizer logging issues and another PR to add a new duplicate balancing transformation. Bharath is also planning to complete the transition to our own AWS infrastructure over the weekend if possible and will send out IAM account information for developers once he’s done.

Peter has been working on trying to speed up dataset loading and creation. Peter put up a PR making improvements to data loading (PR). Unfortunately, most of our datasets mainly have one large file and so can’t easily be split between multiple processes. It looks like most bottlenecks are directly in pandas/RDKit, but Peter’s still looking and there may still be low hanging fruit. Peter’s PR parallelizing transformers was also merged in.

Daiki couldn’t make it to the call this week, but sent over some updates. Daiki improved type annotations (PR and improved the typing of utilities (PR), and added in the beginnings of flake8 support (PR). Daiki also put up a PR to add support for a crystal graph convolutional featurizer.

Seyone has continued work on ChemBerta PR and now has a working model for training on a local branch. Seyone still needs to get predictions working with DeepChem datasets. Once he’s got prediction fixed, Seyone will update the PR for a next round of review. Seyone has also started looking into adding a SmilesTokenizer to DeepChem. It looks like the new rxnfp library has a good smiles tokenizer already. Bharath put up an issue suggesting that we add rxnfp wrappers that might be a good way to add SmilesTokenizer support into DeepChem. Seyone mentioned that if the rxnfp developers were amenable, he’d be happy to take this on.

Nathan made improvements to clean up the materials featurizers (PR). Nathan is now working on a PR to add new materials science datasets to MoleculeNet (PR). This PR is also testing out a new strategy for unit testing MoleculeNet loaders that will help us improve our testing coverage. Nathan has addressed most open comments on the PR, but still has a few issues with typing, but once those are resolved, should be good to merge in.

Michael couldn’t make it to the call, but submitted a PR to improve MoleculeNet documentation.

Shakti couldn’t introduce himself last week, so gave a quick introduction about himself. Shakti is a recent graduate from UCLA who wants to get involved in an open source project before starting his graduate work. Shakti put up a first pull request improving our cosine distance function (PR). Shakti said he was now looking into improving some of the unit tests around our Keras layers. Over time, Shakthi is willing to work on adding to the documentation, adding more unit tests, and adding more models/layers.

After we finished up the roundtable updates, we shifted to the discussion portion of the meeting. Nathan mentioned that there was a new preprint about the OpenChem library. Nathan asked if anyone had a chance to look through the library and the preprint before, and mentioned that OpenChem looked to be focused on PyTorch. Bharath said he noticed that in the preprint, the OpenChem authors pointed out DeepChem wasn’t very modular as a library and that it wasn’t easy to implement new models in DeepChem. Bharath said this had improved a bit recently, but there was still a lot of scope for improving our modularity. Peter suggested that we should add tutorials on how to implement simple models in DeepChem using Keras and using PyTorch so it’s easy for new users to create custom DeepChem models. Nathan said that if no one else had gotten to it, he’d be happy to take a pass in a couple weeks once he finished his ongoing set of changes.

Bharath also mentioned a discussion on GitHub pointing out the new DeepQMC library. It came up in this discussion that models like FermiNet and other Ab Initio quantum methods are benchmarking differently than usual ML datasets. Bharath asked if there was a good way we could support these applications better? Peter pointed out there are some good open source libraries for quantum machine learning like netket already, so we should check to see if there was something we could do that existing tools weren’t already doing.

Shakti mentioned that there was a cool new library Graphein, (paper) that supported geometric deep learning applications for proteins. Bharath took a look and mentioned that they seemed to use PyTorch geometric and DGL as a substrate for their work. Given that these tools are increasingly mature, we should also look into using PyTorch Geometric and DGL more for applications ourselves.

As a quick reminder to anyone reading along, the DeepChem developer calls are open to the public! If you’re interested in attending, please send an email to X.Y@gmail.com, where X=bharath, Y=ramsundar.