DeepChem Minutes 8/21/2020

bharath · August 26, 2020, 10:04pm

Date: 8/21/2020
Attendees: Bharath, Peter, Sanjiv, Nathan, Seyone, Shubhendu, Neel
Summary: Since Shubhendu and Sanjiv were new attendees on the call, we started with a brief round of introductions. Sanjiv is a rising high school senior who’s worked on science fair projects in computational chemistry. Shubhendu is a postdoc in the Barzilay lab at MIT who works on deep learning for molecules.

With introductions done, we moved into the usual round table updates. This week, Bharath has been mainly wrestling with the problem of shuffling large datasets. Our original implementation of DiskDataset.complete_shuffle() was entirely in memory which meant that we couldn’t shuffle large datasets. (Shuffling a large dataset often helps improve training of models and smooths numerical issues.) Bharath put up a first PR which implemented a first pass at model shuffling. This first version used too much disk space while testing, and was unnecessarily slow. It also turned up an issue that DiskDataset.select() had a bug in that it returned its selections in sorted order rather than in the specified order. Bharath implemented a second PR which fixed both these issues. On tests, the new shuffle looks to do quite well. Bharath was able to shuffle a 50GB dataset within a few minutes.

Bharath mentioned that Daiki said he couldn’t make this week’s call and gave a brief update about Daiki’s work. Daiki merged in his PR adding crystal graph convolutional networks. This is the first major DeepChem model that uses DGL as its substrate. Daiki also merged in a series of PRs adding in various small fixes and documentation/build improvements (PR1, PR2, PR3, PR4, PR5, PR6.

Peter has put up a few new changes this week. One PR was an improvement to creating pytorch datasets, and added support for returning batches. The other change was inspired by a post on the forums discussing different GAN variants. While Peter was experimenting with the code in the post, he found some unrelated bugs in GANs and put up a PR to fix them.

Nathan this week merged in his PR for additional large inorganic datasets. With this PR merged in, there’s now a pretty good set of crystal datasets in MoleculeNet. Nathan is also working on a DeepChem tutorial for normalizing flows to train a simple normalizing flow, and hopes to have the normalizing flow PR ready for a second round of review soon.

Seyone has been focusing on the SELFIES 1.0 library release this last week. The SELFIES team is trying to figure out whether they could integrate SELIFES into RDKit. Depending on how this scopes out, we can then figure out next steps for adding SELFIES integration into DeepChem. On the tokenizer side, Seyone has been working on integrating the SmilesTokenizer into the DeepChem library. He’s got a private branch with SmilesTokenizer implemented and is working on getting some unit tests before merging in a PR. He also trained a ChemBerta model on a 1M compound subset of PubChem this week as part of on-going ChemBerta research work.

Neel has been continuing work on familiarizing himself with DeepChem. He put up a PR making a fix to tutorial 1, and is now working on fixing up tutorial 3.

Bharath asked Sanjiv and Shubhendu whether they thought there were any interesting overlaps between their past work and DeepChem. Sanjiv mentioned that he’d worked on running large screening datasets through computational models and might be able to work on similar performance improvements for DeepChem. Shubendu mentioned that he was working on some projects involving pretraining which might have some overlap.

Shubhendu asked if there were any good starter projects that Sanjiv could work on (since he’s mentoring Sanjiv at present). Bharath mentioned that one good starter project might be to write a wrapper for Chemprop in DeepChem so we can expose chemprop to our users.

Bharath also mentioned as a general topic of discussion that many of our tutorials had fallen out of date and that we needed to do a cleanup pass to get the tutorials fixed. He asked if anyone had bandwidth to help. Neel mentioned that he was working on tutorial 3, and Peter mentioned that he might be able to help as well. Bharath said he’d try to fix some of the tutorials as well.

Bharath also asked for a volunteer to take minutes for next week since he’d be out for personal reasons. Seyone mentioned that he’d be able to take minutes for next week’s call.

As a quick reminder to anyone reading along, the DeepChem developer calls are open to the public! If you’re interested in attending, please send an email to X.Y@gmail.com, where X=bharath, Y=ramsundar.