DeepChem Minutes 10/2/2020

Date: October 2nd, 2020
Attendees: Bharath, Peter, Sanjiv, Seyone, Michael, Hariharan, Tyler
Summary: We had no new attendees on the call this week so we dove into our usual roundtable updates.

Bharath this week has been working on getting DeepChem to run on Sagemaker. We have a tutorial describing how to launch DeepChem on Sagemaker notebooks, but not how to use DeepChem with the Sagemaker Python SDK. Bharath hopes to put together documentation for this over the next couple of weeks. This should help us featurize larger datasets and may be useful for creating AlphaFold infrastructure (see this issue which discusses our roadmap to improve our protein deep learning infrastructure).

Peter this week has been continuing working on the tutorials PR. Peter has just finished overhauling most of the beginning tutorials. His latest changes round out the series of core tutorials, which should give readers the grounding needed to work through the remaining advanced tutorials. There is still some cleanup and modifications needed here. In particular, Bharath needs to finish writing tutorial 3. Bharath said he would try to get that written by next week.

In the process of doing this cleanup, Peter discovered that some of the splitters had issues. Peter put up a PR to refactor RandomStratifiedSplitter and has a new PR which fixes some of the issues with ButinaSplitter. Unfortunately, ButinaSplitter doesn’t work for large datasets and seems to crash Peter’s computer! Bharath mentioned that ButinaSplitter uses a O(n^2) algorithm and RDKit might have some memory leaks. Peter mentioned it might be possible the algorithm uses O(n^2) memory so possibly the machine is running out of memory, but noted that should only kill the process and not the entire computer. Peter said it would be good to figure out a cleaner cluster algorithm which works cleanly for larger datasets.

Sanjiv has been continuing to work through more tutorials. Bharath said it would be useful to check out Peter’s new tutorials which are much cleaner and easier to read than the older tutorials.

Seyone has been busy with ChemBerta deadlines. The ChemBerta work was just presented as a poster at the recent Chemical sciences workshop. Seyone is also working on a ChemBerta paper submission for the Neurips Machine Learning for Molecules workshop. Once this submission is done, Seyone is planning to patch the existing ChemBerta tutorial to include some upstream improvements from the ChemBerta paper. Seyone is also thinking of making improvements to integrate tokenization more tightly into the DeepChem infrastructure.

Michael has been working on migrating examples into the .rst documentation files. He has a first PR up migrating SAMPL into the docs. As part of this PR, Michael discovered that we aren’t currently running doctest on our .rst documentation! (comment). Michael has also been looking into how to parallelize builds on travis-CI so we can potentially have a non-blocking build. Bharath mentioned this would be very helpful and would pair well with Peter’s recently improved tutorials.

Hariharan has been continuing work on his research with orbital graph convolutions. He spoke to his advisor about potentially contributing an implementation into DeepChem and asked if Bharath could chat about the work offline which Bharath said they could find time to do.

Tyler has been continuing to work through the DeepChem tutorials. Bharath mentioned that Peter’s newly improved tutorials would probably be a great place to start.

With roundtable updates complete, we moved into general discussion. Bharath mentioned as a general point of discussion that now that we’re doing more DeepChem research projects, it might be useful to start thinking of applying for more DeepChem grants to support on-going and future research work. Bharath said he wanted to put it on everyone’s radar as something to think about. Bharath also mentioned that he’d send out a poll about possible timing for the America/Europe/Africa and Asia/Pacific DeepChem calls in the next week.

As a quick reminder to anyone reading along, the DeepChem developer calls are open to the public! If you’re interested in attending, please send an email to X.Y@gmail.com, where X=bharath, Y=ramsundar.