DeepChem Minutes 7/17/2020

bharath · July 22, 2020, 6:34pm

Date: 7/17/2020
Attendees: Bharath, Peter, Daiki, Seyone, Nathan, Shakti
Summary: Bharath this week worked on various improvements to DeepChem infrastructure. He put up a change to improve DiskDataset logging (PR2015), and added the new InMemoryLoader (PR2010). This new loader class facilitates featurization of large collections of data that are already in memory (say in a pandas dataframe). Bharath also merged in the MolecularFeaturizer (PR1992, which cleans up the hierarchy of featurization methods and continued work on overhauling metrics (PR1928. The metrics overhaul turned out to be quite complex since scikit-learn’s metrics all have slightly differing APIs and DeepChem meeds to support multi-task, multi-class metric computation.

Peter this week merged in type annotations (PR1979 support. This adds mypy type hint checking to DeepChem. The type hints have helped clean up large portions of the codebase already and have helped identify some API mismatches between various super/subclasses in DeepChem. Peter also put up changes to parallelize Transformers (PR2024. As Peter started trying to parallelize the module, it turned out to be a fairly involved change since we can’t multithread in Python and have to use multiprocessing instead. This requires data to be passed around with pickling. One issue with the current change is that it’s not clear it’s really improving performance since there’s some multiprocessing overhead time. Bharath mentioned that he had a few large datasets he could test on. Peter asked if Bharath could put together some larger scale dataset examples to test performance on. Bharath said he was working on cleanups to PDBBind which might serve well for these larger benchmarks.

Daiki worked this week on trying to speed up convolutions in JaxChem by using segment_sum (JaxChem PR15). Unfortunately this seems to run into the same issue as before (JaxChem Issue 8) since segment_sum uses a slow addition method under the hood. Daiki also worked this week on refactoring the featurizer class (PR2017) and on implement common MolecularGraphData structure (in review, PR2012) in addition to working on refactoring DeepChem’s CI script (PR2018).

Nathan has been working on a number of improvements to our crystal structure handling support. As of this week, all major infrastructure for inorganic crystal structure datasets is now merged into DeepChem. Changes from this week include merging in the JsonLoader class (PR1998, and adding in MaterialStructureFeaturizer and MaterialCompositionFeaturizer abstract superclasses (PR2008, PR2020). Nathan is now working on taking 5 inorganic crystal structure datasets and writing in new MoleculeNet loading functions for these datasets and is also working on a first cut at a minimal normalizing flow API.

Bharath mentioned that with the new MoleculeNet template structure, it might be feasible to add in unit tests for MoleculeNet using dummy featurizers. DeepChem is currently at something like 75% unit test coverage, and a large portion of the gap is due to MoleculeNet which completely lacks tests. Nathan mentioned it might be a little tricky to get it right but should be feasible to figure out.

Seyone has been continuing work on PR1970, integrating in ChemBerta with DeepChem. He’s made forward progress but is running into CUDA memory issues which he’s currently working on debugging. Seyone is also working on integrating proper smiles-based tokenization with HuggingFace which might yield some performance improvements. Once this is done, finishing the PR should be relatively straightforward.

We also had a new attendee on the call Shakti. Unfortunately Shakti had some mic issues and couldn’t introduce himself on the call this week.

With the roundtable done, we moved on to general discussion topics. Seyone asked if there was any on-going work on adding FEP integration. Nathan mentioned that his work on a normalizing flow API was a first step for adding FEP support, but at present, the area of deep FEP is still very preliminary and cutting edge research. Having a clean normalizing flow implementation of the DeepMind FEP algorithm will help researchers make more progress since there doesn’t appear to be a good open source implementation.

Bharath mentioned that there was some other interesting DeepMind physical deep learning papers which had come out and improving our support for these techniques would be very valuable. Bharath asked whether Yank had development plans to add a python API since it currently uses YAML to organize simulations. Unfortunately no one on the call knew.

Seyone mentioned that there had been a discovery of Keras bug that caused issues with custom layers and asked if we were affected. Bharath said he wasn’t sure, but he’d raised an issue to track.

As a quick reminder to anyone reading along, the DeepChem developer calls are open to the public! If you’re interested in attending, please send an email to X.Y@gmail.com, where X=bharath, Y=ramsundar.