Date: September 11th, 2020
Attendees: Bharath, Peter, Jatin, Sanjiv, Seyone, Hisham, Nathan
Summary: We had a couple of new attendees on the call so we started with a round of introductions. Hisham is a 2nd year PhD student at Imperial college, London who has been working on applying machine learning to chemistry. Jatin is a high school student, who has been learning ML over the last year.
With introductions out of the way, we moved into our usual roundtable updates.
Bharath this last week has been working on this PR trying to test the saving/reloading for DeepChem models. Unfortunately, it turns out that a number of DeepChem models don’t save/reload correctly including WeaveModel
, GraphConvModel
, MPNNModel
and more. This is a puzzling bug since weights appear to load correctly, but predictions somehow have errors. Bharath mentioned that there were two core saving methods in TensorFlow, one that used checkpoints and another which used keras saved models. DeepChem’s saving flow uses checkpoints. Bharath did some investigation of what it would take to swap to keras saved models, but it appears that this would cause some new errors, especially for GraphConvModel
which uses Keras subclassing style. Bharath asked if Peter had any recommendations for debugging the saving/reloading flow. Peter suggested looking at intermediate points in the graph computation and trying to pin down when the predictions go wrong. Peter said he suspected that somehow some variables might be getting created which aren’t getting registered with the Keras model. Bharath said he’d give this a try and see if he could pin down what was going on.
Peter was busy with other work this week and didn’t have time to work much on DeepChem.
Seyone has been working on transitioning more of the ChemBerta project into DeepChem’s infrastructure. He’s been working on a custom smiles class that tokenized chunks of data on the fly to reduce memory usage and suggested that this might be a nice addition to DeepChem’s infrastructure. Peter asked if it would be possible to reuse DeepChem’s existing infrastructure to achieve the same effect. Bharath said one way to do this might be to follow up on Daiki’s suggestion and make the smiles tokenizer into a featurizer and reuse DeepChem’s DataLoader
infrastructure. The idea here would be to create a new TxtLoader
class that loads in txt files used for NLP pipelines that takes in a txt file and a tokenizer and writes a new DiskDataset
that contains a tokenized version of the dataset. Peter pointed out that this alternative design would have the advantage that tokenization would only have to be performed once up front rather than on every epoch. Seyone said he’d look into this proposed infrastructure and suggested that the newly released HuggingFace datasets library might be another good tool to look into for design ideas.
Nathan has been continuing to work on the normalizing flows. This week, his PR adding in normalizing flow infrastructure to DeepChem has just been merged! This adds a major bit of new functionality to DeepChem allowing for the training of sophisticated normalizing flow models. Nathan also put up a new PR that adds a tutorial for the normalizing flow API and trains a sample normalizing flow on QM9. Nathan mentioned Peter had given some good feedback on the design and asked if Seyone could take a look as well since the tutorial introduces selfies usage and Seyone is a selfies developer. Nathan mentioned that he was also working on making a MoleculeNet loader for ZINC15. He mentioned that this brought up a number of really interesting new challenges. The ZINC15 dataset is 180 GB raw (with only ids and basic 2D information). It might be quite surprising for a MoleculeNet user to find that they’re downloading an enormous dataset. Peter suggested that it might be useful to require the user to set an explicit flag authorizing the large download to make sure they were aware of the disk space requirements. Bharath said he was really interested to see ZINC15 working through the DeepChem infrastructure. So far the biggest dataset we’ve tested DeepChem datasets on has been 100 GB, and this would be a nice addition.
Nathan also mentioned that there was 3D information for many of the compounds and asked if this would be useful. Bharath said this might help facilitate large scale docking, so would probably be worth adding. Nathan also mentioned that it might be useful to create a new SelfiesFeaturizer
which helps facilitate the use of selfies on MoleculeNet datasets, which Bharath seconded as a good idea.
Bharath pointed out that the combination of the tokenization infrastructure Seyone is building and the new ZINC15 dataset might make for a powerful combination for anyone looking to explore NLP based methods.
Sanjiv has been continuing to come up to steam on DeepChem’s structure and has been working through the tutorials. Bharath mentioned that there were 3 high school students on this week’s developer call! He suggested that a good place to start getting used to contributing might be to add a little example to the documentation. This could help get development environments figured out and could help build familiarity with the process of getting code past DeepChem reviews and tests/yapf.
Hisham mentioned that he was primarily listening in for this week, but was interested to see if there were any places he could potentially contribute to DeepChem.
Daiki couldn’t make the call this week, but contributed a number of major PRs to DeepChem. One PR refactored the splitter code and added type annotations, while another PR refactored the utilities, and yet another PR refactored the featurizer structure. These PRs added in type annotations and made these parts of the library much cleaner and should help make them easier to maintain long term.
As a parting note for the week, Bharath mentioned that the saving/reloading issue is one of the critical blockers for releasing DeepChem 2.4.0. Once this issue is fixed, the next big blocker is fixing the atomic convolutions. With both of these fixed, we should be clear to make the cut.
As a quick reminder to anyone reading along, the DeepChem developer calls are open to the public! If you’re interested in attending, please send an email to X.Y@gmail.com, where X=bharath, Y=ramsundar.