DeepChem Minutes 9/18/2020

bharath · September 23, 2020, 9:10pm

Date: September 18th, 2020
Attendees: Bharath, Alana, Peter, Michael JR, Mick, Nathan, Hariharan, Neel, Seyone
Summary: We had a number of new attendees on the call today so we started with a round of introductions. Since there were a lot of new folks, we also did introductions for regulars on the calls.

Seyone is a high school student who’s been working on applying NLP-based methods to chemistry problems. Alana is a high school senior in Tokyo who’s working on a research project implementing GANs for inorganic material characterization in DeepChem. Peter is a software engineer at Stanford who has been a long time contributor to DeepChem. Michael JR has been a software engineer for the last 8 years and is interested to learn more about machine learning for scientific applications through DeepChem. Mick is a 4th year grad student in Greg Bowman’s lab who has been working on applying autoencoders to MD data. He’s been exploring methods of featurizing protein structures and is interested in potentially contributing these new featurizers to deepChem. Nathan is a grad student at UPenn who has been working on applying machine learning and multiscale modeling for materials. He’s been working on extending MoleculeNet and adding normalizing flows into DeepChem. Hariharan is a masters student working with Amir Barati’s group, exploring the use of deep learning for organic materials characterization. Neel is a software engineer at Google interested in learning more about applying machine learning to

With introductions out of the way, we moved into roundtable updates.

Bharath didn’t have too much to report on development this week. He’s been using DeepChem for various projects, and in particular has been working a lot with Weave models. The newly improved weave models have strong performance in practice and might be useful for folks to check out the new max_pair_distance setting.

Peter has been working on extending the tutorial series this week (PR), and is working on taking them and turning them into a more coherent set of lessons for newcomers to the library. Bharath mentioned that the new tutorials looked really good.

Nathan this week pushed some updates to the normalizing flow tutorial PR. Nathan had a question for Seyone about the selfies usage in the tutoria, which currently takes smiles strings, converts them to selfies, one-hot encodes the selfies, models these one-hot vectors with a normalizing flow, then reverses this entire process to get back to a selfies then a smiles. This set of steps achieves 40% valid generated molecules. Although this is a very impressive result, Nathan was curious what explained the 60% of failed selfies->smiles transformations. Seyone said that wasn’t sure off hand, but suggested looking at the code for the variational autoencoder selfies example and in particular the code that computes smiles validity rates to see if it would be possible to adapt the same mechanism from the original papers to the tutorial. Seyone said he’d taken a look through the tutorial and hadn’t spotted any obvious errors but that it was worth a closer examination.

Peter asked where all defined selfies are supposed to generate a valid smiles string. Nathan replied that his understanding was as long as you follow the derivation steps and rules based on the selfies grammar, any random string should generate a valid smiles string and Seyone confirmed. Nathan said he would take a closer look at the VAE code and see if he could figure out the difference. Nathan mentioned that he has also been working on incorporating Zinc15 into MoleculeNet (PR). The full dataset has hundreds of millions of compounds and is used widely in high throughput virtual screening. Nathan has also selected subsets that have 250K and 10 million compounds. Bharath asked how big the full dataset was, and Nathan said uncompressed about 23 GB raw, and 3 GB compressed into a .tar.gz format. Bharath mentioned that it might be useful to use the AWS cli tool to upload larger datasets and that the IAM role permissions should allow for this for access to the DeepChem AWS.t

Seyone this week has started working on integrating text dataset handling into DeepChem datasets. He’s also working on making tokenizers inherit from MolecularFeaturizer as Daiki suggested so that we can more easily tokenize MoleculeNet and other datasets. Seyone mentioned that he was primarily focusing on the upcoming Arxiv paper release of ChemBerta, but should have more time to work on DeepChem integration once that’s released.

Neel has been continuing work on improving the tutorials this week, and put up a PR fixing the code in tutorial 10. Neel mentioned he was still seeing some performance issues in the model and asked if Bharath could take a look at it. Bharath said he would take a look at it shortly.

Daiki wasn’t able to attend the call this week, but added in a featurizer bugfix PR, notebook fix PR, clearance dataset moleculenet bugfix PR, and a splitter naming fix PR. Daiki also merged in a PR fixing GPU support for the PyTorch GATModel and partially fixing CGCNNModel. Daiki is continuing to work on fixing CGCNNModel.

After finished the roundtable, since there were a lot of newcomers on the call, we did a round where all new attendees talked about their interest in DeepChem and potential starter projects that they might be interested in working on.

Alana mentioned that she had recently implemented a method using implementation of a first order taylor expansion (issue) data augmentation technique for neural potentials and asked if this would be of interest to the DeepChem community. She also has been working on a new research project with DeepChem, trying to build GANs to generate new materials with interesting properties. Bharath said that improving DeepChem support for neural potentials would be very valuable and suggested that adding a new dataset to MoleculeNet could be a good place to start. Alana also asked about the newly added MoleculeNet perovskite dataset. Nathan mentioned that he’d just added in this dataset recently and that it contained data from the materials project.

Michael R. said that he was thinking about updating and enhancing the tutorials and examples. Bharath said that currently DeepChem had 3 main sources of examples: the tutorials, the readthedocs, and the examples. The tutorials were currently being overhauled thanks to Peter and Neel’s hard work, and the readthedocs had their correctness provided by doctest, but the examples folder had a lot of semi-supported code. Bharath thought that moving more of these examples into the docs might be a great project that makes the docs into a better resource. Michael R. also mentioned that he was also interested in improving the Travis CI.

Peter mentioned that one possibility we could do was to run another Travis CI instance which is dedicated to running only the slow tests. Right now, we run Linux/Windows/Mac Travis CI, but we could perhaps have a separate Linux instance focused on running slow tests. Bharath thought this was a great idea and might help solve some of our maintainability issues.

Mick mentioned that recently people have started to have real success doing deep learning on protein structures, and thought that a good starter project might be to add infrastructure for a protein structure featurizer. Bharath said this was really interesting, and suggested a good first place to start might be to add something like a OneHotFeaturizer for peptide sequences as a first simple example of a protein featurizer. Mick thought this could be a useful idea. Peter said he thought more structure for protein featurizers could be very useful and asked Mick what types of featurizations he was considering. Mick mentioned that one possibility could be a featurizer that drops a 3D grid over any part of a protein. Or a graph convolutional protein network that gets feature vectors for sidechains. Mick thought it could be possible to build this up from mdtraj. Another possibility is to implement the new geometric vector perceptron architecture (paper) from Ron Dror’s group.

Hariharan mentioned that he was new to the area of molecular machine learning and has been exploring different models in the space. He’s been trying to implement RNN/LSTM models that operate on SMILES strings and has been working on different featurization methods for molecules. Hariharan had a question Seyone about selfies. He asked whether there were utilities to directly perform operations on selfies strings. Seyone said that there were standard encode/decode functions converting selfies to/from smiles strings. Since this was straightforward, this was the recommended method to perform manipulations on selfies.

Bharath mentioned that in general a good resource for newcomers would be to check out Peter’s documentation on DeepChem’s coding conventions. Bharath recommended that a good first place to start could be to contribute a small pull request to DeepChem adding in a fix to the docs. Contributing to DeepChem can be challenging at first since you have to get comfortable with yapf, mypy, and Travis CI, so starting with a small contribution could help lower the barrier.

We then moved to general discussion topics. Mick asked if there was a Slack for the DeepChem developers. Bharath said that not at present and that most discussion happened on the gitter. A number of folks on the call said that they would be interested in setting up a slack for discussion. Bharath mentioned that one issue with slack was that the free tier ran out pretty quickly. Alana mentioned that Discord had nice integrations with other software and might have gitter integration. Michael mentioned that the Dart community (which he’s a member of) used Discord to organize its discussions and that it was a nicer tool than Slack for them. Bharath suggested that for now, we could perhaps try using gitter for more developer discussions since there’s low traffic on there. If this proves to have issues, we could revisit the issue in a few weeks and consider whether it would be good to migrate to Discord.

Bharath mentioned that something he’d been thinking about was reviving the DeepChem meetup. A couple years ago, we used to have regular DeepChem meetups in the bay area that drew a nice crowd. Now that the community has shifted more online and that we have a lot of development and research projects ongoing, he thought it might be useful to revive the meetups. He suggested that we could have a “DeepChem online conference” to highlight all the cool new development that’s been happening in DeepChem later this year or early next year. Peter said he thought this was a good idea. Bharath said this was still in the very early phases, but he wanted to put it in people’s minds in case they were interested in presenting their work.

As a quick reminder to anyone reading along, the DeepChem developer calls are open to the public! If you’re interested in attending, please send an email to X.Y@gmail.com, where X=bharath, Y=ramsundar.