DeepChem Minutes 9/25/2020

Date: September 25th, 2020
Attendees: Bharath, Sanjiv, Hariharan, Daniel , Renuka, Mufei, Seyone, Alana, Michael, Tyler, Peter
Summary: We had a number of new attendees on the call today so we started with a round of introductions.

Mufei is a researcher working in the AWS Shanghai lab, working on the DGL library. In particular, he’s the lead developer of DGL-Life Sciences which works to make DGL more useful for biological and chemical applications.

Renuka is an undergrad in bio engineering from IIT madras, who’s just started learning about DeepChem. Renuka is interested in personalized medicine and other potential DeepChem applications.

Tyler was previously a PhD student in the genetics department at Stanford, and is now the founder of Trident biosciences, which works on making new predictive models of protein and small molecule behavior.

Daniel is doing his master’s of science at the University of Malta and is working on low data and drug discovery.

With introductions out of the way, we moved into roundtable updates. Since we had a larger group on the call today, we opted to do more compressed roundtable updates with a couple minutes per person.

Bharath has been working with using the DeepChem docker instances in practice on AWS. He’s still experimenting with the infrastructure here and found some warning messages (issue) but no actual issues so far. Bharath said he hopes to find time to fix the serialization issues this coming week.

Hariharan has been working on orbital convolutions and extending these models to new applications. Bharath mentioned that adding orbital convolutions to DeepChem could be a really useful contribution. Hariharan said he’d chat with his advisor about the idea.

Seyone has been working on continuing to add more functionality into ChemBerta. He’s in the middle of the push to get the first ChemBerta paper released, and has been building infrastructure to choose different tokenizers and benchmarking sets for the ChemBerta models.

Alana has been continuing to work on infrastructure for ANN potentials in DeepChem. She’s looking at what it would take to get some benchmarking datasets for quaternary metals into DeepChem and is also starting to investigate some of the model building work needed for the subsequent steps.

Michael was busy with other work this week, but for next steps plans to work on turning examples from the examples/ folder into documented doctest examples. This would get these examples tested by the continuous integration and verify that they can run correctly. Michael also hopes to set up a separate CI instance for running the slower tests.

Peter has been continuing to work on the tutorial series this week. He merged in one PR that updated a number of the existing tutorials and has a new PR up that updates the featurizer tutorial. He still needs to finish updating the tutorials on splitters, creating datasets, advanced training, and hyperparameter optimization. Bharath mentioned he was really excited to see these make it in and that they’d be a powerful resource for newcomers.

Daiki couldn’t make the call this week, but worked on improving the speed of the MolGraphConvFeaturizer (PR) and on adding flake8 tests to the remaining parts of the DeepChem codebase PR.

Nathan couldn’t make the call this week, but merged in a couple of major PRs this week. His first PR added Zinc15 to MoleculeNet while his second PR merged in the new tutorial for normalizing flows.

Alana asked if these tutorials have been merged in. Peter said some of them have been but others are still works in progress.

With the roundtable updates out of the way, we moved to general topics discussion. Bharath asked folks what topics they wanted to see discussed in the general updates. The issues raised were:

  1. Timing adjustments given the broad range of time zones.

  2. Planning for better DGL integration

  3. Discussion of low-data modeling support

  4. Some questions about tutorials.

We started going through these topics in order. Bharath said that with the growing number of attendees from different time zones, we should consider adjusting the current meeting time. Our current slot is at 3pm pacific time on Fridays, which is very convenient for people in the US and Canada, but otherwise is pretty inconvenient. Bharath suggested as one possibility that we could move the developer call up a couple of hours to 1pm PST on Fridays, and add a second DeepChem “office hours” at a time more convenient for India/Asia/Pacific attendees. Peter suggested that we could consider a variant of this where we hold two developer calls, one for the Americas/Europe/Africa and another for India/Asia/Pacific on alternating weeks. There was some discussion about whether this would be too complicated, but after some discussion it became clear that no one meeting time could be convenient for everyone given the global range of attendees. The rough consensus settled on the need for two different developer meetings at times convenient for the respective time zones. Bharath said he’d chat with a few more people offline about timing constraints, and would then send out a poll to attendees.

With the discussion about timings complete, we shifted into the discussion about improving our DGL integration. Mufei mentioned that there was an issue where Daiki laid out some of the core steps needed to improve integration and that he planned to follow some of this outlined work. Bharath mentioned that another step that might be useful is to contribute some of the DGL life science MoleculeNet benchmarks to the MoleculeNet docs so that we could start setting up standard baselines. Mufei said this would be very useful and that it could help lay common guidelines on how to contribute a new baseline into MoleculeNet.

We next moved on to the discussion of the low data learning. Bharath mentioned that the original technique came from this paper, which was originally implemented in the old DeepChem tensorgraph framework but which hadn’t been ported over to Keras. Bharath mentioned that the base model wasn’t that complex, and was analogous to something like a Siamese network and that it would be very useful to port this into Keras so we could use it in DeepChem. Daniel mentioned that he was interested in taking on the model porting and that he’d coordinate with Bharath on gitter.

On the tutorials, Alana asked if the GAN tutorials had been modernized. Peter mentioned that he’d overhauled them a week or two ago. Bharath asked on this note whether we should consider switching the book notebooks to run on Colab so that they more naturally complemented the tutorial series. Peter said this might make sense, but that we’d have to do some investigation to see what changes that would require.

As a wrap up discussion, Bharath mentioned that DeepChem was trying to live up to the PyTorch motto of “from research to production.” Bharath asked Tyler whether he had any comments for DeepChem as an attendee from industry about how it could be better suited for production usecases. Tyler mentioned that he was working to get his infrastructure set up on AWS, so more tutorials or information on how to integrate DeepChem with cloud infrastructure could be useful. Bharath said he’d try to add a tutorial or forum post in that vein. Tyler also mentioned that he was interested in improved support for proteins, and in particular support for things like post-translational modifications, Bharath mentioned that the graphein library had done some good work in this vein. Mufei mentioned that graphein used DGL under the hood and was currently working on overhauling its infrastructure. Bharath said it might be interesting to improve DeepChem-graphein integration.

As a quick reminder to anyone reading along, the DeepChem developer calls are open to the public! If you’re interested in attending, please send an email to X.Y@gmail.com, where X=bharath, Y=ramsundar.

1 Like