Attendees: Bharath, Daiki, Seyone, Peter, Nathan, Pat, Sean, Karl
Summary: Bharath put up a first version of the hyperparameter overhaul for review and got a round of comments from Daiki and Peter. He’s working on revising the outstanding comments and hopes to have a version ready for further review early next week. Bharath also put up a WIP PR overhauling atomic convolutions. The goal of this PR is to clean up the atomic convolutions and get them into a state that’s ready for broader usage. Bharath also put up another WIP PR overhauling the metrics. Metrics have been one of DeepChem’s messiest bits of code for quite a while. This PR adds tests and attempts to clean up the state of the metric handling. Bharath hopes that these pull requests will be ready for review by the end of next week. Bharath also put up a small PR fixing the naming of a
Daiki has been continuing work this week on his JaxChem implementation. He’s run into a performance issue which is caused by the fact that Jax’s native dataset iterators can’t walk over generators. This makes training performant Jax models harder since using a python for-loop has bad performance. Daiki is investigating a few ways to potentially fix this issue. Daiki also did a good amount of code review this week looking at the hyperparameter overhaul and the inorganic crystal structure featurization code. Daiki also finished up a cleanup of the deepchem.io website removing the old documentation and put up a WIP PR that adds more documentation on using docker builds for DeepChem.
Seyone made some changes to the ChemBerta tutorial in this PR adding visualization improvements with the bertviz library. Seyone has also been working on the DeepChem implementation of the ChemBerta model. This will likely be DeepChem’s first PyTorch based model so it will be an interesting step forward for the library. Bharath said he was excited to see ChemBerta in DeepChem.
Bharath mentioned that ChemBerta style models could be broadly useful elsewhere. There’s a large community of folks using BERT-style pretraining for protein sequences (UniRep, etc). Adding in support for protein BERT models could be an interesting follow-up. There are also a number of papers that use protein-sequence and ligand smiles to make interaction predictions that don’t depend on the protein’s structure. It would be really useful to get some models of this type added into DeepChem. BERT-style models could perhaps do this by concatenating protein sequence (as a string) to smiles to get a uniform input. We could also perhaps have two towers that process protein-sequence and ligand separately, then combine before some final dense layers.
Peter has been working on squashing some bugs in DeepChem this week. The latest fix PR fixes almost all the issues except for the atomic convolution. Bharath is working on overhauling the atomic convolutions and hopes to have a fixed version ready by the end of next week if possible. Peter also found some serious inconsistencies in the usage of the
epochs argument in the various dataset classes.
Nathan has been working on a first PR adding support for inorganic crystal structure featurizers. There was a lot of good discussion and review on the PR, and the new features look to be in a state to merge. Bharath mentioned that he planned to merge in on Monday barring any more requests for changes. Nathan has also been working on improvements to the MoleculeNet documentation, adding documentation on how to add a new dataset and a template pull-request (see issue). Bharath mentioned this would be very valuable as we continue to grow MoleculeNet. Nathan also talked to a friend who handles the AWS infrastructure for the Materials Project and got some input on best practices for handling cloud resources for an open source project. Nathan said he’d start up an issue to discuss.
Pat mentioned that MoleculeNet has become something of a standard in the field, but there are a number of places that the current benchmarks could be improved. For one, the benchmark suite doesn’t have good regression metrics and evaluates only RMS error and not R^2 error. This could provide an inaccurate picture on some tasks. Pat wondered if there would be interest in improving these benchmarks. Bharath said that this would make a lot of sense and mentioned that a similar issue was that often auPRC was more useful than ROC-AUC for molecular classification challenges. Karl mentioned that similarly, MoleculeNet’s focus on multitask metrics is out of date with the current state of the field.
Pat brought up another point that a number of the MoleculeNet datasets have obvious false positives. The HIV dataset for example has a number of known cytotoxic compounds in it that would be good to filter out. Bharath asked if there were any good resources to do this type of screening, and Pat said he’d written a blog post about it and could help with some of the cleanup effort. Bharath said he’d raise some issues on the MoleculeNet repo (here and here) to track the discussion.
Pat is also interested in using attention mechanisms to create better understanding of how models work. Seyone mentioned that he’d been working on attention mechanisms for learning chemical features and Bharath said he’d connect them offline to kick off a conversation.
Sean has been continuing work on the Julia implementation of DeepChem over the last week. He defined a custom adjoint function to speed up the Julia Weave backward pass and now the training time is only 2-3 times slower than the Python version. Sean also rewrote the Julia weave featurizers and fixed some accuracy issues he was seeing. Sean’s next task is to convert the
dc.model.GraphConvModel implementation to Julia.
Bharath mentioned as a note to Peter that he’d been running some larger scale
dc.model.GraphConvModel training runs. The GPU utilization is now quite high and keepingat roughly 50% on average which is a nice improvement from where it was previously.
Karl mentioned that he hadn’t been able to make it for the past couple weeks, but in that time was able to put up a PR swapping tests to pytest. Karl also mentioned that he was interested in taking a look at any build issues that came up. Bharath said that once the hyperparameter overhaul PR was merged in, he’d start working on the build/release for 2.4.0. Karl said he’d able to take a look at the build/release process once it’s up for review.
Bharath also put up an issue proposing the addition of a multiple objective optimization modules
dc.moo to DeepChem. Often DeepChem’s users want to combine the predictions of multiple models (for say solubility, potency, toxicity, etc.) together to get one list of datapoints that are optimal with respect to these quantities. The multiple objective optimization module will help these users create useful inferences from their models. Karl and Pat both mentioned that a multiple objective optimizer would be very useful in practice and would be a useful addition to the community. Karl suggested that it might be easier to start from the discrete case rather than the continuous case since finding the Pareto frontier can be much more challenging in that case. Bharath said he’d start working on the multiple objective optimization module after finishing his currently open PRs and would tag folks for feedback on the initial solver design once it’s ready to take a look at.
As a quick reminder to anyone reading along, the DeepChem developer calls are open to the public! If you’re interested in attending, please send an email to X.Y@gmail.com, where X=bharath, Y=ramsundar.