DeepChem Minutes 10/16/2020

bharath · October 21, 2020, 5:13pm

Date: October 16th, 2020
Attendees: Bharath, Peter, Alana, Michael, Seyone, Hari
Summary: Bharath started off with an administrative discussion about timings for the new developer calls. Looking at regular attendee responses, it looked like the new timings for the developer calls will be 10:30am PST on Fridays for the Americas/Europe/Africa/Middle East calls and 8pm PST on Thursdays for the India/Asia/Pacific call. These times work for the most people, but it looks like there are still some folks who can’t make these new times. For these folks, Bharath suggested that we keep the current 3pm PST time as a DeepChem “office hours.” Bharath asked if these changes made sense. Alana asked if we could tweak the time to 7:30 pm PST on Thursdays instead so she could make it, which Bharath said should work fine.

With the administrative update out of the way, we moved on to our usual roundtable updates.

Bharath this week has been working on a number of DeepChem TODOs. In particular, Bharath set up this issue to track the saving/reloading issues for models. Bharath has merged in a number of fixes (PR1, PR2, PR3, PR4). Bharath this week also worked on adding a new tutorial for MoleculeNet (PR) and fixed a bug in RobustMultitaskRegressor for singletask datasets (PR). Bharath also put up a PR which adds documentation trying to address potential concerns of scientist contributors about whether their contributions would help or hurt their careers.

Peter has been working on the MoleculeNet loader overhaul and has put up a PR which lays out the new design. This PR tries to create a new clean API for MoleculeNet loaders which allows users to specify their choice of featurizer, splitter, and transformer. The PR is getting close to a stable state. Once everyone is on board with the design, Peter plans to swap all the other MoleculeNet loaders to the new API. This should help minimize the amount of untested code we have in MoleculeNet. The next task will be to add unit tests for MoleculeNet which will take a major effort as well. Bharath said this was really exciting since it would make MoleculeNet considerably more useful for the community and help reduce the amount of unmaintained code.

Alana has been working on some data from the MARIA vaccine paper to try to get it into MoleculeNet. This effort has been running into some issues with data formats and loaders and has raised an issue discussing. Bharath mentioned that he took a quick look and that it might be possible that we just need to add support for a new file format for spectroscopy data, but that he wasn’t yet sure.

Alana asked if Peter had a possible ETA on the API redesign since s? Peter says the next iteration would hopefully be done by the end of the day. It should be possible then to use the example loader as a template to make the new MoleculeNet loaders.

Michael mentioned that he hadn’t been able to work much on DeepChem work this week, but had gotten his PR to add infrastructure to doctest examples. Bharath mentioned that the new example test infrastructure was proving very useful and mentioned that he’d been able to get the Chembl example running in a PR (this example had previously been spitting out Nans, but this seemed to be caused by issues with computing Pearson-R^2 on small datasets). Michael mentioned that he was concerned that the Chembl example fix could still be broken. Bharath mentioned that the loss curve looked stable, and that it was possible that the original example added 2 years ago had the same issue with Nans. Peter mentioned that the codebase a couple of years ago didn’t have strong testing infrastructure, so it was very possible that the code had never been tested at all.

Bharath asked if there was a way we could start running the slow tests on doctest as well to improve robustness of the codebase. Peter had pointed out in a comment that the slow graph conv tests had broken. Having automated Travis CI infrastructure to catch this could make our lives much easier. Michael suggested that we could perhaps add the slow doctests onto the new documentation Travis CI runner since we were well below the timeout limit on that test runner. Bharath asked if Michael would have bandwidth to take a crack at it, and Michael said he’d be able to give it a try.

Seyone has been working on improving the ChemBERTa tutorials. He’s starting by fixing up the current ChemBERTa tutorial to solve the full Tox21-SR-p53 task dataset, using a custom MolNet dataloader, rather than the filtered version previously used. He’s also switching to using the PubChem BPE encoder model pretrained on a dataset of 10 million SMILES from MoleculeNet. Seyone asked if there would be interest in adding more chemberta tutorials explaining the tokenization and sequence/graph-based attention visualization processes. Bharath said this would be very useful and suggested it might be useful to follow Daiki’s earlier suggestion to make the SmilesTokenizer inherit from MolecularFeaturizer. Automated molnet benchmarking script.

Hari this week has been workin on a bandgap prediction problem using graph convs and ECFP fingerprints. He mentioned that he was surprised that the ECFP had performance comparable to the graph convolutions. Bharath mentioned that ECFP was a strong baseline and that this happened often on learning tasks. Bharath also suggested contributing the bandgap dataset to MoleculeNet if that would be of interest to Hari’s research group.

With the roundtable updates done, we moved to general discussion. Bharath asked if there were any general discussion points.

Alana asked if it would be possible for us to have more regular builds and releases, since we’re required researchers to use the nightly build. Bharath mentioned that absolutely this was something we should get on top of. Once the save/reload tests are complete and hopefully Peter’s refactored MoleculeNet API, we should be in a good place to make a next release. If we document the process carefully, hopefully it would become possible for us to reach a steady release cadence.

Bharath mentioned that we should also consider writing a DeepChem paper for the next major release to acknowledge the hard work of everyone who’s contributed to the latest release. Peter mentioned that he thought the current release could reasonably be called DeepChem 3.0 with all the changes made.

Alana also asked about PGP signing, and whether it would make sense for us to start signing our releases. Peter mentioned that since we were releasing primarily through conda/pip it wasn’t as critical to sign our releases since we could depend on the conda/pip infrastructure. Bharath mentioned that it might be useful longer term if we start distributing other code in future releases, so might be good to get that infrastructure in place.

Bharath mentioned that he would send out a draft of the call for papers for the DeepChem conference by next week, and mentioned that he would set up the invites for the new call timings.

As a quick reminder to anyone reading along, the DeepChem developer calls are open to the public! If you’re interested in attending, please send an email to X.Y@gmail.com, where X=bharath, Y=ramsundar.