DeepChem Minutes 12/10/2020

bharath · December 16, 2020, 10:28pm

DeepChem Minutes

India/Asia/Pacific Call

Date: December 10th, 2020

Attendees: Bharath, Mufei, Vignesh, Seyone, Michael

Summary:

We had a separate discussion about AlphaFold2 on a separate call so we started with a discussion of replicating it in DeepChem. Michael has a PR up with work in progress he’s planning to continue work on.

Blast computation, etc.

Michael gave a brief intro about himself. Michael was a student in the Pande lab at Stanford, applying deep learning for molecular machine learning. Since then he worked a bit on protein structure prediction, and since then he’s been working on bioinformatics, mostly working with cell biology data to see whether supervised/unsupervised methods can be applied.

Bharath mentioned that we’ve been working on modernizing MoleculeNet for the last several months. Peter has been working on modernizing the software infrastructure and Mufei has been working on new benchmarks.

Mufei mentioned that he’s the developer of DGL Life Science. Mufei mentioned that we want to use scaffold split or cross validation since random splits can yield high variance. Mufei has been setting up some rigorous baselines (PR 14) for MoleculeNet.

Bharath asked if Michael has any thoughts about MoleculeNet since the original paper. Michael agreed with Mufei on the variance/splitting issue. For example, graph attention models have been around for a while along with numerous new models. It would be important to add the most representative models. There’s also a lot of other graph benchmarks that have started to come up.

Peter suggested looking at https://qcarchive.molssi.org/ for some quantum datasets.

Michael asked about whether there’s a plan to test graph/molecular generative modelling. Mufei mentioned these models can be tricky to work with, but a lot of people are interested.

Michael also mentioned that graph pre-training might be useful to add to the benchmarks. Bowen for example has put up a paper to do graph pre-training. Seyone mentioned that he’s been working on pretraining techniques with ChemBERTa and these techniques seem to work pretty well. Mufei mentioned that there’s one SNAP model in DGL, and feedback so far has been mixed on whether it helps or not. It’s possible though that using richer descriptors could help. Peter mentioned that the paper mentioned that you could use either local or global pretraining and it might be necessary to use both.

Bharath asked if it might be possible to add a wrapper of the SNAP implementation in DGL to DeepChem and that seconding Michael’s idea that it would be good to add into MoleculeNet. Mufei agreed this would be a good addition, and Seyone mentioned that we were also working on expanding ChemBERTa transfer learning tests to pan-MoleculeNet.

Vignesh has been continuing the task of implementing the LCNN (PR). Nathan has just reviewed this PR and given some comments about it. Vignesh has been able to implement the LCNN using DGL and is seeing some performance improvements. He’s also experimenting with some new architectural ideas to see if he’s able to improve performance.

Peter has been continuing on updating tutorials and has got in a PR.

Seyone this week has been working on updating the tutorial based on Peter’s feedback to add more pedagogical material about transformers, attention, Huggingface, masked-language modelling, etc. Seyone said he’d update the PR after the call.

Bharath mentioned the ML4Molecules workshop was coming up.

Bharath asked about Michael’s recent work. Michael mentioned that he’d been working on RNASeq data with data from cells at different development stages, working with another group at Stanford.

Bharath asked if there were RNASeq tools in open source. Michael mentioned a lot of it was doing ZMap or tSNE, but there’s not a general toolkit you can use.

Bharath mentioned that Travis CI was failing and asked Seyone for a quick update about the transformers failures. Seyone said he was looking into it and should hopefully be able to get it fixed. Seyone mentioned he could also help out with the github actions.

Americas/Europe/Africa/Middle East

Date: December 11th, 2020

Attendees: Bharath, Seyone, Vaijeyanthi, Hosein, Nathan, Hari

Summary:

Bharath gave the same update as he did yesterday.

Vaijeyanthi is working through the tutorials.

Seyone gave the same update as he did yesterday. Most of his time has been spent on the Neurips ML for Molecules poster.

Hosein read a number of papers about generative models to learn more about the tricks being used to evaluate generative models. Hosein has also been working through more of the tutorials including the normalizing flow. Hosein noted in an issue that methods like .get_tf_dataset() don’t work on ConvMol datasets. There are also some issues with minor mismatches with the docs. Hosein is also looking at metrics for generative models.

Nathan mentioned like Seyone most of this week has been put into Neurips work. Nathan mentioned that Alana also uncovered some issues with the latest version of the tutorials. These are related to updates to the SELFIES library and should be fixed soon. Nathan is also working on a PR for a new PDBLoader(PR) and a refactor of ComplexFeaturizer, which should be up soon.

Hari this week has been working on wrapping up his semester.

Joining the DeepChem Developer Calls

As a quick reminder to anyone reading along, the DeepChem developer calls are open to the public! If you’re interested in attending either or both of the calls, please send an email to X.Y@gmail.com, where X=bharath, Y=ramsundar.