DeepChem Minutes 12/17/2020

bharath · December 24, 2020, 1:13am

DeepChem Minutes

India/Asia/Pacific Call

Date: December 17th, 2020

Attendees: Bharath, Alana, Peter, Michael

Summary:

Bharath mainly reviewed a few PRs this week.

Alana put up a draft PR for fixing a tutorial

Peter this week has been working on developing a new splitter that splits based on fingerprints (issue, PR). This splitter seems to work really well and provides a large separation between training and validation. Peter asked if Michael had any feedback on the methodology. Michael asked how the splitter worked under the hood. Peter said that at each step, it finds the molecule that’s least similar to one group and adds it to the other group. It’s essentially a max-cut problem.

Michael asked if Peter had checked visually whether the split groups were very different. He suggested one possibility might be that the training space might be too small. Peter said he hadn’t, but he looked at the individual bit distributions in training, and all bits seem to be set in training.

Bharath suggested it might be useful to compute scaffolds in the training/test to see if there’s a strict overlap and suggested this might be a nice JCIM paper.

Peter noted that on a paper from the discussion, temporal splits were strictly harder than scaffold splits, and it might be useful to see if the new fingerprint splitter is close to temporal splits.

Michael asked if the method was deterministic, and Peter said yes, but it depended on the initial ordering of the dataset pre-split. Michael asked what the average Tanimoto similarity was between the splits, and Peter said the Tanimoto similarity was quite low.

Alana asked if the goal was to minimize validation score, couldn’t that just be posed as a separate optimization problem? Peter answered that the goal was to find a split that’s “representative” of the true behavior in production and lower might actually be worse sometimes. As an example, Peter said a MNIST split that put all 4s in validation and all 8s in test wouldn’t be useful.

Bharath suggested that https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00403 might also be worth looking.

Michael has been working on running more BLAST jobs on Sherlock. Hopefully by next week, Michael should be able to upload the full dataset to MoleculeNet and a simple dataset to MoleculeNet as an example.

Michael has recently been looking at graph generation tools. Oftentimes, these are coupled with simple properties like those with RDKit. Michael wondered if docking could be used in the generative process. On a related note, Bharath mentioned he came across a cool blogpost discusses alphafold’s structure.

We moved to a general discussion.

Alana asked if there were tools for loading mass spec files in Python. Michael suggested checking out pyopenms, pyteomics.

Bharath asked if we were good to make the release for 2.4.0 even though we have a couple minor issues. Peter asked if these were new issues or issues present in 2.3.0. Bharath said they were present in 2.3.0 so we should be good to make the release. After discussion we made a call to punt the completion of Issue 2217 to the next release but said we’d try to look at Issue 2312 before the release if possible.

Americas/Europe/Africa/Middle East

Date: December 18th, 2020

Attendees: Bharath, Peter, Tess, Nathan, Seyone

Summary:

Tess is a first time attendee so we started with a brief intro. Tess is a postdoc at LBNL where she works on designing neural networks for first principles on 3D data that arises from geometry. Primarily working on euclidean neural networks which require specification of a neural

Bharath gave the same update as he did yesterday

Peter gave the same update he did yesterday.

Nathan this week has mostly been working on infrastructure for the protein-ligand interaction tutorial. He refactored the ComplexFeaturizer to inherit from the base Featurizer class (PR) and revamped the PDBBind MoleculeNet loader to use the API.

Nathan also noticed that the pypi install hasn’t been updated in a while and asked if there were any issues with the pip pipeline and noted Representation vectors of protein-ligand interactions and clustering with DeepChem GraphConv. Bharath said perhaps this was a Travis CI issue.

Nathan also noted that as we were introducing better protein support into DeepChem it might be good to add uniprot as a dataset. Nathan asked if ther were any other datasets we should add. Peter suggested it might be good to brainstorm use cases.

Tess suggested looking at the Adam3D datasets. Seyone suggested that it might be useful to add Uniparc.

Seyone doesn’t have too much of an update similarly due to the holidays. Seyone mentioned there was good feedback from the ML4Molecules. Seyone mentioned that the ChemBERTa model had also been now invoked 200K times by the API.

We moved to general discussion.

Bharath asked Tess if there were any good places to collaborate with e3nn. Tess mentioned that e3nn is going through a big refactor to make the framework more general and that it might be useful to start using e3nn as a dependency once the refactor is complete.

Joining the DeepChem Developer Calls

As a quick reminder to anyone reading along, the DeepChem developer calls are open to the public! If you’re interested in attending either or both of the calls, please send an email to X.Y@gmail.com, where X=bharath, Y=ramsundar.