DeepChem Minutes 8/14/2020

bharath · August 19, 2020, 5:22pm

Date: 8/14/2020
Attendees: Bharath, Peter, Daiki, Seyone, Aneesh
Summary: Aneesh is a current master’s student in machine learning studying at University College of London. Aneesh was an early DeepChem contributor while an undergrad working with the Pande group and is excited to see what DeepChem has been up to recently.

Bharath this week completed the needed changes to add shape metadata into DeepChem (PR). These changes removed the option to write the old metadata format given the superiority of the new format and saved a couple datasets in the old metadata format for testing backwards compatibility of the old metadata format. Bharath plans to continue working on some improvements to DiskDataset and start working on some of the release blockers for DeepChem 2.4.0.

Peter’s PR adding support for PyTorch models to DeepChem has been merged in! With this merged in, we now have the capability ot wrap arbitrary torch.nn.module models as DeepChem models with a little bit of wrapper code. This should make it much easier to develop PyTorch models with DeepChem. Peter has also been working on some fixes to test cases (PR) and is currently working on improvements to creating PyTorch datasets from DeepChem datasets.

Daiki this week made a fix to the Windows CI (PR fixing make_pytorch_dataset which converts DeepChem datasets to PyTorch datasets. Daiki also made improvements to the loaders for materials science datasets (PR). Daiki also made a fix to our AdaGrad implementation (PR) and some improvements to our type annotations for RDKit (PR). Daiki is currently working on an implementation of a crystal graph convolutional networks (PR which is currently in review.

Seyone spent most of his week wrapping up pre-training on the PubChem 1M dataset and got the smiles tokenizer working with the pipeline. Also merged in a PR for improvements to the ChemBerta tutorial upgrading its masked-language modelling training set from 100K Zinc compounds to 250K Zinc compounds, and expanding the training time to 10 epochs.

Bharath asked about how the larger pretrained models are doing on benchmarks. Seyone said that he hadn’t had a chance to run the full benchmarks yet, but noticed that using a smiles tokenizer yielded more meaningful attention visualizations since tokens were semantically meaningful for smiles strings.

Bharath asked if Aneesh would mind giving a quick overview of some of his on-going research work. Aneesh mentioned that he was currently working on his master’s thesis, focused on applying gradient based metalearning techniques like MAML to molecular machine learning. Aneesh has been using chemprop as the underlying model for his system. Bharath mentioned that in his past experience, working with metalearning techniques was quite finicky and asked if Aneesh saw similar behavior. Aneesh said that stability was very much an issue. Recent papers like How to train your MAML have made some advances in stability but it’s broadly an open question.

Bharath mentioned that there were some cool uses of low data learning in the NLP world using GPT-3 that seemed fairly robust. Peter mentioned that Unirep had achieved some interesting low data results using NLP models on protein sequences. It stands to reason that low data NLP methods might possibly yield results on small molecules as well. Peter mentioned that one challenge would be to find suitably large datasets for pretraining. Bharath mentioned that Enamine REAL had a 1 billion compound commercial library and that GDB-17 had a 166 billion compound (synthetic) library which might be good grist for pretraining.

On that note, Bharath mentioned that making predictions on large datasets was quite challenging with deepchem. KerasModel builds up predictions in memory which means that out-of-memory errors are common when making predictions on large molecular datasets. Peter mentioned that he’d noticed this too and said it might be good to create a standard API for this and suggested putting up an issue to discuss. Bharath said he’d do this right after the call (issue).

Seyone mentioned that the selfies library was nearing a stable 1.0 release and that there was some interest in integrating selfies with DeepChem and RDKit. Bharath mentioned that RDKit integration would be ideal since it’s the defacto chemoinformatic standard library, but it might be technically complex since most of RDKit is C++. If RDKit integration doesn’t make sense, DeepChem-selfies integration might make good sense since DeepChem is all written in Python. Bharath asked what some downstream selfies applications were. Seyone mentioned that there was a selfies variational autoencoder and a generative genetic algorithm toolchain (github). Bharath said it might be interesting to expose these capabilities in DeepChem.

Seyone also mentioned that he raised an issue about design for a good tokenizer. Seyone said it probably would make sense to follow HuggingFace’s lead in establishing a tokenizer API since they’re now the defacto community standard. Bharath agreed and asked if Seyone could suggest an API design on the issue, which Seyone said sounded like a good next step.

As a quick reminder to anyone reading along, the DeepChem developer calls are open to the public! If you’re interested in attending, please send an email to X.Y@gmail.com, where X=bharath, Y=ramsundar.