DeepChem Minutes 5/15/2020

bharath · May 20, 2020, 8:03pm

Date: 5/15/2020
Attendees: Bharath, Seyone, Vignesh, Peter
Summary: As a procedural note, Bharath asked if we could move the DeepChem weekly calls to 3pm PST. Our new Google Summer of Code student is based in Japan so our current 12:45 PST times are very tricky for him to make. Bharath suggested a new time of 3pm PST which everyone seems to be able to make so we’ll shift the calls to then from next week onwards. Bharath noted that Daiki has posted a project introduction on the forums and a project roadmap on Github.

Bharath has continued work on the refactoring PR PR. He merged in the pretty print for Datasets PR that was split out and factored out another small PR of coordinate box utils. He’s been working on cleaning up the design of the docking module and has most of it now refactored. He’s cleaning up that and hopes to send in PRs for them over the next week. Bharath noted that the earlier docking module was poorly designed and didn’t have a clean API so this will involve a major breaking change to reach a sensible API. Seyone mentioned he’s interested in working with docking so he’d be interested to play with the new API once it’s designed.

Peter continued working on getting the slow tests passing again in this PR. Almost all the slow tests are now functional, with the exception of those from the docking submodule. Once the docking changes are upstreamed, we should be able to merge in these changes. Peter also took some time to try to work more seriously on speeding up the graph convolutions. It looks like a lot of our overhead was from handling the ConvMol objects and other pre-TensorFlow data munging. He hopes to have a PR up soon that improves some of these issues.

Vignesh is currently working on a Neurips deadline so will be busy till June, but hopes to have more time to work on DeepChem development afterwards.

Seyone has been working on getting his ChemBerta implementation working on multitask models. He hopes to benchmark it soon on Tox21 to see whether there are transfer learning boosts. He’s also working on pretraining on a larger subset of Zinc to see if that results in improved predictive power for the models.

Bharath mentioned as a parting note that it looks like Amazon made a SageMaker tutorial for using DeepChem. However, it looks like there are a couple issues with the tutorial as noted in this issue. He asked if anyone had used Sagemaker before? Unfortunately it looks like no one has experience, but Seyone said he’d be interested in giving it a try to run some experiments. Bharath and Seyone will coordinate offline about experimenting with Sagemaker.

As a quick reminder to anyone reading along, the DeepChem developer calls are open to the public! If you’re interested in attending, please send an email to X.Y@gmail.com, where X=bharath, Y=ramsundar.

aced125 · May 24, 2020, 10:26pm

Hey! I imagine ChemBerta is a Transformer pretrained on Zinc with masked language modeling, then finetuned on downstream tasks? If so, would be interested to hear the results.

I’m also doing something similar (but with a different pretraining method: https://openreview.net/forum?id=r1xMH1BtvB). The pretraining is looking good, will lyk if you downstream task performance is good too.

bharath · May 24, 2020, 11:37pm

Welcome to the forums @aced125!

Tagging @seyonec who might be interested in checking it out

seyonec · May 25, 2020, 1:56am

Hey @aced125, let me know how it goes! Looks like an interesting methodology!

aced125 · May 25, 2020, 11:51am

Hey @seyonec the pretraining is going well. I’m training for about 600k steps now with a batch size of 4096 and the MLM is getting 90% MLM accuracy (with 25% noising probability) and the discriminator is getting an MCC of 0.64.

Will let it train for a bit more then use it on downstream tasks.

seyonec · May 30, 2020, 6:23pm

Sounds good!!

Downstream tasks seem to be the hardest part. Are you using HuggingFace’s sequence classification pipeline, or an additional library such as simple-transformers

aced125 · May 30, 2020, 8:35pm

Hi @seyonec,

For the pretraining, I’m using my own implementation (the base transformer is from HuggingFace). However, I raised and merged a PR in simple-transformers so they now have a proper Electra pretraining routine as well.

I have been busy with other stuff over the past week so have not yet got around to writing finetuning scripts yet. Hopefully this week!

aced125 · June 9, 2020, 9:44pm

@seyonec @bharath

I finally had some time to finetune Electra on downstream tasks.

The results look pretty promising. With no hyperparam tuning, out of the box settings, it’s getting 0.9 R2 on Delaney in 200 gradient steps (batch size 32, so about 6-7 epochs). This is in line with the findings in the recent Stanford pretraining paper which showed significantly faster convergence compared to random initialization.

Will keep you all updated as I run it on more tasks.

(The 200 gradient steps took a few minutes on CPU)

Update:
I tried training from random initialization and, indeed, it cannot learn (training loss won’t go down)

Update #2:
Fine-tuned on SAMPL dataset. No hyperparam tuning, just out of the box first time, 0.95 R2 validation, 0.94R2 test in about 500ish steps.

Update #3:
Fine-tuned on some downstream classification tasks. I made sure to avoid the downstream tasks where some of the weights were 0 (e.g toxcast) as I have not implemented ignoring these indices yet.

0.83 (index) / 0.82 (scaffold) AUC SIDER (Stanford pretraining paper reports - 0.63 (scaffold), MoleculeNet leaderboards report high 0.6s.
0.996 AUC BBBP scaffold split. For some reason the Stanford paper reports really low scores.
0.994 ClinTox Scaffold split (in 76 gradient steps!). Again, Stanford paper reports really low scores for some reason.
0.83 AUC BACE scaffold split. Stanford paper reports best at 84.

I just checked the Stanford code, looks like they’re making the splits themselves. Ideally everyone would use the deepchem pre-made splits so there is no debate over reproducibility / validity.