DeepChem Minutes 5/15/2020

Date: 5/15/2020
Attendees: Bharath, Seyone, Vignesh, Peter
Summary: As a procedural note, Bharath asked if we could move the DeepChem weekly calls to 3pm PST. Our new Google Summer of Code student is based in Japan so our current 12:45 PST times are very tricky for him to make. Bharath suggested a new time of 3pm PST which everyone seems to be able to make so we’ll shift the calls to then from next week onwards. Bharath noted that Daiki has posted a project introduction on the forums and a project roadmap on Github.

Bharath has continued work on the refactoring PR PR. He merged in the pretty print for Datasets PR that was split out and factored out another small PR of coordinate box utils. He’s been working on cleaning up the design of the docking module and has most of it now refactored. He’s cleaning up that and hopes to send in PRs for them over the next week. Bharath noted that the earlier docking module was poorly designed and didn’t have a clean API so this will involve a major breaking change to reach a sensible API. Seyone mentioned he’s interested in working with docking so he’d be interested to play with the new API once it’s designed.

Peter continued working on getting the slow tests passing again in this PR. Almost all the slow tests are now functional, with the exception of those from the docking submodule. Once the docking changes are upstreamed, we should be able to merge in these changes. Peter also took some time to try to work more seriously on speeding up the graph convolutions. It looks like a lot of our overhead was from handling the ConvMol objects and other pre-TensorFlow data munging. He hopes to have a PR up soon that improves some of these issues.

Vignesh is currently working on a Neurips deadline so will be busy till June, but hopes to have more time to work on DeepChem development afterwards.

Seyone has been working on getting his ChemBerta implementation working on multitask models. He hopes to benchmark it soon on Tox21 to see whether there are transfer learning boosts. He’s also working on pretraining on a larger subset of Zinc to see if that results in improved predictive power for the models.

Bharath mentioned as a parting note that it looks like Amazon made a SageMaker tutorial for using DeepChem. However, it looks like there are a couple issues with the tutorial as noted in this issue. He asked if anyone had used Sagemaker before? Unfortunately it looks like no one has experience, but Seyone said he’d be interested in giving it a try to run some experiments. Bharath and Seyone will coordinate offline about experimenting with Sagemaker.

As a quick reminder to anyone reading along, the DeepChem developer calls are open to the public! If you’re interested in attending, please send an email to X.Y@gmail.com, where X=bharath, Y=ramsundar.

Hey! I imagine ChemBerta is a Transformer pretrained on Zinc with masked language modeling, then finetuned on downstream tasks? If so, would be interested to hear the results.

I’m also doing something similar (but with a different pretraining method: https://openreview.net/forum?id=r1xMH1BtvB). The pretraining is looking good, will lyk if you downstream task performance is good too.

1 Like

Welcome to the forums @aced125!

Tagging @seyonec who might be interested in checking it out :slight_smile:

1 Like

Hey @aced125, let me know how it goes! Looks like an interesting methodology!

Hey @seyonec the pretraining is going well. I’m training for about 600k steps now with a batch size of 4096 and the MLM is getting 90% MLM accuracy (with 25% noising probability) and the discriminator is getting an MCC of 0.64.

Will let it train for a bit more then use it on downstream tasks.

Sounds good!!

Downstream tasks seem to be the hardest part. Are you using HuggingFace’s sequence classification pipeline, or an additional library such as simple-transformers

Hi @seyonec,

For the pretraining, I’m using my own implementation (the base transformer is from HuggingFace). However, I raised and merged a PR in simple-transformers so they now have a proper Electra pretraining routine as well.

I have been busy with other stuff over the past week so have not yet got around to writing finetuning scripts yet. Hopefully this week!

1 Like