ChemBERTa latest?

APJansen · February 11, 2022, 3:01pm

Hi all, I’m Aron, I recently started working at the Netherlands eScience Center, on a project that involves learning properties of molecules, hence my interest in DeepChem.

Specifically I’m looking into fine tuning ChemBERTa, and I was wondering what the latest developments are since the paper and the tutorial. More concretely I noticed that DeepChem has a profile on HuggingFace where several ChemBERTa models have been uploaded 3 weeks ago, one of which is called ChemBERTa-77M-MLM, which sounds to me like it is trained on the whole 77M curated pubchem dataset. Is that correct, and if so what tokenizer does it use? Can I use it as a drop in replacement in the tutorial linked above? Does it actually out-perform the older version?

Thanks!

bharath · February 12, 2022, 7:04am

The latest paper is from the ELLIS ML for Molecules workshop (https://moleculediscovery.github.io/workshop2021/, https://cloud.ml.jku.at/s/dZ7CwqBkHX97C6S). The new weights are from this workshop paper. We will have a new arxiv paper up soon!

APJansen · May 9, 2022, 6:44am

Any update on this?
And can I ask, where do I find this list of 200 properties that the MTR model was trained on?

bharath · May 11, 2022, 5:51pm

I believe the descriptors are all at https://www.rdkit.org/docs/GettingStartedInPython.html#list-of-available-descriptors. @seyonec Do you know if we used any other descriptors?

Sorry for the delay. We’ve been trying to add a set of experiments to the paper before the arxiv post and it’s been going slower than optimal.