What would be the best modeling choices for a mixture with oxidation reaction?

Hello everyone,

Thanks for this so exciting library !
My name is Theo, I am more of a Machine Learner than a expert in computational chemistry (I only had a few courses in college), I am discovering the potential of DeepChem and would love to get a few advice from the experts around.

I am trying to predict properties of a mixture of compounds. I know there is an oxidation reaction between a few compounds, but I have no idea which ones and to which extent.

So far, I have tried to :

  • Compute molecular embeddings for each compound using ChemBERTa (inspired by tutorial 22)
  • Agregate my embeddings vectors for my mixture by taking a weighted average of the vectors weighted by the number of moles of each compound in my mixture (similar crazy simple method used in NLP for Doc2Vec approaches).
  • Use simple ML models (SVM, RF) with “mixture embeddings” as inputs to predict my properties.

It works quite well, ie enough to be interesting, but not enough to feel like I have cracked my problem. I think the mixture oxidation reaction is too complex to be simply modeled using a weighted average.

What would you think are the best modeling choices ?

  • Should I use the ChemBERTa technique to vectorize my SMILES ? Or would another vectorizer be more suitable for mixture modeling ?
  • How could I better model my mixture ? I’ve seen that there is a potential to use the USPTO dataset as shown in this forum post, but I am really not sure of myself there

Thanks a lot for your help !

1 Like

You might check out my old write-up Mixture Descriptors toward the Development of Quantitative Structure–Property Relationship Models for the Flash Points of Organic Mixtures which discusses a paper also working on mixture modeling.

I’d love for us to get better mixture support into DeepChem but we don’t have much out of the box right now. Your ideas all seem quite solid fwiw. Perhaps you could try doing something like a multi-instance graph conv model where you have multiple parallel graphconvs, one for each mixture component? This would be tricky to implement and isn’t available in DeepChem right now though