Hello everyone,
Thanks for this so exciting library !
My name is Theo, I am more of a Machine Learner than a expert in computational chemistry (I only had a few courses in college), I am discovering the potential of DeepChem and would love to get a few advice from the experts around.
I am trying to predict properties of a mixture of compounds. I know there is an oxidation reaction between a few compounds, but I have no idea which ones and to which extent.
So far, I have tried to :
- Compute molecular embeddings for each compound using ChemBERTa (inspired by tutorial 22)
- Agregate my embeddings vectors for my mixture by taking a weighted average of the vectors weighted by the number of moles of each compound in my mixture (similar crazy simple method used in NLP for Doc2Vec approaches).
- Use simple ML models (SVM, RF) with “mixture embeddings” as inputs to predict my properties.
It works quite well, ie enough to be interesting, but not enough to feel like I have cracked my problem. I think the mixture oxidation reaction is too complex to be simply modeled using a weighted average.
What would you think are the best modeling choices ?
- Should I use the ChemBERTa technique to vectorize my SMILES ? Or would another vectorizer be more suitable for mixture modeling ?
- How could I better model my mixture ? I’ve seen that there is a potential to use the USPTO dataset as shown in this forum post, but I am really not sure of myself there
Thanks a lot for your help !