Molecule WGAN in DeepChem

@peastman @bharath I am have created a new topic rather than updating the other one.
I am currently working on converting ConvMol conversion to RDKit Molecule. But there is a problem, namely, lack of bond information present. Given that in order for conversion to work we need this information we have to decide how to approach it. Either we create sub-class of WeaveMol/ConvMol that works with GraphModel or modify ConvMol to contain that data but not use it (prefer this option due to cleaner code, downside, memory requirements).
What would be your preference?
I am looking at retro-calculating bonds to avoid this issues but it is sophisticated (I can determine bond from hybridization); typical memory vs speed dilemma.

1 Like

We’re actually in the middle of replacing ConvMol with a new, more general class. Take a look at https://github.com/deepchem/deepchem/issues/1942, and the other issues that are linked on that page.

2 Likes

+1 to @peastman’s comment. We should make the Molecule WGAN use the new generic class. We’re in the process of swapping over GraphConvModel and WeaveModel to use this new format

1 Like

I agree with @bharath. It will simplify model building and will be less confusing.
Furthermore, like I said before, calculating bond types is doable from data we e.g. hybridization, isAromatic (I have already created buggy alpha version), but, it is not simple and definitely will not be more performant than just having it straight away. Especially that even for large chemical databases we are talking about a few MB difference.

@peastman Do you know how long will it take you to finish the transition?
As soon as I have class in hand I will start working on WGAN. In the meantime, I will play around with what I have.

@MiloszGrabski
Hi, I’m working for graph class transition.
Now, I want to finish the transition until the end of the next month.

1 Like

Here is a rough ConvMol conversion to RDKit.
It is very limited and does not work for conjugated systems and complex molecules.
I have included rdkit.Chem.SanitizeMol() at the end, if molecule is broken it will throw an error (needs wrapping in try/except)

@nd-02110114, @peastman
I was thinking about current ConvMol, are you planning to keep current feature matrix and adjacency matrix as is?

I am mainly talking about the list of atoms that are encoded by default. I think it might be beneficial to have two versions: default and extended. Where user can specify which version they are interested in.
I might be wrong, but majority of uses will be in the field of med. chem. and during my years as medicinal chemist there were very few cases of molecules having other than C, N, O, P, S, Cl, F atoms. Given the code is flexible enough to accommodate different feature lengths, I can see only one drawback, user would have to be aware what is included in each list. On the other side, smaller feature matrix should speed up training and reduce memory requirements. Therefore, I propose that by default C, N, O, P, S, Cl, F be used as default and full list as extended.

Please check the new featurizer.

The new featurizer uses the one-hot vector encoded by ["C", "N", "O", "F", "P", "S", "Cl", "Br", "I"].

2 Likes

Are you also working on conversion back RDKit molecule?

No. I’m not working.

Thanks for the update. I will pick it up once I have some time on my hands.