Molecule WGAN in DeepChem

MiloszGrabski · August 27, 2020, 12:03pm

@peastman @bharath I am have created a new topic rather than updating the other one.
I am currently working on converting ConvMol conversion to RDKit Molecule. But there is a problem, namely, lack of bond information present. Given that in order for conversion to work we need this information we have to decide how to approach it. Either we create sub-class of WeaveMol/ConvMol that works with GraphModel or modify ConvMol to contain that data but not use it (prefer this option due to cleaner code, downside, memory requirements).
What would be your preference?
I am looking at retro-calculating bonds to avoid this issues but it is sophisticated (I can determine bond from hybridization); typical memory vs speed dilemma.

peastman · August 27, 2020, 5:20pm

We’re actually in the middle of replacing ConvMol with a new, more general class. Take a look at https://github.com/deepchem/deepchem/issues/1942, and the other issues that are linked on that page.

bharath · August 28, 2020, 3:35am

+1 to @peastman’s comment. We should make the Molecule WGAN use the new generic class. We’re in the process of swapping over GraphConvModel and WeaveModel to use this new format

MiloszGrabski · August 28, 2020, 7:20am

I agree with @bharath. It will simplify model building and will be less confusing.
Furthermore, like I said before, calculating bond types is doable from data we e.g. hybridization, isAromatic (I have already created buggy alpha version), but, it is not simple and definitely will not be more performant than just having it straight away. Especially that even for large chemical databases we are talking about a few MB difference.

@peastman Do you know how long will it take you to finish the transition?
As soon as I have class in hand I will start working on WGAN. In the meantime, I will play around with what I have.

nd-02110114 · August 28, 2020, 9:56am

@MiloszGrabski
Hi, I’m working for graph class transition.
Now, I want to finish the transition until the end of the next month.

MiloszGrabski · August 28, 2020, 3:00pm

Here is a rough ConvMol conversion to RDKit.
It is very limited and does not work for conjugated systems and complex molecules.
I have included rdkit.Chem.SanitizeMol() at the end, if molecule is broken it will throw an error (needs wrapping in try/except)

MiloszGrabski · September 1, 2020, 7:44am

@nd-02110114, @peastman
I was thinking about current ConvMol, are you planning to keep current feature matrix and adjacency matrix as is?

I am mainly talking about the list of atoms that are encoded by default. I think it might be beneficial to have two versions: default and extended. Where user can specify which version they are interested in.
I might be wrong, but majority of uses will be in the field of med. chem. and during my years as medicinal chemist there were very few cases of molecules having other than C, N, O, P, S, Cl, F atoms. Given the code is flexible enough to accommodate different feature lengths, I can see only one drawback, user would have to be aware what is included in each list. On the other side, smaller feature matrix should speed up training and reduce memory requirements. Therefore, I propose that by default C, N, O, P, S, Cl, F be used as default and full list as extended.

nd-02110114 · September 2, 2020, 3:01am

Please check the new featurizer.

github.com

deepchem/deepchem/blob/3d257a0c9ce32284422c7dc0ec7f5cab34a8ebda/deepchem/utils/molecule_feature_utils.py#L20-L30


DEFAULT_ATOM_TYPE_SET = [
    "C",
    "N",
    "O",
    "F",
    "P",
    "S",
    "Cl",
    "Br",
    "I",
]

The new featurizer uses the one-hot vector encoded by ["C", "N", "O", "F", "P", "S", "Cl", "Br", "I"].

MiloszGrabski · September 2, 2020, 7:11am

Are you also working on conversion back RDKit molecule?

nd-02110114 · September 4, 2020, 8:27am

No. I’m not working.

MiloszGrabski · September 4, 2020, 11:21am

Thanks for the update. I will pick it up once I have some time on my hands.

MiloszGrabski · November 4, 2020, 12:42pm

Hi @bharath, @peastman
I have been quite busy so it took longer than expected but I have finally converted MolGAN to TensorFlow 2+. This is simplified version and does not include VAE presented in their manuscript.
The code works fine on QM9 dataset, but fails on bigger molecules. When I tried training on 200 000 random compounds with MW 300-400, it provided NaN losses.
Before trying to converting into DeepChem I would like to solve this issue (if possible), any input is welcome.

Furthermore, could use some pointers on conversion into DeepChem standard. I am wondering what would be the best way to implement it here i.e. should we try to rewrite it using DeepChem GCN or just use as is (just modifying to adhere to DeepChem standards), what to do about datasets itself etc.

bharath · November 5, 2020, 8:21pm

Great to hear about your progress! I took a quick look through the code. Here are some first thoughts on conversion into DeepChem:

It looks like you’re already using Keras which is good. You can follow this tutorial to convert a Keras model into a DeepChem model (https://github.com/deepchem/deepchem/blob/master/examples/tutorials/05_Creating_Models_with_TensorFlow_and_PyTorch.ipynb)
You have a lot of custom layers here. Adding a new layer to DeepChem requires lots of additional testing. I’d suggest trying to minimize the number of new layers and re-using either DeepChem existing layers or keras standard layers to the degree possible
It would be good to make MolGAN fit into DeepChem’s existing GAN framework (see https://github.com/deepchem/deepchem/blob/master/examples/tutorials/15_Training_a_Generative_Adversarial_Network_on_MNIST.ipynb for a tutorial example).
We need to come up with good unit tests for MolGAN. See https://deepchem.readthedocs.io/en/latest/coding.html#testing-machine-learning-models

Hope that’s helpful!

MiloszGrabski · November 9, 2020, 11:06am

Will look into that once have a moment, but need to figure out why losses return NaN for bigger molecules.
There is still case of inputs, current DC graph convolution input is not compatible with the model structure. So, we would have to add this input format into DeepChem.
Most of the layers are for simplification and making code more flexible. The base of the code are graph convolution layer and graph aggregation layer. They are required unless we could utilize DeepChem graph layers; this is something I have been wondering about. I have not looked in-depth into structure of DC GCN. All they all-at-once or node-by-node type? If we could obtain same results with DC GCN then it would also solve input incompatibility issue.
That would be ideal, but we would have to change slightly how loss function is calculated as generated samples are not simple upscaled random noise. Albeit, I am wondering if I should not remove all flexibility (e.g. adding gumbel, flexible number of layers), and just rigid base that gets moved to DeepChem. I have tried to mimic original code as much as possible. But we could just create something similar that is based on principles not the code itself, maybe.
Scoring using just loss is difficult as even with “high” loss the generative model can provide valid molecules. I think in this case the best would be ones use what is mentioned in the manuscript: validity test, novelty test. For dataset I would utilize QM9 test set as it is well established and used in the original code.

The another question is, should it also contain reinforcement learning part that will guide learning toward specific goal or just generative part.

bharath · November 11, 2020, 9:33pm

Some more thoughts inline

What’s the input format for the code? DeepChem supports a fairly broad range of input formats (including DGL/PyTorch Geometric style) so I think we could figure something out.
I think node-by-node but I’m not entirely sure about the terminology here
It might be easiest to make something similar based on the principles as you point out! Taking a standard Keras/Pytorch model into DeepChem is pretty easy so that might be the best way to go
I think it could be potentially interesting to support both the RL and the generative aspects, but maybe the generative aspect might be a good first place to start!

MiloszGrabski · November 17, 2020, 4:01pm

The dataset format created by authors contain number of things, but key data are two matrices that encode atom types (adj. matrix) and other that encodes bond types. The size of both depends on the whole dataset itself as it uses max number of atoms in the dataset (HAC for biggest molecule in the training set). This allows processing of the whole molecules at once (opposed node-by-node), they argue it is faster (have not checked). Those matrices (atoms and bonds) are then converted into one-hot encoding using tensorflow methods.
Node-by-node means that you can have molecules of various sizes as it process each node individually and then aggregates results (similar to RNN that do not care about input size). The authors algorithms processes molecules as a whole.
I think that this is the best solution, just struggle to find free time. Given recent change in format I need to familiarize myself with DC again. I have not even looked into custom model based on layers yet. Given that the generator is simple DNN it should matter little what GCN is used as critics, but reality is never as simple.
I agree, I will start with simple solution and move toward adding RL.

bharath · November 20, 2020, 2:33am

I think this can be fitted into DeepChem’s data classes (https://deepchem.readthedocs.io/en/latest/api_reference/dataclasses.html)
I think DeepChem would be node-by-node? Since we can handle arbitrary size molecules

MiloszGrabski · November 24, 2020, 8:47am

I believe so.
I will try to organize some time this weekend to finally sit down and have a proper look.
First thing to do is also creating conversion from matrix representation into RDKit molecule, but it should not be too hard. The biggest problem I found is how RDKit handles errors, I could not find a way to stop it from throwing errors when molecule was invalid. Wrapping it in try/except does not help. I found workaround in Jupyter by using with “io.capture_output() as …”

MiloszGrabski · December 7, 2020, 10:09am

Hi,
I have looked through DC and could not find matching infrastructure. Therefore, I will have to utilize some of the layers I have created; I will try to utilize DC WGAN at the same time. The biggest problem is that models like Weave would require flexible generator i.e. utilizing RNN. I think it will be quite interesting to see, so I might look into that one I am done with this.

MiloszGrabski · December 11, 2020, 2:36pm

@bharath
Hi, I have created alpha version of the code, you can check it here: https://github.com/MiloszGrabski/DeepChem_MolGAN
I was not able to utilize DC convolution layers, so I have used my own.
I might be able to reduce number of custom layers in future, first I wanted to be sure that algorithm works.
I am glad to see that it works nicely with WGAN, albeit I had to use a few hacks to make it work i.e. I had increase number of outputs from generator as there is no way to intercept data between generator and discriminator.

A few comments:

Training is more unpredictable than I have noticed with my old code, not sure what is the reason behind it. So, sometimes it trains well, sometimes provides no results, and sometimes just a few good compounds. Training is quite random, but not once noticed NaN issues.
10 is good number of epochs, less or more resulted in degradation of training
Would be good to expand GAN/WGAN functionality, by allowing data modification between generator and discriminator (required for this algorithm to work). Currently, it can be solved by adding additional outputs from generator, one modified and one unmodified.
Would be nice to able to impact loss without redoing whole GAN/WGAN infrastructure i.e. by enabling custom lambda function which work on top of loss, so lambda loss: loss +/* etc. This would allow introduction of additional scoring functions e.g. similarity, validity and so on.
Had to create custom featurizer, as this model only works on small number of atom and bond types; currently accepts C,N,O,F,dummy and single,double,triple,aromatic,dummy bonds.
*Probably can remove aromatics as structures are kekulized before graph generation. This is due to buggy molecule generation from graph if hydrogens are not included in graph .e.g. pyrrole