Advice on DeepChem's Ligand Based virtual screening

mk28 · April 25, 2021, 12:43pm

Hello all,

I’m quite new to deepchem and trying to implement some of the frameworks for my problem at hand. Currently I’m trying my hand at developing a machine learning model for classifying whether compounds can bind to an allosteric site of a particular sodium channel I’m interested in and so am trying to some extent to implement the framework listed in Chapter 11 of Deep Learning for the Life Sciences. What I have done so far is use a subset of the zinc database and perform a ligand based screen using vina for about 10,000 compounds. This has given me binding affinity values for all my compounds. Thereafter, at somewhat arbitrary binding affinity values, I separated out compounds into active compounds i.e. that bind well and decoy compounds i.e. bind with much lower affinity. I then tried to examine the molecular properties as outlined in the workflow but basically the active compounds have a much higher molecular weight than the decoy compounds (log P and charge are very similar between both).

My real question is 1. Is this somewhat of a valid approach to build a rudimentary model or am I doing something very wrong and 2. given the libraries have quite different averages in molecular weights between active and decoy compounds, how should I go about fixing this? Should I use a different starting library in vina or a different approach or a different pre-processing step? I’m happy to clarify anything if needed in much more detail given this is a pretty loose outline without too many specific details. Any help would be much appreciated!!

bharath · April 29, 2021, 7:31pm

One challenge with your workflow is that you’re using molecular docking (through Vina) to get ground truth data for your model. In effect, this means the learned model can never be more accurate than Vina itself! A nicer strategy is to get experimental data for the assay if possible since otherwise you’re basically just building a fast vina approximation algorithm.

Is there a way you can get access to some experimental data?

mk28 · May 3, 2021, 2:21pm

Hi bharath and thanks for your reply.

Unfortunately, I do not have access to any experimental binding data on hand. Your answer makes a lot of sense. So I’ve changed my approach a bit now in which I am taking the top results of my Vina screen on the 10,000 molecules and then trying to get chemically similar compounds from another much larger dataset. In essence, I’m trying to streamline the process so that I don’t have to run a very large screen and can run Vina on a few compounds from this other dataset.

I’ve featurised the top smiles from my list using deepchems convmolfeaturizer function but I was just wondering if there was a simple way to compare certain molecular fingerprints easily against each other. Any help would be much appreciated!

edit: I’d also like to mention that I don’t have a specific chemical subgroup in mind so using SMARTS strings for me probably won’t really help because stuff like molecular weight and length also make a difference so it would be good for any molecular fingerprint comparison to somewhat take this into account

bharath · May 4, 2021, 11:27pm

One thought is to use a tanimoto score (for ECFP featurized molecules). Unfortunately, there’s no easy way to compare different molecular fingerprints to one another! (Graph conv fingerprints often have a molecular graph, which can’t be obviously compared to an ECFP vector for example).

For your purposes, it may actually make sense to train a fast-vina approximation algorithm (using vina as groundtruth) using your initial vina screen to run against the larger catalog. This would effectively serve as a nonlinear “fingerprint” comparison method if that makes sense

mk28 · May 7, 2021, 5:22pm

Hi bharath and thanks for your reply! I just had a quick question before I try to build my classification model and that is what data should I use to feed into the model. More specifically if I had to run a relatively small library of compounds through vina, do you have any recommendations on which library would be appropriate. Also would you know how many compounds would be good to train for this model to get relatively decent accuracy?

bharath · May 9, 2021, 10:10pm

A good bet would be to go for some structural diversity in your initial screening set. Perhaps try to select compounds that are tanimoto-dissimilar from one another. It’s hard to say how may compounds you would need, but as a good rule of thumb, starting with at least 500 would likely be helpful!