I’m quite new to deepchem and trying to implement some of the frameworks for my problem at hand. Currently I’m trying my hand at developing a machine learning model for classifying whether compounds can bind to an allosteric site of a particular sodium channel I’m interested in and so am trying to some extent to implement the framework listed in Chapter 11 of Deep Learning for the Life Sciences. What I have done so far is use a subset of the zinc database and perform a ligand based screen using vina for about 10,000 compounds. This has given me binding affinity values for all my compounds. Thereafter, at somewhat arbitrary binding affinity values, I separated out compounds into active compounds i.e. that bind well and decoy compounds i.e. bind with much lower affinity. I then tried to examine the molecular properties as outlined in the workflow but basically the active compounds have a much higher molecular weight than the decoy compounds (log P and charge are very similar between both).
My real question is 1. Is this somewhat of a valid approach to build a rudimentary model or am I doing something very wrong and 2. given the libraries have quite different averages in molecular weights between active and decoy compounds, how should I go about fixing this? Should I use a different starting library in vina or a different approach or a different pre-processing step? I’m happy to clarify anything if needed in much more detail given this is a pretty loose outline without too many specific details. Any help would be much appreciated!!