Featurizing PDBBind dataset saved on google drive

bio_coder · October 6, 2023, 2:07am

I have downloaded a PDBBind subset (Protein-nucleic acid complexes) onto my GoogleDrive directly from PDBBind website. Is there a way I can retrieve and featurize this dataset using DeepChem?
Thank you DeepChem team for all you do!

cheongw · October 6, 2023, 2:24pm

Hi,

I believe that you may find the information you want from the link: https://github.com/deepchem/deepchem/blob/master/examples/tutorials/Modeling_Protein_Ligand_Interactions.ipynb.

I hope it helps.

Best regards,
Woo-Jae

bio_coder · October 8, 2023, 11:59am

Thank you, Woo-Jae, for taking time to respond. I have watched the suggested tutorial and many more on YouTube. Unfortunately, all of them assume using load_pdbbind function which loads the dataset from the PDBBind website. It works fine when loading the small “Core” dataset. However, I am trying to load the “refined” set and the Colab connection keeps breaking before I get the complete dataset. To get around the problem, I have downloaded the dataset onto my local hard drive but now I don’t know how to featurize it and use it in my script. Any more information will be greatly appreciated.

bharath · October 8, 2023, 9:32pm

This is tricky to do at scale on colab. My recommendation would be to set up a small dedicated EC2 instance for the calculation or run it on a good desktop if you have one handy.

cheongw · October 12, 2023, 6:22pm

You are welcome. I do not know whether the “load_pdbbind()” function has an argument to featurize the downloaded dataset. Although I am not an expert in the area, you may need to modify the open-source function “load_pdbbind()” for it.

If you are going to use the other featurization following the tutorial example, it is possible to featurize them (the downloaded data) after removing some poor quality data. If I were you, I would start working with the code with good-quality and small dataset

I hope it helps.

Best regards,
Woo-Jae

bio_coder · October 13, 2023, 2:25am

I managed to download and featurize the “refined” PDBBind dataset onto my Google Drive using Colab. Here is how I did it, in case someone else faces the same issues in the future.

First, the algorithm uses a lot of RAM, so I had to pay for Colab Pro and use High-RAM session type.

Below is the code based on @bharath post Featurize and Save MoleculeNet Datasets to Google Drive with Colab

!pip install condacolab

import condacolab
condacolab.install()

!conda install -c conda-forge pycosat mdtraj pdbfixer openmm -y -q

!pip install --pre deepchem

from google.colab import drive
drive.mount('/content/drive')

import os
os.listdir('/content/drive/My Drive/Colab Notebooks/Datasets')

import os
os.environ['DEEPCHEM_DATA_DIR'] = '/content/drive/My Drive/Colab Notebooks/Datasets/'

import deepchem as dc

from deepchem.molnet import load_pdbbind
from deepchem.models import AtomicConvModel
from deepchem.feat import AtomicConvFeaturizer

f1_num_atoms = 100  # maximum number of atoms to consider in the ligand
f2_num_atoms = 1000  # maximum number of atoms to consider in the protein
max_num_neighbors = 12  # maximum number of spatial neighbors for an atom

acf = AtomicConvFeaturizer(frag1_num_atoms=f1_num_atoms,
                      frag2_num_atoms=f2_num_atoms,
                      complex_num_atoms=f1_num_atoms+f2_num_atoms,
                      max_num_neighbors=max_num_neighbors,
                      neighbor_cutoff=4)

tasks, datasets, transformers = load_pdbbind(featurizer=acf,
                                             reload=False,
                                             set_name='refined')

This code downloads the featurized dataset to the specified directory. To retrieve the dataset for further use, I ran the same code as above with one change:

tasks, datasets, transformers = load_pdbbind(featurizer=acf,
                                             reload=True)

train, val, test = datasets