Featurize and Save MoleculeNet Datasets to Google Drive with Colab

bharath · April 30, 2020, 1:13am

In previous posts, I’ve explained how to get DeepChem’s stable releases running in Colab (Getting DeepChem running in Colab) and DeepChem HEAD running in Colab (Running DeepChem HEAD in Colab). In this post, I’m going to show you how you can featurize and save a MoleculeNet dataset to Google Drive with Colab.

Why is this useful? Well, once you featurize a MoleculeNet dataset and save it to Google Drive, you can use it in future Colab sessions without needing to re-featurize. This can be really useful especially if you’re on the free tier of Colab and have shorter lived sessions. Let’s get cracking! To start, add a cell that installs DeepChem as usual

%tensorflow_version 1.x
!wget -c https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
!chmod +x Miniconda3-latest-Linux-x86_64.sh
!bash ./Miniconda3-latest-Linux-x86_64.sh -b -f -p /usr/local
!conda install -y -c deepchem -c rdkit -c conda-forge -c omnia deepchem-gpu=2.3.0
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')

This will take about 4 minutes to run on your Colab instance. Once this is done, you need to mount your Google drive onto your Colab instance. You can do this by running a cell with the following command

from google.colab import drive
drive.mount('/content/drive')

When you run this cell, you’ll get a prompt asking you to authorize Colab access to Google drive. It’ll give you a link you can click to get an authorization code.

Your core Google drive is at /content/drive/My Drive/. I recommend making a folder to store your moleculenet datasets. I use /content/drive/My Drive/Colab Notebooks/Datasets. You can take a look at what’s in your folder by running a cell as follows (replace my location with yours)

import os
os.listdir('/content/drive/My Drive/Colab Notebooks/Datasets')

The next step we need to take is to set DEEPCHEM_DATA_DIR to point to this dataset. We can’t use the usual Bash command to do this. Rather, we have to run a cell with the following

import os
os.environ['DEEPCHEM_DATA_DIR'] = '/content/drive/My Drive/Colab Notebooks/Datasets/'

And now we import DeepChem. Note that we can’t import DeepChem before we set the DEEPCHEM_DATA_DIR! DeepChem reads environment variables at load time so we need the variable set before we load DeepChem.

import deepchem as dc

And now let’s check that we set DEEPCHEM_DATA_DIR correctly

dc.utils.get_data_dir()

You should see your chosen data directory listed. Now we’re set to load a MoleculeNet dataset

tasks, datasets, transformers = dc.molnet.load_tox21()

When you run this, you should see the logging note that it’s saving to the data directory you’ve set. Once the featurization is done, you can check that it was saved to your data directory by running

os.listdir('/content/drive/My Drive/Colab Notebooks/Datasets')

And you’re all set! Now run

tasks, datasets, transformers = dc.molnet.load_tox21()

and note how it reloads from your Google drive folder rather than refeaturizing. If you’re a heavy DeepChem user, I’d recommend taking some time to featurize MoleculeNet datasets onto drive. Then when you’re doing benchmarking or other work, you’ll have a full suite of datasets you can call with no overhead. Happy hacking!

jasonzdeng · May 1, 2020, 2:26am

Thanks for posting this. I found this is relevant, but in my case I have a huge dataset and I want to pre-featurize them, preferably with multiprocessing, and save them onto disk for future use. How to do this? Thanks!

bharath · May 1, 2020, 4:20pm

This is a good question! At a high level, you can featurize them and store them to disk in a DiskDataset. This can be done using a DataLoader class (if your data is in CSV/SDF files) or can be be done with a bit of custom code. I’ll try to put up a guide for how to process your own dataset in the next few weeks

jasonzdeng · May 5, 2020, 5:42pm

Sounds great! Thanks!