How to get ScaffoldSplitter output as a CSV file?

Hello,

I am a newbie to python/deepchem. I need to do a scaffold split on my own dataset (to evaluate ROCS scaffold hopping).

I tried running the example and I am able to get a split. But I don’t know as to how to write the output as a csv file. Any help is appreciated.

This is what I did so far with example dataset (example.csv) provided in deepchem.

import deepchem as dc

import pandas as pd

import os

current_dir=os.path.dirname(os.path.realpath(‘file’))

input_data=os.path.join(current_dir,’./example.csv’)

tasks=[‘log-solubility’]

featurizer=dc.feat.CircularFingerprint(size=1024)

loader = dc.data.CSVLoader(tasks=tasks, smiles_field=“smiles”,featurizer=featurizer)

dataset=loader.featurize(input_data)

splitter = dc.splits.ScaffoldSplitter(input_file)

train_dataset, test_dataset = splitter.train_test_split(dataset)

len(train_dataset),len(test_dataset)

The output I am trying to get are two csv files train_dataset and test_dataset. Ideally with Compound ID.

Thanks
Mohamed

This is unfortunately not as straightforward as it should be. There’s an open GitHub issue discussing how to make this easier. Here’s how you can do this:

train_df = pd.DataFrame(train_dataset.ids, columns=["smiles"])
test_df = pd.DataFrame(test_dataset.ids, columns=["smiles"])
train_df.to_csv("/tmp/train.csv")
test_df.to_csv("/tmp/test.csv")

Your code example didn’t load a compound ID. If you did have a compound ID for compounds, and you used unique isomeric smiles, you could maintain a dictionary mapping smiles to IDs and recover your IDs from the smiles.

1 Like

Thanks a lot, Bharath! It’s very helpful.

1 Like