Scaffold splitting and loading dataset from dataframe

syedzayyan · July 7, 2023, 9:29am

Hello All,

I am trying to load a dataset from a pandas dataframe, which is okay but when I use scaffold splits with this code this happens:

[10:24:13] SMILES Parse Error: syntax error while parsing: CHEMBL132806
[10:24:13] SMILES Parse Error: Failed parsing SMILES 'CHEMBL132806' for input: 'CHEMBL132806'

and I am using this code:

dc_dataset = dc.data.DiskDataset.from_dataframe(df_adn_simple, 
                                                X="canonical_smiles",
                                                y="pKI",
                                                ids="molecule_chembl_id")

splitter_scaffold = dc.splits.ScaffoldSplitter()

scaffold_train_dataset_1, scaffold_test_dataset_1 = splitter_scaffold.train_test_split(dc_dataset)

If I swap out X and ids fields things work but obviously column names are wonky

Any help would be appreciated!

arunppsg · July 10, 2023, 7:24am

Yeah, column names are wonky. For scaffold split to work, ids attribute has to be smiles strings.

syedzayyan · July 10, 2023, 8:59am

Yeah, I figured as much. I think that’s a reason to do a PR. Thanks for the reply.