Saving and loading custom featurized dataset

DocLee · June 13, 2023, 5:58am

I have a very large dataset of ~320,000 labeled molecules that I’ve featurized with MolGraphConvFeaturizer. It takes around 20 minutes to featurize the dataset on my local machine and I’m needing to transfer it over to another GPU server for training. Is there an efficient and simple way to do this? I’ve got the saved directories created with dc.utils.save_dataset_to_disk. But the documentation /lack of tutorials is very poor for the rest of the process. I’d say this fairly critical task should be documented thoroughly if DeepChem is supposed to be for real world application rather than just trying to improve training of benchmark datasets.

Ons_Masmoudi · June 13, 2023, 10:12am

Hello , you can save it by using this :
loader = dc.data.CSVLoader(tasks=tasks, feature_field=“SMILES”, featurizer=featurizer)
dataset = loader.create_dataset(data, data_dir= “file_name”)
and than you can load it by using this command :
dataset =dc.data.DiskDataset(‘file_name’)