Replacing deepchem.data with Pytorch's data classes

bharath · January 30, 2020, 12:40am

One of our open questions as we work through the DeepChem revamp for the upcoming 3.0 release is what we should do about deepchem.data.datasets module. This module contains two classes, NumpyDataset and DiskDataset that provide various convenient methods for handling datasets. Here’s an example of NumpyDataset usage:

In [7]: dataset = NumpyDataset(np.ones((2,2)))

In [8]: dataset
Out[8]: <deepchem.data.datasets.NumpyDataset at 0x7f4e1fecc470>

In [9]: dataset.X
Out[9]: 
array([[1., 1.],
       [1., 1.]])

NumpyDataset is a small convenience class that stores data in memory. DiskDataset on the other hand stores data on disk by default. Here’s an example of DiskDataset usage:

In [15]: dataset = DiskDataset.from_numpy(np.ones((2,2)), np.ones((2,1)), data_dir="/tmp/datadir_test", verbose=True)
TIMING: dataset construction took 0.010 s
Loading dataset from disk.

In [16]: dataset.X
Out[16]: 
array([[1., 1.],
       [1., 1.]])

In particular, DiskDataset has the useful property that datasets are written to disk automatically and can be accessed from memory with a set of accessors (such as dataset.X). This makes it easy for the moleculenet.ai benchmarks to use a common API regardless of the size of the underlying dataset. This code is pretty well optimized, and can probably handle datasets up to 100 GB without much effort (although I haven’t rigorously tested this). Underneath the hood, DiskDataset breaks data up into “shards”, each of which is stored on disk. There’s nothing too exotic here, but some clean engineering.

I’m trying to see if there’s a more standard machine learning dataset abstraction or package that I can use instead. My current working hypothesis is that we should use the pytorch dataset classes (https://pytorch.org/docs/stable/data.html). These classes are widely used in the PyTorch ecosystem and it seems that there are pretty clean abstractions such as torch.utils.data.IterableDataset and torch.utils.data.DataLoader introduced here.

I’d like to open out the discussion to the community to help figure out a couple things:

Is the swap to the pytorch dataset classes a good idea? Is there a better package we should consider?
If we make the swap, what additional infrastructure would we need to add on top of the core classes from pytorch?

CC @peastman, @Vignesh, @patrickhop

peastman · January 30, 2020, 1:02am

I’m not familiar with the PyTorch dataset classes. I’ll take a look at them.

TensorFlow also has a Dataset class with, I expect, similar features. dc.data.Dataset has a method that constructs a tf.data.Dataset. If we want to keep the data handling classes ML framework agnostic, there may be value in keeping our own Dataset class, and making it easy to translate to either one. You don’t want to force TensorFlow users to install PyTorch and use its dataset class, nor PyTorch users to install TensorFlow and use its dataset class.

Even better is if you can easily create a dc.data.Dataset from any of the above, and also from other common representations like a Pandas data frame. A tool for quickly converting between all those representations would be useful.

Vignesh · January 30, 2020, 4:00am

I agree with Peter here - DeepChem should maintain its own Dataset class while allowing interoperatability between pandas, tensorflow and pytorch. That said, some points open for discussion:

I am not sure if DeepChem Datasets already support multiple workers based loading, but that would be useful to have. PyTorch’s datasets have that implemented into them. The other thing I like about PyTorch’s design is the use of collate functions inside the Dataset class for batching.
I have recently been looking into HDF5 as a storage mechanism for datasets and also speaking to people from computer vision about their thoughts since its widely used for image processing and ways to make best use of them. Part of the reason behind going to HDF5 was that joblib runs into MemoryErrors on my local machine.

peastman · January 30, 2020, 5:24am

When iterating over batches, Dataset uses a single worker thread to load the next shard while you’re working with the current one.