One of our open questions as we work through the DeepChem revamp for the upcoming 3.0 release is what we should do about deepchem.data.datasets
module. This module contains two classes, NumpyDataset
and DiskDataset
that provide various convenient methods for handling datasets. Here’s an example of NumpyDataset
usage:
In [7]: dataset = NumpyDataset(np.ones((2,2)))
In [8]: dataset
Out[8]: <deepchem.data.datasets.NumpyDataset at 0x7f4e1fecc470>
In [9]: dataset.X
Out[9]:
array([[1., 1.],
[1., 1.]])
NumpyDataset
is a small convenience class that stores data in memory. DiskDataset
on the other hand stores data on disk by default. Here’s an example of DiskDataset
usage:
In [15]: dataset = DiskDataset.from_numpy(np.ones((2,2)), np.ones((2,1)), data_dir="/tmp/datadir_test", verbose=True)
TIMING: dataset construction took 0.010 s
Loading dataset from disk.
In [16]: dataset.X
Out[16]:
array([[1., 1.],
[1., 1.]])
In particular, DiskDataset
has the useful property that datasets are written to disk automatically and can be accessed from memory with a set of accessors (such as dataset.X
). This makes it easy for the moleculenet.ai benchmarks to use a common API regardless of the size of the underlying dataset. This code is pretty well optimized, and can probably handle datasets up to 100 GB without much effort (although I haven’t rigorously tested this). Underneath the hood, DiskDataset
breaks data up into “shards”, each of which is stored on disk. There’s nothing too exotic here, but some clean engineering.
I’m trying to see if there’s a more standard machine learning dataset abstraction or package that I can use instead. My current working hypothesis is that we should use the pytorch dataset classes (https://pytorch.org/docs/stable/data.html). These classes are widely used in the PyTorch ecosystem and it seems that there are pretty clean abstractions such as torch.utils.data.IterableDataset
and torch.utils.data.DataLoader
introduced here.
I’d like to open out the discussion to the community to help figure out a couple things:
- Is the swap to the pytorch dataset classes a good idea? Is there a better package we should consider?
- If we make the swap, what additional infrastructure would we need to add on top of the core classes from pytorch?
CC @peastman, @Vignesh, @patrickhop