One of our open questions as we work through the DeepChem revamp for the upcoming 3.0 release is what we should do about deepchem.data.datasets module. This module contains two classes, NumpyDataset and DiskDataset that provide various convenient methods for handling datasets. Here’s an example of NumpyDataset usage:
In [7]: dataset = NumpyDataset(np.ones((2,2)))
In [8]: dataset
Out[8]: <deepchem.data.datasets.NumpyDataset at 0x7f4e1fecc470>
In [9]: dataset.X
Out[9]:
array([[1., 1.],
[1., 1.]])
NumpyDataset is a small convenience class that stores data in memory. DiskDataset on the other hand stores data on disk by default. Here’s an example of DiskDataset usage:
In [15]: dataset = DiskDataset.from_numpy(np.ones((2,2)), np.ones((2,1)), data_dir="/tmp/datadir_test", verbose=True)
TIMING: dataset construction took 0.010 s
Loading dataset from disk.
In [16]: dataset.X
Out[16]:
array([[1., 1.],
[1., 1.]])
In particular, DiskDataset has the useful property that datasets are written to disk automatically and can be accessed from memory with a set of accessors (such as dataset.X). This makes it easy for the moleculenet.ai benchmarks to use a common API regardless of the size of the underlying dataset. This code is pretty well optimized, and can probably handle datasets up to 100 GB without much effort (although I haven’t rigorously tested this). Underneath the hood, DiskDataset breaks data up into “shards”, each of which is stored on disk. There’s nothing too exotic here, but some clean engineering.
I’m trying to see if there’s a more standard machine learning dataset abstraction or package that I can use instead. My current working hypothesis is that we should use the pytorch dataset classes (https://pytorch.org/docs/stable/data.html). These classes are widely used in the PyTorch ecosystem and it seems that there are pretty clean abstractions such as torch.utils.data.IterableDataset and torch.utils.data.DataLoader introduced here.
I’d like to open out the discussion to the community to help figure out a couple things:
- Is the swap to the pytorch dataset classes a good idea? Is there a better package we should consider?
- If we make the swap, what additional infrastructure would we need to add on top of the core classes from pytorch?
CC @peastman, @Vignesh, @patrickhop