Using different transformations on different input columns

RWalser · January 9, 2023, 11:07am

What is the canonical way of applying different transformations on different columns of a dataset. From the docs:

n_samples = 10
n_features = 3
n_tasks = 1
ids = np.arange(n_samples)
X = np.random.rand(n_samples, n_features)
y = np.random.rand(n_samples, n_tasks)
w = np.ones((n_samples, n_tasks))
dataset = dc.data.NumpyDataset(X, y, w, ids)
transformer = dc.trans.NormalizationTransformer(transform_X=True, dataset=dataset)
dataset = transformer.transform(dataset)

dataset.X has three columns and the above code transforms dataset.X but what if one needs to apply a different transformation for each column of the input.

bharath · January 9, 2023, 7:15pm

Good question, I don’t know if we have a good API for this at present. If I were doing it, I might split into three different datasets, apply the transformations and then rejoin back. It’s a bit kludgy but should do what you’re looking for.

RWalser · January 10, 2023, 3:26am

Yes, I was thinking about a combination of a dictionary and the merge method for datasets but the source code has the axis hard coded as 0 (i.e. the concatenation works on samples).

Maybe making adding an axis argument to merge would be worth considering.

bharath · January 18, 2023, 1:31am

This might be useful! The tricky thing is you would have to handle merges (say if two datasets have the same datapoints but in different orders, would the method be responsible for merging intelligently?)

RWalser · January 18, 2023, 3:44am

Good point. One issue here is that we’re handling numpy arrays. If we have pandas dataframes, we can always use the index column as the basis of all kinds of joins.

Perhaps a decorator based implementation might help to perform a join if the input is in the form of pandas frames but I have a feeling that would require changes all along the pipeline.