Using different transformations on different input columns

What is the canonical way of applying different transformations on different columns of a dataset. From the docs:

n_samples = 10
n_features = 3
n_tasks = 1
ids = np.arange(n_samples)
X = np.random.rand(n_samples, n_features)
y = np.random.rand(n_samples, n_tasks)
w = np.ones((n_samples, n_tasks))
dataset = dc.data.NumpyDataset(X, y, w, ids)
transformer = dc.trans.NormalizationTransformer(transform_X=True, dataset=dataset)
dataset = transformer.transform(dataset)

dataset.X has three columns and the above code transforms dataset.X but what if one needs to apply a different transformation for each column of the input.

Good question, I don’t know if we have a good API for this at present. If I were doing it, I might split into three different datasets, apply the transformations and then rejoin back. It’s a bit kludgy but should do what you’re looking for.

Yes, I was thinking about a combination of a dictionary and the merge method for datasets but the source code has the axis hard coded as 0 (i.e. the concatenation works on samples).

Maybe making adding an axis argument to merge would be worth considering.

This might be useful! The tricky thing is you would have to handle merges (say if two datasets have the same datapoints but in different orders, would the method be responsible for merging intelligently?)

Good point. One issue here is that we’re handling numpy arrays. If we have pandas dataframes, we can always use the index column as the basis of all kinds of joins.

Perhaps a decorator based implementation might help to perform a join if the input is in the form of pandas frames but I have a feeling that would require changes all along the pipeline.