Possible NormalizationTransformer bug

RWalser · January 11, 2023, 5:52am

Using the normalization transformer, one has the option whether to pass the dataset at initialization or not:

transformer = dc.trans.NormalizationTransformer(transform_y=True, dataset=dataset)

The dataset is then used to find X_means, X_stds or y_means and y_stds depending upon which quantity is being transformed:

if dataset is not None and transform_X:
  X_means, X_stds = dataset.get_statistics(X_stats=True, y_stats=False)
  self.X_means = X_means
  self.X_stds = X_stds

It seems from the source code however, while these are not computed when the dataset is not passed to the __init()__ method, the transform_array method is still looking for them:

if self.transform_X:
  if not hasattr(self, 'move_mean') or self.move_mean:
    X = np.nan_to_num((X - self.X_means) / self.X_stds)

This seems to be a bug. A possible work around of course is to set these quantities manually but that is annoying.

bharath · January 18, 2023, 1:30am

Tentatively, yes confirming this is a bug. I think we should probably make the dataset a required argument