Hyperparameter optimization: randomsplit vs scaffold split?

Hi Guys,

I was wondering whether I need different Hyper-parameters for my model depending whether I am random or scaffold splitting.

So lets say I want to compare the performance of my model with random and then scaffold split. Would I optimize the hyper-parameters for one of the two and then use the same ones for the other model. Or would I optimize them separately.

Is their a consensus or papers who have a similar issue?

I’m not an expert on this, but I can’t think of a good reason for using different hyperparameters. Your training set will be very similar whichever splitting method you use. After all, it contains most of your data, so all splits will produce training sets that have the majority of their samples in common.

The real difference is in the test and validation sets. Do they have samples that are overly similar to the training data, or have you carefully chosen test samples to be distinct from the training ones?

Seconding @peastman’s opinion here. Typically better to use a common set of hyperparameters. Usually the goal is to find a robust set of hyperparameters that can generalize to unseen examples from elsewhere (when the model is used in the wild)

Yes I understand. But I think using a random split, is an easier task for the model because it does not need to predict for scaffolds it did not see before.

However, with the scaffold splitting you train the model to make out-of-distribution predictions ( predict molecules which scaffold it has never seen during training), so maybe different model parameters favor one of the other approaches.

1 Like

In general, I’d recommend choosing the hardest split possible when choosing model parameters. Random is definitely an easier task than scaffold. Scaffold has some issues as well. Time splits are likely better if you’re lucky enough to have timestamps on data (and I’ve seen some time-scaffold splits floating around).

It might help us go deeper if we understood your application. Are you benchmarking for a research paper? Are you trying to create a model to put into production?

Remember, your goal is to create a model that works well on production data. If the test set is overly similar to the training set, that doesn’t make the problem any easier. It just means your test set isn’t useful for determining whether you’ve solved it!


I am trying to benchmark for research paper. I want to evaluate different models at a classification task. In some bench-marking papers I have seen that the analysis is carried out often for ones random split and then for scaffold split (and also for time-split if available). But they often do not talk about the hyperopt. in so much detail.

If I would only do one hyper-parameter optimization per model I would probably use the scaffold split to optimize the model and then also evaluate its performance on the random split. But I could also optimize those two independent of each other, with the argument that out-of-distribution predictions is a different task and needs different hyper-parameters.

1 Like

Ah I see, that clarifies a bit, thanks!

I think best practice would be to optimize using the scaffold split and evaluate on random as you suggest first. This is a litter closer to “real world” challenges. What do you think @peastman?

If you did that, you’d have some of the same samples in both the training and test sets. That would be a totally invalid comparison.

If you want to compare the effects of different splits, I think the best thing to do is to fit and test separately for each one, using the same hyperparameters in each case.

@bharath Thanks for the patience :slight_smile:. I think I will do it the way you suggested.

@peastman Maybe my explanation was confusing. I would like to compare the model performance given a scaffold split and a random split. But I would not use the model trained on the scaffold to predict the test set produced by random split.

So this would be my design:

  1. Split by Scaffold

  2. Hyperparameter optimization

  3. Train & Evaluate Model (using the scaffold hyper-parameters & data based on scaffold split)

  4. Now split the data randomly

  5. Train & Evaluate Model (using scaffold hyper-parameters & data based on random split)

1 Like

Yes, that makes sense.