3
$\begingroup$

The monograph Cross Validation contains a section on nested cross-validation for hyper-parameter optimisation (page 6). The author refers to this paper for a reason why it is better to decouple hp-search from model selection, but I didn't find an intuitive and understandable answer. In short, my question is:

Why use nested validation for hp-search and model selection if an ML algorithm with different values of hyper-parameters can be seen as two different ML algorithms (and it seems that it is fine to use the flat validation approach for selecting the best ML algorithm)?

To make the question precise, below I describe in detail what are nested and flat validations. For simplicity, I will omit the cross part and do not divide data into folds -- the core of the question remains the same and so I believe the reasons should stay the same as well.

HP-tuning and model search using flat validation This procedure divides the data set into two parts data split in flat validation

best_model_family = none
best_hp = none
best_model = none
best_score = none
for each model_family:
    for each value of hp of model_family:
        model = model_family.train(hp, A)
        score = evaluate(model,V)
        if score > best_score:  // larger is better
            best_model_family = model_family
            best_hp = hp
            best_model = model
            best_score = score

After completing the procedure, best_hp contains the hyper-parameter value yielding a model that scores the highest. The value best_score is the prediction of performance of the production model, where the production model will be trained on the whole data set: model = best_model_family.train(best_hp, A \cup V).

The author says that this approach is prone to over-fitting, because best model and best hyper-parameters are picked using the same data. I don't understand why the mere fact of using the same data set for two searches leads to over-fitting. To me, using different hyper-parameter values is akin to using different model families. Consider for instance the Nearest Neighbours ML algorithm, and let its hyper-parameter be the number of neighbours. To me, NN(3) describes is a family of models different from NN(4). It is considered OK to pick the best model family using flat validation. However, once we view the number of neighbours as a hyper-parameter, it is not OK anymore to use the flat validation. What do I miss here? For a reference, I now describe what is the nested-validation approach for hp-tuning and model selection.

HP-tuning in a nested loop (nested validation) The nested validation approach divides the data set into three parts: the original set A is divided into two parts: A = A' \cup B. The hyper-parameter tuning is performed by training on A' and evaluation on B (that is why the approach is called "nested"), whereas model selection is performed as before by training on A=A'\cup B and evaluation on V, using the previously found best hyper-parameters. data split in nested validation

for each model_family:
    best_score = none
    for each value of hp of model_family:
        model = model_family.train(hp, A')
        score = evaluate(model,B)
        if score > best_score:  // larger is better
            best_hp[model_family] = hp

best_model_family = none
best_score = none
for each model_family:
    model = model_family.train(best_hp[model_family], A' \cup B)
    score = evaluate(model,V)
    if score > best_score:  // larger is better
        best_model_family = model_family
        best_score = score

After executing the procedure, best_model_family is the ML algorithm that we want to use in production, and we train it on the whole data set using the hyper-parameter value best_hp[best_model_family].

$\endgroup$
3
  • 1
    $\begingroup$ "It is considered OK to pick the best model family using flat validation". It is not okay in general. See here. I'm not too familiar w/ nested CV, but it looks like your pseudocode will result in the same problem if there are many families to select among. It's a multiple comparisons problem. In general, I don't think model selection cares to distinguish between families and HPs. A model in this context is basically a tuple (family, HPs) which specifies how training will happen. $\endgroup$ Commented Nov 26, 2024 at 8:11
  • 1
    $\begingroup$ @chicxulub thanks, I think that is the key to understanding, and a bit unexpected. (I am watching now the lectures linked in one of your answers, and it seems to contain the necessary theory to properly reason about the problem, and will comment afterwards) $\endgroup$ Commented Nov 26, 2024 at 14:37
  • 1
    $\begingroup$ "I don't think model selection cares to distinguish between families and HPs." indeed - feature selection is also an element of model selection, as is architecture selection for deep NNs. In ML, "model selection" essentially means any model choice that is not determined by the training algorithm. Note that the distinction between parameters and hyper-parameters is often computational rather than statistical or theoretical - see stats.stackexchange.com/questions/149098/… $\endgroup$ Commented Nov 27, 2024 at 16:28

2 Answers 2

6
+100
$\begingroup$

FWIW, the paper cited in the article was written by Mrs Marsupial and I - I think there may be some subtleties that have been missed.

Firstly, the purposes of the inner and outer cross-validation are very different. The purpose of the inner cross-validation is for hyper-parameter tuning. The purpose of the outer cross-validation is unbiased performance estimation. We need to perform hyper-parameter tuning for most methods because using default values is bad practice and can perform very badly. We don't always need an unbiased performance estimate though, sometimes we just want to choose the best model, in which case flat cross-validation is fine - just ignore the cross-validation performance estimate (or at least don't take it too seriously).

So why does flat cross-validation give an optimistically biased performance estimate? The performance estimate is a random variable. If we repeated the same experiment using a different sample of data and/or a different partitioning of the data into folds, we would get a different value. This means when we tune the hyper-parameters by optimising the cross-validation performance, some tuning genuinely does improve generalisation, but some is just fitting the random variation in the estimator, i.e. we are over-fitting the model tuning criterion.

This is easier to understand in a feature selection setting. This is equivalent to hyper-parameter tuning - we are still making choices about the model by optimising a performance estimate based on a finite sample of data (think of it as having one binary hyper-parameter for each feature). I have an example of that in my answer to a related question on this SE, but to summarise: Consider trying to predict the outcome of a sequence of flips of a fair coin, where the input features are the sequences of flips from N other fair coins. Obviously no classifier can have any genuine skill here as the target is entirely random. We use cross-validation to decide which single feature best predicts the target coin. As we increase N, there will be a point where we are virtually guaranteed to have a coin where the sequence of heads and tails is the same as the sequence of the target coin, so it's cross-validation error will be zero. So have we picked the best feature? No, it is no better predictor than any of the other features. Is the cross-validation estimate optimistically biased? Yes - it is predicting that the chosen coin will predict the target one in operation with 100% accuracy, when it actually only get 50% right. We can't trust the flat cross-validation estimate if we have made too many model choices by optimising it.

If we need an unbiased performance estimate, we need some data that has not been used to make any choices about the model parameters, structure or hyper-parameters, and we can do that with nested cross-validation (or e.g. an additional test set).

Note that overfitting the model selection criterion is not the same as over-fitting the training data. See Figures 5 and 6 of our paper. Figure 5 shows the cross-validation performance for the same (synthetic) task, but with different samples of data in each case. As the datasets are fairly small, we can get very different sets of "optimal" hyper-parameters (yellow crosses) in each case, due to the cross-validation estimate being a random variable based on a finite sample:

enter image description here

If we plot the corresponding models, we can see that some over-fit the data, but some under-fit (and some are good fits)

enter image description here

There are two separate criteria that can be over-fit - the training data, and the cross-validiation estimate.

Nested cross-validation doesn't help you avoid overfitting in tuning the hyper-parameters, it just helps you to detect it, by having an unbiased performance estimate.

IFF you have a small number of hyper-parameters, and you have plenty of data (so that the cross-validation estimator has a low variance and over-fitting the training partition is unlikely to be very bad) AND you don't need an unbiased performance estimate, you just need to pick the best model THEN flat cross-validation is probably fine. See Wainer and Cawley (2021).

If you have a lot of models to choose from and they have a lot of hyper-parameters that need tuning and the dataset is fairly small, then you may need to chose the best model using nested cross-validation. This is because the optimistic bias of flat cross-validation will be greater for models with lots of hyper-parameters or models that are very sensitve to the value of those hyper-parameters (i.e. they are particularly prone to over-fitting in model selection). That would make the flat cross-validation unreliable for model choice.

HTH, happy to try and answer any further questions.

$\endgroup$
3
  • 1
    $\begingroup$ The cited paper of Mrs.Marsupial says that "default parameters of some ML algorithms perform very badly on some datasets", whereas the following paper says that "default parameters of often-used algorithms often perform well on often-used datasets": mdpi.com/1999-4893/15/9/315/pdf (I found it funny to see two papers with seemingly different claims, but there is no contradiction here.) (Unfortunately, that paper does not compare with your paper.) Many thanks for the summary of when to use flat validation vs nested validation. $\endgroup$ Commented Dec 2, 2024 at 12:39
  • 1
    $\begingroup$ @Ayrat yes, as you say there is no contradiction. sometimes the default parameters of a particular implementation of a learning algorithm are set using hyper-parameters that give the best average performance on a set of benchmark datasets (typically the ones in the paper). However sometimes they are just "sensible" but arbitrary values. However that does not mean they don't perform very badly on some of those datasets, or that the hyperparameters that work well on benchmark datasets (typically rather small) will work well in practical applications. $\endgroup$ Commented Dec 2, 2024 at 12:53
  • 1
    $\begingroup$ An important point though is that for some algorithms if you don't perform model selection you are not using the model correctly. The SVM is an example of this which is based on the idea of Structural Risk Minimisation. However, SRM is implemented by the SVM by careful selection of the regularisation/slack penalty parameter(s). So if you don't tune them, you are ignoring most of the theory that underpins them. (BTW the default parameter paper was by Mr Marsupial, i.e. me, I wrote the other one with Mrs Marsupial). Glad you found the answer useful! $\endgroup$ Commented Dec 2, 2024 at 12:56
0
$\begingroup$

I will post a slightly differently worded answer, to which I came after watching a lecture on generalization, reading a short chapter on generalization error in the book [1], and reading @Dikran Marsupial answer.

Intuitions

Each model learned on a training data has a small probability that it fits the validation data way too well by accident. The chance is "usually" small if there is only one model. If there are two models, and we pick the best among them using their performance on the validation data, this probability of over-fitting on the validation data increases, but it is still small. If we have a very large number of models, this probability of accidental over-fitting on validation cannot be ignored anymore.

Let us see the difference between model selection and hyper-parameter tuning. "Usually", we do not have a huge number of different model classes (for instance, we may pick among linear, quadratic, and cubic regressors -- just three of them). In this case, we can still use a single validation procedure. However, hyper-parameter tuning can be seen as selecting between many-many slightly different model classes -- the number is very large. Because of the sheer number of such "model classes", the probability to fit the validation data by accident is substantial. Hence, we cannot use the validation performance as a faithful estimate of model performance. To overcome this, yet another data set is used to select a model (and estimate its performance). This is the purpose of the outer validation.

A different view

Based on Hoeffding's inequality, we can see why understand a small number of models results in a more faithful performance values on the validation data. Consider a space of hypothesis $H$. For a single hypothesis $h$, let $R(h)$ be the actual error of the hypothesis, $R(h,D)$ the error on a given validation set $D$. Let a validation data set $D$ of size $N$ be drawn from a given fixed distribution $p^*$. Then, the probability that the error $R(h,D)$ of one of those hypothesis $h \in H$ differs from its actual error $R(h)$ by more than $\epsilon$ is upper-bounded by $$ P(|R(h) - R(h,D)| > \epsilon \text{ for some } h\in H) < 2 |H| e^{-2N\epsilon^2}. $$ (Here, we do not specify how we pick $h$.) When the size $N$ of the validation set is large, the probability of over-fitting is small. It is still small for small $|H|$. However, when $|H|$ becomes very large (as happens when doing parameter tuning), this inequality can't anymore guarantee the absence of over-fitting. (It is not a proof of existence of over-fitting, only that we cannot guarantee its absence.)

[1]: Chapter 5.4.4.1 Bounding the generalization error, page 196, in Probabilistic Machine Learning: An Introduction, by Kevin P. Murphy

$\endgroup$

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.