Stability of tuning parameters in repeated elastic-net and variable retrieval

Question

I am working with data that has $p$ close to $n$, and a high degree of collinearity between predictors. The excellent Introduction to Statistical Learning and Elements of Statistical Learning led me to try elastic-net regularization. Following the glmnet package documentation and other questions here on CV I have the following workflow in R:

A grid of alpha (mixing) values according to: seq(0.1, 0.9, by = 0.1).
A function which established manual folds for cross-validation and fits a cv.glmnet object using consistent folds to simultaneously tune both alpha and lambda (as suggested in the documentation / vignette).
To overcome the fixed fold effects, the function shuffles the folds and repeats the tuning process for 100 repeats.
The best alpha, lambda, and MSE determined by cross-validation within each repeat are returned.

Here are a few lines of output to help visualize the data and the fluctuation in alpha /lambda across repeats:

repeat best_alpha   best_lambda lowest_mse  num_folds
1      0.1          0.17        0.5364155   10
2      0.2          0.25        0.5400845   10
3      0.1          0.4         0.5256955   10
4      0.1          0.38        0.5355564   10
5      0.1          0.41        0.5629918   10
6      0.1          0.42        0.5568862   10
7      0.6          0.08        0.5432515   10

This leaves me with two problems:

How best to aggregate the results from the repetitions. Looking at how the Caret package handles this, it appears the mean value of alpha and lambda are used. However, I have noticed that occasionally for some repetitions the 'best' alpha and lambda values fluctuate quite considerable - this becomes a problem when I wish to re-fit the elastic-net model in order to extract the variables selected, and their corresponding coefficients.

What is the best way of dealing with this situation? I am leaning towards a weighted mean based on the frequency of alpha.
The second question relates to the finer points on fitting and extracting using glmnet where the documentation doesn't quite cover the elastic-net use case. My code is as follows:
```
enetfit <- cv.glmnet(x=x, y=y, alpha=0.1, lambda=lambdas)
yhat    <- predict(fit, newx=x, s=0.4, exact=TRUE)
coefs   <- predict(fit, newx=x, s=0.4, exact=TRUE, type='coef')
```
Does this syntax appear correct? My goal is to re-fit the model using the same x,y and lambda sequence, and then to extract variables and coefficients at something close to the 'best' model as determined by 100 repeats. I use cv.glmnet to fit, as it allows you to pass a sequence of lambdas and this appears advantageous for 'warm starts' and exact prediction at a value contained in the sequence.

You often find $\alpha=0.1$ which is the lower bound of your grid. I suggest you include $\alpha=0$ in your grid, too. Since ridge regression is suited for highly collinear variables, perhaps you need no LASSO penalty and pure ridge (corresponding to $\alpha=0$) is enough. — Richard Hardy
– Richard Hardy, Commented Mar 15, 2016 at 10:06
When I was doing a similar exercise, I would fix $\alpha$ and then select an "optimal" $\lambda$ based on cross validation across the folds. I would then measure the out-of-sample performance for the fixed $\alpha$ and the "optimal" $\lambda$ (this would be an optimistic measure, but we are not generalizing it at this stage, so no problem). I would then change $\alpha$ and repeat all the same. Once the grid of $\alpha$s is exhausted, I would select the $\alpha$ that delivers the best performance, and the "optimal" $\lambda$ corresponding to the $\alpha$. Would that make sense? — Richard Hardy
– Richard Hardy, Commented Mar 15, 2016 at 10:13
Thanks for the comments Richard. For this data you are right ridge does well. However, in other data alpha settles down to around 0.5. Thanks for sharing your experience, my current function does fix alpha and tune lambda (using constant fold IDs) - finish the alpha grid, shuffle folds and repeat. Very helpful to know I am on the right track. Did you use many repeats, if so did you stick with the absolute best (i.e. lowest MSE) combination of alpha/lambda for prediction? Thanks. — Grant
– Grant, Commented Mar 15, 2016 at 20:25
I was doing my exercise in a time series setting, so it was slightly more complicated when it comes to cross validation. But the idea was to pick the $\alpha$-$\lambda$ combination that produces the lower MSE, if I remember correctly. — Richard Hardy
– Richard Hardy, Commented Mar 15, 2016 at 20:54

Mark van de Wiel · Accepted Answer · 2022-09-06 19:34:55Z

A very late reply, but if you're interested in tuning the elastic net: my blog post post explains why it is actually very hard to tune alpha and lambda together, which seems to be your problem.

Here's the argumentation in a nutshell.

First empirical. When one plots cross-validated likelihood (CVL) against alpha and lambda one will notice a ridge in the landscape along which the CVL is very flat and close to the maximum. CVL is strongly linked to marginal likelihood, which is another criterion to tune hyperparameters (= empirical Bayes). For marginal likelihood one can prove that for elastic net this is approximately a Gaussian likelihood with only one variance parameter v. An infinite number of alpha-lambda combinations can render v, implying non-identifiability.

Stack Exchange Network

Stability of tuning parameters in repeated elastic-net and variable retrieval

1 Answer 1

Your Answer

Hot Network Questions

Stability of tuning parameters in repeated elastic-net and variable retrieval

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions