I am working with data that has $p$ close to $n$, and a high degree of collinearity between predictors. The excellent Introduction to Statistical Learning and Elements of Statistical Learning led me to try elastic-net regularization. Following the glmnet package documentation and other questions here on CV I have the following workflow in R:
- A grid of alpha (mixing) values according to:
seq(0.1, 0.9, by = 0.1). - A function which established manual folds for cross-validation and fits a
cv.glmnetobject using consistent folds to simultaneously tune both alpha and lambda (as suggested in the documentation / vignette). - To overcome the fixed fold effects, the function shuffles the folds and repeats the tuning process for 100 repeats.
- The best alpha, lambda, and MSE determined by cross-validation within each repeat are returned.
Here are a few lines of output to help visualize the data and the fluctuation in alpha /lambda across repeats:
repeat best_alpha best_lambda lowest_mse num_folds
1 0.1 0.17 0.5364155 10
2 0.2 0.25 0.5400845 10
3 0.1 0.4 0.5256955 10
4 0.1 0.38 0.5355564 10
5 0.1 0.41 0.5629918 10
6 0.1 0.42 0.5568862 10
7 0.6 0.08 0.5432515 10
This leaves me with two problems:
How best to aggregate the results from the repetitions. Looking at how the Caret package handles this, it appears the mean value of alpha and lambda are used. However, I have noticed that occasionally for some repetitions the 'best' alpha and lambda values fluctuate quite considerable - this becomes a problem when I wish to re-fit the elastic-net model in order to extract the variables selected, and their corresponding coefficients.
What is the best way of dealing with this situation? I am leaning towards a weighted mean based on the frequency of alpha.
The second question relates to the finer points on fitting and extracting using glmnet where the documentation doesn't quite cover the elastic-net use case. My code is as follows:
enetfit <- cv.glmnet(x=x, y=y, alpha=0.1, lambda=lambdas) yhat <- predict(fit, newx=x, s=0.4, exact=TRUE) coefs <- predict(fit, newx=x, s=0.4, exact=TRUE, type='coef')Does this syntax appear correct? My goal is to re-fit the model using the same x,y and lambda sequence, and then to extract variables and coefficients at something close to the 'best' model as determined by 100 repeats. I use cv.glmnet to fit, as it allows you to pass a sequence of lambdas and this appears advantageous for 'warm starts' and exact prediction at a value contained in the sequence.