Skip to main content

Questions tagged [cross-validation]

Repeatedly withholding subsets of the data during model fitting in order to quantify the model performance on the withheld data subsets.

Filter by
Sorted by
Tagged with
6 votes
1 answer
71 views

I’m building a Python forecasting pipeline that tries several models: Holt‑Winters (tuned with Optuna) ARIMA (via pmdarima.auto_arima) XGBoost (tuned with Optuna) ...
CSe's user avatar
  • 161
2 votes
0 answers
61 views

In cross-validation, $k$-folds are a common way to train, compare and validate models. Often we want to find an optimal set of hyperparameters for our models. There are many ways to probe the ...
Markus Klyver's user avatar
2 votes
1 answer
58 views

I'm conducting some experiments using TCGA-LUAD clinical and RNA-Seq count data. I'm building machine learning models for survival prediction (Random Survival Forests, Survival Support Vector Machines,...
Yordany Paz's user avatar
2 votes
0 answers
59 views

I am currently developing a project that deals with multiple targets which can have different numbers of cardinalities. The idea is to use different ML-models(e.g. Random Forest, SVM, AdaBoost) and ...
Le Roi des Aulnes's user avatar
0 votes
0 answers
25 views

I'm comparing, pairwise, the results of Linear Regression models with transformations applied to one numerical feature and the target. I'm using K folds cross validation scoring with R-squared. The ...
Morgan P's user avatar
1 vote
0 answers
56 views

I am in the position of having a time series data set that I can model well using either a Autoregressive Fractionally Integrated Moving Average (ARFIMA) or an ARIMA model. I'm asking for ways to ...
David White's user avatar
4 votes
1 answer
519 views

I have a question about normalization when merging training and validation sets for cross-validation. Normally, I normalize using re-scaling (Min-Max Normalization) calculated from the training set ...
Suebpong Pruttipattanapong's user avatar
1 vote
2 answers
273 views

What is the proper algorithm for k-fold CV in case of class-balancing (under/over sampling)? Variant 1: split data into train and test set balance classes in the train set run k-fold CV Variant 2: ...
Jakub Małecki's user avatar
4 votes
1 answer
133 views

Conceptually, I understand that models should be built totally blind to the test set in order to most faithfully estimate performance on future data. However, I'm struggling to understand the extent ...
Evan's user avatar
  • 329
0 votes
0 answers
55 views

I want to simulate data with missing values and use them to compare the predictive performance of several machine learning algorithms, including LASSO. All analyses will be performed in R, using the ...
Benykō-Zamurai's user avatar
4 votes
1 answer
89 views

I am using nested cross validation in mlr3 to tune my model's hyperparameters and gauge its out-of-sample performance. Previously, when I was performing regular k-fold CV, my understanding was that ...
Adverse Effect's user avatar
1 vote
1 answer
122 views

I know my next steps involve using a GLM and selecting the type of GLM based on my response variables (possibly gamma or Poisson regression?). I also need to standardise explanatory variables to be ...
SMM's user avatar
  • 41
0 votes
1 answer
147 views

I have two binary classifiers and would like to check whether there is a statistically significant difference between the area under the ROC curve (AUROC). I have reason to opt for AUROC as my ...
IsaacNuketon's user avatar
2 votes
0 answers
31 views

It is often recommended that one uses cross fold validation to estimate the generalisation ability of a machine learning model. Most ressources I've found however do not adres what one should do after ...
Digitallis's user avatar
0 votes
0 answers
64 views

This topic has been discussed before but I couldn't find a specific answer. Here's my approach to forecast QoQ values, Run the usual LASSO K-fold CV on timeseries data and generate a one-step ahead ...
bebgejo's user avatar
0 votes
1 answer
60 views

My project has the following steps: Use elbow method to determine the features and number of clusters for kmeans. Run kmeans on the data (with determined features and n clusters), and gives the ...
Xin Niu's user avatar
  • 103
2 votes
1 answer
111 views

I’m trying to understand structural equation modeling (SEM) for hypotheses model and have questions about when to apply SEM. I have three models in mind: • Model 1: IV → M → DV • Model 2: IV → M1 →...
chen Crush's user avatar
2 votes
1 answer
160 views

I read in the mlr3 book about nested resampling that: Nested resampling is a method to compare models and to estimate the generalization performance of a tuned model, however, this is the performance ...
ChickenTartR's user avatar
1 vote
1 answer
122 views

I'm building a model for a binary classification task. Because my dataset is pretty small (~86 samples with 68 class 0 and 18 class 1), I'm using a nested k-fold cross validation (5-inner loops and 5-...
Shortytot's user avatar
1 vote
0 answers
55 views

Is there a statistical way to compare two kappa statistics from the same group of raters, rating the same subjects, but under two different conditions (low vs. high field strength MRIs)? We can't ...
ACHD's user avatar
  • 13
3 votes
1 answer
96 views

We have a small dataset of n=130. Current step is exploring the data looking for anything interesting. Our primary aim is to compare whether using additional variable is helping improve model ...
Leon Yao's user avatar
4 votes
1 answer
274 views

I am trying to make a linear regression predictive model between a continuous dependent variable and a set of continuous predictors. I have a large number (~5000) of these predictor variables (...
user7831861's user avatar
1 vote
0 answers
42 views

If I am using a GridSearchCV to find hyper parameters on a training set; if I were to run a CalibriatedClassifierCV to tune my probabilities, would it suffice to fit the CalibraitedClassifierCV with ...
user54565's user avatar
0 votes
0 answers
51 views

If there is an ML model, the standard deviation (SD) of the root mean squared error (RMSE) can be calculated using time series splits by fitting the model on different training sets and evaluating it ...
Geek_Tech's user avatar
  • 329
1 vote
1 answer
87 views

For cross validation of hyperparameters, I have a question about which approach is generally considered better in the context of running regularized regression (specifically elastic net l1, l2 ...
qwer's user avatar
  • 111
6 votes
1 answer
133 views

The thread Evaluating a classifier with small samples considers the problem in its title. Specifically, the question is about splitting off the test set from the rest of the data many times instead of ...
Richard Hardy's user avatar
6 votes
2 answers
203 views

I'm trying to evaluate two classifiers splitting the sample into the training and tests samples with 50-50 split. The classifiers are fitted and tuned with K-fold CV on the training sample. The ...
Lionville's user avatar
  • 487
0 votes
0 answers
34 views

How do I obtain a reasonable parameter estimate (regression beta) for the single predictor of interest in a multiple regression model and appropriate standard errors for this estimate using holdout ...
jf1's user avatar
  • 312
1 vote
0 answers
57 views

I've been thinking about the use of cross-validation and hold-out sets and I don't really see the use of a randomly selected hold-out test set. I have to say, though, that when the hold-out is not ...
adriavc00's user avatar
1 vote
1 answer
71 views

Let's say I pick any of the winning surrogate models in my nested cv (in theory if you do k outer folds you could have k surrogate models) to simplify things, lets say I pick the first model and just ...
iYOA's user avatar
  • 185
0 votes
0 answers
78 views

In nested cross validation, I'm seeing an interesting scenario that I'd like to understand better: Using 4-fold outer CV, my model selection process chose Model A overall (it performed best on average ...
iYOA's user avatar
  • 185
0 votes
0 answers
88 views

Consider a factor analysis model \begin{equation*} \begin{array}{cccccccccc} X &=& \mu&+& L&\cdot& f & + &u \\ p\times 1 & & p\times 1 &&p\times k& ...
user avatar
0 votes
0 answers
58 views

In this paper, two deep learning models where proposed: Hybrid-AttUnet++ and EH-AttUnet++. The first model, Hybrid-AttUnet++, is simply a modified U-net model, and the second model is an ensemble ...
AAA_11's user avatar
  • 1
4 votes
1 answer
146 views

Consider a regression model $$ Y= X\beta+ u. \tag{$\star$} $$ $Y$ is a column vector with length $n$ containing $n$ observations. $X$ is a $n\times p$ matrix with each row corresponding to a ...
user avatar
0 votes
0 answers
36 views

I'm using the LongituRF package in R to fit a MERT (Mixed effects regression trees) model to my data. While I have no issues ...
Linus's user avatar
  • 399
0 votes
0 answers
54 views

I have spectroscopy data measured from 10 different porcine. The goal is to analyse three different tissue types. However, not all tissues were measured from each porcine. The total numbers are Fat: 3,...
masto12's user avatar
  • 119
2 votes
1 answer
187 views

I am currently working with a dataset that includes sociodemographic information about each student in a class (X variables) and information about whom each student votes for as class speaker (Y ...
Elena O.'s user avatar
0 votes
0 answers
43 views

I'm performing gradient boosting machine modeling on a large dataset (700k+ records) with several hundred variables on a work laptop with limited memory. I'm coding in R v2022.02.2. I've found running ...
RobertF's user avatar
  • 6,644
1 vote
0 answers
34 views

What I'm doing I am making an undergraduate thesis about audio classification using SVM. My goal is to identify if adding Feature X to the feature matrix could improve the performance of the ...
ASTRAL's user avatar
  • 11
0 votes
0 answers
26 views

Lets assume that we retrain the model every year in production and we have accumulating 50 years of data. If using a time series CV (e.g TimeSeriesSplit in sklearn) for hyperparams recalibration at ...
Kreol's user avatar
  • 121
1 vote
2 answers
284 views

I have fitted a relatively complex/large generalized additive model for prediction purposes but would like to assess its predictive power/cross-validate it. Due to variability in observed data and the ...
Paul Julian's user avatar
3 votes
1 answer
372 views

I'm working on a classification problem with ~90k data rows and 12 features. I'm trying to tune the hyperparamters of an XGBoost model to minimize the overfitting. I use ROC_AUC as the metric to ...
WatermelonBunny's user avatar
0 votes
0 answers
76 views

I have data for each day, with a date/time, event, and when a secondary event gets triggered. ...
erotavlas's user avatar
  • 101
1 vote
1 answer
104 views

Apologies for cross-posting I am starting to use Lasso and cross validation for model selection to explain a dependent variable using linear models, but I can not understand why all p-values ...
Rodrigo Badilla's user avatar
5 votes
3 answers
277 views

I am trying to estimate non parametrically the first order derivative of a function g(x). I am estimating $g(x)$ using a local polynomial (quadratic) procedure. I know how to compute the leave-one-out ...
G. Ander's user avatar
  • 239
0 votes
0 answers
45 views

(CONTEXT) I'm currently doing a report project at my university to build a classifer model that classifies a comment as spam or ham (non-spam) using this data set, and then submit a prediction csv ...
KitanaKatana's user avatar
0 votes
1 answer
225 views

I’m comparing the performance of 10 ML models across 15-fold cross-validation, using metrics like MSE. Each model’s performance is ranked per fold, and I want to determine if there are significant ...
PascalIv's user avatar
  • 921
1 vote
0 answers
201 views

I want to assess predictive power of zero-inflated negative binomial model in Python. My steps are listed as below: Regarding 5-folds cross-validation: Fit multiple Zero-Inflated Negative Binomial (...
Student coding's user avatar
16 votes
2 answers
841 views

I understand AIC is asymptotically equivalent to leave-one-out cross-validation and that BIC has a similar asymptotic equivalence to leave-k-out cross-validation. My question is, other than ...
Louis F-H's user avatar
  • 271
0 votes
1 answer
71 views

I’m working on a survival analysis model with a small internal dataset (n=140). An outside researcher suggests splitting the dataset into train/val and setting aside a separate test set (e.g., ~10%, ...
mel's user avatar
  • 1

1
2 3 4 5
71