I am currently making my way through Harrell's Regression Modeling Strategies and Van Buuren's Flexible Imputation so that I can apply rigorous imputation methods in our workflows. On p 95 of Regression Modeling Strategies, Professor Harrell recommends imputing the data in the context of prediction if the degree of missingness is not small. Currently, I am using MICE with FCS to impute the data
My question is twofold:
- At what stage do I 'expose' my model with now complete imputed data to the model that do not bias or inflate parameter estimates?
- How should I validate my model?
As an example suppose I have the following set up:
- I have a collection of 15 predictors and a binary response where I have a mix of continuous, binary, and unordered categorical data
- I have a model (perhaps Logistic Ridge/Lasso)
- Using FCS and specifying linear, logistic, and multinomial regression for the corresponding predictors with partial mean matching.
- I have m = 10 multiple imputation data sets
For fitting the model, am I under the impression I should perform FCS and then fit my model within 10 each of the 10 data sets. I should then use Rubin's rules to pool the probability that an observation belongs to the 'positive' class. If the pooled probability for an observation is > .5, I would label it as a 'positive case'
For validating the model, this is where I'm a little confused. I'd like to use the bootstrap validation technique consistent with Professor Harrell's recommendations, however this is a bit fuzzy to me; wouldn't I have to bootstrap within each of the m = 10 data sets. Any insight into this would be amazing.
In scenarios where we don't run out of sufficient data-Is it a hold out method appropriate where, for each of the 10 imputed data sets I have a corresponding imputed test set and then pool the estimates of accuracy according to Rubin's Rule's?
Thank you