1
$\begingroup$

I am currently making my way through Harrell's Regression Modeling Strategies and Van Buuren's Flexible Imputation so that I can apply rigorous imputation methods in our workflows. On p 95 of Regression Modeling Strategies, Professor Harrell recommends imputing the data in the context of prediction if the degree of missingness is not small. Currently, I am using MICE with FCS to impute the data

My question is twofold:

  1. At what stage do I 'expose' my model with now complete imputed data to the model that do not bias or inflate parameter estimates?
  2. How should I validate my model?

As an example suppose I have the following set up:

  1. I have a collection of 15 predictors and a binary response where I have a mix of continuous, binary, and unordered categorical data
  2. I have a model (perhaps Logistic Ridge/Lasso)
  3. Using FCS and specifying linear, logistic, and multinomial regression for the corresponding predictors with partial mean matching.
  4. I have m = 10 multiple imputation data sets

For fitting the model, am I under the impression I should perform FCS and then fit my model within 10 each of the 10 data sets. I should then use Rubin's rules to pool the probability that an observation belongs to the 'positive' class. If the pooled probability for an observation is > .5, I would label it as a 'positive case'


For validating the model, this is where I'm a little confused. I'd like to use the bootstrap validation technique consistent with Professor Harrell's recommendations, however this is a bit fuzzy to me; wouldn't I have to bootstrap within each of the m = 10 data sets. Any insight into this would be amazing.

In scenarios where we don't run out of sufficient data-Is it a hold out method appropriate where, for each of the 10 imputed data sets I have a corresponding imputed test set and then pool the estimates of accuracy according to Rubin's Rule's?

Thank you

$\endgroup$

1 Answer 1

1
$\begingroup$

Harrell's online Regression Modeling Strategies discusses this matter in Section 5.3.6, a section evidently added since the second edition of the print version (Springer, 2015).

In general, he does recommend separate bootstrap validation of the m models developed on each of the m multiply imputed data sets to start. That gives you m estimates of the set of model-performance values. Then average the model-performance values over the m models.

In versions of the rms package starting with 6.6-0 (released a few months ago), there is a processMI() function to simplify this. The online link above contains an example of an ordinary least squares model, working through the imputation, the fits to the multiple data sets, and the bootstrap validation.

$\endgroup$
8
  • $\begingroup$ Oh, wonderful! I'll order a second edition right away. I'll start looking over the notes right now-I'll accept your answer though and post any follow up questions here (the online notes look robust though!) Thank you! $\endgroup$ Commented Jul 7, 2023 at 21:23
  • 1
    $\begingroup$ @user7351362 that extra information is only in the (free) online version that I linked. It’s not in any print edition, so far that I know. $\endgroup$ Commented Jul 7, 2023 at 23:57
  • $\begingroup$ Thanks! I noticed that he addresses the double bootstrap in the text here, but doesn't seem to in the written one. After some googiling, Prof Harrell mentioned that approach 2 in the following algorithm is valid for the double bootstrap: discourse.datamethods.org/t/…. I think that I can modify this to accept MI if I changed Step 1: For each model i = 1....I: 1. Fit to M Mice completed Data 2. Average the coeff of the M models, M_avg 3. Calculate apparent perf of M_avg $\endgroup$ Commented Jul 10, 2023 at 18:43
  • $\begingroup$ Ran out of characters there, apologies for the formatting; let me know how I can better fix if needed $\endgroup$ Commented Jul 10, 2023 at 18:49
  • $\begingroup$ @user7351362 if I understand correctly, the processMI() function that recently became available in Harrell's rms package, and illustrated in the web link that I provided in the first paragraph of the answer, will do that for you. The link in your comment dates to a time prior to the release of that function. I haven't used it myself yet, however. The 2nd edition of the printed text is now 8 years old. The online reference has added a fair amount of new material. $\endgroup$ Commented Jul 10, 2023 at 19:16

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.