Multiple Imputation for Predictors Only, Excluding Missing Outcome Data

Question

I am working with a dataset containing ~300 predictors and ~3000 observations and building a predictive model using elastic net (and hoping to generalize to an external validation set). While the majority of observations are complete cases, there are some observations with missing values in either some of the predictors, the outcome, or both. My current approach is to remove all observations with missing outcome data from the analysis, and use mice (in R) to perform multiple imputation for missing values of the predictors. To me, this was the approach that made sense intuitively, as I was concerned about reporting performance metrics on observations that did not have observed values of the outcome.

However, I have seen that it may be valid to include observations with missing outcome data in the dataset, and let the outcome values also be handled through multiple imputation. I was curious about the conditions in which one method would be preferred over the other, if any. My suspicion is that it may be better to re-do these analyses while also imputing missing outcome data and missing predictor data, rather than using this "complete outcome" approach. Any insight is appreciated, and I'm happy to provide more information if needed!

EdM · Accepted Answer · 2022-03-08 19:57:10Z

From your description, you might be better off doing imputation on all your observations. There is no need to remove cases with missing outcome values, as analysis of properly performed multiply imputed data sets will incorporate the uncertainty from imputing the outcome values. Stef van Buuren's Flexible Imputation of Missing Data (FIMD) book certainly advocates imputing missing outcomes.

How much of a difference that will make depends on details of your data, whether missingness depends on outcome values, and whether your complete-data model is correct.

Depending on your situation, even complete-case analysis might be OK. Stef Van Buuren outlines some such special cases, in particular (FIMD, Section 2.7):

The first special case occurs if the probability to be missing does not depend on [outcome] $Y$. Under the assumption that the complete-data model is correct, the regression coefficients [of complete-case analysis] are free of bias. This holds for any type of regression analysis, and for missing data in both $Y$ and $X$. Since the missing data rate may depend on $X$, complete-case analysis will in fact work in a relevant class of MNAR models.

That depends on missingness not depending on $Y$ and having a correct complete-data model. Nevertheless,

Multiple imputation gains an advantage over complete-case analysis if additional predictors for $Y$ are available that are not part of [the complete-case model predictors] $X$.

That would seem to be your situation, as the predictor selection in elastic net means that there are potential predictors of $Y$ that will not be in the final set of predictors in the ultimate model.

Thank you! Helpful point about the imputation including predictors that don't end up in the final model - does this have implications for the generalisability of the model? My initial thought was that information about variables that are NOT in the model are influencing the results through the imputation, but I think this may be a misconception on my end about the way the imputation process works. Perhaps this is balanced through the "multiple" aspect of multiple imputation - we are modeling the uncertainty around missing values and not influencing the results? — NB3
– NB3, Commented Mar 10, 2022 at 15:02
@NB3 yes, ideally "we are modeling the uncertainty around missing values and not influencing the results" inappropriately. You are trying to "influenc[e] the results" in terms of removing bias that might come from omitting cases whose values are missing. That requires careful attention to how values are imputed for each of the variables. The mice package provides defaults for imputations of variable types that are good places to start, but the defaults aren't necessarily appropriate in all circumstances. van Buuren's book provides much guidance. — EdM
– EdM, Commented Mar 10, 2022 at 15:12

Stack Exchange Network

Multiple Imputation for Predictors Only, Excluding Missing Outcome Data

1 Answer 1

Your Answer

Hot Network Questions

Multiple Imputation for Predictors Only, Excluding Missing Outcome Data

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions