2
$\begingroup$

I am working with a dataset for which I have generated three hypotheses, i.e. I built three substantive models (two logistic regressions and one cox regression with a different dependent variable). The dataset contains some missing values and I decided to fill the gaps with multiple imputation.

Should I generate a separate imputation model for each substantive model or would it be better practice to create a single imputation model to be used with every substantive model? Or does it not even matter?

I work in medicine and have no formal education in statistics, so I might be missing some fundamental understanding - always happy to learn!

$\endgroup$

2 Answers 2

1
$\begingroup$

Should I generate a separate imputation model for each substantive model or would it be better practice to create a single imputation model to be used with every substantive model?

The first question to consider is whether you should be proceeding with 3 separate models (for what presumably are 3 separate outcomes) or if you should be trying a single combined model that accounts for correlations among the outcomes. A single multiple-outcome model would clearly mean that only a single set of imputations would be called for.

If 3 separate models are appropriate, then think back to what multiple imputation is trying to do. As Stef van Buuren explains, for a quantity $Q$ (potentially a vector) of interest in a population:

We can only calculate $Q$ if the population data are fully known, but this is almost never the case. The goal of multiple imputation is to find an estimate $\hat Q$ that is unbiased and confidence valid.

The goal of imputation is to estimate properties of the underlying population. There is thus no reason to tailor the imputation model to the particular outcome being evaluated--you want imputed data sets that fairly represent the underlying population regardless of which outcome you happen to be evaluating at any time. If the outcome values are correlated and some are also being imputed, this is even more important as you want to try to capture the correlation structure that exists in the population.

It's helpful to follow up the goals of imputation in a bit more detail. For unbiasedness:

Unbiasedness means that the average $\hat Q$ over all possible samples $Y$ from the population is equal to $Q$.

If appropriate conditions are met, taking the mean value of $\hat Q$ among the imputed data sets, $\bar Q$, will do.

If $U$ is the variance-covariance matrix of an estimate $\hat Q$, then

This estimate is confidence valid if the average of $U$ over all possible samples is equal or larger than the variance of $\hat Q$.

This requires a bit more thought, as the variance-covariance matrix includes both within-imputation and between-imputation contributions. The within-imputation pooled estimate is just the average of the individual variance-covariance matrices $U$ among the imputations, $\bar U$.

The between-imputation variance has two contributions. One is the empirical variance-covariance matrix, the variance of individual estimates around the observed mean $\bar Q$, called $B$. The second is a correction for the fact that you have only used a finite number of imputations, say $m$. Then the estimate of the total variance-covariance matrix $T$ is (Equation 2.20 from van Buuren):

$$ T = \bar U + B + B/m = \bar U + \left(1+ \frac{1}{m} \right) B.$$

The point is that the smallest overall variance is achieved as you increase the number of imputations. Although there can be decreasing returns at large numbers of imputations, the general strategy is to do a large number of imputations. If you have already done 3 separate sets of imputations, pool them all together to get triple the number of imputations and do your modeling of all outcomes on the larger pool.

If you haven't done this before, the temptation is just to use the default imputation settings provided by the software. That might end up working, but the best strategy is to read carefully about the principles of how imputations are best done, and combine those principles with your understanding of the subject matter to find an approach suitable for your data.

$\endgroup$
0
$\begingroup$

First, it’s best to determine whether data for the variables in question are missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR). Many people leap to imputation without considering this important question. If the data are MNAR, imputation will not help your cause and in fact will amplify the bias problem that exists by producing one or more new, biased data sets that have a larger sample size than your original, complete-cases set.

$\endgroup$
3
  • $\begingroup$ I did consider that and for the data in question, MAR seems reasonable. $\endgroup$ Commented Jul 13, 2021 at 16:08
  • $\begingroup$ It is possible to handle MNAR to some extent with multiple imputation, if care is taken in imputation and sensitivity analysis is used to evaluate. See the section of Stef van Buuren's multiple-imputation book on modeling choices. $\endgroup$ Commented Jul 13, 2021 at 17:00
  • $\begingroup$ @EdM Thank you for the info ~ $\endgroup$ Commented Jul 16, 2021 at 18:27

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.