25
$\begingroup$

I've got a dataset on agricultural trials. My response variable is a response ratio: log(treatment/control). I'm interested in what mediates the difference, so I'm running RE meta-regressions (unweighted, because is seems pretty clear that effect size is uncorrelated with variance of estimates).

Each study reports grain yield, biomass yield, or both. I can't impute grain yield from studies that report biomass yield alone, because not all of the plants studied were useful for grain (sugar cane is included, for instance). But each plant that produced grain also had biomass.

For missing covariates, I've been using iterative regression imputation (following Andrew Gelman's textbook chapter). It seems to give reasonable results, and the whole process is generally intuitive. Basically I predict missing values, and use those predicted values to predict missing values, and loop through each variable until each variable approximately converges (in distribution).

Is there any reason why I can't use the same process to impute missing outcome data? I can probably form a relatively informative imputation model for biomass response ratio given grain response ratio, crop type, and other covariates that I have. I'd then average the coefficients and VCV's, and add the MI correction as per standard practice.

But what do these coefficients measure when the outcomes themselves are imputed? Is the interpretation of the coefficients any different than standard MI for covariates? Thinking about it, I can't convince myself that this doesn't work, but I'm not really sure. Thoughts and suggestions for reading material are welcome.

$\endgroup$
3
  • $\begingroup$ I haven't got the answer, but one question and two notes: 1) log of a ratio is, of course, the difference of logs. So your DV is equivalent to log(treatment) - log(control). 2) Which textbook of Gelman's were you looking at? $\endgroup$ Commented Dec 19, 2012 at 10:39
  • $\begingroup$ Yes, the DV is equivalent to log(treatment)-log(control). I'm basing the iterative regression imputation on the (nontechnical) chapter on missing data that Gelman has posted online: stat.columbia.edu/~gelman/arm/missing.pdf $\endgroup$ Commented Dec 19, 2012 at 17:41
  • $\begingroup$ I have been told that imputing the outcome leads to Monte Carlo error. Will try to find a link later. Don't forget that you need to make sure to include the outcome in the imputation models for the covariates. $\endgroup$ Commented Dec 30, 2012 at 22:38

3 Answers 3

30
+25
$\begingroup$

As you suspected, it is valid to use multiple imputation for the outcome measure. There are cases where this is useful, but it can also be risky. I consider the situation where all covariates are complete, and the outcome is incomplete.

If the imputation model is correct, we will obtain valid inferences on the parameter estimates from the imputed data. The inferences obtained from just the complete cases may actually be wrong if the missingness is related to the outcome after conditioning on the predictor, i.e. under MNAR. So imputation is useful if we know (or suspect) that the data are MNAR.

Under MAR, there are generally no benefits to impute the outcome, and for a low number of imputations the results may even be somewhat more variable because of simulation error. There is an important exception to this. If we have access to an auxiliary complete variable that is not part of the model and that is highly correlated with the outcome, imputation can be considerably more efficient than complete case analysis, resulting in more precise estimates and shorter confidence intervals. A common scenario where this occurs is if we have a cheap outcome measure for everyone, and an expensive measure for a subset.

In many data sets, missing data also occur in the independent variables. In these cases, we need to impute the outcome variable since its imputed version is needed to impute the independent variables.

$\endgroup$
5
  • $\begingroup$ Thanks, this is consistent with my intuition, but could you perhaps share a link to a well-done published study that imputes dependent variables? One of the main reasons that I want to impute the outcome measures is to increase sample size (from about 250 to about 450), in order to facilitate semi-parametric tensor product interaction terms in GAM's that have very high df requirements (before they get penalized, lowering edf). MAR is reasonable in my case. $\endgroup$ Commented Jan 13, 2013 at 6:37
  • 1
    $\begingroup$ It has been widely practiced for ANOVA to get balanced designs. See the introduction of RJA Little, Regression with missing X's, JASA 1992. I suppose that you know that increasing the sample size in this way does not help you to get more precise estimates. For the case of auxiliary variables, read the section on super-efficiency in DB Rubin, Multiple Imputation after 18+ Years, JASA 1996. $\endgroup$ Commented Jan 13, 2013 at 11:59
  • 1
    $\begingroup$ "Under MAR, there are generally no benefits to impute the outcome" - I have seen this mentioned before, but I don't have any reference for it - can you provide one please ? $\endgroup$ Commented Jan 14, 2013 at 13:33
  • 1
    $\begingroup$ I think you can quote Little 1992 tandfonline.com/doi/abs/10.1080/01621459.1992.10476282 for that, but please note the exceptions. $\endgroup$ Commented Jan 17, 2013 at 17:04
  • 2
    $\begingroup$ @StefvanBuuren - helpful answer for the most part, but my understanding is that "if we know (or suspect) that the data are MNAR" then imputation cannot solve our problems any more than complete case analysis can. This seems to be fall in the "no free lunch" category. $\endgroup$ Commented May 26, 2014 at 22:21
5
$\begingroup$

Late to the party, so no idea whether anyone will read this, but...

I think it is useful to keep in mind that imputation will always make up pseudo-information that isn't real information, and as such imputation is a bad thing. Don't do it unless you have good reasons!

When it comes to imputing covariates, however, there are very good reasons for doing it. Think about how much information you lose, and potential biases, when not imputing, and instead doing an analysis on complete cases only.

What imputation does in such a situation is that it makes available real information for the analysis that wouldn't be available otherwise. So even though the imputation as such is bad, it will pay off if the overall effect is positive from involving much more information in the analysis (non-missing covariate values from observations that have a missing value somewhere) than without imputation.

We should never think of an imputed value as "correct" though. An imputed value is a problematic device to achieve something good. Imputation itself adds uncertainty, for which reason multiple imputation is recommended, which basically explores, based on a range of seemingly "realistic" imputation values, how much uncertainty comes from the imputation. (We should also have in mind that the real uncertainty is even larger, because the imputation model itself is uncertain.)

The problem with imputing the response is that, as long as our aim only is to predict the response from the covariates, observations without response carry no direct information about this, and imputation does not change that. Response imputation may lead to underestimation of uncertainty by treating a response as known that actually isn't known, and not even its real uncertainty is known, despite using multiple imputation.

But one can still say that to some extent real information is made available by imputing the response. This concerns the distribution of the covariates, which may have an impact on the model.

In the first case discussed in @Stef van Buuren's answer, all covariates present but response missing, a complete case analysis implicitly uses the covariate distribution of the complete cases, and imputing the response will make the full observed covariate distribution available. To what extent this changes something for the good will depend on what exactly you are doing. It may very well not help or even do harm, in case the imputation model is wrong.

Also if we're running some kind of regression, the regression model predicts the response anyway, and the imputation model needs to somehow improve on the regression model for making sense of imputing the response, in which case we may wonder whether our specific regression model is good in the first place.

A major issue in missing value imputation is that whether (and how exactly) a model assuming MCAR, MAR, or MNAR is correct is strictly not observable, because it critically depends on the values that are actually missing. Unless there is strong background information about the missingness process, we can never rule out with any confidence that the situation is MNAR, and potentially even "evil MNAR" in the sense that any imputation model we try may be quite off.

By imputation we make stuff up, and there is no guarantee whatsoever that we do it well. Van Buuren's point is that if we do it well, it may help, which is fair enough, but not only is there no guarantee, there is an essential barrier to information that can tell us whether we did it well, at least in the sense of comparing the imputed with the true values. What may be possible is that comparing various prediction models on test data, we may find out that a model involving imputing the response on training data may do better predicting the test data than competing models not imputing responses. That'd be a valid justification, but I expect this to happen rather rarely.

Van Buuren is also right that an imputed response may help to impute covariates in case they are missing. However we also need to be very careful about his, because once more, our aim is to estimate the relation between covariates and response, and if we impute covariates by help of the response, we're basically using response information twice whereas the final analysis will treat the data as if the response was only used as response. Once more this may induce an underestimation of uncertainty, but also once more there may be situations in which the risk that this goes somewhat wrong looks acceptable given what we win by doing it, in terms of making more real information available for the analysis. (Personally I hardly ever impute responses, and for doing it I'd want to see that very little imputation involves a very clear improvement in terms of making real information available.)

So the baseline is that imputation never creates real information, and is in itself never good. It becomes good only to the extent that it allows us to involve other real information in the analysis that wouldn't otherwise be available (or only in a much weaker way). This may include use of an imputation model that uses background information that otherwise wouldn't be used. The positive case for imputation is much clearer in most situations for imputing covariates rather than the response. Furthermore imputation should be multiple because single imputation can't be trusted, and we'd like to assess the uncertainty in the imputation process, but keep in mind that there is additional uncertainty through untestable assumptions.

$\endgroup$
4
$\begingroup$

Imputing outcome data is very common and leads to correct inference when accounting for the random error.

It sounds like what you're doing is single imputation, by imputing the missing values with a conditional mean under a complete case analysis. What you should be doing is multiple imputation which, for continuous covariates, accounts for the random error you would have observed had you retroactively measured these missing values. The EM algorithm works in a similar way by averaging over a range of possible observed outcomes.

Single imputation gives correct estimation of model parameters when there is no mean-variance relationship, but it gives standard error estimates which are biased toward zero, inflating type I error rates. This is because you've been "optimistic" about the extent of error you would have observed had you measured these factors.

Multiple imputation is a process of iteratively generating additive error for conditional mean imputation, so that through 7 or 8 simulated imputations, you can combine models and their errors to get correct estimates of model parameters and their standard errors. If you have jointly missing covariates and outcomes, then there is software in SAS, STATA, and R called multiple imputation via chained equations where "completed" datasets (datasets with imputed values which are treated as fixed and non-random) are generated, model parameters estimated from each complete dataset, and their parameter estimates and standard errors combined using a correct mathematical formation (details in the Van Buuren paper).

The slight difference between the process in MI and the process you described is that you haven't accounted for the fact that estimating the conditional distribution of the outcome using imputed data will depend on which order you impute certain factors. You should have estimated the conditional distribution of the missing covariates conditioning on the outcome in MI, otherwise you'll get biased parameter estimates.

$\endgroup$
10
  • $\begingroup$ Thanks. First off, I'm programming everything from scratch in R, not using MICE or MI. Second off, I am imputing with draws of a (modeled) predictive distribution, not just conditional expectations. Is that what you are talking about in the second paragraph? If not, I'd appreciate clarification. Also, which Royston paper are you referring to? For your last point -- are you saying anything more complicated than "you should put your dependent variable in the imputation model."? If so, I'd greatly appreciate clarification. $\endgroup$ Commented Jan 13, 2013 at 18:08
  • $\begingroup$ Lastly -- I'm not doing single imputation. I'm fitting 30 models with filled in data and using the V_b = W + (1+1/m)B formula from Rubin. $\endgroup$ Commented Jan 13, 2013 at 18:15
  • 1
    $\begingroup$ Royston paper was hyperlinked. I actually meant to link the Van Buuren one who implemented the program in R and includes computational details: doc.utwente.nl/78938 MICE/MI is a process. If you're imputing based on home-grown code, you ought to better elaborate on the details. Conditional means = predicted values if the model is correct (or approximately so, a necessary assumption). It is more complicated than "add the outcome", it's that you're imputing over several missing patterns (at least 3, missing covariate / outcome / jointly missing). $\endgroup$ Commented Jan 13, 2013 at 18:28
  • $\begingroup$ If you're singly imputing the predicted value 30 times, you should be getting the same results 30 times. How are you estimating the error? $\endgroup$ Commented Jan 13, 2013 at 18:29
  • $\begingroup$ Its a pretty simple algorithm -- say I observe a, b, c and d with some missingness. I fill in all four with random draws (with replacement) from observed values. Then I model imp = lm(a~b*+c*+d*) where * indicates filled in, and then x = predict(imp,se.fit=TRUE), y = rnorm(N,imp$fit,imp$se.fit). I then do a* = y, and then do imp = lm(b~a*+c*+d*), predict the same way, and so on. I loop through the whole set of variables 50 times. This is all from that Andrew Gelman textbook chapter that I linked above, and it is also why i don't get the same result each time. $\endgroup$ Commented Jan 13, 2013 at 18:41

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.