What is the compatibility of imputer and analyst models?

Question

In a paper from Atem, et al 2018 (DOI: 10.1002/bimj.201800275), they claim the following in section 3 regarding the so called "imputer/imputation model" - i.e. the model used to impute missing covariate values - and the "analysis model" - the model on which the primary inference is based:

An imputation model for the covariate 𝑋 can be obtained by specifying the conditional distribution 𝑓(𝑉,𝐷|𝐙). However, if such an imputation model is not compatible with the substantive model, the imputation procedure may lead to specious results. As suggested by Bartlett et al. (2015), such an incompatibility can be avoided if there is a joint model for the outcome and the covariate of interest from which we deduce an imputation model or algorithm. Our imputation model is similar to the method proposed by Rubin (2004), Schafer (1999), and Meng (1994). In order to eliminate inconsistency, they proposed that the assumptions in both models (imputer and analyst model) should be similar and the imputer model should not make more assumptions than the analyst model. The conditional distribution of such a joint model, given the available covariates, would correspond to the given (correctly) specified substantive model.

What specifically is meant by "compatibility" vs. "incompatibility" between these models, and precisely what inconsistency arises when they are incompatible?

For a simple example, suppose I have a dataset comprising 3 variables, $[X, M, Y]$ and my primary inference is for the response of $Y$ conditional on $X$. $M$ has been causally determined to be a mediator and is excluded from analyses. Further, say $X$ has missing values and so we wish to multiply impute these and perform inference using Rubin's Rules. Are Atem et al suggesting I should omit $M$ from the imputer model because it "makes more assumptions" than the analyst model?

Alexis · Accepted Answer · 2022-10-08 16:18:08Z

Multiple imputation using chained equations (MICE) is an approach to imputing missing data which accommodates data missing in multiple variables.

The basic approach is:

Start with probabilistic initial values for the all variables with missing data you wish to impute.
Model the first variable you wish to impute missing values for as a function of other variables in your data set, and estimate missing. This typically looks like a regression model appropriate to the data type of the target variable (e.g., multiple logistic regression, multiple ordered logit regression, etc.). Note: These models can be sophisticated—using constraints, specifying deterministic relationships, using different estimators, etc.
Proceed to the next variable with missing values, and create a model for it based on some set of variables in your data set (possibly including the 1st imputed variable—and using that first variable's newly imputed values).
Do this for all variables you wish to impute.
Repeat steps 1–4 but now using the newly imputed values where those variables appear on the right hand side of the imputation models in 1–4. Do this many times, until the imputation algorithm is satisfied by some measure of convergence.
You now have a single imputed data set. Repeat steps 1–5 until you have some target number of imputed data sets (per Rubin and Schafer this is on the order of 5 to 10).

Atem, et al. are concerned with imputing observations for a Cox proportional hazards model and data set, where some data are incomplete (i.e. have some form of censoring). They describe a case where the censoring process producing data missingness is not completely at random, and has a particular kind of dependency structure. From Bartlett's 2015 article "Two conditional models are said to be incompatible if there exists no joint model for which the conditionals (for the relevant variables) equal these conditional models."

To me this suggests that, for example, the baseline hazard function in the Cox proportional hazards model assumes a particular kind of dependency structure for right censoring (proportionality, survival until time $t$, etc.). One could imagine that the models highlighted in my outline of MICE above could build in dissimilar assumptions to the Cox model resulting in the kind of incompatibility Bartlett, and Atem &Co. warn about. For example, one might model a covariate with respect to a different hazard function (including no hazard function).

References

Atem, F. D., Matsouaka, R. A., & Zimmern, V. E. (2019). Cox regression model with randomly censored covariates. Biometrical Journal. 1–13

Bartlett, J. W., Seaman, S. R., White, I. R., Carpenter, J. R., & A. D. N. Initiative. (2015). Multiple imputation of covariates by fully conditional specification: Accommodating the substantive model. Statistical Methods in Medical Research, 24(4), 462–487

Neither of these links (which seem to be the same link?) works. — dipetkov
– dipetkov, Commented Oct 8, 2022 at 14:58
Thank you. I didn't know about Sci Hub but the second paper is open access: doi.org/10.1177/0962280214521348 — dipetkov
– dipetkov, Commented Oct 8, 2022 at 16:17

Stack Exchange Network

What is the compatibility of imputer and analyst models?

1 Answer 1

Your Answer

Hot Network Questions

What is the compatibility of imputer and analyst models?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions