0
$\begingroup$

I am analyzing a dataset with variables such as Age, Sex, and Education, where some variables have missing values. One of the variables (Education) has over 60% missing data. For my analyses, I am considering the following approach:

  1. Complete Case Analysis: Exclude the variable with substantial missingness (Education), and drop rows with missin values to retain as many complete cases as possible.
  2. Multiple Imputation Analysis: Include all variables, including Education, in the imputation process to recover and utilize the potential information from the missing values.

My reasoning is that including Education in complete case analysis would result in a significant loss of data, but it could still provide meaningful insights after imputation.

Is this approach valid? Are there any established references or best practices that support this strategy? What potential issues should I consider when applying this method?

$\endgroup$

4 Answers 4

4
$\begingroup$

Complete case analysis is statistically valid under the missing completely at random assumption. On the other hand, multiple imputation gives you valid results under the less stringent missing at random assumption. Also, the multiple imputation method is specifically designed to propagate the uncertainty due to incomplete data appropriately. Hence, the standard errors for the variable education after imputation will account for the missing information in this variable.

On the practical side, I would suggest using many imputed datasets (e.g., 80-100) than the norm of five (which was suggested in the era when computers were much slower than they are today).

$\endgroup$
5
  • 1
    $\begingroup$ +1 But note that complete case analysis will have much lower power, even under MCAR, because so much data is missing and also, there has been some work showing that MI is somewhat helpful even for MNAR. $\endgroup$ Commented Jan 13 at 14:44
  • $\begingroup$ @PeterFlom, indeed, complete case analysis will be valid under MCAR but less efficient. $\endgroup$ Commented Jan 13 at 14:55
  • $\begingroup$ @DimitrisRizopoulos The MCAR assumption is violated, so I performed a univariate analyses to assess the plausibility of the MAR assumption, which makes sense in my case. $\endgroup$ Commented Jan 13 at 15:27
  • $\begingroup$ @eshuns I do not see how a univariate analysis assesses the plausibility of MAR. $\endgroup$ Commented Jan 13 at 16:19
  • $\begingroup$ @DimitrisRizopoulos I did univariate comparisons of subjects with complete and incomplete data using the chi-squared test to examine differences across observed variables with no missing values (they were all categorical) and assess patterns of missingness. I adapted this approach from Austin et al., 2021 $\endgroup$ Commented Jan 13 at 16:39
2
$\begingroup$

I don't see what complete case analysis offers and, unless you have some strong reason for doing it, I would not.

  1. If education is missing completely at random (MCAR) then the complete case parameter estimates won't be biased, but power would be reduced by a lot. And MCAR is unlikely.
  2. If education is missing at random (MAR, i.e. related to the DV but in ways that can be accounted for by variables that are present) then CCA's estimates will be biased. MI will not.
  3. If the data are missing not at random (NMAR, aka as nonignorable nonresponse) then both CCA and MI results will biased, but there is some work showing that MI will be less biased. I haven't kept up with this literature, but there were some papers by Schafer on this. Maybe someone else can give more recent results.
$\endgroup$
1
  • $\begingroup$ The MCAR assumption is violated, so I performed a univariate analyses to assess the plausibility of the MAR assumption, which makes sense in my case. $\endgroup$ Commented Jan 13 at 15:13
1
$\begingroup$

60 % missing values is a lot, too much some would say. Complete case analysis would be wrong since you are throwing away too much information, and the information will probably be informative (for ex. missing values are due to no or little education, which is very likely to distort your results). Multiple imputation is also too optimistic in this case since you will be practically guessing.

The safest bet is to either 1) exclude this variable from the model and explain that you don't have enough information about education (because you don't), or 2) fill the missing values with "unknown" or some similar and treat it as any other level during interpretation whilst acknowledging that it may contain anything and everything.

$\endgroup$
10
  • 1
    $\begingroup$ Using an indicator variable for "unknown" isn't typically reliable; see Section 1.3.7 of Flexible Imputation of Missing Data. Multiple imputation, if properly done on MAR data, might provide imprecise estimates of coefficients associated with education here, but it has the advantage of using all of the available data. Frank Harrell says: "Extreme amount of missing data does not prevent one from using multiple imputation, because alternatives are worse." $\endgroup$ Commented Jan 13 at 14:33
  • $\begingroup$ @EdM 1) the indicator approach discussed in the article is for continuous variable, but in this case we have a categorical variable, so how would a separate category reduce the reliability of the model? 2) MAR assumption is quite optimistic. 3) If MI provides imprecise estimates even with the right approach and assumptions, why do it at all? Why not drop the variable altogether and retain all data? 4) Frank, as usual, does not provide a lot of arguments for his statements, so I cannot comment on his "because alternatives are worse". $\endgroup$ Commented Jan 13 at 14:45
  • $\begingroup$ @PeterFlom I didn't suggest to do CCA, I suggested to drop the variable so that OP would not have to do that (or to impute with a "unknown" category). $\endgroup$ Commented Jan 13 at 15:07
  • $\begingroup$ Can I use the MI analysis for my main results, and in a sensitivity analysis check these suggestions? 1) removing the Education, and doing CCA. 2) Indicator approach. 3) Removing Education from the MI analysis. $\endgroup$ Commented Jan 13 at 15:12
  • $\begingroup$ Frank Harrell's quote from his text is followed by 2 references supporting it, which I omitted above due to space constraints in a comment: "The proportion of missing data should not be used to guide decisions on multiple imputation" and "Missing covariate data in medical research: to impute is better than to ignore." $\endgroup$ Commented Jan 13 at 15:33
0
$\begingroup$

I suggest looking into Hierarchical models for this problem. As a rough idea for what this means. Consider a multiple linear regression model where you've added another variable that indicates if education is present in the data set (1) or missing (0). We'll call this variable education_missing. The education variable should only show up in the model as an interaction with education missing. That is

$$y=\beta_{0}+\beta_{age}\cdot age+\cdots+\beta_{education}\cdot education\_missing\cdot education+\cdots$$

The information contained in the education variable will be used in your output when it is present. Prediction for when the education variable is missing will be captured in $\beta_{0}$. This same idea can be applied to many other models if you want something more complex than MLR.

$\endgroup$
2
  • 1
    $\begingroup$ This seems like the "indicator method" discussed by Stef van Buuren in Section 1.3.7 of Flexible Imputation of Missing Data. It's a good choice when the "missing" data can't even exist (e.g., a loan amount when there is no loan), and can work in some circumstances like randomized trials with missing baseline covariates, but it "generally fails in observational data... the method can yield severely biased regression estimates, even under MCAR and for low amounts of missing data." $\endgroup$ Commented Jan 13 at 14:21
  • $\begingroup$ Agreed. This is equivalent to the "indicator method" in your link. While it isn't a general solution to the missing data problem. It may still be worth the OP looking into it and assessing the model's performance given the large amount of missing values and the fact that the data may not be MCAR. $\endgroup$ Commented Jan 13 at 17:08

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.