Variable selection in multiply imputed data

Ask Question

Asked 1 year, 11 months ago

Modified 1 year, 11 months ago

Viewed 135 times

I have a dataset with approximately 1800 observations and I'm trying to fit a multivariable logistic regression model (250 cases, 1550 controls). There are 19 covariates (mix of continuous, ordinal and categorical) with P < 0.2 on univariate regression and I am planning to include them in an initial full model. There are low-moderate missing data (1-10%) for the majority of covariates and high missing data for one covariate (70%), so I have created 70 multiply imputed datasets using the mice package in R.

This is my first time modelling multiply imputed data. I have previously used purposeful selection of covariates as described by Hosmer and Lemeshow but I am not sure how to do this in multiply imputed data as I don't think it is possible to compare fits using partial likelihood ratio tests. Would it be reasonable to use fit.mult.impute at each stage of the purposeful selection process (which I understand fits the model in each imputed dataset and then combines coefficients using Rubin's rule)? Is there an optimal way to assess and compare each model fit against the last?

Are there other selection procedures that are likely to be simpler or produce better results for my data? I have seen a package called "miselect: variable selection for multiply imputed data", which provides procedures for LASSO and elastic net regression in multiply imputed data. Is this worth exploring?

Many thanks for any suggestions.

asked Dec 12, 2023 at 14:51

donm79

511 bronze badge

1

$\begingroup$ you may be able to kill two birds with one stone by performing variable selection in a "Stability Selection"-like way <arxiv.org/pdf/0809.2932.pdf>, which is to say, run lasso on each set of imputed data, then compute the proportion of times a given variable is selected and keep those variables which appear more than a given proportion of the time. $\endgroup$

Nathan Wycoff
– Nathan Wycoff

2023-12-12 15:24:23 +00:00
Commented Dec 12, 2023 at 15:24
1

$\begingroup$ If you insist on using an automated method, LASSO isn't as bad as most. But it's better NOT to use automated methods. This has been discussed many times here. It doesn't change if you are using imputed data. $\endgroup$

Peter Flom
– Peter Flom

2023-12-12 16:36:01 +00:00
Commented Dec 12, 2023 at 16:36
$\begingroup$ Thanks John and Peter. So many different ways to handle this but I think I would prefer to use a non-automated method, as I have in the past. Is it acceptable to use a procedure like fit.mult.impute for each fit at each stage of the modelling process? $\endgroup$

donm79
– donm79

2023-12-12 21:28:47 +00:00
Commented Dec 12, 2023 at 21:28

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Stack Exchange Network

Variable selection in multiply imputed data

0

Your Answer

Hot Network Questions

Variable selection in multiply imputed data

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions