0
$\begingroup$

I am currently trying to build a model to link water quality metrics (e.g. biochemical oxygen demand, chemical oxygen demand) with regional characteristics data (e.g. population, GDP) through multiple linear regression.

The problem is, I have over 200+ regional characteristics and I would need to narrow these down as not all of these are of best interest. Therefore, I would need to carry out variable selection appropriately.

I have researched some methods and here is what I have so far:

  • univariate/bivariate selection should be avoided (source)
  • sparse PCA can be good, but is difficult to carry out in my case as there is some missing data for some years/regions - meaning I would need to fill empty data using R packages such as dineof
  • the R package glmnet (which carries out elastic net regualisation, utilising both LASSO and ridge regression)
  • Random forest model
  • Structural equation modelling (SEM)
  • the R package regsem (which combines LASSO with SEM, source)
  • Support vector machine (but sounds very complex)

Furthermore, I have also thought of just building a model with the 200+ parameters and narrowing them down using VIF and stepwise selection, and selecting the best model through AIC. Is this okay?

I am still very new to statistical methods - so any suggestion will be appreciated, as I have not carried out variable selection before.

$\endgroup$

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.