I am currently trying to build a model to link water quality metrics (e.g. biochemical oxygen demand, chemical oxygen demand) with regional characteristics data (e.g. population, GDP) through multiple linear regression.
The problem is, I have over 200+ regional characteristics and I would need to narrow these down as not all of these are of best interest. Therefore, I would need to carry out variable selection appropriately.
I have researched some methods and here is what I have so far:
- univariate/bivariate selection should be avoided (source)
- sparse PCA can be good, but is difficult to carry out in my case as there is some missing data for some years/regions - meaning I would need to fill empty data using R packages such as
dineof - the R package
glmnet(which carries out elastic net regualisation, utilising both LASSO and ridge regression) - Random forest model
- Structural equation modelling (SEM)
- the R package
regsem(which combines LASSO with SEM, source) - Support vector machine (but sounds very complex)
Furthermore, I have also thought of just building a model with the 200+ parameters and narrowing them down using VIF and stepwise selection, and selecting the best model through AIC. Is this okay?
I am still very new to statistical methods - so any suggestion will be appreciated, as I have not carried out variable selection before.