I am currently doing a research where I am finding the relationship between the quality of wastewater (e.g. biochemical oxygen demand, amount of nitrogen...) and regional characteristics of that catchment area (e.g. population, average income...) The problem (the absolute pain of this) is in that there is only 16 samples, and 84 explanatory variables I am using, making it a p>n situation.
I am trying to carry out variable selection for the sake of narrowing down on the main variables which are able to explain patterns for each of the wastewater quality.
From my research, these are the suggested methods and their caveats that I have seen:
- LASSO (but it will select variables which are possibly correlated)
- Ridge (but this will preserve too many of my variables, which doesn't help with simplicity of my model)
- Elastic Net (unstable given my sample size wrt its variance)
- Principal component regression (but squishes the data into PC's - making it less interpretable, although they can be converted back to each variable with some loss of data)
- sparse PCA (which sounds reasonable, but some posts have mentioned unsupervised pre-processing is pretty bad for data with very low sample size, as in my case. Link, comment by cbeleites)
I have read Professor Harrell's book, and I have also considered getting rid of a few predictors manually before processing them (e.g. predictors not related, predictors which might not be collected from now on).
I understand this isn't an "ideal" statistical situation - but any advice/recommendation regarding the variable selection methods / modelling methods I can use would be greatly appreciated.
Thank you!