0
$\begingroup$

I am currently doing a research where I am finding the relationship between the quality of wastewater (e.g. biochemical oxygen demand, amount of nitrogen...) and regional characteristics of that catchment area (e.g. population, average income...) The problem (the absolute pain of this) is in that there is only 16 samples, and 84 explanatory variables I am using, making it a p>n situation.

I am trying to carry out variable selection for the sake of narrowing down on the main variables which are able to explain patterns for each of the wastewater quality.

From my research, these are the suggested methods and their caveats that I have seen:

  • LASSO (but it will select variables which are possibly correlated)
  • Ridge (but this will preserve too many of my variables, which doesn't help with simplicity of my model)
  • Elastic Net (unstable given my sample size wrt its variance)
  • Principal component regression (but squishes the data into PC's - making it less interpretable, although they can be converted back to each variable with some loss of data)
  • sparse PCA (which sounds reasonable, but some posts have mentioned unsupervised pre-processing is pretty bad for data with very low sample size, as in my case. Link, comment by cbeleites)

I have read Professor Harrell's book, and I have also considered getting rid of a few predictors manually before processing them (e.g. predictors not related, predictors which might not be collected from now on).

I understand this isn't an "ideal" statistical situation - but any advice/recommendation regarding the variable selection methods / modelling methods I can use would be greatly appreciated.

Thank you!

$\endgroup$
5
  • $\begingroup$ There's no real solution to having a very small sample. Is data very hard/expensive to gather? // Given what you've got, I might just present a table of correlations between the DVs and IVs and use it as a guide for further research. $\endgroup$ Commented May 19 at 10:37
  • $\begingroup$ Related to @PeterFlom ‘s comment, variable selection doesn’t work for very large sample sizes, so you know it cannot work for small ones. Use within-predictor correlations to do data reduction and model only reduced scores (variable clustering, principal components, sparse PCs). $\endgroup$ Commented May 19 at 11:32
  • $\begingroup$ @PeterFlom the data is already gathered from a national census data, so it is impossible to get more data unfortunately. Wrt DV-IV correlation, I can definitely include it in my results. Thank you for the suggestion! $\endgroup$ Commented May 19 at 12:10
  • $\begingroup$ @FrankHarrell I think I will have to go with those options. After reducing variables through the methods you suggested, is it necessary to carry out e.g. VIF to check for multicollinearity and AIC to select for the best model? $\endgroup$ Commented May 19 at 12:11
  • 1
    $\begingroup$ I don’t think so, but these methods will negate the effect of collinearities anyway. $\endgroup$ Commented May 20 at 11:49

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.