I've been going over the output of a Monte Carlo model that simulates disease risk as a function of genotype. Under a null model of no disease risk, we have 1000 case and 1000 control individuals. Each individual has 500 loci of interest to which a genotype is randomly assigned based on the allele frequency at each locus. Under this scenario, the only possible association with genotype and disease will be due to random error.
Nevertheless, when running logistic regression of disease against genotype as well as several other machine learning classifiers (naive Bayes, neural networks, random forests), we consistently find AUC > 0.5 for classifiers under the test set under this null model. If we simulate a reduced data set with fewer (200) individual and 50 sites, AUC under the null model is even larger.
I can understand how overfitting may give AUC < 0.5, but I can' think of a plausible scenario that would generate AUC > 0.5 for this null model.
Another strange anomaly is that in the absence of feature selection (using LASSO), AUC is approximately 0.5 under the null model for all classifiers. However, once feature selection is introduced, I get AUC > 0.5 for all of them.
Is there some phenomenon that's a "mirror image" of overfitting that could generate these results, and if so, why am I only seeing it for classifiers following feature selection?
If it matters, the random assignment of genotypes and the machine learning classifiers were implemented in Mathematica (a colleague's old code) while the Lasso was performed in R (interfacing with Mathematica), but this shouldn't matter.