What is the interpretation of outlier-robust principal component analysis?

Question

There's a set of methods called "robust" principal component analysis (here, "robust" means resistant to influence from outliers). One example is Hubert et al., "ROBPCA: A new approach to robust principal component analysis," from Technometrics (2005): https://doi.org/10.1198/004017004000000563. In that paper, in particular, a subset of obervations (say, 75%) are used to estimate principal components as those observations are assumed to be non-outliers. The paper then proposes methods that are intended to identify candidate outliers.

I can see the value in having a method for PCA that can help identify outliers if one then investigates the outliers found. If some of the outliers are then deemed to be inappropriate to include with the rest of the data (perhaps because they represent contaminated results or are from a population too different from the rest to justify lumping together), they can be removed. But suppose then that some observations identified as "outliers" are judged to be in-sample, and should not be modified or excluded. Then I'm nervous about using the resulting PCs as a substitute for conventional (let's say non-robust) PCs. There's theory for what conventional PCs mean and estimate in relation to the population. I don't know what the analogue is for robust PCs, what they're estimating, and whether what they estimate is desirable or meaningful.

So taking that supposition that observations identified by robust PCA as "outliers" are kept in the sample as-is. What is robust PCA estimating in the population and why do I care about it? Why should I continue to use robust PCA?

krkeane · Accepted Answer · 2024-06-24 20:49:28Z

@cgmil: But suppose then that some observations identified as "outliers" are judged to be in-sample, and should not be modified or excluded.

Typically there are many more inliers than outliers, so dropping a few "good" data points will not cause serious trouble.

The term "outlier" implies a mixture model. There is perhaps a "typical" generative process for most of the data, and an "atypical" generative process for a smaller subset of the data.

If you can become philosophically comfortable with mixture models, you will likely be comfortable in applying PCA to subsets of your data. See for example

Tipping, Michael E., and Christopher M. Bishop. "Probabilistic principal component analysis." Journal of the Royal Statistical Society Series B: Statistical Methodology 61.3 (1999): 611-622.

and

Fischler, Martin A., and Robert C. Bolles. "Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography." Communications of the ACM 24.6 (1981): 381-395.
Wiki discussion.

Stack Exchange Network

What is the interpretation of outlier-robust principal component analysis?

1 Answer 1

Your Answer

Hot Network Questions

What is the interpretation of outlier-robust principal component analysis?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions