1
$\begingroup$

There's a set of methods called "robust" principal component analysis (here, "robust" means resistant to influence from outliers). One example is Hubert et al., "ROBPCA: A new approach to robust principal component analysis," from Technometrics (2005): https://doi.org/10.1198/004017004000000563. In that paper, in particular, a subset of obervations (say, 75%) are used to estimate principal components as those observations are assumed to be non-outliers. The paper then proposes methods that are intended to identify candidate outliers.

I can see the value in having a method for PCA that can help identify outliers if one then investigates the outliers found. If some of the outliers are then deemed to be inappropriate to include with the rest of the data (perhaps because they represent contaminated results or are from a population too different from the rest to justify lumping together), they can be removed. But suppose then that some observations identified as "outliers" are judged to be in-sample, and should not be modified or excluded. Then I'm nervous about using the resulting PCs as a substitute for conventional (let's say non-robust) PCs. There's theory for what conventional PCs mean and estimate in relation to the population. I don't know what the analogue is for robust PCs, what they're estimating, and whether what they estimate is desirable or meaningful.

So taking that supposition that observations identified by robust PCA as "outliers" are kept in the sample as-is. What is robust PCA estimating in the population and why do I care about it? Why should I continue to use robust PCA?

$\endgroup$

1 Answer 1

1
$\begingroup$

@cgmil: But suppose then that some observations identified as "outliers" are judged to be in-sample, and should not be modified or excluded.

Typically there are many more inliers than outliers, so dropping a few "good" data points will not cause serious trouble.


The term "outlier" implies a mixture model. There is perhaps a "typical" generative process for most of the data, and an "atypical" generative process for a smaller subset of the data.

If you can become philosophically comfortable with mixture models, you will likely be comfortable in applying PCA to subsets of your data. See for example

Tipping, Michael E., and Christopher M. Bishop. "Probabilistic principal component analysis." Journal of the Royal Statistical Society Series B: Statistical Methodology 61.3 (1999): 611-622.

and

Fischler, Martin A., and Robert C. Bolles. "Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography." Communications of the ACM 24.6 (1981): 381-395.
Wiki discussion.

$\endgroup$

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.