Why is 50% the best breakdown point for an estimator?

Question

As stated in Wikipedia:

Intuitively, we can understand that a breakdown point cannot exceed 50% because if more than half of the observations are contaminated, it is not possible to distinguish between the underlying distribution and the contaminating distribution Rousseeuw & Leroy (1986). Therefore, the maximum breakdown point is 0.5

What is the intuition behind this?

When you suppose data can come from two or more sources, at most one source can contribute more than 50% of the data. That's all this is saying. The illustrations in my (closely) related post at stats.stackexchange.com/a/114363/919 might help with the intuition. — whuber
– whuber ♦, Commented Jan 14, 2022 at 17:43

Christian Hennig · Accepted Answer · 2022-01-14 11:05:42Z

3

First realise that there is a mathematical theorem behind this, which has assumptions. The statement isn't true in general. A standard assumption is affine equivariance, which roughly means that this only holds if estimators "move" in a certain sense with the data. For example, if you compute the sample mean and then add 5 to all observations, the mean moves by 5 as well. Particularly, the estimator needs to be able to move to infinity if the data are changed more and more. Technically, estimating the mean as 0 independently of the data defines an "estimator" as well (a very crappy one!), and this estimator has a breakdown point of 100% - regardless of what the data are and how much you change them, it will always be the same!

Now imagine you have a (reasonably flexible, see above) estimator $T$ that has a breakdown point of 60%. So you have a data set, say $x=(x_1,\ldots,x_{100})$. Having a breakdown point of 60% means that you can keep observations $x_1,\ldots,x_{41}$ and remove the other 59% of the observations by something else, and the estimator will still stay in a neighborhood of the original value $T(x)$.

Now imagine a sequence of other data sets $y^k=(y_1^k,\ldots,y^k_{100})$ for $k\to\infty$ so that $T(y^k)\to\infty$ (which is possible because of the affine equivariance assumption, see above), i.e., $T(y^k)$ can be arbitrarily far away from $T(x)$. If the estimator has a breakdown point of 60%, you can change 59% of the observations of $y^k$ and the resulting estimator will still be arbitrarily far away from $T(x)$.

But this isn't possible, because when replacing 59% observations of $y^k$, you may well introduce $x_1,\ldots,x_{41}$ to the data set, and then the estimator needs to be close to $T(x)$ as explained in the paragraph before. So there is one >40% portion of the data that requires that the estimator should be in one place, and another >40% portion that requires the estimator to be in a totally different place. This cannot be true.

This can only be avoided by having a breakdown point <50%.

answered Jan 14, 2022 at 11:05

Christian Hennig

34.4k44 silver badges131 bronze badges

$\begingroup$ The claim of "$<50\%$" seems to not be true, at least depending on the definition of breakdown point. The median has a "finite sample breakdown point" of half, rounded up, according to [Davies, Gather, THE BREAKDOWN POINT — EXAMPLES AND COUNTEREXAMPLES, 2007, REVSTAT] $\endgroup$

user551504
– user551504

2022-01-14 15:01:20 +00:00
Commented Jan 14, 2022 at 15:01
$\begingroup$ @user551504 There are various definitions of breakdown point around. In particular you can define the breakdown point as the largest proportion at which breakdown does not occur (in which case it's smaller than 50%) and as the smallest proportion at which breakdown occurs (in which case it can be 50% exactly but not larger). The implications are the same. There's even more, for example there are "replacement" and "addition" breakdown points, and breakdown points for distributions rather than data sets. I thought I keep things simple... $\endgroup$

Christian Hennig
– Christian Hennig

2022-01-14 16:41:08 +00:00
Commented Jan 14, 2022 at 16:41
1

$\begingroup$ Sure there's a theorem here--but it's trivial, as I explained in a comment to the question. We have to understand the context of the Wikipedia quotation, which is based on a model in which data coming from one process are "contaminated" by data from another process. Thus, the situation implicitly concerns a mixture. Discussion of various definitions of breakdown, of equivariance, etc. seem beside the point. $\endgroup$

whuber
– whuber ♦

2022-01-14 17:46:21 +00:00
Commented Jan 14, 2022 at 17:46
$\begingroup$ @whuber If the theorem were trivial, it should be clear for what kind of estimators it holds and for what kind it doesn't, but that isn't so trivial after all. Note that there are reasonable estimators that are not affine equivariant and can reach a higher breakdown point, which depends on the sample, the specific estimation problem, and the precise breakdown concept in use. $\endgroup$

Christian Hennig
– Christian Hennig

2022-01-14 18:10:17 +00:00
Commented Jan 14, 2022 at 18:10
1

$\begingroup$ This has nothing to do with the estimators and everything to do with the model being discussed in the quotation. The quotation holds for all estimators, without exception, under this mixture assumption, precisely because it relies on such a simple mathematical fact. $\endgroup$

whuber
– whuber ♦

2022-01-14 19:27:10 +00:00
Commented Jan 14, 2022 at 19:27

| Show 4 more comments

Stack Exchange Network

Why is 50% the best breakdown point for an estimator?

1 Answer 1

Your Answer

Linked

Hot Network Questions

Why is 50% the best breakdown point for an estimator?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Hot Network Questions