In a "bursty" dataset, how do you filter for the few important values that make up the bulk of the information?

Question

Note sure if there is an existing stats concept for this but I have a dataset that consists of mostly small data points with a few large ones.

e.g. 1 2 1 3 1 2 87 3 2 1 1 1 1 3 1 2 1 1 1 99

How can I filter this data set to only get the values that disproportionately make up the bulk of the information? I am currently filtering by data points that exist a few standard deviations out but this doesn't tell me what % of the total I am getting. (e.g. if I go 2 standard deviations out am I getting 70% of the info? if I go 5 is it 95%? I only know what % of the number of data points it represents not the percentage of the data)

EDIT: I want to remove as many data points as possible without removing the important data points. So if I have a mean of 5 and a std-deviation of 20, I filter out data points that are less than 45 (20 + 20 + 5). This removes say 95% of the data points but then the dataset can look like: 50 46 90 80 44 99999 57 87 88. The Pareto Principle here is applied recursively with this 99999. In this scenario I'd like to only keep the 99999 since it accounts for 99% of the data but I don't know this from just using a std-deviation rule of thumb.

For example, many people will agree that 1% of people can hold 99% of the wealth. If you slice into that data further you find that 1% of that 1% holds 99% of that wealth. Meaning that 0.01% of people hold 98% of the wealth. This second piece of information is surprising since it shows the "big guys" of the "big guys". This might even go further with the "big guys" of the "big guys" of the "big guys" (big-guys^3) Maybe 1 person holds 95% of all the wealth. How can I analyze my data for this? Given a pie or bar-chart it would be obviously at a glance.

Isn't this just outlier detection, only you want the outliers? — djechlin
– djechlin, Commented Aug 25, 2014 at 21:25

Aniko · Accepted Answer · 2014-08-25 22:04:51Z

If you just want to keep the data points that represent the top, say, 90% of the sum of the data, then do exactly that. Sort the values in decreasing order, calculate the cumulative sum up to each (sorted) value and express it as a percentage of the total sum. Finally, pick the values that up to the point where you cross 90%.

With your example:

x <- c(1L, 2L, 1L, 3L, 1L, 2L, 87L, 3L, 2L, 1L, 1L, 1L, 1L, 3L, 1L, 
        2L, 1L, 1L, 1L, 99L)
cumsum(sort(x, decreasing = TRUE)) / sum(x)
 [1] 0.4626168 0.8691589 0.8831776 0.8971963 0.9112150 0.9205607 0.9299065 0.9392523 0.9485981
[10] 0.9532710 0.9579439 0.9626168 0.9672897 0.9719626 0.9766355 0.9813084 0.9859813 0.9906542
[19] 0.9953271 1.0000000

So the 5 largest values represent the top 91% of the sum.

Or if you are willing to be less automated, then you can plot the cumulative sums of the ordered values, and pick the cutoff based on the "knee" of the plot.

Matt Krause · Accepted Answer · 2014-08-25 19:19:25Z

2

If you know nothing at all about your data set, it's possible to get fairly loose bounds using Chebyshev's Inequality. Specifically, for a random variable $X$ with finite expected value $\mu$ and a finite, non-zero variance $\sigma^2$, it is true that $$P(|X-\mu| > k\sigma) \le \frac{1}{k^2}$$

In other words, 75% of your data must lie within two standard deviations of the mean (plug in $k=2$), 96% of it must lie within 5 standard deviations of the mean, and 99% of it lies within 10 standard deviations of the mean. This inequality requires knowing the population mean and standard deviation. If you replace them with values estimated from a sample, then things get a bit more complex: see equation 2.1 and 2.2 of Saw, Yang, and Mo (1984).

The population bounds are pretty loose and the sample-based bounds are even looser. This is the price you must pay for having minimal information/assumptions. If you know that your data is generated from a certain distribution, it's possible to do a lot better. For example, for a normal distribution, 99.7% of the data lies within 3$\sigma$ of the mean, which is ~40x tighter than the corresponding Chebyshev bound. This can be calculated directly from the cumulative distribution function.

edited Aug 25, 2014 at 19:19

answered Aug 25, 2014 at 19:07

Matt Krause

22.2k3 gold badges69 silver badges115 bronze badges

$\begingroup$ Since the "outliers" make up the bulk of the data, wouldn't most of the information be outside x amount of standard deviations? I worded my original question badly but when I say I'm going 2 standard deviations out, I'm filtering for data that is only larger than a 2σ threshold. The point is to lower the amount of data points as small as possible while losing as little information as you can. $\endgroup$

david
– david

2014-08-25 19:39:56 +00:00
Commented Aug 25, 2014 at 19:39
$\begingroup$ Hmmm. Are you calling the 87 and 99 "information" and the smaller values noise? If so, you're correct in that you can flip it around: $\le 25$ percent of the data are outside 2 standard deviations from the mean and so on.. $\endgroup$

Matt Krause
– Matt Krause

2014-08-25 19:43:45 +00:00
Commented Aug 25, 2014 at 19:43
$\begingroup$ Added some information to the original question, but an analogy would be looking for trillionaires in a population set where most people are poor. Even if I filter the data to 1% so that there are the millionaires, multi-millionaires, billionaires, and trillionaires, since the trillionaires still relatively make up most of the money, the millionaire and billionaire data points can be trashed. $\endgroup$

david
– david

2014-08-25 19:58:09 +00:00
Commented Aug 25, 2014 at 19:58

Add a comment |

Chris C · Accepted Answer · 2014-08-25 19:19:46Z

Matt's answer is a good one, I just thought I would expand a little on the quantile concept brought up by Patrick Coulombe in the comments. It's very easy to do if you're working in R. If, for example, you are looking to grab the middle 50% of data, you would look to limit your bounds to above the 25th quantile and below the 75th. e.g. If you were working with the wt column in the mtcars data set:

y <- mtcars
cut <- subset(y, wt > as.double(quantile(y$wt, 0.25)) & wt < as.double(quantile(y$wt, 0.75)))

This just takes the 25th percentile and the 75th percentile and limits your data between these. If you want 95% of your data, simply replace 25 and 27 with 2.5 and 97.5.

Though, if you're just trying to control some crazy outliers, you might want to look into winsorizing your data. It grabs all the data points above a certain percentile and forces them down to that percentile.

Stack Exchange Network

In a "bursty" dataset, how do you filter for the few important values that make up the bulk of the information?

3 Answers 3

Your Answer

Linked

Hot Network Questions

In a "bursty" dataset, how do you filter for the few important values that make up the bulk of the information?

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Hot Network Questions