Usefulness of p-value to flag outliers in a data set [closed]

Question

Closed. This question needs details or clarity. It is not currently accepting answers.

Want to improve this question? As written, this question is lacking some of the information it needs to be answered. If the author adds details in comments, consider editing them into the question. Once there's sufficient detail to answer, vote to reopen the question.

Closed last year.

Improve this question

Suppose I have a set of data such that $$y= a\times x + b + \varepsilon $$

I am trying to find $a$ and $b$, but some $y$'s are outliers and up to 80% of the data is missing, so I don't have access to $x$. To do so, I am constructing $x$ with a B&B algorithm. Because if I can construct $x$, I can easily find $y$.

Would p-value be useful in order to flag some data as outlier? If so, how can it be done?

What is a "B&B algorithm"? I find it hard to believe that any algorithm can work well with so much missing data. — Peter Flom
– Peter Flom, Commented Jul 24, 2024 at 11:35
(1) "p-value" of what hypothesis? (2) Why are the data missing? That's crucial for giving objective answers. (3) In what sense do you "not have access" to $x$? If you have no values and don't know $a$ or $b,$ then there's nothing you can do. It sounds like you're chasing your tail by trying to "construct" $x$ from $y,$ $a,$ and $b,$ and then estimating $a$ and $b$ from that. — whuber
– whuber ♦, Commented Jul 24, 2024 at 18:16
Read Whuber's comment again, and then ask a new question about how to set up the analysis of your core problem using the data that you have. Desc ribe the data and hypotheses in detail. Outliers should be a million miles away from your most pressing concerns. — Michael Lew
– Michael Lew, Commented Jul 24, 2024 at 21:37

Nick Cox · Accepted Answer · 2024-07-24 18:07:55Z

Typically, the p-value is used to test whether a whole dataset can be explained by a simple model (called the null hypothesis) to a complex model. You are not in this case since you want to talk about single data points.
We can tweak this slightly. I imagine you might want to something like this:
- assume that each $y$ is Gaussian centered at $ax + b$
- compute the probability that a given $y$ is too far from $ax + b$, in a similar way that we compute a p-value
as far as I can see, that should be mostly valid but I think we can do better
There are two difficulties, that are worth highlighting:
1. in the presence of outliers, the regression will be "pulled" towards them. This will be more important for outliers associated to very low and very high values of $x$. These outliers will be closer to the regression line than they should be.
2. your data might be non-Gaussian. This makes the transformation to p-values from a Gaussian model not relevant
Overall, I would advise the following:
- do a robust regression, such as Huber regression, if you suspect that you have outliers. It prevents outliers from having a strong pull.
- for each point, fit the regression line to the rest of the data before computing the error at that point
- do not compute probabilities, but use absolute deviations, and find some way to identify when the deviation is too big.

The thing is I don't have access to $x$ so I can't do a robust regression, that is why I am trying to construct $x$. So finding outliers is necessary otherwise I won't have the right $x$ — Anatole
– Anatole, Commented Jul 24, 2024 at 10:08
you use the $y$ to reconstruct $x$? I think that missing data is the hardest statistical problem, so I don't have good generic advice for that. My remarks above are still valid. My overall advice would be to collect better data. I think it's always ok to conclude that we can't say anything given the current data. — Guillaume Dehaene
– Guillaume Dehaene, Commented Jul 24, 2024 at 10:15

Stack Exchange Network

Usefulness of p-value to flag outliers in a data set [closed]

1 Answer 1

Hot Network Questions

Usefulness of p-value to flag outliers in a data set [closed]

1 Answer 1

Related

Hot Network Questions