-1
$\begingroup$

Suppose I have a set of data such that $$y= a\times x + b + \varepsilon $$

I am trying to find $a$ and $b$, but some $y$'s are outliers and up to 80% of the data is missing, so I don't have access to $x$. To do so, I am constructing $x$ with a B&B algorithm. Because if I can construct $x$, I can easily find $y$.

Would p-value be useful in order to flag some data as outlier? If so, how can it be done?

$\endgroup$
3
  • 2
    $\begingroup$ What is a "B&B algorithm"? I find it hard to believe that any algorithm can work well with so much missing data. $\endgroup$ Commented Jul 24, 2024 at 11:35
  • 3
    $\begingroup$ (1) "p-value" of what hypothesis? (2) Why are the data missing? That's crucial for giving objective answers. (3) In what sense do you "not have access" to $x$? If you have no values and don't know $a$ or $b,$ then there's nothing you can do. It sounds like you're chasing your tail by trying to "construct" $x$ from $y,$ $a,$ and $b,$ and then estimating $a$ and $b$ from that. $\endgroup$ Commented Jul 24, 2024 at 18:16
  • 1
    $\begingroup$ Read Whuber's comment again, and then ask a new question about how to set up the analysis of your core problem using the data that you have. Desc ribe the data and hypotheses in detail. Outliers should be a million miles away from your most pressing concerns. $\endgroup$ Commented Jul 24, 2024 at 21:37

1 Answer 1

1
$\begingroup$
  1. Typically, the p-value is used to test whether a whole dataset can be explained by a simple model (called the null hypothesis) to a complex model. You are not in this case since you want to talk about single data points.

  2. We can tweak this slightly. I imagine you might want to something like this:

    • assume that each $y$ is Gaussian centered at $ax + b$
    • compute the probability that a given $y$ is too far from $ax + b$, in a similar way that we compute a p-value

    as far as I can see, that should be mostly valid but I think we can do better

  3. There are two difficulties, that are worth highlighting:

    1. in the presence of outliers, the regression will be "pulled" towards them. This will be more important for outliers associated to very low and very high values of $x$. These outliers will be closer to the regression line than they should be.
    2. your data might be non-Gaussian. This makes the transformation to p-values from a Gaussian model not relevant
  4. Overall, I would advise the following:

    • do a robust regression, such as Huber regression, if you suspect that you have outliers. It prevents outliers from having a strong pull.
    • for each point, fit the regression line to the rest of the data before computing the error at that point
    • do not compute probabilities, but use absolute deviations, and find some way to identify when the deviation is too big.
$\endgroup$
2
  • $\begingroup$ The thing is I don't have access to $x$ so I can't do a robust regression, that is why I am trying to construct $x$. So finding outliers is necessary otherwise I won't have the right $x$ $\endgroup$ Commented Jul 24, 2024 at 10:08
  • 1
    $\begingroup$ you use the $y$ to reconstruct $x$? I think that missing data is the hardest statistical problem, so I don't have good generic advice for that. My remarks above are still valid. My overall advice would be to collect better data. I think it's always ok to conclude that we can't say anything given the current data. $\endgroup$ Commented Jul 24, 2024 at 10:15

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.