Influential residual vs. outlier

Question

First, I should state that I have searched on this site for the answer. I either didn't find a question that answered my question or my knowledge level is so low I didn't realize I already read the answer.

I am studying for the AP Statistics Exam. I have to learn linear regression and one of the topics is residuals. I have a copy of Introduction to Statistics and Data Analysis on page 253 it states.

Unusual points in a bivariate data set are those that fall away from most of the other points in the scatterplot in either the $x$ direction or the $y$ direction

An observation is potentially an influential observation if it has an $x$ value that is far away from the rest of the data (separated from the rest of the data in the $x$ direction). To determine if the observation is in fact influential, we assess whether removal of this observation has a large impact on the value of the slope or intercept of the least-square line.

An observation is an outlier if it has a large residual. Outlier observation fall far away from the least-square line in the $y$ direction.

Stattreck.com states four methods to determine an outlier from residuals:

Data points that diverge in a big way from the overall pattern are called outliers. There are four ways that a data point might be considered an outlier.

It could have an extreme X value compared to other data points.

It could have an extreme Y value compared to other data points.

It could have extreme X and Y values.

It might be distant from the rest of the data, even without extreme X or Y values.

These two sources seem to conflict each other. Could anyone help clear up my confusion. Also, how does one define extreme. The AP Statistics uses the rule if the data point is outside of (Q1-1.5IQR,Q3+1.5IQR) the it is an outlier. I don't know how to apply that from just a graph off the residuals.

John · Accepted Answer · 2013-01-02 17:58:51Z

The stattrek site seems to have a much better description of outliers and influential points than your textbook but you've only quoted a short passage that may be misleading. I don't have that particular book so I cannot examine it in context. Keep in mind though, that the textbook passage you quoted says, "potentially". It's not exclusive either. Keeping those points in mind, stattrek and your book don't necessarily disagree. But it does appear that your book is misleading in the sense that it implies (from this short passage) that the only difference between outliers and influential points is whether they deviate on x or y axis. That is incorrect.

The "rule" for outliers varies depending on context. The rule you cite is just a rule of thumb and yes, not really designed for regression. There are a few ways to use it. It might be easier to visualize if you imagine multiple y-values at each x and examining the residuals. Typical textbook regression examples are too simple to see how that outlier rule might work, and in most real cases it is quite useless. Hopefully, in real life, you collect much more data. If it's necessary that you may be applying the quantile rule for outliers to a regression problem then they should be providing data for which it is appropriate.

Thanks for the answer, it just gets annoying that different books try to state these rules without really saying it honestly depends on the data, as you are saying. — MaoYiyi
– MaoYiyi, Commented Jan 2, 2013 at 16:23
Actually, I stated it wrong as well... it depends on theory, method, and data... the entire study. — John
– John, Commented Jan 2, 2013 at 17:58

Placidia · Accepted Answer · 2013-01-02 18:41:24Z

5

I agree with John. Here are a few more points. An influential observation is (strictly) one that influences the parameter estimates. A small deviation in the Y value gives a big change in the estimated beta parameter(s). In simple regression of 1 variable against another, influential variables are precisely those whose X value is distant from the mean of the X's. In multiple regression (several independent variables), the situation is more complex. You have to look at the diagonal of the so called hat matrix $X(X'X)^{-1}X'$, and regression software will give you this. Google "leverage".

Influence is a function of the design points (the X values), as your textbook states.

Note that influence is power. In a designed experiment, you want influential X values, assuming you can measure the corresponding Y value accurately. You get more bang for the buck that way.

To me, an outlier is basically a mistake - that is, an observation that does not follow the same model as the rest of the data. This may occur because of a data collection error, or because that particular subject was unusual in some way.

I don't much like stattrek's definition of an outlier for several reasons. Regression is not symmetric in Y and X. Y is modelled as a random variable and the X's are assumed to be fixed and known. Weirdness in the Y's is not the same as weirdness in the X's. Influence and outliership mean different things. Influence, in multiple regression, is not detected by looking at residual plots. A good description of outliers and influence for the single variable case should set you up to understand the multiple case as well.

I dislike your textbook even more, for the reasons given by John.

Bottom line, influential outliers are dangerous. They need to be examined closely and dealt with.

answered Jan 2, 2013 at 18:41

Placidia

14.6k6 gold badges46 silver badges77 bronze badges

$\begingroup$ Your dislike of the stattrek regression explanation is appropriate if you come from a background where true experiments are the norm. Your reasons all apply there. But if you come from a background where quasi-experimental designs are more common then the stattrek site has more relevance. In those cases both x and y values are often just random samples. $\endgroup$

John
– John

2013-01-03 05:52:19 +00:00
Commented Jan 3, 2013 at 5:52
$\begingroup$ @John how about the background of wanting to pass the AP Statistics Exam? What is quasi-experimental design? Is that using a random number table for a simulation? $\endgroup$

MaoYiyi
– MaoYiyi

2013-01-03 07:41:32 +00:00
Commented Jan 3, 2013 at 7:41
1

$\begingroup$ I don't know anything about the AP statistics exam. True experiments are ones where you manipulate the predictor variable and make groups to test multiple hypothesis or control and experimental groups, etc. Quasi-experimental designs are pretty much anything else that looks like an experiment. So, imagine a regression where the x value is weight and the y value is some sport skill. You don't manipulate either variable, you randomly sample both. So, Placidia's criticisms of stattrek are quite valid for true experiments but not as much so for quasi. $\endgroup$

John
– John

2013-01-03 12:29:15 +00:00
Commented Jan 3, 2013 at 12:29
$\begingroup$ @John ... I do come from a background where designed experiments are seen as the gold standard. In practice, I know that X and Y are often both random samples, which begs the question of why regression is being used, and not some form of latent variable analysis. $\endgroup$

Placidia
– Placidia

2013-01-03 14:17:20 +00:00
Commented Jan 3, 2013 at 14:17
$\begingroup$ When you've only got two variables... :) Sometimes you have good theory to suggest one thing predicts another, for example, height and probability of getting into the NBA... both random samples. In cases with one, or a few (especially uncorrelated) linear relationships regression is good. $\endgroup$

John
– John

2013-01-03 14:49:00 +00:00
Commented Jan 3, 2013 at 14:49

| Show 3 more comments

Stack Exchange Network

Influential residual vs. outlier

2 Answers 2

Your Answer

Linked

Hot Network Questions

Influential residual vs. outlier

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Hot Network Questions