Questions tagged [outliers]
An outlier is an observation that appears to be unusual or not well described relative to a simple characterization of a dataset. A discomfiting possibility is that these data come from a different population than the one intended to be studied.
1,383 questions
1
vote
0
answers
149
views
Standardized Euclidean Distance over variables distributed as $\chi^2$
I sample $n$ dimension vectors (each sample is a vector). My objective is the detection of outliers. In case those elements would distribute normally, for outlier detection, I could use Standardized ...
3
votes
0
answers
874
views
Measuring unusual death [closed]
Given the Prussian Horse Data here:
https://www.randomservices.org/random/data/HorseKicks.html
Is there a way to find out which corp has an unusually high number of deaths?
(Note that Prussian horse ...
3
votes
2
answers
555
views
Is that possible for a dataset to be 9% outliers?
I have a dataset about solar panels' output power. After visually inspecting the data distribution, I found it is not normal distribution and is a right-skewed distribution with many zeroes. I used ...
0
votes
0
answers
57
views
Is it appropriate to fit a linear model to my data?
I have a bunch of outcome/exposure relationships I am trying to fit models to:
From these graphs, I am not sure if a simple lm is appropriate. Some of them look ...
3
votes
4
answers
954
views
Why don't we automatically have outliers when mean and median differ strongly?
Assume you have a data set with information on income of all students in the lecture. The mean value is 1500\$. The median value is however only 800\$. Which of the following conclusions is wrong?
The ...
1
vote
1
answer
152
views
Outlier management
I apologize in advance for my novice question. I am a part of an interview committee of eight people. We interview 70 applicants for just six positions. All of the applicants are very accomplished. We ...
4
votes
3
answers
2k
views
Which are outliers?
I am in the process of solving a Machine Learning challenge, and I want to do it the right way.
I did some exploratory data analysisand I wanted to check the distribution of the data.
As displayed in ...
0
votes
0
answers
227
views
How to clean dataset in order to fit to a curve? [duplicate]
I'm trying to fit a dataset to a curve for while, but I'm not managing.
The goal is to obtain a curve with equation that fits the data so I can get the parameter x to any value of y.
The blue dataset ...
0
votes
0
answers
487
views
Detecting and dealing with outliers in a sales prediction dataset of "Rossmann"
I have been working on a dataset for which the task is to forecast the sales of the drug sold by 1115 drug stores of the Rossmann chain. The dataset is fairly large with over 1m records and as many as ...
0
votes
1
answer
178
views
Standard deviation estimator without outliers
I have samples that are distributed like this:
I want to calculate the standard deviation (or similar) of the main peak without the outliers. Of course I can do this just applying a cut at, say, -5µ. ...
0
votes
1
answer
339
views
Why is 50% the best breakdown point for an estimator?
As stated in Wikipedia:
Intuitively, we can understand that a breakdown point cannot exceed 50% because if more than half of the observations are contaminated, it is not possible to distinguish ...
0
votes
0
answers
1k
views
Winsorizing or taking the logarithm first?
I testing if I can describe the StockPRice with EPS (=earnings per share), BookValuePS an ESGscore.
Before I start I winsorized all my variables. Now I want to take the loagrithm of e.g. BookValuePS ...
9
votes
10
answers
6k
views
Why is the Median Less Sensitive to Extreme Values Compared to the Mean?
I am sure we have all heard the following argument stated in some way or the other:
For a given set of measurements (e.g. heights of students), the mean of these measurements is more "prone"...
2
votes
1
answer
524
views
Comparing outliers in two distributions
I apologize in advance as I am not well-versed in statistics, but I hope that this question makes sense.
I have 2 populations which are normally distributed and have a near-identical mean. I would ...
1
vote
1
answer
169
views
General Question: Should Legitimate Outliers in the Data be Included or Excluded from Statistical Models? [duplicate]
I have the following (general) question (I know there is no definite answer to this question and it largely depends on the specific data and choice of model): Should Legitimate Outliers in the Data be ...
0
votes
1
answer
650
views
How to detect outliers in skewed data?
I have a dataset I need to use to predict the probability of conversion based on the number of days an individual has spent using my app. I got a list of historical users and the number of session ...
0
votes
1
answer
3k
views
Outliers Logistic Regression
I want to know how to find and remove outliers from my Logistic Regression.
I have tried using formula from Faraway, but I don't know is it applicable for logistic regression or not
For example my ...
0
votes
2
answers
4k
views
Running ANOVA - must I remove outliers?
Some people seem to frown on removing outliers. But I've also read many times elsewhere that ANOVAs are sensitive to outliers and you must remove them.
I'm running a 2 x 2 repeated measures within ...
1
vote
2
answers
182
views
Should I remove this outlier?
I am running a multiple variable regression predicting GDP per capita for U.S. states with a bunch of independent variables. Currently I have included the District of Columbia in the data set which ...
-1
votes
1
answer
267
views
Independent Sample T-test or Mann-Whitney U test?
I am a very young stats learner, and I need help understanding the justification of a test choice. I have a sample of 39 participants (20 females and 19 males) been measured on task performance, and I ...
2
votes
1
answer
3k
views
Ridge regression for multicollinearity and outliers
I'm wondering about techniques like ridge regression with regard to both multicollinearity and outliers.
My understanding is that ridge regression is primarily used for multicollinearity, but that ...
4
votes
2
answers
3k
views
Does classic outlier detection assume normality?
My classmate told me he was showing his work in some stuff statistics-based and some time he was showing a boxplot and using it as outlier detection then his professor said 'it's not even correct, the ...
0
votes
0
answers
223
views
Using the IQR method to filter outliers in experimental research, by group or as a whole?
In my current dataset (results from a factorial ANOVA), I know I have outliers (due to qualitative comments participants wrote during an online experiment), thus I'd like to do a filtering process to ...
1
vote
0
answers
191
views
Legitimacy of transforming data before statistical tests
I have two groups of samples (N=4 for each) and found that there is one outlier for each group (both are higher than the rest of the respected samples within the groups).
I have no resources to repeat ...
0
votes
1
answer
192
views
Setting the observation likelihood threshold for outlier detection if you know know the percentage of outliers
Let's assume I have a sensor that gives me measurements $z$ and I know that $50\%$ of the measurements I read are outliers (more than 3 standard deviations away from the real measurement distribution)....
0
votes
0
answers
26
views
PCA: does outlier detection make sense with low linear correlation? [duplicate]
I am experimenting PCA to detect outliers based on the reconstruction error.
What I do: I start with a 6 dimensions dataset and reduce it to 5 dimensions. Then, I reconstruct the initial dataset and ...
1
vote
0
answers
48
views
Outlier Detection in Meta-Analysis Models for Observational Studies of Adverse Drug Outcomes using Distributed Networks
I hope you are in good health. My thesis is on outlier detection in meta-analysis models. I will be using a case study from Canadian Network for Observational Drug Effects Studies (CNODES) to detect ...
0
votes
0
answers
281
views
How to optimize K-means to eliminate outliers and unrelated clusters?
I clustered document embeddings with K-Means. Embeddings have 2048 dimensions. Now, i am trying to optimize clustering. There are two problems. 1- Some clusters may have outlier samples. 2- Sometimes,...
1
vote
1
answer
318
views
Removing outliers at the start when there are multiple ANOVA and correlational analyses in a single results section [duplicate]
I would be grateful for opinion on which of the two options below (or an alternative) is best:
Summary of study: In a single results section, different ANOVAs are run on the different metrics – raw ...
1
vote
0
answers
127
views
Outliers for N=4
If you have 4 observations, can you have an outlier? Consider any value outside (Lower fourth-1.5(Fourth Spread), upper fourth+1.5(fourth spread)) as an outlier.
0
votes
1
answer
286
views
Heteroskedastic time series outlier analysis using machine learning
Is anyone aware of machine learning models that are able to deal with heteroskedasticity in time series, when trying to detect outliers? There are a lot of anomaly detection tools out there (like k-...
1
vote
1
answer
412
views
detecting outliers in weight measurement
I have weights data of users collected over a period of time. My goal is to find incorrect weight readings. The definition of incorrect readings is purely based on logical reasons (or in other words ...
3
votes
1
answer
3k
views
What to do with outliers? Should you use capping, remove outliers, or use non-parametric tests?
This will be my first question on Cross Validated, and besides, no one has ever taught me statistics. I am completely self-taught in this regard. So please forgive me if my question seems trivial.
I ...
1
vote
1
answer
335
views
Removing instances that decrease accuracy of Machine learning algorithm Methodology
Is it bad practice to run a Machine learning algorithm on an experimental dataset, check the MAE, and remove the instances that have a value of MAE above a certain limit? If we run the algorithm ...
2
votes
2
answers
174
views
Does KNN fail if the test data have no epsilon close nearest neighbors to the training data?
If I have binary-classification data and a Euclidean metric, and I know the best number of nearest neighbors, then I draw circles on my training data based on my K-value which tell me which regions ...
2
votes
2
answers
133
views
How to test assumptions for a large number of statistical tests?
I am running a logistic regression. The outcome is a clinical variable, and there are two predictors: gene expression (continuous), hormone levels (continuous), and the interaction term between them.
...
5
votes
2
answers
342
views
Applications of "Regression Towards the Mean" in Real Life
I was reading about "regression towards the mean". Over here, an explanation of this concept is provided:
"Consider a simple example: a class of students takes a 100-item true/false ...
2
votes
0
answers
80
views
Define outliers in correlation with right-skewed data (log-log plot)
I have a dataset of counts of occurrences of variables in different classes. For each class, I have an equivalent control created by shuffling the dataset.
For instance, this could be words from ...
7
votes
1
answer
7k
views
Tukey's fences for outlier removal
I'm in a biomedical research field, and I see a lot of researchers conducting low N studies that use Tukey's fences for outlier removal. For anyone who doesn't know, Tukey's fences works as such:
...
0
votes
0
answers
981
views
Is it valid to remove trials as outliers using IQR?
I have a repeated measures experiment where all participants completed several trials for each condition. My dependent variables are response time and accuracy. I am using the Interquartile Range as ...
0
votes
0
answers
52
views
Remove outliers from mostly linear data
I have a cumulated sum of battery charges that is mostly extremely linear, apart from some faulty data in the beginning. See this image as an example:
In order to get the most accurate linear ...
10
votes
2
answers
2k
views
Why is maximum likelihood estimator suspectible to outliers?
I'm new to statistics and currently learning abot MLE.
Some of the papers I read: Robust Graph Embedding with Noisy Link Weights mentioned MLEs are suspectible to contamination in data, but didn't ...
2
votes
1
answer
336
views
Method for outlier detection in noisy seasonal time series data?
I have around 1000 times series of around 1000 samples, where each sample is 5 minutes a part.
An example of a time series after performing seasonal decomposition is
As we can see the data is very ...
0
votes
1
answer
176
views
How to measure if a point of data is a deviation from other data points?
I have a data set that consists of many single data points. They are the measurements of network traffic, so they include e.g. '1403021', '1402341, '1399312'... values that are labeled as 'label1' and ...
1
vote
2
answers
693
views
AUC measure for Local outlier detection in python?
I'm using Local outlier factor algorithm provided by Scikit-learn for outlier detection. For the evaluation i want to use auc measure.
...
4
votes
1
answer
425
views
In anomaly detection of time series, should global outliers and contextual outliers be separated?
I am trying to create a pipeline in Python which automatically identifies global and contextual anomalies of a time series.
Which one of these approaches do you believe is more correct?
Method 1)
...
0
votes
0
answers
503
views
Many Outlier Handling in Logistic Regression
I am working on Telcom data for Churn modelling. I have 18 categorical and 2 numeric variables (total charges and monthly charges) in my data set. After handling the missing values, I checked the ...
0
votes
0
answers
80
views
Relative anomalies in multiple multivariate times series with different lengths?
I have a set of time series, highly correlated (similar peaks and trend). I'm going to find relative anomalies, e.g. say there are 20 times series. At a snapshot date, all values increase, but one ...
1
vote
2
answers
81
views
When is a realization from a bivariate distribution surprising?
I really need a hint here.
Suppose I want to be able to detect unusual events and express the likelihood of it occurring. Suppose that I know that two events usually move in a given association such ...
0
votes
2
answers
2k
views
Is it wrong to remove outliers from dependent variable when adjusting a model?
I'm beginning to study Generalized Linear Models and I was trying to adjust a model to the dataset NMES1988. More specifically, my goal is to adjust a Poisson Regression to this dataset considering ...