Questions tagged [outliers]
An outlier is an observation that appears to be unusual or not well described relative to a simple characterization of a dataset. A discomfiting possibility is that these data come from a different population than the one intended to be studied.
1,383 questions
4
votes
2
answers
2k
views
Is calculating skewness necessary before using the z-score to find outliers?
For example, if I specify a z-value of 3, then I would look at both sides and know its position in the distribution (99.73%).
Would this change if I have a left or right skewed distribution? Would I ...
1
vote
1
answer
254
views
Can I trust the results of a t test on 4 point Likert scale data which hides outliers?
I want to use an unpaired two-sample t-test of random samples of $n=40$ each. The sample data is from 4-point Likert scale assessments. I understand the t-test is not very robust to outliers, which I ...
3
votes
1
answer
453
views
Why should I split the data when searching for outliers? (pyod)
I am using pyod to detect outliers in data, and I came across this official example: https://github.com/yzhao062/pyod/blob/master/examples/comb_example.py
I have a question regarding the need to split ...
0
votes
0
answers
41
views
How can I detect univariate outliers multivariately?
Chose hopefully a catchy title :-) I am looking for a simple algorithm to detect outliers caused by measurement errors
So assune I have given a multivariate sample (30 dimensions) and I want to detect ...
0
votes
0
answers
171
views
Why do residuals cluster in two group?
I am running a logistic regression in a sample with ~150,000 observations. I am predicting three different outcomes, x, y, and z, that occur in ~10,000, ~4,000, and ~2,000 cases respectively (for each ...
0
votes
1
answer
112
views
How to identify outliers and drop rows in train splits of all folds, when using StratifiedKFold in GridSearchCV?
For predicting whether a subject has liver disease or not, I'm using StratifiedKFold CV in GridSearch for AdaBoost and RandomForest Classsifiers.
For Outlier anlaysis, I've identified all feature ...
0
votes
0
answers
125
views
What did Grubbs mean when he "cautioned against interpreting probabilities too literally when normality of the data is not assured"?
In his 1969 paper, Grubbs mentioned that "Until such time as criteria not sensitive to the normality assumption are developed, the experimenter is cautioned against interpreting probabilities too ...
3
votes
1
answer
249
views
Modeling outliers in maximum likelihood estimation with gradient descent
Consider a set of 3D points $X = \{x_1, x_2, ...x_n\} $ with $ x_i\in\mathbb{R}^3$ on which we want to fit an arbitrary probability distribution. The distribution we want to fit models some ...
1
vote
0
answers
44
views
shuffling data change OPTICS outlier results
I am trying to use sklearn.cluster.OPTICS to identify outliers, but found an issue:
I use 2 examples with exactly the same data but different orders. They give different results:
1st example
/////////...
0
votes
0
answers
62
views
What is it called when an outlier falls out of a rolling window statistical calculation?
I have a time series $X_t \sim N(0, 1)$. There is a single outlier at index 347, at 8.5 standard deviations from the mean. If I now compute a rolling window standard deviation of $X_t$ with window ...
2
votes
1
answer
235
views
Adjust the "Threshold" in a robust regression
I am trying to perform a robust regressions using the lmrob function in R.
I am getting this error Message:
...
0
votes
1
answer
135
views
Do outliers begin from or above the whisker-limit? [duplicate]
Does outliers begin on the whisker limit or above it?
In the (Python) example below the calculcated upper whisker limit is 64.8125. Is a value of ...
4
votes
1
answer
837
views
Should I be concerned about outliers in NB GLMM with an offset term?
I'm working on a negative binomial model for count data. Unfortunately I can't provide a more detailed description because I wasn't explicitly allowed to. All I can say now is that the data is about ...
1
vote
1
answer
184
views
Suggestions on dealing with outliers when sample size is very small AND you must order the results
I run competitive events. In our normal event, we have 8 adjudicators split between to categories. Skill and Artistry.
For each category we throw out the high and low scores and average the remaining ...
0
votes
1
answer
567
views
Can high standard deviations explain my non-significant & low effect size results? (please read description)
I'm trying to analyse bullying experiences across three age groups. The DV is scored on a 5-point Likert, and the IV is categorical (ages 11, 13, and 15).
Initially I ran an ANOVA to see if there was ...
1
vote
0
answers
129
views
Standardization of out-of-sample data
I have a panel (N firms across 10 years) dataset on which I want to estimate and test a prediction model $f$:
\begin{equation}
y = f(x).
\end{equation}
Following common practice, I split my data into ...
0
votes
1
answer
276
views
Outlier Detection using OutlierTest
I found an outlier using the outlierTest function in the car package. However, I can see from the results that the Externally Studentized Residual and p-values. This is a result for the full model.
<...
1
vote
0
answers
75
views
how to find anomalies for a non-normal distribution with seasonality?
I have a time series broken down by day, and there are gaps in it that I have marked in red:
the distribution there is not normal
How do we approach modeling a system that will look for anomalies ...
0
votes
1
answer
1k
views
How to deal with Covid outlier in time series/machine learning forecasting?
Disclaimer: I checked some similar questions but I could not find anything in particular that would work for my case.
I am dealing with a time series going from 2015 to 2023. The data points are the ...
1
vote
0
answers
41
views
How to deal with outliers after heterogeneity test in microarray expression datasets?
I have performed a meta-analysis using five micro-array datasets. After performing meta analysis I visualized the heterogeneity using funnel plot and forest plot (using two up-regulated and two down-...
1
vote
1
answer
595
views
Robustification in lavaan: Difference between M, MV and MVS?
In lavaan, I am running a two-factor CFA on a questionnaire with 28 items, all of which are scored on a 6-point Likert scale. In total I have ~350 participants who completed the questionnaire.
Because ...
2
votes
1
answer
597
views
Replacing outliers with the median value of the preceding 5 observations
In the paper Implications of dynamic factor models for VAR analysis the authors propose a a technique for removing outliers in variables used for dyanamic factors analysis:
"The outlier ...
2
votes
1
answer
322
views
R Tukey Anova: Can non-overlapping boxplots share the same letter of significance in Anova / Tukey Test?
I conducted a one way anova followed by a tukey-test in Rstudio and used a compact letter display to add letters of significance to a ggplot.
After a positive Grubbs-outlier-test I removed an outlier ...
0
votes
0
answers
341
views
outliers for right heavy tails distribtuions
There is plenty of information on how to detect outliers in a sample when assuming that this sample was derived from a normal distribution.
Sometimes it seems to me as if when we talk about outliers ...
0
votes
1
answer
71
views
Evaluate outliers of strictly non-decreasing sequences
Say I have the following sequence:
Is there a way to get a probability for each point indicating whether it is an outlier or not of the underlining strictly non-decreasing sequence?
I suppose the ...
1
vote
0
answers
53
views
Which metric for neural network should I try for time series data with sudden peaks?
I am doing time series forecasting with neural network (feedforward for now, but I will test also RNNs) and my problem is that, even though the network learned general patterns, it doesn't forecast ...
0
votes
0
answers
43
views
How to impute additive outliers in time series data
I need to forecast daily electricity demand. It seems that the outliers in my dataset are additive as they are affected by an anomalous behavior and are not induced by a random process that also ...
5
votes
1
answer
368
views
Boxplot | 5-Number-Summary
I have a question regarding the boxplot. On some web pages, the Minimum and the Maximum of the 5-Number-Summary correspond to the whiskers. However, regarding this definition, my question is:
how is ...
3
votes
0
answers
43
views
How to identify individuals that don't belong to a training class?
The frequency of 8 cell types is measured in 100 patients (the frequencies do not sum up to 1). The patients form 4 pathologies established by the physicians. As there might be better markers (cell ...
2
votes
2
answers
2k
views
Should I treat these data points as outliers?
Currently, I am building my analytics portfolio as part of the Google Data Analytics course. I chose the option to analyze Divvy Bike Sharing data for the year 2021. But now I'm currently stuck in the ...
5
votes
1
answer
4k
views
How to define the line to fit in Q-Q plot?
I'm trying to figure out if my data follows a normal distribution and if it contains outliers. I have plotted the histogram and now I would like to plot the quantile-quantile (Q-Q) plot. My point is, ...
0
votes
1
answer
196
views
What is a suitable technique for detecting anomalies in time series data?
I have a problem, where I try to identify if a machine performs an activity when it is not supposed to, or performs it an unusual number of times.
I am attempting to this using an anomaly detection ...
1
vote
0
answers
96
views
F1 Score vs PR Curve
If I understood correctly, PR Curve it's just the mean of F1 score computed multiple times with different thresholds.
In the task of outlier detection those are two suggested metrics given the fact ...
1
vote
2
answers
988
views
Heavy vs light tail distributions when modelling with outliers
I am reading this lecture notes on using the MLEs from other distributions (as Laplace) rather than a Gaussian when dealing with outliers. The lecture notes came from Oxford University: https://www.cs....
1
vote
1
answer
314
views
Do I remove outliers within training set or duplicate of original?
I want to predict on a test set.
I have created a binary logistic regression using my current training set and have predicted on the test set. The dataset I used to split has 299 observations.
What if ...
0
votes
0
answers
67
views
Cluster a set of files by the the number of points
I have a large set of aerial images with herds of elephants in it. The number of elephants in a single image can range from ~ 20 elephants to 1.
I have created a dataset of ~ 2,000 png image files ...
3
votes
2
answers
569
views
Do I want to overfit, when doing outlier detection based on regression?
Imagine, we have speed data of car and we would like to detect, if car speeds up or down more than it should.
Do I want to just overfit my model, so the outlier (higher or lower speed) would lead me ...
0
votes
0
answers
766
views
Can I use normalization and standardization on the same dataset?
I'm working on an ML project to predict wine quality from a wine's physical characteristics. The features of my data are on vastly different scales so I've been experimenting with different ...
1
vote
0
answers
125
views
Inverse-variance weighting non related to meta-analysis?
I've been reading about inverse-variance weighting and every reference I find to it is related to meta-analysis. However, I wonder if inverse-variance weighting can be used to reduce the bias produced ...
2
votes
3
answers
359
views
Should variables be dropped according to their skewness?
I am creating a classification model to predict the credit score of a person based on lots of factors. I got the dataset from kaggle. When I started doing the EDA part, I noticed that the skewness ...
4
votes
1
answer
1k
views
What's the best method to identify outliers and influential cases for linear mixed models?
I've seen many many many different questions on how to extract Leverage and Cook's distance for Lmers. I'm able to do that with different packages and functions by now, but how should I interpret them ...
1
vote
1
answer
2k
views
DHARMa outlier test is significant, what are my next steps?
I'm looking for information and guidance to help me understand the outlier test in DHARMa for negative binomial regression in R. Here is the diagnostic plot from DHARMa using the function ...
1
vote
0
answers
178
views
Outliers in a PCA score plot [closed]
I have this dataset of 104 tissue samples from two different types of tumors (B and C) along and 182 observations (gene expression profile). I do not need to understand the underlying biological ...
3
votes
1
answer
989
views
Outliers and possible dispersion in neg. binomial glmm residuals (DHARMa package)
I need help fixing the model I landed on through backwards step-wise elimination. I chose a negative binomial model because my variance seems much larger than the mean, with random intercepts from the ...
0
votes
0
answers
53
views
Scaling outliers in a dataset and reverse scaling
I have a data set with lots of small integer values and occasional large integers. For instance 1,1,1,3,2,1,320,2,3,4. I would like to scale my outlier values such that I can perform regression on my ...
4
votes
2
answers
2k
views
Should outliers be removed for goodness-of-fit tests?
If you allow a bit digression about the context: I am on a journey to better understand the power and usefulness of parametric distributions; I am a bit scared of them. Maybe due to the fact that I've ...
0
votes
1
answer
3k
views
Interquartile range finding more than 10 times outliers than zscore
I'm learning about outlier detection and I wrote these two methods to get the row indexes of the instances that have outliers so I can drop them later. The problem is I'm getting two numbers very far ...
0
votes
1
answer
117
views
Detection outliers in financial time series taking into account related time series
I would like seek advice on how to build an efficient approach to identify outliers in a financial series taking into account also related series.
For example, let's assume the there is a very ...
3
votes
3
answers
254
views
Problem with a single outlier, non-normal data, and unequal sample distributions
I am wanting to compare two independent groups on a likert-like item. To explain, the dependent variable is structured so that a 1 = <1 units, 2 = 1-<2, 3 = 2-<3, all the way up to option 7 = ...
0
votes
0
answers
106
views
Help Needed for Outliers detection post paired T-test statistical test
I don't know if this is a standard way od doing things so open to any suggestions, basically I have done random sampling from my population to create 2 groups Treatment & Control. I also have few ...