Questions tagged [outliers]
An outlier is an observation that appears to be unusual or not well described relative to a simple characterization of a dataset. A discomfiting possibility is that these data come from a different population than the one intended to be studied.
1,383 questions
0
votes
0
answers
23
views
Detection of Multivariate Outliers (in a multiple linear regression problem) [duplicate]
In a multiple regression problem, suppose we have responses $Y_1, Y_2, \cdots , Y_n$ corresponding to data $\mathbf{X}_1, \mathbf{X}_2, \cdots, \mathbf{X}_n$ where each $\mathbf{X}_i$ is a $d$-...
0
votes
0
answers
70
views
Should you use mean difference between measurements or min-max difference to detect outliers?
I have a dataset which has temperature measurements for every minute in a certain time period.
I want to focus on 10 minute intervals and determine whether two adjacent 10 minute intervals differ ...
1
vote
0
answers
45
views
A theorem or result relating the probability of occurence of outliers to population size
I was wondering if there is a theorem or a result that relates the size of the population to the probability of the occurrence of outliers of various degrees, relating the z-score to the size of the ...
0
votes
0
answers
34
views
Identify outliers of river levels that change continuously over time
This is a time-dependent measure of the water level of a river measured by an instrument that measures the water level every five minutes. However, due to some interference and other factors, there ...
3
votes
1
answer
380
views
Detecting multivariate outliers with Minimum covariance discriminant and mahalanobis distance
I've read in some papers (such as this) and CrossValidated questions (such as this, that people are using mahalanobis distance based on robust estimations of location and scatter using minimum ...
0
votes
0
answers
125
views
The Literature on the impact of outliers on ordinary least square (OLS) regression
I remembered I have encountered a paper in 1960s or 1970s that explore the impact of outliers on ordinary least square (OLS) regression. In the paper, it is shown that just adding one outlier will ...
2
votes
0
answers
407
views
How to detect low and high flow outliers with seasonal time series data in R?
I have a dataset recording daily river flow from 1976 to 2017. I want to find out unusually high (potential flood) or low (potential drought) flow values from that datatset. What's the best way to ...
0
votes
0
answers
241
views
What is the right order in dealing with outliers, missing values and log transformation?
I am currently working on a project involving banking stock price data. I have around 3000 observations, some columns have a lot of missing values (null value); they can account for 5 to 50% of the ...
0
votes
0
answers
337
views
Dealing with outliers for Multimodal distribution
Say the distribution of underlying data points is multi-modal and we have an extremely large data point that has been confirmed to be an outlier. If it is not acceptable to simply remove the outlier ...
0
votes
0
answers
32
views
How do we account best for outliers with applied statistics? [duplicate]
If we have a set of data of how long one watches youtube, these data points only include the raw number of minutes watched. If it is known that some of those data points include situations where you ...
2
votes
2
answers
231
views
Algorithm for detecting collective outliers
What algorithm should I go for if I want to determine collective outliers within a dataset?
By collective outliers, I mean a series of data points differ significantly from the trends in the rest of ...
3
votes
2
answers
478
views
Fixing outliers and normalizing a vector using R
I am trying to do factor analysis on a few variables and one particular variable (given in the example below) is covering/ explaining all the variance due to some outliers. I am not sure what else I ...
3
votes
1
answer
634
views
Why am I getting strange upper & lower limits on a gamma distribution?
I am working on a time series dataset. I understand it has a gamma distribution. I want to use a 99% probability threshold to establish upper & lower limits/cut-offs and find anomalies. However, I ...
3
votes
2
answers
213
views
Standardize dataset with high outliers
Is there a better way to standardize a dataset with outliers than to normalized value (z-score) based on the mean and standard deviation? I am using the Excel STANDARDIZE function.
I have two datasets ...
-1
votes
0
answers
15
views
Centroid and Outlier calculation [duplicate]
I have this question, but to be honest i am stuck
1.Considering a set of 60 users, an a maximum number of objects that a user can own equal to 4000, which approach would you choose to calculate the ...
7
votes
1
answer
1k
views
Outlier/anomaly detection on histograms
So, the idea is that I have many histograms, each one representing results for something. So, I have histogram_1 for object_1, histogram_2 for object_2,...,histogram_20 for object_20. How can throw ...
5
votes
2
answers
6k
views
MAE vs MSE for Linear regression
Several articles says that MAE is robust to outliers but MSE is not and MSE can hamper the model if errors are too huge. My question is that MSE and MAE both are error matrices, our priority is to ...
2
votes
1
answer
1k
views
How to use box plots to detect outliers?
Suppose for simplicity that we have Gaussian distributed data with some outliers, whose typical characteristic is getting values that are far from the mean. Suppose my sample size is ...
1
vote
0
answers
97
views
Understanding an outlier detection technique for fraud detection
I came across this article:
http://projetoaprendizagemgrupo4.pbworks.com/f/03.03%20-%20Unsupervised%20Profiling%20Methods%20Fraud%20Detection.pdf
since I am interested in detecting abnormal behavior (...
3
votes
1
answer
329
views
Is applying dimension reduction to mixed type data valid for outlier detection after that?
I'm facing with anomaly detection (outlier detection) task with mixed (numerical and categorical) multi-feature data set. I understand that many of the possible multivariate outlier detection methods ...
2
votes
2
answers
5k
views
Does IQR method for outliers work for non-normal data?
Any observations that are more than 1.5 IQR below Q1 or more than 1.5 IQR above Q3 are considered outliers. However does this theory still hold when a data set is not normally distributed?
Outlier ...
7
votes
1
answer
1k
views
Why does univariate Mahalanobis distance not match z-score?
I am using Mahalanobis distance for outlier detection. Sometimes my dataset only has 1 feature, sometimes many more. I believe the univariate Mahalanobis distance should be equal to the z-score of the ...
2
votes
1
answer
624
views
Is winsorizing limited to the usage of a certain percentile cutoff?
The Context
My dataset consists of 68 groups, each with 4 data points inside it.
As means of a robustness test, I am looking to see how the type of average/mean I use impacts the analysis that I will ...
2
votes
1
answer
79
views
Statistical method to detect possible electoral frauds
In Colombia there are 12.000 voting centers that consist of one or more electoral tables (the number of electoral tables depends on the number of registered voters in the voting center, and voting ...
2
votes
0
answers
69
views
if my dataset is standardized but have outliers should I remove them and re-standardize? [closed]
I have a data set named Geographical Original of Music Data Set from the UCI repository. The data is given standardized but I think it has outliers and I do not know the best way to handle them.
...
0
votes
0
answers
39
views
outliers in regression, selection of a specific region of the samples
I have a set of points/samples like the ones in blue in the image below: there is a bunch of wiggly nonsense here and there, and somewhere in the middle the is a region of almost perfect linear fit (...
1
vote
1
answer
813
views
How to detect outliers in linear regression
I am studying the relationship between the concentration of metals in organisms (Y axis in the image) and the environment (X axis). The regressions are not very good due to some outliers, and I want ...
0
votes
0
answers
48
views
Outlier Treatment and Forecasting
I have come across multiple methods regarding outlier treatment:
(features = my input/regressor/... matrix)
Treat outliers in the entire sample (both features and the variable to be forecasted).
...
8
votes
2
answers
5k
views
Can I remove sample outliers using standard deviation?
I am looking to find find clinical and other measurements to predict a blood metabolite with Elastic-Net Regression models.
Can I remove samples with values greater than 1.96 SD from the mean as ...
1
vote
1
answer
223
views
Should I trim/winsorize raw data or computed metric used in models?
Question: Should I rather winsorise (or trim, where relevant) my raw data, or the intermediary metric I use in my models?
Context: My analysis consists in 3 steps:
Collect raw data,
Compute ...
1
vote
0
answers
220
views
Bayesian approach to removing outliers from a normal distribution
A lot of what I've seen for Bayesian approaches to removing outliers is for a linear model, not a normal distribution. Is there a way we can take a Bayesian approach to remove outliers from a normal ...
0
votes
0
answers
56
views
Modification of Outliers
I have a practical / applied statistics question. I'm dealing with a specialized dataset with a very small sample (i.e. n < 10). In the sequence of observations, it is possible that a new ...
0
votes
0
answers
86
views
How to conduct EM algorithm when there are some outliers in GMM Models?
I'm just confused about the problem of adding an outlier component directly to the primary form of GMM models:
Suppose that the observed data contains several outliers. The mixture model could be:
$$
...
0
votes
0
answers
2k
views
Log Transformation to treat outliers [duplicate]
I am trying to replicate a research paper as part of my Applied Econometrics course, and I came across a particularly vague statement in the reference paper.
"Following Malmendier and Tate (2005),...
0
votes
0
answers
71
views
Smoothing time series with Adjusted R2-weighted averages
I have two parameters (a,b) resulting from an exponential estimation of a curve. I have estimated this curve every hour for one month. In other words, I have a total of 720 parameters a and b, and I ...
1
vote
1
answer
532
views
Detecting outliers in a multiple time-series
I have some broker prices incoming in real-time for several products. Sometimes a broker makes a typo and sends a wrong price, or my parsing engine assigns the price to the wrong product - these are ...
0
votes
0
answers
222
views
Detect and remove outliers from unknown distribution
I have completed a range of steady-state CFD simulations on building roofs. A contour map of the resulting variable is displayed in the Figure below with the corresponding values on the left side. ...
2
votes
2
answers
2k
views
Do we need to split the data for Unsupervised Anomaly Detection?
I'm struggling with understanding the concept of splitting data for unsupervised anomaly/outlier detection. You can find all approaches here. I found some papers and implementations that didn't split ...
1
vote
1
answer
73
views
Which raw data to include for heterogenous autoregressive (HAR) model
I constructed the realized variance of bitcoin returns per day from 8-10-2015 to today. The realized variance is calculated by taking the cumulative squared intra-day returns. 5-minute high frequency ...
0
votes
0
answers
61
views
Sample of Runners - Can the Group Run 2.5 Miles in 20 mins?
I have a dataset where there are 6 runners. Each runner runs as far as they can for 20 mins, and a watcher records their distance (to the nearest 0.1 miles) at certain times, precisely on the minute ...
1
vote
1
answer
768
views
Identify outliers in chi-squared goodness of fit test
I am performing a chi-square goodness of fit test to compare an observed value with an expected value. The expected value is calculated from theory. p-value suggests statistical significance. How do I ...
1
vote
1
answer
196
views
Do I need to transform/standardise my dependent variable?
Attached are the results and the residual plot for my regression of control variables on CEO compensation (TDC1). When I look at the plot my main concerns are the outliers (which I checked to be ...
1
vote
1
answer
267
views
Detecting Spikes in a 1-D discrete time series data with unknown underlying distribution
I have a discrete 1-D data set with a value range of 0-100. The underlying distribution is unknown --although we have enough data to fit a model-- to summarize it is a highly right-skewed data set, ...
0
votes
0
answers
114
views
How to decide which "outliers" to get rid of?
I have thinking about this problem for a while but couldn't quite formulate a proper solution myself. I am also not even sure if it is appropriate to speak of "outliers" or if the term "...
-1
votes
1
answer
148
views
Is 6% of your dataset are outliers normal?
My dataset has 80,886 obs and 16 variables. I am using Mahalanobis Distance to detect outliers. And use P-value less than 0.001 as the cut-off. I am getting 5,423 obs as outlier which is 6% of total ...
2
votes
1
answer
472
views
Flagging bad time series behavior (Pattern Recognition and Outlier Detection)
I want to get some opinions on how to approach the following problem to do with detecting "unhealthy" behavior in time series data (either using a statistical/analytical model or ML/DL, I do ...
0
votes
0
answers
207
views
Outlier in grouped data
Existencial crysis here xD.
When you want to determine outliers with IQR, and plotting a box-plot what do you plot if your data is structure in the following manner:
n-dependent variables (n=6) (...
0
votes
0
answers
271
views
Non-parametric outlier estimation
Are there ways to automatically detect outliers ( we can fix uni-dimensional datasets ) when the underlying distribution is difficult to model ?
Intuitively, resampling techniques could help.
(1) You ...
1
vote
0
answers
540
views
Should I have more trees than dimensions for the Isolation Forest?
I have a dataset which has 200 dimensions after pre-processing. I read multiple times that 100 is the recommended number of trees for the Isolation Forest. Since each tree chooses one feature randomly,...
0
votes
1
answer
278
views
How to deal with a large number of outliers in biological data?
I´m working on a marine species dataset with R. I would like to compare the biomass and abundance between different sites but I´m not sure how to deal with the large number of outliers. I am aware ...