Skip to main content

Questions tagged [outliers]

An outlier is an observation that appears to be unusual or not well described relative to a simple characterization of a dataset. A discomfiting possibility is that these data come from a different population than the one intended to be studied.

Filter by
Sorted by
Tagged with
0 votes
0 answers
23 views

In a multiple regression problem, suppose we have responses $Y_1, Y_2, \cdots , Y_n$ corresponding to data $\mathbf{X}_1, \mathbf{X}_2, \cdots, \mathbf{X}_n$ where each $\mathbf{X}_i$ is a $d$-...
JRC's user avatar
  • 619
0 votes
0 answers
70 views

I have a dataset which has temperature measurements for every minute in a certain time period. I want to focus on 10 minute intervals and determine whether two adjacent 10 minute intervals differ ...
Jamess11's user avatar
1 vote
0 answers
45 views

I was wondering if there is a theorem or a result that relates the size of the population to the probability of the occurrence of outliers of various degrees, relating the z-score to the size of the ...
Tekko's user avatar
  • 11
0 votes
0 answers
34 views

This is a time-dependent measure of the water level of a river measured by an instrument that measures the water level every five minutes. However, due to some interference and other factors, there ...
yongchuang's user avatar
3 votes
1 answer
380 views

I've read in some papers (such as this) and CrossValidated questions (such as this, that people are using mahalanobis distance based on robust estimations of location and scatter using minimum ...
ira's user avatar
  • 461
0 votes
0 answers
125 views

I remembered I have encountered a paper in 1960s or 1970s that explore the impact of outliers on ordinary least square (OLS) regression. In the paper, it is shown that just adding one outlier will ...
Alex Cicco's user avatar
2 votes
0 answers
407 views

I have a dataset recording daily river flow from 1976 to 2017. I want to find out unusually high (potential flood) or low (potential drought) flow values from that datatset. What's the best way to ...
CyG's user avatar
  • 181
0 votes
0 answers
241 views

I am currently working on a project involving banking stock price data. I have around 3000 observations, some columns have a lot of missing values (null value); they can account for 5 to 50% of the ...
MINH NHỰT NGUYỄN TRẦN's user avatar
0 votes
0 answers
337 views

Say the distribution of underlying data points is multi-modal and we have an extremely large data point that has been confirmed to be an outlier. If it is not acceptable to simply remove the outlier ...
NMA's user avatar
  • 19
0 votes
0 answers
32 views

If we have a set of data of how long one watches youtube, these data points only include the raw number of minutes watched. If it is known that some of those data points include situations where you ...
NMA's user avatar
  • 19
2 votes
2 answers
231 views

What algorithm should I go for if I want to determine collective outliers within a dataset? By collective outliers, I mean a series of data points differ significantly from the trends in the rest of ...
Iamtrying's user avatar
3 votes
2 answers
478 views

I am trying to do factor analysis on a few variables and one particular variable (given in the example below) is covering/ explaining all the variance due to some outliers. I am not sure what else I ...
Saurabh's user avatar
  • 175
3 votes
1 answer
634 views

I am working on a time series dataset. I understand it has a gamma distribution. I want to use a 99% probability threshold to establish upper & lower limits/cut-offs and find anomalies. However, I ...
S2DEN8's user avatar
  • 31
3 votes
2 answers
213 views

Is there a better way to standardize a dataset with outliers than to normalized value (z-score) based on the mean and standard deviation? I am using the Excel STANDARDIZE function. I have two datasets ...
Patricia Nunes's user avatar
-1 votes
0 answers
15 views

I have this question, but to be honest i am stuck 1.Considering a set of 60 users, an a maximum number of objects that a user can own equal to 4000, which approach would you choose to calculate the ...
De Une's user avatar
  • 1
7 votes
1 answer
1k views

So, the idea is that I have many histograms, each one representing results for something. So, I have histogram_1 for object_1, histogram_2 for object_2,...,histogram_20 for object_20. How can throw ...
nowhere's user avatar
  • 157
5 votes
2 answers
6k views

Several articles says that MAE is robust to outliers but MSE is not and MSE can hamper the model if errors are too huge. My question is that MSE and MAE both are error matrices, our priority is to ...
Parth Sharma's user avatar
2 votes
1 answer
1k views

Suppose for simplicity that we have Gaussian distributed data with some outliers, whose typical characteristic is getting values that are far from the mean. Suppose my sample size is ...
Thomas's user avatar
  • 1,137
1 vote
0 answers
97 views

I came across this article: http://projetoaprendizagemgrupo4.pbworks.com/f/03.03%20-%20Unsupervised%20Profiling%20Methods%20Fraud%20Detection.pdf since I am interested in detecting abnormal behavior (...
Thomas's user avatar
  • 1,137
3 votes
1 answer
329 views

I'm facing with anomaly detection (outlier detection) task with mixed (numerical and categorical) multi-feature data set. I understand that many of the possible multivariate outlier detection methods ...
Hendrik's user avatar
  • 253
2 votes
2 answers
5k views

Any observations that are more than 1.5 IQR below Q1 or more than 1.5 IQR above Q3 are considered outliers. However does this theory still hold when a data set is not normally distributed? Outlier ...
maximus's user avatar
  • 131
7 votes
1 answer
1k views

I am using Mahalanobis distance for outlier detection. Sometimes my dataset only has 1 feature, sometimes many more. I believe the univariate Mahalanobis distance should be equal to the z-score of the ...
kwinkunks's user avatar
  • 369
2 votes
1 answer
624 views

The Context My dataset consists of 68 groups, each with 4 data points inside it. As means of a robustness test, I am looking to see how the type of average/mean I use impacts the analysis that I will ...
Son18's user avatar
  • 21
2 votes
1 answer
79 views

In Colombia there are 12.000 voting centers that consist of one or more electoral tables (the number of electoral tables depends on the number of registered voters in the voting center, and voting ...
user2246905's user avatar
2 votes
0 answers
69 views

I have a data set named Geographical Original of Music Data Set from the UCI repository. The data is given standardized but I think it has outliers and I do not know the best way to handle them. ...
Dazckel's user avatar
  • 81
0 votes
0 answers
39 views

I have a set of points/samples like the ones in blue in the image below: there is a bunch of wiggly nonsense here and there, and somewhere in the middle the is a region of almost perfect linear fit (...
user1384636's user avatar
1 vote
1 answer
813 views

I am studying the relationship between the concentration of metals in organisms (Y axis in the image) and the environment (X axis). The regressions are not very good due to some outliers, and I want ...
Antón's user avatar
  • 43
0 votes
0 answers
48 views

I have come across multiple methods regarding outlier treatment: (features = my input/regressor/... matrix) Treat outliers in the entire sample (both features and the variable to be forecasted). ...
shenflow's user avatar
  • 1,149
8 votes
2 answers
5k views

I am looking to find find clinical and other measurements to predict a blood metabolite with Elastic-Net Regression models. Can I remove samples with values greater than 1.96 SD from the mean as ...
Molly_K's user avatar
  • 203
1 vote
1 answer
223 views

Question: Should I rather winsorise (or trim, where relevant) my raw data, or the intermediary metric I use in my models? Context: My analysis consists in 3 steps: Collect raw data, Compute ...
ebosi's user avatar
  • 138
1 vote
0 answers
220 views

A lot of what I've seen for Bayesian approaches to removing outliers is for a linear model, not a normal distribution. Is there a way we can take a Bayesian approach to remove outliers from a normal ...
bme-programmer's user avatar
0 votes
0 answers
56 views

I have a practical / applied statistics question. I'm dealing with a specialized dataset with a very small sample (i.e. n < 10). In the sequence of observations, it is possible that a new ...
logisticregress's user avatar
0 votes
0 answers
86 views

I'm just confused about the problem of adding an outlier component directly to the primary form of GMM models: Suppose that the observed data contains several outliers. The mixture model could be: $$ ...
Iris88's user avatar
  • 1
0 votes
0 answers
2k views

I am trying to replicate a research paper as part of my Applied Econometrics course, and I came across a particularly vague statement in the reference paper. "Following Malmendier and Tate (2005),...
Madhav Bajaj's user avatar
0 votes
0 answers
71 views

I have two parameters (a,b) resulting from an exponential estimation of a curve. I have estimated this curve every hour for one month. In other words, I have a total of 720 parameters a and b, and I ...
angelavtc's user avatar
1 vote
1 answer
532 views

I have some broker prices incoming in real-time for several products. Sometimes a broker makes a typo and sends a wrong price, or my parsing engine assigns the price to the wrong product - these are ...
MilTom's user avatar
  • 369
0 votes
0 answers
222 views

I have completed a range of steady-state CFD simulations on building roofs. A contour map of the resulting variable is displayed in the Figure below with the corresponding values on the left side. ...
JimiChango's user avatar
2 votes
2 answers
2k views

I'm struggling with understanding the concept of splitting data for unsupervised anomaly/outlier detection. You can find all approaches here. I found some papers and implementations that didn't split ...
Mario's user avatar
  • 579
1 vote
1 answer
73 views

I constructed the realized variance of bitcoin returns per day from 8-10-2015 to today. The realized variance is calculated by taking the cumulative squared intra-day returns. 5-minute high frequency ...
Elise's user avatar
  • 51
0 votes
0 answers
61 views

I have a dataset where there are 6 runners. Each runner runs as far as they can for 20 mins, and a watcher records their distance (to the nearest 0.1 miles) at certain times, precisely on the minute ...
user267587's user avatar
1 vote
1 answer
768 views

I am performing a chi-square goodness of fit test to compare an observed value with an expected value. The expected value is calculated from theory. p-value suggests statistical significance. How do I ...
Jalan's user avatar
  • 11
1 vote
1 answer
196 views

Attached are the results and the residual plot for my regression of control variables on CEO compensation (TDC1). When I look at the plot my main concerns are the outliers (which I checked to be ...
user3129800's user avatar
1 vote
1 answer
267 views

I have a discrete 1-D data set with a value range of 0-100. The underlying distribution is unknown --although we have enough data to fit a model-- to summarize it is a highly right-skewed data set, ...
Ninja Bug's user avatar
  • 111
0 votes
0 answers
114 views

I have thinking about this problem for a while but couldn't quite formulate a proper solution myself. I am also not even sure if it is appropriate to speak of "outliers" or if the term "...
rememberhthename94's user avatar
-1 votes
1 answer
148 views

My dataset has 80,886 obs and 16 variables. I am using Mahalanobis Distance to detect outliers. And use P-value less than 0.001 as the cut-off. I am getting 5,423 obs as outlier which is 6% of total ...
surfffffffff's user avatar
2 votes
1 answer
472 views

I want to get some opinions on how to approach the following problem to do with detecting "unhealthy" behavior in time series data (either using a statistical/analytical model or ML/DL, I do ...
User_13's user avatar
  • 49
0 votes
0 answers
207 views

Existencial crysis here xD. When you want to determine outliers with IQR, and plotting a box-plot what do you plot if your data is structure in the following manner: n-dependent variables (n=6) (...
Leonardo Mendes-Silva's user avatar
0 votes
0 answers
271 views

Are there ways to automatically detect outliers ( we can fix uni-dimensional datasets ) when the underlying distribution is difficult to model ? Intuitively, resampling techniques could help. (1) You ...
Thomas's user avatar
  • 1,137
1 vote
0 answers
540 views

I have a dataset which has 200 dimensions after pre-processing. I read multiple times that 100 is the recommended number of trees for the Isolation Forest. Since each tree chooses one feature randomly,...
2much2code's user avatar
0 votes
1 answer
278 views

I´m working on a marine species dataset with R. I would like to compare the biomass and abundance between different sites but I´m not sure how to deal with the large number of outliers. I am aware ...
Florian B.'s user avatar

1 2 3
4
5
28