Questions tagged [outliers]
An outlier is an observation that appears to be unusual or not well described relative to a simple characterization of a dataset. A discomfiting possibility is that these data come from a different population than the one intended to be studied.
1,383 questions
5
votes
2
answers
500
views
Extreme outlier in real data
I'm looking at the amount of carbon in seven forest pools. For dead trees left on the landscape across many locations and over several harvest retention (logging) treatments, there is an extreme value ...
0
votes
0
answers
42
views
Winsorizing outliers across multiple analyses: once or multiple times? (SPSS)
I have a 2×2 experimental design with four conditions and eight outcome variables. I’m supposed to winsorize outliers, but I’m confused about how many times this needs to be done because I’m ...
1
vote
2
answers
275
views
outlier detection in classification
I am curious if there are any methods of outlier detection [read: NOT high leverage point detection] that be used in classification problems without fitting a model.
As I understand it, some commonly ...
1
vote
0
answers
34
views
How to assign an observation to a group but include an out-group option?
I have collected data from a number of known groups, and from individuals that I would like to assign to a group but may be from an unknown group.
For simplicity's sake, I have created an example with ...
5
votes
3
answers
533
views
How to handle outliers when some predictors perform better with them and others without
I’m working on a project where I need to build a predictive model for wine quality based on its chemical properties. The goal is to find which features best explain or predict the quality score.
I’ve ...
8
votes
4
answers
1k
views
Should I transform my data before or after removing outliers? (Highly skewed cortisol example)
I am analyzing cortisol data collected over multiple days, with three samples per day (Cortisol_1, Cortisol_2, Cortisol_3). My data are extremely skewed:
Skewness of Cortisol_1: 26.3
Skewness of ...
2
votes
0
answers
30
views
Hypothesis testing for a weekly seasonal effect in the presence of outliers
Suppose that I have a time series where the mean usually changes smoothly over time, and I want a hypothesis test for whether there is a weekly seasonal pattern to the data. The time series also ...
0
votes
0
answers
65
views
A simple-ish way of estimating the number of modes, and the 'pronounced'-ness of said modes of a discrete, finite distribution
Intuitively, let's say we're given a price $p$ for some product, and we want to compare the prices with what's available on the market (ex: to determine if we're being ripped off or not).
We come back ...
0
votes
0
answers
66
views
What does iteration in sigma clipping do
If I only want the high-SNR data, I do sigma-clipping to an array.
As this link says
Suppose you have a set of data. Compute its median m and its standard deviation ...
8
votes
1
answer
378
views
Does the presence of outliers always mean that robust regression analysis should be used?
I revised my question to be more specific, as suggested by the community. Since my knowledge of statistics is limited, I'm not entirely sure what it means to specialize in this subject—but I'll give ...
3
votes
2
answers
124
views
How to test if a single value in a set of values is higher than the remaining values
I have a set of $8$ participants $P_1, \ldots P_8$. Each participant takes two tasks $A$ and $B$, and each task results in an ordered vector of $6$ positive values. I'll denote the vector recorded ...
0
votes
0
answers
68
views
Should varIdent be used in a linear model with outliers in nlme in R
I am unsure whether/how to use varIdent from the nlme package to allow different variances across factor levels when analysing a dataset which has outliers.
I am specifically interested in mixed ...
3
votes
1
answer
167
views
What is the difference between Theil-sen estimator and Repeated median regression?
I am currently learning about robust regression and came across two variants: the Theil–Sen estimator and Repeated Median Regression. However, I got confused when comparing these two algorithms. Both ...
6
votes
1
answer
214
views
What regression method should I use for non-normal, outlier-heavy biomedical data with a continuous outcome?
I'm working with a large dataset of about 50,000 patients and trying to understand how protein expression levels influence erythrocyte (red blood cell) counts. The outcome variable — erythrocyte count ...
5
votes
1
answer
282
views
Moderation analysis assumption: univariate outliers after centering
I am conducting a moderation analysis for my thesis and am performing assumption testing.
I found a few univariate outliers and transformed any scores that were z-score of > (-)3.29. I then ...
0
votes
1
answer
129
views
dataset with outliers: Kendall Tau or Spearman´s Rho?
I am analyzing some data and in particular I want to test for the presence of a monotonic relationship between two random variables whose values don´t appear normally distributed. I know about the ...
0
votes
1
answer
98
views
Outlier Removal from only One Class in a binary classification problem
Can outlier removal be done only on one class in a binary classification problem?
when facing with class imbalance for example, can it be done only on majority class?
if so, is there any paper on this ...
5
votes
2
answers
602
views
How can I use unsupervised methods to recommend an “ideal” number of managers for companies when no labels exist?
I have a dataset of around 100,000 companies. For each company, I have a bunch of features such as:
Number of employees,
Number of customers,
Number of complaints,
other additional company attributes ...
3
votes
2
answers
299
views
DFBETA in regression model diagnostics of influential points
Belsley (1980) mentioned how DFBETA are calculated for linear regression models "DFBETA values are usually calculated via
equations that relate the least-squares fit of a model calculated with $n$...
6
votes
2
answers
652
views
Can you "dummy-out" an outlier on the independent variable?
I want to run a regression where one of the regressors has a single outlier. I wonder if I can include a dummy variable to rule out this outlier without loosing information from other regressors, as ...
0
votes
1
answer
116
views
Outlier detection, is it appropriate to take the mean of Z scores? [closed]
Simple backstory, I have few crypto tokens that I want to look at. I want to do some outlier detection and look for which token could be susceptible to a rugpull or scam.
Lets say, we get 10 tokens. I ...
0
votes
0
answers
54
views
How to handle an extreme outlier (clinical setting)
I am currently analyzing data from cancer patients and plan on running cox regression and assessing survival times. I also want to correlate certain tumor-related data to different markers.
One of ...
1
vote
1
answer
141
views
Feature selection and outlier detection in panel regression with fixed effects
I am trying to fit the following panel regression with fixed entity effects
$$Y_{it} = \alpha_i + \sum_j \beta_jX^{(j)}_{it} + \epsilon_{it},$$
where the index $j$ labels the different features. Some ...
0
votes
0
answers
37
views
How do you identify "important" changes between 2 or 3 time periods?
I am comparing sales by Customer for a company for 2 years in a row (sometimes for 3 years) and would like to highlight to my sales team the customers they should be looking into: customers who have ...
2
votes
0
answers
36
views
What do I do with a time series that has a large, strong, trend-violating glitch? [duplicate]
I have data (a few hundred thousand points) from 1 January 2017 up to a few days ago. I can create a time series by day (or even by time to the minute) if I so wish. However, this data is of public ...
2
votes
2
answers
316
views
Checking for an increase in outliers over time
I've been asked to test if there has been an increase in the number and size specifically of the high outliers over the years. The purpose is to show that there are more and higher extreme cases as ...
7
votes
4
answers
880
views
Can you remove outliers if they are less than 10% of the datapoints? [duplicate]
I am currently attending my first data analysis class and we do some simple hypothesis tests like t test etc. Our teacher told us that we can remove outliers, as long as they are not more that the 10% ...
2
votes
3
answers
153
views
Testing forecasting accuracy - outliers [ with example]
I have a simple model that produces forecast values. The model works on hourly data. Now, I am only interested in observations with flags. I would like to identify where the forecasts are ...
0
votes
1
answer
59
views
How can I filter outliers in data that is manually recorded?
Different people have to write down values on a certain type of parameter in order to fill out a table, and people obviously tend to write wrong. Sometimes, by a factor of 1000. This creates a lot of ...
5
votes
3
answers
401
views
Understanding heuristic-based outlier detection: concerns about scoring, weighting, and validity
I am trying to understand the mathematics and methodology behind a newly published outlier detection algorithm in the Computer & Security journal. This algorithm uses heuristic-based approaches, ...
2
votes
1
answer
236
views
Finding outliers in mostly zero data
Background
I'm working on an algorithm to find a short pieces of DNA sequence in a long DNA sequence. I won't go in detail of how it actually works, but let me more formally state it to provide ...
1
vote
0
answers
69
views
How can I identify the distribution of a series of Mahalanobis distances?
If my dataset follows a multivariate t-distribution, what is the cdf of the Mahalanobis distance of a datapoint outside the sample? In other words, if I want to calculate the probability that a ...
1
vote
1
answer
272
views
Local Outlier Factor for time series
I hope this makes sense. I have discovered LOF and tried it in R. However, since I am dealing with time series, the neighbors cannot be "future" neighbors of the current observation(s). I am ...
0
votes
0
answers
47
views
Latent variable demonstration with only 3 variables
I collected data for anxiety (ANX), depression (DEP), and posttraumatic stress syndrome (PTSD) symptoms. Spearman's correlation results are the following (...
1
vote
0
answers
41
views
How to know which features contribute the most to the outlier score after applying GMM detector?
I have a dataset with 100+ features, upon which I test GMM to detect anomalies. For example, I add some Gaussian noise to 5-6 features of 100 points. GMM detects the points easily, but the next ...
2
votes
1
answer
163
views
MSE gets better but $R^2$ gets worse
Consider the following small dataset (around 569 data points), where Uptake is the regression target:
As you can see, most of the variables are skewed, with some of them having only 2 or 3 data ...
1
vote
1
answer
70
views
Determining the multiplier in limits for spotting Outliers
I want to determine the chance of having above-the-expected sales orders for products, then i could use this (my gut feeling and other business analysis) to determine if i should (or not) keep safety ...
0
votes
0
answers
49
views
Bayesian model missing outliers at cutoff in data
I am having trouble getting the model to fit. I have ED50 values of chlorophyll in corals during a heating experiment. I have 4 reef sites and 4 species of coral with ~14 corals per site-species group....
3
votes
1
answer
343
views
Utilising Paired T-test but data is not normally distributed and there are outliers
I have a data sample of 190 but I have a few outliers and my data is not normally distributed. I intend to use paired T-test to evaluate the pre-post treatment over time. What should I do?
In addition,...
5
votes
1
answer
1k
views
Why does modified z-score not pick up an obvious outlier?
looking to draw on some of your wisdom around modified z-scores as used for detecting outliers.
As far as I can tell from my research, when a distribution might not be normal (e.g. skewed), a modified ...
0
votes
1
answer
64
views
Outlier detection for data set comparison
I have two data sets with similar columns, one numerical and the rest categorical.
col_1= categorical: city_name,
col_2= categorical: company_name,
col_3 = categorical: product_name,
col_4 = numerical ...
-1
votes
1
answer
81
views
Usefulness of p-value to flag outliers in a data set [closed]
Suppose I have a set of data such that $$y= a\times x + b + \varepsilon $$
I am trying to find $a$ and $b$, but some $y$'s are outliers and up to 80% of the data is missing, so I don't have access to $...
2
votes
1
answer
124
views
How can I show statistically that one of my replicates is likely contaminated?
I have a dataset that looks like the below: five replicate samples, each of which is composed of 4 different fractions that sum to 100%. The fifth sample clearly looks visually distinct from the other ...
0
votes
0
answers
68
views
Determining the p-value of a test statistic, which is not distributed according to a commonly known distribution under the null hypothesis
Currently I am working in R on a project that aims to identify Dragon King events (massive outliers) in large datasets. These outliers appear for example in the city sizes in England, where London is ...
1
vote
1
answer
93
views
What is the interpretation of outlier-robust principal component analysis?
There's a set of methods called "robust" principal component analysis (here, "robust" means resistant to influence from outliers). One example is Hubert et al., "ROBPCA: A new ...
1
vote
1
answer
882
views
to determine the appropriate threshold of the z-score for the non-normally distributed data
I am interested in CPI. And I need to identify outliers in the series. For that, my instructor mentioned about the number of standard deviations from the mean that a data point is. This is Z-score.
I ...
2
votes
2
answers
613
views
Methods for Detecting outliers in a time series
I have a question on detecting the outliers in a time series like PPI, CPI, inflation,...etc.)
Which method should I use? How can I precisely detect these outliers in a test or a method?
Please ...
2
votes
1
answer
103
views
Calculate the confidence that the data point is NOT explained by the regression
I have $n$ independent variables $x_i$ and dependent variables $y_i$ with uncertainties for both $x$ and $y$. I did a linear regression to get a model $\hat y = \beta x$.
Now I want to use this ...
1
vote
0
answers
198
views
How to deal with outliers in panel data? [closed]
When we have cross-sectional data, we can easily detect and remove outliers. But how should one approach outliers when we are dealing with panel data? Since we have $i$ entities and $t$ times periods, ...
4
votes
1
answer
215
views
Interpreting Mass-Volume as an evaluation criterion for unsupervised anomaly detection
I have found this paper How to Evaluate the Quality of Unsupervised Anomaly Detection Algorithms? by
Nicolas Goix that talks about evaluation of unsupervised anomaly scoring functions by the use of ...