Skip to main content

Questions tagged [outliers]

An outlier is an observation that appears to be unusual or not well described relative to a simple characterization of a dataset. A discomfiting possibility is that these data come from a different population than the one intended to be studied.

Filter by
Sorted by
Tagged with
5 votes
2 answers
500 views

I'm looking at the amount of carbon in seven forest pools. For dead trees left on the landscape across many locations and over several harvest retention (logging) treatments, there is an extreme value ...
Declan's user avatar
  • 51
0 votes
0 answers
42 views

I have a 2×2 experimental design with four conditions and eight outcome variables. I’m supposed to winsorize outliers, but I’m confused about how many times this needs to be done because I’m ...
mk0's user avatar
  • 21
1 vote
2 answers
275 views

I am curious if there are any methods of outlier detection [read: NOT high leverage point detection] that be used in classification problems without fitting a model. As I understand it, some commonly ...
plotmaster473's user avatar
1 vote
0 answers
34 views

I have collected data from a number of known groups, and from individuals that I would like to assign to a group but may be from an unknown group. For simplicity's sake, I have created an example with ...
AnneA's user avatar
  • 11
5 votes
3 answers
533 views

I’m working on a project where I need to build a predictive model for wine quality based on its chemical properties. The goal is to find which features best explain or predict the quality score. I’ve ...
QualityX's user avatar
8 votes
4 answers
1k views

I am analyzing cortisol data collected over multiple days, with three samples per day (Cortisol_1, Cortisol_2, Cortisol_3). My data are extremely skewed: Skewness of Cortisol_1: 26.3 Skewness of ...
Aaliya Ahamed's user avatar
2 votes
0 answers
30 views

Suppose that I have a time series where the mean usually changes smoothly over time, and I want a hypothesis test for whether there is a weekly seasonal pattern to the data. The time series also ...
Alex's user avatar
  • 817
0 votes
0 answers
65 views

Intuitively, let's say we're given a price $p$ for some product, and we want to compare the prices with what's available on the market (ex: to determine if we're being ripped off or not). We come back ...
MergeMonster's user avatar
0 votes
0 answers
66 views

If I only want the high-SNR data, I do sigma-clipping to an array. As this link says Suppose you have a set of data. Compute its median m and its standard deviation ...
Firestar-Reimu's user avatar
8 votes
1 answer
378 views

I revised my question to be more specific, as suggested by the community. Since my knowledge of statistics is limited, I'm not entirely sure what it means to specialize in this subject—but I'll give ...
Ertan's user avatar
  • 141
3 votes
2 answers
124 views

I have a set of $8$ participants $P_1, \ldots P_8$. Each participant takes two tasks $A$ and $B$, and each task results in an ordered vector of $6$ positive values. I'll denote the vector recorded ...
chesslad's user avatar
  • 241
0 votes
0 answers
68 views

I am unsure whether/how to use varIdent from the nlme package to allow different variances across factor levels when analysing a dataset which has outliers. I am specifically interested in mixed ...
Pratorum's user avatar
3 votes
1 answer
167 views

I am currently learning about robust regression and came across two variants: the Theil–Sen estimator and Repeated Median Regression. However, I got confused when comparing these two algorithms. Both ...
Olivia's user avatar
  • 191
6 votes
1 answer
214 views

I'm working with a large dataset of about 50,000 patients and trying to understand how protein expression levels influence erythrocyte (red blood cell) counts. The outcome variable — erythrocyte count ...
Nikimiskata's user avatar
5 votes
1 answer
282 views

I am conducting a moderation analysis for my thesis and am performing assumption testing. I found a few univariate outliers and transformed any scores that were z-score of > (-)3.29. I then ...
Emily's user avatar
  • 51
0 votes
1 answer
129 views

I am analyzing some data and in particular I want to test for the presence of a monotonic relationship between two random variables whose values don´t appear normally distributed. I know about the ...
Jamilo's user avatar
  • 1
0 votes
1 answer
98 views

Can outlier removal be done only on one class in a binary classification problem? when facing with class imbalance for example, can it be done only on majority class? if so, is there any paper on this ...
vhd's user avatar
  • 25
5 votes
2 answers
602 views

I have a dataset of around 100,000 companies. For each company, I have a bunch of features such as: Number of employees, Number of customers, Number of complaints, other additional company attributes ...
B_fig's user avatar
  • 63
3 votes
2 answers
299 views

Belsley (1980) mentioned how DFBETA are calculated for linear regression models "DFBETA values are usually calculated via equations that relate the least-squares fit of a model calculated with $n$...
user27842288's user avatar
6 votes
2 answers
652 views

I want to run a regression where one of the regressors has a single outlier. I wonder if I can include a dummy variable to rule out this outlier without loosing information from other regressors, as ...
Victor Hugo Schieck Terziani's user avatar
0 votes
1 answer
116 views

Simple backstory, I have few crypto tokens that I want to look at. I want to do some outlier detection and look for which token could be susceptible to a rugpull or scam. Lets say, we get 10 tokens. I ...
myts999's user avatar
  • 13
0 votes
0 answers
54 views

I am currently analyzing data from cancer patients and plan on running cox regression and assessing survival times. I also want to correlate certain tumor-related data to different markers. One of ...
Maria Nieves Arredondo Lasso's user avatar
1 vote
1 answer
141 views

I am trying to fit the following panel regression with fixed entity effects $$Y_{it} = \alpha_i + \sum_j \beta_jX^{(j)}_{it} + \epsilon_{it},$$ where the index $j$ labels the different features. Some ...
Mark Dubin's user avatar
0 votes
0 answers
37 views

I am comparing sales by Customer for a company for 2 years in a row (sometimes for 3 years) and would like to highlight to my sales team the customers they should be looking into: customers who have ...
Adriana's user avatar
2 votes
0 answers
36 views

I have data (a few hundred thousand points) from 1 January 2017 up to a few days ago. I can create a time series by day (or even by time to the minute) if I so wish. However, this data is of public ...
Bryan's user avatar
  • 1,541
2 votes
2 answers
316 views

I've been asked to test if there has been an increase in the number and size specifically of the high outliers over the years. The purpose is to show that there are more and higher extreme cases as ...
Woolynik's user avatar
7 votes
4 answers
880 views

I am currently attending my first data analysis class and we do some simple hypothesis tests like t test etc. Our teacher told us that we can remove outliers, as long as they are not more that the 10% ...
Maria's user avatar
  • 71
2 votes
3 answers
153 views

I have a simple model that produces forecast values. The model works on hourly data. Now, I am only interested in observations with flags. I would like to identify where the forecasts are ...
Lohengrin's user avatar
0 votes
1 answer
59 views

Different people have to write down values on a certain type of parameter in order to fill out a table, and people obviously tend to write wrong. Sometimes, by a factor of 1000. This creates a lot of ...
Huragok's user avatar
5 votes
3 answers
401 views

I am trying to understand the mathematics and methodology behind a newly published outlier detection algorithm in the Computer & Security journal. This algorithm uses heuristic-based approaches, ...
Mario's user avatar
  • 579
2 votes
1 answer
236 views

Background I'm working on an algorithm to find a short pieces of DNA sequence in a long DNA sequence. I won't go in detail of how it actually works, but let me more formally state it to provide ...
CodeNoob's user avatar
  • 231
1 vote
0 answers
69 views

If my dataset follows a multivariate t-distribution, what is the cdf of the Mahalanobis distance of a datapoint outside the sample? In other words, if I want to calculate the probability that a ...
Andreas Ierodiaconou's user avatar
1 vote
1 answer
272 views

I hope this makes sense. I have discovered LOF and tried it in R. However, since I am dealing with time series, the neighbors cannot be "future" neighbors of the current observation(s). I am ...
umbe1987's user avatar
  • 307
0 votes
0 answers
47 views

I collected data for anxiety (ANX), depression (DEP), and posttraumatic stress syndrome (PTSD) symptoms. Spearman's correlation results are the following (...
pdeli's user avatar
  • 161
1 vote
0 answers
41 views

I have a dataset with 100+ features, upon which I test GMM to detect anomalies. For example, I add some Gaussian noise to 5-6 features of 100 points. GMM detects the points easily, but the next ...
AlisherAliev's user avatar
2 votes
1 answer
163 views

Consider the following small dataset (around 569 data points), where Uptake is the regression target: As you can see, most of the variables are skewed, with some of them having only 2 or 3 data ...
AnotherSherlock's user avatar
1 vote
1 answer
70 views

I want to determine the chance of having above-the-expected sales orders for products, then i could use this (my gut feeling and other business analysis) to determine if i should (or not) keep safety ...
Simonates's user avatar
0 votes
0 answers
49 views

I am having trouble getting the model to fit. I have ED50 values of chlorophyll in corals during a heating experiment. I have 4 reef sites and 4 species of coral with ~14 corals per site-species group....
Michael's user avatar
  • 11
3 votes
1 answer
343 views

I have a data sample of 190 but I have a few outliers and my data is not normally distributed. I intend to use paired T-test to evaluate the pre-post treatment over time. What should I do? In addition,...
Aurelia 's user avatar
5 votes
1 answer
1k views

looking to draw on some of your wisdom around modified z-scores as used for detecting outliers. As far as I can tell from my research, when a distribution might not be normal (e.g. skewed), a modified ...
gecko's user avatar
  • 53
0 votes
1 answer
64 views

I have two data sets with similar columns, one numerical and the rest categorical. col_1= categorical: city_name, col_2= categorical: company_name, col_3 = categorical: product_name, col_4 = numerical ...
Jens123's user avatar
-1 votes
1 answer
81 views

Suppose I have a set of data such that $$y= a\times x + b + \varepsilon $$ I am trying to find $a$ and $b$, but some $y$'s are outliers and up to 80% of the data is missing, so I don't have access to $...
Anatole's user avatar
2 votes
1 answer
124 views

I have a dataset that looks like the below: five replicate samples, each of which is composed of 4 different fractions that sum to 100%. The fifth sample clearly looks visually distinct from the other ...
Dubukay's user avatar
  • 298
0 votes
0 answers
68 views

Currently I am working in R on a project that aims to identify Dragon King events (massive outliers) in large datasets. These outliers appear for example in the city sizes in England, where London is ...
user25936873's user avatar
1 vote
1 answer
93 views

There's a set of methods called "robust" principal component analysis (here, "robust" means resistant to influence from outliers). One example is Hubert et al., "ROBPCA: A new ...
cgmil's user avatar
  • 1,633
1 vote
1 answer
882 views

I am interested in CPI. And I need to identify outliers in the series. For that, my instructor mentioned about the number of standard deviations from the mean that a data point is. This is Z-score. I ...
1190's user avatar
  • 1,160
2 votes
2 answers
613 views

I have a question on detecting the outliers in a time series like PPI, CPI, inflation,...etc.) Which method should I use? How can I precisely detect these outliers in a test or a method? Please ...
2 votes
1 answer
103 views

I have $n$ independent variables $x_i$ and dependent variables $y_i$ with uncertainties for both $x$ and $y$. I did a linear regression to get a model $\hat y = \beta x$. Now I want to use this ...
Tibor's user avatar
  • 155
1 vote
0 answers
198 views

When we have cross-sectional data, we can easily detect and remove outliers. But how should one approach outliers when we are dealing with panel data? Since we have $i$ entities and $t$ times periods, ...
TFT's user avatar
  • 345
4 votes
1 answer
215 views

I have found this paper How to Evaluate the Quality of Unsupervised Anomaly Detection Algorithms? by Nicolas Goix that talks about evaluation of unsupervised anomaly scoring functions by the use of ...
deblue's user avatar
  • 399

1
2 3 4 5
28