Skip to main content

Questions tagged [outliers]

An outlier is an observation that appears to be unusual or not well described relative to a simple characterization of a dataset. A discomfiting possibility is that these data come from a different population than the one intended to be studied.

Filter by
Sorted by
Tagged with
1 vote
0 answers
149 views

I sample $n$ dimension vectors (each sample is a vector). My objective is the detection of outliers. In case those elements would distribute normally, for outlier detection, I could use Standardized ...
Gideon Kogan's user avatar
3 votes
0 answers
874 views

Given the Prussian Horse Data here: https://www.randomservices.org/random/data/HorseKicks.html Is there a way to find out which corp has an unusually high number of deaths? (Note that Prussian horse ...
william007's user avatar
  • 1,097
3 votes
2 answers
555 views

I have a dataset about solar panels' output power. After visually inspecting the data distribution, I found it is not normal distribution and is a right-skewed distribution with many zeroes. I used ...
graphicart86's user avatar
0 votes
0 answers
57 views

I have a bunch of outcome/exposure relationships I am trying to fit models to: From these graphs, I am not sure if a simple lm is appropriate. Some of them look ...
Hank Lin's user avatar
  • 529
3 votes
4 answers
954 views

Assume you have a data set with information on income of all students in the lecture. The mean value is 1500\$. The median value is however only 800\$. Which of the following conclusions is wrong? The ...
StatisticsNoobie's user avatar
1 vote
1 answer
152 views

I apologize in advance for my novice question. I am a part of an interview committee of eight people. We interview 70 applicants for just six positions. All of the applicants are very accomplished. We ...
Joe Davey's user avatar
4 votes
3 answers
2k views

I am in the process of solving a Machine Learning challenge, and I want to do it the right way. I did some exploratory data analysisand I wanted to check the distribution of the data. As displayed in ...
Spicy strike's user avatar
0 votes
0 answers
227 views

I'm trying to fit a dataset to a curve for while, but I'm not managing. The goal is to obtain a curve with equation that fits the data so I can get the parameter x to any value of y. The blue dataset ...
JCV's user avatar
  • 153
0 votes
0 answers
487 views

I have been working on a dataset for which the task is to forecast the sales of the drug sold by 1115 drug stores of the Rossmann chain. The dataset is fairly large with over 1m records and as many as ...
Ritik P. Nayak's user avatar
0 votes
1 answer
178 views

I have samples that are distributed like this: I want to calculate the standard deviation (or similar) of the main peak without the outliers. Of course I can do this just applying a cut at, say, -5µ. ...
user171780's user avatar
0 votes
1 answer
339 views

As stated in Wikipedia: Intuitively, we can understand that a breakdown point cannot exceed 50% because if more than half of the observations are contaminated, it is not possible to distinguish ...
JustBlaze's user avatar
0 votes
0 answers
1k views

I testing if I can describe the StockPRice with EPS (=earnings per share), BookValuePS an ESGscore. Before I start I winsorized all my variables. Now I want to take the loagrithm of e.g. BookValuePS ...
wrangjangler's user avatar
9 votes
10 answers
6k views

I am sure we have all heard the following argument stated in some way or the other: For a given set of measurements (e.g. heights of students), the mean of these measurements is more "prone"...
stats_noob's user avatar
2 votes
1 answer
524 views

I apologize in advance as I am not well-versed in statistics, but I hope that this question makes sense. I have 2 populations which are normally distributed and have a near-identical mean. I would ...
octopuslegs11's user avatar
1 vote
1 answer
169 views

I have the following (general) question (I know there is no definite answer to this question and it largely depends on the specific data and choice of model): Should Legitimate Outliers in the Data be ...
stats_noob's user avatar
0 votes
1 answer
650 views

I have a dataset I need to use to predict the probability of conversion based on the number of days an individual has spent using my app. I got a list of historical users and the number of session ...
Andrei Budaes's user avatar
0 votes
1 answer
3k views

I want to know how to find and remove outliers from my Logistic Regression. I have tried using formula from Faraway, but I don't know is it applicable for logistic regression or not For example my ...
Jasmine Helen's user avatar
0 votes
2 answers
4k views

Some people seem to frown on removing outliers. But I've also read many times elsewhere that ANOVAs are sensitive to outliers and you must remove them. I'm running a 2 x 2 repeated measures within ...
Statsquestionboy's user avatar
1 vote
2 answers
182 views

I am running a multiple variable regression predicting GDP per capita for U.S. states with a bunch of independent variables. Currently I have included the District of Columbia in the data set which ...
Jeremy's user avatar
  • 13
-1 votes
1 answer
267 views

I am a very young stats learner, and I need help understanding the justification of a test choice. I have a sample of 39 participants (20 females and 19 males) been measured on task performance, and I ...
marth's user avatar
  • 1
2 votes
1 answer
3k views

I'm wondering about techniques like ridge regression with regard to both multicollinearity and outliers. My understanding is that ridge regression is primarily used for multicollinearity, but that ...
fmtcs's user avatar
  • 575
4 votes
2 answers
3k views

My classmate told me he was showing his work in some stuff statistics-based and some time he was showing a boxplot and using it as outlier detection then his professor said 'it's not even correct, the ...
Davi Américo's user avatar
0 votes
0 answers
223 views

In my current dataset (results from a factorial ANOVA), I know I have outliers (due to qualitative comments participants wrote during an online experiment), thus I'd like to do a filtering process to ...
JoeyyyFunk's user avatar
1 vote
0 answers
191 views

I have two groups of samples (N=4 for each) and found that there is one outlier for each group (both are higher than the rest of the respected samples within the groups). I have no resources to repeat ...
William Wong's user avatar
0 votes
1 answer
192 views

Let's assume I have a sensor that gives me measurements $z$ and I know that $50\%$ of the measurements I read are outliers (more than 3 standard deviations away from the real measurement distribution)....
MattSt's user avatar
  • 350
0 votes
0 answers
26 views

I am experimenting PCA to detect outliers based on the reconstruction error. What I do: I start with a 6 dimensions dataset and reduce it to 5 dimensions. Then, I reconstruct the initial dataset and ...
savoga's user avatar
  • 16
1 vote
0 answers
48 views

I hope you are in good health. My thesis is on outlier detection in meta-analysis models. I will be using a case study from Canadian Network for Observational Drug Effects Studies (CNODES) to detect ...
HRH's user avatar
  • 11
0 votes
0 answers
281 views

I clustered document embeddings with K-Means. Embeddings have 2048 dimensions. Now, i am trying to optimize clustering. There are two problems. 1- Some clusters may have outlier samples. 2- Sometimes,...
Alper M.'s user avatar
1 vote
1 answer
318 views

I would be grateful for opinion on which of the two options below (or an alternative) is best: Summary of study: In a single results section, different ANOVAs are run on the different metrics – raw ...
Pop's user avatar
  • 13
1 vote
0 answers
127 views

If you have 4 observations, can you have an outlier? Consider any value outside (Lower fourth-1.5(Fourth Spread), upper fourth+1.5(fourth spread)) as an outlier.
Aleph's user avatar
  • 11
0 votes
1 answer
286 views

Is anyone aware of machine learning models that are able to deal with heteroskedasticity in time series, when trying to detect outliers? There are a lot of anomaly detection tools out there (like k-...
SimonDude's user avatar
1 vote
1 answer
412 views

I have weights data of users collected over a period of time. My goal is to find incorrect weight readings. The definition of incorrect readings is purely based on logical reasons (or in other words ...
monte's user avatar
  • 121
3 votes
1 answer
3k views

This will be my first question on Cross Validated, and besides, no one has ever taught me statistics. I am completely self-taught in this regard. So please forgive me if my question seems trivial. I ...
Marek Fiołka's user avatar
1 vote
1 answer
335 views

Is it bad practice to run a Machine learning algorithm on an experimental dataset, check the MAE, and remove the instances that have a value of MAE above a certain limit? If we run the algorithm ...
RandML000's user avatar
2 votes
2 answers
174 views

If I have binary-classification data and a Euclidean metric, and I know the best number of nearest neighbors, then I draw circles on my training data based on my K-value which tell me which regions ...
user avatar
2 votes
2 answers
133 views

I am running a logistic regression. The outcome is a clinical variable, and there are two predictors: gene expression (continuous), hormone levels (continuous), and the interaction term between them. ...
Sam's user avatar
  • 679
5 votes
2 answers
342 views

I was reading about "regression towards the mean". Over here, an explanation of this concept is provided: "Consider a simple example: a class of students takes a 100-item true/false ...
2 votes
0 answers
80 views

I have a dataset of counts of occurrences of variables in different classes. For each class, I have an equivalent control created by shuffling the dataset. For instance, this could be words from ...
mm523's user avatar
  • 85
7 votes
1 answer
7k views

I'm in a biomedical research field, and I see a lot of researchers conducting low N studies that use Tukey's fences for outlier removal. For anyone who doesn't know, Tukey's fences works as such: ...
torpedo_cantankerous_softener's user avatar
0 votes
0 answers
981 views

I have a repeated measures experiment where all participants completed several trials for each condition. My dependent variables are response time and accuracy. I am using the Interquartile Range as ...
john connor's user avatar
0 votes
0 answers
52 views

I have a cumulated sum of battery charges that is mostly extremely linear, apart from some faulty data in the beginning. See this image as an example: In order to get the most accurate linear ...
mneumann's user avatar
  • 101
10 votes
2 answers
2k views

I'm new to statistics and currently learning abot MLE. Some of the papers I read: Robust Graph Embedding with Noisy Link Weights mentioned MLEs are suspectible to contamination in data, but didn't ...
port trum's user avatar
  • 103
2 votes
1 answer
336 views

I have around 1000 times series of around 1000 samples, where each sample is 5 minutes a part. An example of a time series after performing seasonal decomposition is As we can see the data is very ...
kspr's user avatar
  • 231
0 votes
1 answer
176 views

I have a data set that consists of many single data points. They are the measurements of network traffic, so they include e.g. '1403021', '1402341, '1399312'... values that are labeled as 'label1' and ...
norivotset's user avatar
1 vote
2 answers
693 views

I'm using Local outlier factor algorithm provided by Scikit-learn for outlier detection. For the evaluation i want to use auc measure. ...
Imen F's user avatar
  • 11
4 votes
1 answer
425 views

I am trying to create a pipeline in Python which automatically identifies global and contextual anomalies of a time series. Which one of these approaches do you believe is more correct? Method 1) ...
kspr's user avatar
  • 231
0 votes
0 answers
503 views

I am working on Telcom data for Churn modelling. I have 18 categorical and 2 numeric variables (total charges and monthly charges) in my data set. After handling the missing values, I checked the ...
newbie-data-student's user avatar
0 votes
0 answers
80 views

I have a set of time series, highly correlated (similar peaks and trend). I'm going to find relative anomalies, e.g. say there are 20 times series. At a snapshot date, all values increase, but one ...
Soom's user avatar
  • 11
1 vote
2 answers
81 views

I really need a hint here. Suppose I want to be able to detect unusual events and express the likelihood of it occurring. Suppose that I know that two events usually move in a given association such ...
Eugene's user avatar
  • 111
0 votes
2 answers
2k views

I'm beginning to study Generalized Linear Models and I was trying to adjust a model to the dataset NMES1988. More specifically, my goal is to adjust a Poisson Regression to this dataset considering ...
mathguy_666's user avatar

1
3 4
5
6 7
28