Recently Active 'winsorizing' Questions

0 votes

1 answer

237 views

Can I apply both winsorization and CUPED to my experiment results?

Our current experimentation platform currently has winsorization implemented to reduce "whale effects" on metrics like revenue and volume. We are also interested in applying CUPED to further ...

CommunityBot

1

modified yesterday

0 votes

0 answers

42 views

Winsorizing outliers across multiple analyses: once or multiple times? (SPSS)

I have a 2×2 experimental design with four conditions and eight outcome variables. I’m supposed to winsorize outliers, but I’m confused about how many times this needs to be done because I’m ...

kjetil b halvorsen♦

85.6k

modified Nov 24 at 19:32

42 votes

5 answers

37k views

What are the relative merits of Winsorizing vs. Trimming data?

Winsorizing data means to replace the extreme values of a data set with a certain percentile value from each end, while Trimming or Truncating involves removing those extreme values. I always see ...

Pietro Battiston

281

modified Mar 9 at 21:05

1 vote

0 answers

157 views

Valid approach: Winsorizing data for main analysis and then doing sensitivity analysis without winsorizing?

I've got a variable with psychological data (N=75) which is distributed pretty symmetrical, but has very few cases with very extreme values, more extreme to the left tail. But nevertheless this data ...

Malea Dondé

11

asked Mar 12, 2024 at 16:57

4 votes

2 answers

3k views

Removing outliers from asymmetric data

I have a data set that includes the number of visits to a website. Here are some descriptive statistics for my data Median: 4 Mean: 14.1352 SD: 121.8119 Clearly, there are some huge values (...

deq2

1

modified Feb 1, 2024 at 17:09

1 vote

1 answer

163 views

Removing outliers in several groups and for several features

I'm unsure on how to remove or winsorize outliers. Let's say I have 2 groups, treated and control. And I measure feature1 and feature2 for both. How should I handle outliers? For each group and each ...

Peter Flom

141k

answered Dec 22, 2023 at 14:11

24 votes

4 answers

70k views

Should the mean be used when data are skewed?

Often introductory applied statistics texts distinguish the mean from the median (often in the the context of descriptive statistics and motivating the summarization of central tendency using the mean,...

Glen_b

298k

modified Mar 29, 2023 at 18:02

0 votes

0 answers

51 views

Identify ARMA model with no autocorrelation in residuals [duplicate]

I have a set of log-return data for a commodity and am unable to identify an appropriate ARMA model. I used auto.arima() function, and the optimized model is (4,0,4) with zero mean. However, when I ...

Richard Hardy

71.5k

modified Mar 5, 2023 at 8:01

8 votes

1 answer

5k views

Use and misuse of Winsorization

I am doing research on Winsorization (and trimming), which has been broadly applied in many fields, but I think many researchers didn't do it in a "rigorous" way. Or maybe even worse, they misuse it. ...

Roger V.

5,091

modified Dec 18, 2022 at 11:29

2 votes

1 answer

624 views

Is winsorizing limited to the usage of a certain percentile cutoff?

The Context My dataset consists of 68 groups, each with 4 data points inside it. As means of a robustness test, I am looking to see how the type of average/mean I use impacts the analysis that I will ...

Nick Cox

62.1k

modified Jun 21, 2022 at 15:59

1 vote

1 answer

223 views

Should I trim/winsorize raw data or computed metric used in models?

Question: Should I rather winsorise (or trim, where relevant) my raw data, or the intermediary metric I use in my models? Context: My analysis consists in 3 steps: Collect raw data, Compute ...

frank

11.5k

answered May 19, 2022 at 6:56

15 votes

5 answers

45k views

How to correct outliers once detected for time series data forecasting?

I'm trying to find a way of correcting outliers once I find/detect them in time series data. Some methods, like nnetar in R, give some errors for time series with big/large outliers. I already managed ...

Melanie Shebel

349

modified Jan 22, 2022 at 5:07

0 votes

0 answers

1k views

Winsorizing or taking the logarithm first?

I testing if I can describe the StockPRice with EPS (=earnings per share), BookValuePS an ESGscore. Before I start I winsorized all my variables. Now I want to take the loagrithm of e.g. BookValuePS ...

Richard Hardy

71.5k

modified Jan 12, 2022 at 18:12

1 vote

0 answers

482 views

How to optimally choose winsorization thresholds for different metrics in large scale A/B testing platform

I work on our A/B testing platform where we have implemented one-sided winsorization broadly across all continuous variables (capped at 95th percentile). While that's a common cut-off, some of our ...

Kevin

11

asked Dec 10, 2021 at 5:42

34 votes

8 answers

43k views

Replacing outliers with mean

This question was asked by my friend who is not internet savvy. I've no statistics background and I've been searching around internet for this question. The question is : is it possible to replace ...

Nick Cox

62.1k

modified Sep 16, 2021 at 9:11

2 votes

0 answers

297 views

Winsorizing and ratios [closed]

Say I have a ratio c = a/b. Should I winsorize both a and b and then ...

folderj

115

modified Apr 19, 2021 at 19:39

2 votes

0 answers

2k views

Dealing with outliers: Interquartile range normalization vs. Winsorization

According to this page -- "When a data set has outliers, variability is often summarized by a statistic called the interquartile range, which is the difference between the first and third ...

Nick Cox

62.1k

modified Jan 16, 2021 at 12:29

1 vote

1 answer

815 views

functional differences between using huber loss and winsorizing/trimming

Curious what the functional differences are between using a Huber loss function/ regression and Winsorizing data and then running a classic least squares regression. Will the resulting outputs be ...

CommunityBot

1

modified Oct 26, 2020 at 23:07

4 votes

1 answer

2k views

Removing outliers renders a new distribution that has its own outliers

I'm trying to remove all the outliers from a data set. However, after removing them, data points that weren't outliers before are now outliers due to the new distribution. What is the correct ...

BruceET

59.9k

modified Oct 21, 2020 at 22:39

13 votes

1 answer

4k views

Downweight outliers in mean

I have a bunch of points $x_i$ and would like to calculate a kind of weighted mean that deemphasizes outliers. My first idea was to weight each point by $1/ (x_i - \mu)^2$. However, the problem is ...

CommunityBot

1

modified Sep 11, 2020 at 2:02

2 votes

1 answer

1k views

Winsorizing propensity scores

Is it kosher? Inverse propensity weights (IPW) has been shown to perform poorly when selection probabilities are small (Kang and Schafer, 2007). Are there any standard solutions to this issue?

Noah

40.2k

modified Jul 15, 2020 at 21:16

2 votes

2 answers

2k views

What is the difference between GAS ( Generalized Autoregressive Score) model and a GARCH?

I am trying to analyze some data about Brent Oil volatility. So far I have managed to fit a GARCH(1,1) model and an EGARCH. However, someone has recommended to use a GAS model, Generalized ...

user34884

56

answered Jun 18, 2020 at 13:02

1 vote

0 answers

120 views

Name for the opposite of Winsorizing?

For some regressions we find it useful to focus on extreme values, and so we discard middling dependent values (which we might call "noise") from data in order to find relationships that hold at data ...

feetwet

1,176

asked Mar 12, 2020 at 4:57

1 vote

2 answers

2k views

Is Winsorization performed on test data as well?

I know what is Winsorization and why is it applied. My understanding was that it is applied only on the train data to reduce the effect of outliers. But! Recently I came across a kernel where Min, ...

Mann

131

modified Oct 29, 2019 at 9:46

20 votes

5 answers

16k views

Which robust correlation methods are actually used?

I plan to do a simulation study where I compare the performance of several robust correlation techniques with different distributions (skewed, with outliers, etc.). With robust, I mean the ideal case ...

O.rka

1,502

answered Jul 19, 2019 at 4:37

0 votes

0 answers

701 views

How to choose cut off for winsorization/ flooring- capping? What is the impact of variable distribution on the decision

To perform logistic regression I wish to winsorize outliers in independent/ explanatory variables by flooring and capping independent variables. Can you suggest how I should choose cut-off for ...

user9291966

1

modified Apr 8, 2019 at 22:41

0 votes

0 answers

901 views

Winsorizing data in small sample

I have a relatively small sample of panel data (quarterly data for 68 firms over 7 years). My dependent variable is positively skewed. In order to limit the influence of observations with large values,...

Nick Cox

62.1k

modified Apr 7, 2019 at 14:51

4 votes

1 answer

2k views

Linear regression with violated assumptions

I am trying to find out the determinants of cognitive function. The outcome variable is the mini–mental state examination which is a 30 point questionnaire response that has score values from 0 to 30(...

Hessian

201

modified Feb 20, 2019 at 15:39

6 votes

3 answers

8k views

Extreme values in the data

I have a very general statistical question. If a variable has some extreme values, then for the purpose of statistical inferences for example OLS regression, is it better to detect these extreme ...

Matthew Gunn

23.6k

modified Aug 28, 2018 at 15:26

1 vote

1 answer

1k views

Greater than 30% outliers in small dataset - what to do? Standard test? Test with outliers removed? Robust statistics?

I have a small-sample dataset representing observations from a longitudinal study. My principal interest is in 'change scores' across three parameters (A, B, C). This requires simple paired t-tests. ...

pomodoro

833

modified Aug 27, 2018 at 8:51

0 votes

1 answer

5k views

Winsorizing data

I am currently working on my bachelor thesis in finance and I faced some problems regarding my dataset. I wanted to analyze the effect of leverage on the performance of companies and as many ...

Sal Mangiafico

11.8k

modified Jun 10, 2018 at 17:35

4 votes

0 answers

2k views

Treating outliers for time series forecasting in Python

What is the best way to treat outliers in a time series forecasting model? In particular, for product demand modeling? Based on what I've read so far, the following methods can be applied: ...

hxlaclhemy

168

modified May 5, 2018 at 12:43

0 votes

0 answers

722 views

Winsorization to remove spikes in time series

In product demand forecasting, is it valid to use winsorization to remove large outliers (spikes) in the data? I understand that the spikes may be due to holiday effects (e.g. people will buy more ...

meraxes

749

modified May 5, 2018 at 9:56

3 votes

1 answer

2k views

Treatment of outliers in financial data

I have a data set with financial panel data from 150 companies. I want to analyse the data using linear repeated measures ANOVA and OLS Regression (so far). For this, I want to use the absolute values ...

Dave Harris

7,920

answered Mar 22, 2018 at 5:47

2 votes

0 answers

424 views

Does pre winsorising of a variable help for a logistic regression?

I am wondering if winsorising makes a difference in a logistic regression. In a situation where I am looking at the individual contribution, looking at their individual discriminatory power (...

R. Prost

210

modified Jan 12, 2018 at 12:19

0 votes

1 answer

71 views

winsoring forecasting dataset

I have performed a logistic regression to estimate the default probability of a dataset of firms based on some basic balance-sheet ratios. I have winsorized all the ratios at the 1st and 99th ...

Björn

38k

answered Nov 22, 2017 at 16:45

2 votes

2 answers

713 views

Removing outliers and calculating a "lowest" attainable price from a pre-determined/fixed time series of prices

Just a foreword, I'm not a mathematician or otherwise statistically skilled. I know my way around calculating standard deviations, but it's all self taught. I'm a programmer with limited stats ...

CommunityBot

1

modified Jun 12, 2017 at 17:08

4 votes

3 answers

453 views

In a "bursty" dataset, how do you filter for the few important values that make up the bulk of the information?

Note sure if there is an existing stats concept for this but I have a dataset that consists of mostly small data points with a few large ones. e.g. 1 2 1 3 1 2 87 3 2 1 1 1 1 3 1 2 1 1 1 99 How can ...

kjetil b halvorsen♦

85.6k

modified Jun 12, 2017 at 17:03

2 votes

1 answer

793 views

Ensemble time series prediction from two separate models

I have two different forecasts that are produced by ARMA models using two different data samples. The difference between the two data sets is their size: one used data from 2013-2014 and another used ...

kjetil b halvorsen♦

85.6k

modified Jun 12, 2017 at 17:02

9 votes

2 answers

3k views

Alternatives to using Coefficient of Variation to summarize a set of parameter distributions?

Background I have a model with 17 parameters, and I currently use the coefficient of variation ($\text{CV}=\sigma/\mu$) to summarize the prior and posterior distributions of each parameter. All of ...

kjetil b halvorsen♦

85.6k

modified Jun 12, 2017 at 17:00

3 votes

1 answer

724 views

Combining similarity scores

I have a list of m x n similarity score matrix, something like ...

kjetil b halvorsen♦

85.6k

modified Jun 12, 2017 at 16:59

3 votes

2 answers

834 views

How to best estimate the time remaining for a variable-length questionnaire?

The greatest gain of the statistics classes in both school and university seems to be that I now have an inkling of which QA site to use for this question. :) I'm a programmer and I'm making a ...

kjetil b halvorsen♦

85.6k

modified Jun 12, 2017 at 16:58

2 votes

1 answer

1k views

Scale independent forecast error metric that works with changing signs

I am trying to analyze a quite large (~25,000 rows) dataset of cash flow forecasts. Receipts and expenses are aggregated, thus I may end up with the following data: ...

kjetil b halvorsen♦

85.6k

modified Jun 12, 2017 at 16:57

5 votes

3 answers

2k views

How to describe the differences in skewed data with same median but statistically different distribution?

I am comparing length of stay after laparoscopic and open appendectomy in over 160000 patients. LOS is typically a skewed variable so I use the median and interquartile range and ranksum test to ...

kjetil b halvorsen♦

85.6k

modified Jun 12, 2017 at 14:08

1 vote

1 answer

141 views

How to evaluate a curve considering outliers?

I have data on runners who run marathons; for each runner I have their final times on a number of races. I would like to predict how fast they are running considering outliers i.e. he's running faster ...

kjetil b halvorsen♦

85.6k

modified Jun 12, 2017 at 14:05

1 vote

2 answers

910 views

How do I calculate the ranking of some galleries based on the rankings of the artists represented by them?

The mean is not good in this case, because there are galleries that have an artist with a high rank and several other artists with way lower ranks. I'm thinking about doing a weighted mean, but I don'...

kjetil b halvorsen♦

85.6k

modified Jun 12, 2017 at 14:03

3 votes

2 answers

3k views

Measure of closeness

Given a list of numbers, is it possible to find out (or in other words, is there a statistical measure to tells the) the closeness of the numbers (do note that i am not talking about correlation - ...

kjetil b halvorsen♦

85.6k

modified Jun 12, 2017 at 14:02

3 votes

3 answers

3k views

Robust standardization of data

I have some data where I want to determine whether the shape of the probability distribution has changed compared to 10 years ago. One example is that I have for various automobiles multiple measures ...

kjetil b halvorsen♦

85.6k

modified Jun 12, 2017 at 14:00

2 votes

2 answers

349 views

Multilevel modeling for limited dependent variable

I am doing the research, using Multilevel modeling, with limited dependent variable number of days- it is limited downward (0) and upward (30). Is it necessary to use Multilevel logit model? Or is it ...

Michael R. Chernick

43.8k

modified Apr 23, 2017 at 16:32

6 votes

2 answers

756 views

How to average quantized and truncated data?

So I have data that has been quantized by an analogue to digital converter. (continuous data has been turned into discrete data and the values range from 0 to the saturation value , which is 127 in ...

kjetil b halvorsen♦

85.6k

modified Apr 23, 2017 at 16:16

Questions tagged [winsorizing]