Skip to main content

Questions tagged [winsorizing]

Winsorizing is a kind of data transformation used in robust/resistant statistics. Extreme values in the sample is replaced by some chosen data quantile(s). See https://en.wikipedia.org/wiki/Winsorizing

Filter by
Sorted by
Tagged with
0 votes
1 answer
237 views

Our current experimentation platform currently has winsorization implemented to reduce "whale effects" on metrics like revenue and volume. We are also interested in applying CUPED to further ...
0 votes
0 answers
42 views

I have a 2×2 experimental design with four conditions and eight outcome variables. I’m supposed to winsorize outliers, but I’m confused about how many times this needs to be done because I’m ...
42 votes
5 answers
37k views

Winsorizing data means to replace the extreme values of a data set with a certain percentile value from each end, while Trimming or Truncating involves removing those extreme values. I always see ...
1 vote
0 answers
157 views

I've got a variable with psychological data (N=75) which is distributed pretty symmetrical, but has very few cases with very extreme values, more extreme to the left tail. But nevertheless this data ...
4 votes
2 answers
3k views

I have a data set that includes the number of visits to a website. Here are some descriptive statistics for my data Median: 4 Mean: 14.1352 SD: 121.8119 Clearly, there are some huge values (...
1 vote
1 answer
163 views

I'm unsure on how to remove or winsorize outliers. Let's say I have 2 groups, treated and control. And I measure feature1 and feature2 for both. How should I handle outliers? For each group and each ...
24 votes
4 answers
70k views

Often introductory applied statistics texts distinguish the mean from the median (often in the the context of descriptive statistics and motivating the summarization of central tendency using the mean,...
0 votes
0 answers
51 views

I have a set of log-return data for a commodity and am unable to identify an appropriate ARMA model. I used auto.arima() function, and the optimized model is (4,0,4) with zero mean. However, when I ...
8 votes
1 answer
5k views

I am doing research on Winsorization (and trimming), which has been broadly applied in many fields, but I think many researchers didn't do it in a "rigorous" way. Or maybe even worse, they misuse it. ...
2 votes
1 answer
624 views

The Context My dataset consists of 68 groups, each with 4 data points inside it. As means of a robustness test, I am looking to see how the type of average/mean I use impacts the analysis that I will ...
1 vote
1 answer
223 views

Question: Should I rather winsorise (or trim, where relevant) my raw data, or the intermediary metric I use in my models? Context: My analysis consists in 3 steps: Collect raw data, Compute ...
15 votes
5 answers
45k views

I'm trying to find a way of correcting outliers once I find/detect them in time series data. Some methods, like nnetar in R, give some errors for time series with big/large outliers. I already managed ...
0 votes
0 answers
1k views

I testing if I can describe the StockPRice with EPS (=earnings per share), BookValuePS an ESGscore. Before I start I winsorized all my variables. Now I want to take the loagrithm of e.g. BookValuePS ...
1 vote
0 answers
482 views

I work on our A/B testing platform where we have implemented one-sided winsorization broadly across all continuous variables (capped at 95th percentile). While that's a common cut-off, some of our ...
34 votes
8 answers
43k views

This question was asked by my friend who is not internet savvy. I've no statistics background and I've been searching around internet for this question. The question is : is it possible to replace ...
2 votes
0 answers
297 views

Say I have a ratio c = a/b. Should I winsorize both a and b and then ...
2 votes
0 answers
2k views

According to this page -- "When a data set has outliers, variability is often summarized by a statistic called the interquartile range, which is the difference between the first and third ...
1 vote
1 answer
815 views

Curious what the functional differences are between using a Huber loss function/ regression and Winsorizing data and then running a classic least squares regression. Will the resulting outputs be ...
4 votes
1 answer
2k views

I'm trying to remove all the outliers from a data set. However, after removing them, data points that weren't outliers before are now outliers due to the new distribution. What is the correct ...
13 votes
1 answer
4k views

I have a bunch of points $x_i$ and would like to calculate a kind of weighted mean that deemphasizes outliers. My first idea was to weight each point by $1/ (x_i - \mu)^2$. However, the problem is ...
2 votes
1 answer
1k views

Is it kosher? Inverse propensity weights (IPW) has been shown to perform poorly when selection probabilities are small (Kang and Schafer, 2007). Are there any standard solutions to this issue?
2 votes
2 answers
2k views

I am trying to analyze some data about Brent Oil volatility. So far I have managed to fit a GARCH(1,1) model and an EGARCH. However, someone has recommended to use a GAS model, Generalized ...
1 vote
0 answers
120 views

For some regressions we find it useful to focus on extreme values, and so we discard middling dependent values (which we might call "noise") from data in order to find relationships that hold at data ...
1 vote
2 answers
2k views

I know what is Winsorization and why is it applied. My understanding was that it is applied only on the train data to reduce the effect of outliers. But! Recently I came across a kernel where Min, ...
20 votes
5 answers
16k views

I plan to do a simulation study where I compare the performance of several robust correlation techniques with different distributions (skewed, with outliers, etc.). With robust, I mean the ideal case ...
0 votes
0 answers
701 views

To perform logistic regression I wish to winsorize outliers in independent/ explanatory variables by flooring and capping independent variables. Can you suggest how I should choose cut-off for ...
0 votes
0 answers
901 views

I have a relatively small sample of panel data (quarterly data for 68 firms over 7 years). My dependent variable is positively skewed. In order to limit the influence of observations with large values,...
4 votes
1 answer
2k views

I am trying to find out the determinants of cognitive function. The outcome variable is the mini–mental state examination which is a 30 point questionnaire response that has score values from 0 to 30(...
6 votes
3 answers
8k views

I have a very general statistical question. If a variable has some extreme values, then for the purpose of statistical inferences for example OLS regression, is it better to detect these extreme ...
1 vote
1 answer
1k views

I have a small-sample dataset representing observations from a longitudinal study. My principal interest is in 'change scores' across three parameters (A, B, C). This requires simple paired t-tests. ...
0 votes
1 answer
5k views

I am currently working on my bachelor thesis in finance and I faced some problems regarding my dataset. I wanted to analyze the effect of leverage on the performance of companies and as many ...
4 votes
0 answers
2k views

What is the best way to treat outliers in a time series forecasting model? In particular, for product demand modeling? Based on what I've read so far, the following methods can be applied: ...
0 votes
0 answers
722 views

In product demand forecasting, is it valid to use winsorization to remove large outliers (spikes) in the data? I understand that the spikes may be due to holiday effects (e.g. people will buy more ...
3 votes
1 answer
2k views

I have a data set with financial panel data from 150 companies. I want to analyse the data using linear repeated measures ANOVA and OLS Regression (so far). For this, I want to use the absolute values ...
2 votes
0 answers
424 views

I am wondering if winsorising makes a difference in a logistic regression. In a situation where I am looking at the individual contribution, looking at their individual discriminatory power (...
0 votes
1 answer
71 views

I have performed a logistic regression to estimate the default probability of a dataset of firms based on some basic balance-sheet ratios. I have winsorized all the ratios at the 1st and 99th ...
2 votes
2 answers
713 views

Just a foreword, I'm not a mathematician or otherwise statistically skilled. I know my way around calculating standard deviations, but it's all self taught. I'm a programmer with limited stats ...
4 votes
3 answers
453 views

Note sure if there is an existing stats concept for this but I have a dataset that consists of mostly small data points with a few large ones. e.g. 1 2 1 3 1 2 87 3 2 1 1 1 1 3 1 2 1 1 1 99 How can ...
2 votes
1 answer
793 views

I have two different forecasts that are produced by ARMA models using two different data samples. The difference between the two data sets is their size: one used data from 2013-2014 and another used ...
9 votes
2 answers
3k views

Background I have a model with 17 parameters, and I currently use the coefficient of variation ($\text{CV}=\sigma/\mu$) to summarize the prior and posterior distributions of each parameter. All of ...
3 votes
1 answer
724 views

I have a list of m x n similarity score matrix, something like ...
3 votes
2 answers
834 views

The greatest gain of the statistics classes in both school and university seems to be that I now have an inkling of which QA site to use for this question. :) I'm a programmer and I'm making a ...
2 votes
1 answer
1k views

I am trying to analyze a quite large (~25,000 rows) dataset of cash flow forecasts. Receipts and expenses are aggregated, thus I may end up with the following data: ...
5 votes
3 answers
2k views

I am comparing length of stay after laparoscopic and open appendectomy in over 160000 patients. LOS is typically a skewed variable so I use the median and interquartile range and ranksum test to ...
1 vote
1 answer
141 views

I have data on runners who run marathons; for each runner I have their final times on a number of races. I would like to predict how fast they are running considering outliers i.e. he's running faster ...
1 vote
2 answers
910 views

The mean is not good in this case, because there are galleries that have an artist with a high rank and several other artists with way lower ranks. I'm thinking about doing a weighted mean, but I don'...
3 votes
2 answers
3k views

Given a list of numbers, is it possible to find out (or in other words, is there a statistical measure to tells the) the closeness of the numbers (do note that i am not talking about correlation - ...
3 votes
3 answers
3k views

I have some data where I want to determine whether the shape of the probability distribution has changed compared to 10 years ago. One example is that I have for various automobiles multiple measures ...
2 votes
2 answers
349 views

I am doing the research, using Multilevel modeling, with limited dependent variable number of days- it is limited downward (0) and upward (30). Is it necessary to use Multilevel logit model? Or is it ...
6 votes
2 answers
756 views

So I have data that has been quantized by an analogue to digital converter. (continuous data has been turned into discrete data and the values range from 0 to the saturation value , which is 127 in ...