2
$\begingroup$

I am wondering if winsorising makes a difference in a logistic regression.

In a situation where I am looking at the individual contribution, looking at their individual discriminatory power (regressing the variable against the dependent variable). One would inspect the variable, see that there are some outliers and/or that there are some tails on both sides of the distribution.

For example this is the distribution of a variable

enter image description here

We see that there are some extreme values. Now running the regression on that variable only I can see if there is potentially some explanatory power.

Running the single variable regression glm(Outcome ~Variable, family="binmomial") I get a AUC of 71.77.

So it seems this variable might be usefull for my general model.

Now maybe I can improve my prediction power for that single variable. I do not want to throw away the extreme values as they also potentially contain information. I will choose to squeeze them (i.e winsorising). For the example I'll do it on both sides at 95%.

enter image description here

So we see the variable has been squeezed to a much smaller range and we see aggregated data on the left and the right.

Running the one variable regression the AUC is 71.85.

Thus technically this is "better". From this I would consider winsorising my variable before adding it to the general model (with more variables) so as to maximise its effectiveness...

But does it really help or is it just an artifact? There is the issue that you do not apply the same transformation on other variables when you put more variables in the model. So the observations that have been squeezed wont be squeezed the same way on other variables.

So the question would then go down to would you recommend winsorising? is it a waste of time trying to maximise the discriminatory power of individual variable using winsorising? is conceptually wrong because you don't squeeze the observations on other variables the same way ?

$\endgroup$
5
  • $\begingroup$ Many questions are given here but few details on precisely what you're doing. But as far as I can gather you are winsorizing one (or more?) predictor variables. The only easy advice is to compare results with and without winsorizing and see how much difference it makes. If little difference, winsorizing is redundant; if a big difference, it is harder to say if it's a good idea. $\endgroup$ Commented Jan 11, 2018 at 10:46
  • $\begingroup$ thank you for your time. I have added some explanations, hope it is clearer. I am wondering if even if it seems to increase the power of the single variable (slightly) it might just be conceptually wrong to do that. Even if the increase in performance was bigger. $\endgroup$ Commented Jan 11, 2018 at 13:30
  • $\begingroup$ We should mention that assessing one predictor regressions is not a good way to select predictors in a multi predictor model. $\endgroup$ Commented Jan 11, 2018 at 15:01
  • 1
    $\begingroup$ Your predictor is positive and negative in different tails. Fine, but that rules out logarithms. Not fatal as you can consider cube root, neglog $\text{sign}(x) \log(|1 + x|)$ or asinh as possible transforms. Much less blunt and arbitrary compared with Winsorizing. $\endgroup$ Commented Jan 11, 2018 at 17:35
  • $\begingroup$ I guess that those three transforms make more sense as it keeps the relative separation between the extreme points while reducing the outbound spread whereas winsorising just "glues" them together... so indeed a rather blunt transform. Doing so I obtain the same "extra" power. $\endgroup$ Commented Jan 12, 2018 at 7:17

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.