I am wondering if winsorising makes a difference in a logistic regression.
In a situation where I am looking at the individual contribution, looking at their individual discriminatory power (regressing the variable against the dependent variable). One would inspect the variable, see that there are some outliers and/or that there are some tails on both sides of the distribution.
For example this is the distribution of a variable
We see that there are some extreme values. Now running the regression on that variable only I can see if there is potentially some explanatory power.
Running the single variable regression glm(Outcome ~Variable, family="binmomial") I get a AUC of 71.77.
So it seems this variable might be usefull for my general model.
Now maybe I can improve my prediction power for that single variable. I do not want to throw away the extreme values as they also potentially contain information. I will choose to squeeze them (i.e winsorising). For the example I'll do it on both sides at 95%.
So we see the variable has been squeezed to a much smaller range and we see aggregated data on the left and the right.
Running the one variable regression the AUC is 71.85.
Thus technically this is "better". From this I would consider winsorising my variable before adding it to the general model (with more variables) so as to maximise its effectiveness...
But does it really help or is it just an artifact? There is the issue that you do not apply the same transformation on other variables when you put more variables in the model. So the observations that have been squeezed wont be squeezed the same way on other variables.
So the question would then go down to would you recommend winsorising? is it a waste of time trying to maximise the discriminatory power of individual variable using winsorising? is conceptually wrong because you don't squeeze the observations on other variables the same way ?

