Predictive modeling on biased features

Ask Question

Asked 9 months ago

Modified 9 months ago

Viewed 51 times

Some features I want to use for modeling have distributions like below:

There are high values of the features occurring frequently in my data. I can identify a subset of my data points that cause this polarization easily. There is no phenomena here, these are just samples associated with big cities. The question is how I should tackle the problem. Should I build a separate model for big cities? Or would you recommend a transformation minimizing the polarity? I know there is no general recipe in predictive modeling, but maybe do you have some experience and good practices with datasets like this. What would be your suggestion on how to incorporate these features in a model?

asked Feb 11 at 15:19

Jakub Małecki

3781 silver badge7 bronze badges

$\begingroup$ What is the "problem" you believe needs tackling? If big cities have high values of predictors, then that may simply mean that your target also has a different distribution in big cities, and that your predictor is useful in predicting for a specific city. $\endgroup$

Stephan Kolassa
– Stephan Kolassa

2025-02-11 15:33:56 +00:00
Commented Feb 11 at 15:33
$\begingroup$ My question is how distributions like these should be handled in modeling process in general. I see there is a distinguishing subset of samples. In my case it's related to big cities. But in general, when you see the distribution like that, what would be your next step? Would you transform the features somehow to make the distribution not so polarized? Or would you extract the subset and create a separate model only for these samples? $\endgroup$

Jakub Małecki
– Jakub Małecki

2025-02-11 16:13:25 +00:00
Commented Feb 11 at 16:13
$\begingroup$ I would first try to understand whether there is anything to be concerned about. Are the six features actually correlated, i.e., is it the same instances that score high on all six features? If so, separate your data into these instances and "the rest" and run separate diagnostics. If there is an issue, address the issue. It may be helpful to add a "city size" predictor or similar. Or not, because these features may already be carrying all the information. But I would certainly not start transforming anything just because of these histograms. $\endgroup$

Stephan Kolassa
– Stephan Kolassa

2025-02-11 16:16:38 +00:00
Commented Feb 11 at 16:16
1

$\begingroup$ Many similar Qs stats.stackexchange.com/questions/264119/…, stats.stackexchange.com/questions/621402/…, stats.stackexchange.com/questions/478734/… stats.stackexchange.com/questions/580216/… stats.stackexchange.com/questions/660818/… $\endgroup$

kjetil b halvorsen
– kjetil b halvorsen ♦

2025-05-13 00:48:58 +00:00
Commented May 13 at 0:48

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Stack Exchange Network

Predictive modeling on biased features

0

Your Answer

Linked

Hot Network Questions

Predictive modeling on biased features

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Hot Network Questions