Assigning average outcome values to categorical variables

Question

Let's say we have a regression problem in which we would like to predict some score $y_i$ for person $i$.
As predictor data we have two variables: country $c_i$ and gender $g_i$. If we now assume that gender has three different classes (male, female, miscellaneous) but we only have historical score data for males and females and also some of the existing countries, how do we make predictions for the genders with miscellaneous value?

I maybe have some (quite philosophical) idea (that I've personally never have seen applied before):
What if we replace the values male/female/miscellaneous with the average historical score values corresponding to the male/female/miscellaneous and in case we don't have any values for miscellaneous we replace it to the average score over all instance (we can also do the same for the variable countries)?

E.g. what is the effect if we replace the categorical variable in itself to a numerical variable which has as values the historical average scores per value in the 'thrown away categorical variable'? It seems to me that this is a efficient trick to convert the categorical variable to some numerical variable and making better future predictions (since, intuitively speaking, future score predictions will depend more on the historical scores for that type of person/country than the type of gender and country in itself).

Gregg H · Accepted Answer · 2018-03-23 19:46:58Z

I generally do not like to simply say something cannot be done...but in this case, it really can’t be done.

The issue is with the categorical predictors in a multiple regression (MR) setting. You may have a single variable coded with three options for a categorical variable, but when it is analyzed in the MR context, this “single” variable becomes 2 dummy variables. And if there is complete missing data for one of those categories, then the model cannot be run.

If you think about having only two categories for a moment, then you only need one variable (say female, which is 1 if female and 0 if male). In the regression model, you would have an intercept and a slope for this variable, female, and maybe other variables. The coefficient for female is simply the difference between the males and the females (adjusting for the other variables in the model). Another way to think about it is that the intercept is the average for the males, and the intercept plus the female slope is the average for the females. Now, if you try to include two variables in the model (male and female), the regression will “crash” because your independent variables are linearly dependent. Specifically, in this context, if you know the value of female (0 or 1), you automatically know the value of male (1-female).

Returning back to the scenario you are suggesting: Now we have three categories. Let’s keep male and female, and then the intercept would represent the miscellaneous/other category you proposed. If you have even one individual with this classification, then they will have male = female = 0 as their coded data. And the model will run. However, if you have none in this category, then you really only have two categories. And the model won't run.

Lastly, without having any information about this subset of the population, you really can't make any guess as to what their data might look like. Thus, any form of data imputation would be just short of impossible to justify.

I recognize this may not be the response you were looking for, but I do hope it is useful.

The issue that you are adressing is in fact the problem which I tried to solve by this method ;)! The method I was proposing is in fact to let the script, in which I'm running the regression analysis, assign, e.g., the historical average score for male/female as predictor data and replace this with "male/female" in the gender variable and assign the average of the overal scores to "miscelaneous" in the gender variable; now we've made the categorical variable into a numerical variable, but whats the effect of these systematics (maybe nothing but as benefit now we only can use "recent scores")? — Bas van der Reijden
– Bas van der Reijden, Commented Mar 25, 2018 at 14:26

Stack Exchange Network

Assigning average outcome values to categorical variables

1 Answer 1

Your Answer

Hot Network Questions

Assigning average outcome values to categorical variables

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions