I'm building a logistic regression model and am attempting to select the correct variables. One thing that is puzzling to me is the AUC score is decreasing as I add more variables even within the training set. I would expect this to happen in the testing set due to over-fitting, but I don't understand why or how it would occur within the training set.
I am using the sklearn library in Python. Below I use three predictor variables and get an AUC score or ~.83 within the training set
predictors = ["lsat", "gpa", "urm"]
X = train[predictors] #select predictor variables
y = train[["was_accepted"]] #select target variable
logreg = linear_model.LogisticRegression() #create logistic regression model
logreg.fit(X, y) #fit model to the data
predictions = logreg.predict_proba(X)[:,1] #get predictions
auc_score = roc_auc_score(y, predictions)
print(auc_score)
#output = 0.8341757855809823
However, when I run the same code again with four additional predictor variables, I get an AUC of only ~.72.
predictors = ["lsat", "gpa", "urm", "is_military", "softs", "is_international", "years_out"]
X = train[predictors] #select predictor variables
y = train[["was_accepted"]] #select target variable
logreg = linear_model.LogisticRegression() #create logistic regression model
logreg.fit(X, y) #fit model to the data
predictions = logreg.predict_proba(X)[:,1] #get predictions
auc_score = roc_auc_score(y, predictions)
print(auc_score)
#output = 0.7205734302381707
I'm confused as to how the AUC could go lower with the addition of more variables. Even if the new variables have zero predictive power, why wouldn't the coefficients just be set to zero and the AUC stay at ~.83?
I did see this post, which provides some helpful context with a similar issue, but I'm hoping someone here could provide a more definitive answer or direct me to materials that could.
Thank you.