5
$\begingroup$

I'm building a logistic regression model and am attempting to select the correct variables. One thing that is puzzling to me is the AUC score is decreasing as I add more variables even within the training set. I would expect this to happen in the testing set due to over-fitting, but I don't understand why or how it would occur within the training set.

I am using the sklearn library in Python. Below I use three predictor variables and get an AUC score or ~.83 within the training set

predictors = ["lsat", "gpa", "urm"] 
X = train[predictors] #select predictor variables
y = train[["was_accepted"]] #select target variable
logreg = linear_model.LogisticRegression() #create logistic regression model 
logreg.fit(X, y) #fit model to the data
predictions = logreg.predict_proba(X)[:,1] #get predictions
auc_score = roc_auc_score(y, predictions)
print(auc_score)
#output = 0.8341757855809823

However, when I run the same code again with four additional predictor variables, I get an AUC of only ~.72.

predictors = ["lsat", "gpa", "urm", "is_military", "softs", "is_international", "years_out"]
X = train[predictors] #select predictor variables
y = train[["was_accepted"]] #select target variable
logreg = linear_model.LogisticRegression() #create logistic regression model 
logreg.fit(X, y) #fit model to the data
predictions = logreg.predict_proba(X)[:,1] #get predictions
auc_score = roc_auc_score(y, predictions)
print(auc_score)
#output = 0.7205734302381707

I'm confused as to how the AUC could go lower with the addition of more variables. Even if the new variables have zero predictive power, why wouldn't the coefficients just be set to zero and the AUC stay at ~.83?

I did see this post, which provides some helpful context with a similar issue, but I'm hoping someone here could provide a more definitive answer or direct me to materials that could.

Thank you.

$\endgroup$

2 Answers 2

3
$\begingroup$

I think there are a couple of reasons:

  1. Logistic regression does not have a deterministic closed form solution and must be solved iteratively, starting from some random initialization. The solutions are stochastic and depend on their random seed (hence the random_state argument). There's no guarantee that a given search will converge on the lowest cost, i.e. optimal, solution.
  2. The algorithm, as implemented in sklearn anyway, has L2 regularization applied by default. This penalty tries to smooth the coefficients of the logistic function and essentially prevents the model from finding a perfect (and possibly overfit) solution.

I just tried some experiments and I reckon that if you switch to L1 regularization, which tries to push as many coefficients as possible to 0, and ramp up the regularization penalty, then I think you will find the model stops behaving this way. For example, try instantiating the model like this:

LogisticRegression(penalty='l1', solver='liblinear', C=0.001)

Is it a 'better' model? That's the $64,000 question!


Not sure whether to include my experiment or not, since this isn't really a programming question. But it's here if you're interested.

$\endgroup$
1
  • $\begingroup$ Thank you - this is incredibly helpful. $\endgroup$ Commented Jun 17, 2022 at 21:47
1
$\begingroup$

The logistic regression optimizes the log loss between the predicted probabilities and categories (coded as $0$ and $1$). The true observations are the $y_i$, and the predictions are the $\hat y_i = \dfrac{1}{ 1 + \exp\left(-x_i^T\hat\beta\right) } $, where $\hat\beta$ is the vector of estimated parameters for the regression model.

$$ L(y,\hat y) = -\dfrac{1}{N}\overset{N}{\underset{i=1}{\sum}}\left[ y_i\log(\hat y_i) + (1 - y_i)\log(1 - \hat y_i) \right] $$

Notice that this is not the $AUC$. That is, logistic regression parameter estimation does not seek out the smallest $AUC$. If some parameter values decrease the $AUC$ yet also decreases the log-loss, those parameter values with the higher $AUC$ will be preferred. This can happen if some ability to discriminate between the categories is exchanged for better calibation of the predicted probabilities, as $AUC$ only cares about the ability for the model to distinguish between the two categories.

Therefore...

I'm confused as to how the AUC could go lower with the addition of more variables. Even if the new variables have zero predictive power, why wouldn't the coefficients just be set to zero and the AUC stay at ~.83?

If you train by maximizing the $AUC$, you will observe this, short of numerical issues that arise from doing this kind of strange optimization. However, the default of logistic regression is to find the parameters that optimize log-loss, not $AUC$.

Another answer mentions that logistic regression lacks a closed-form solution and must be solved numerically, putting some of the blame on the merely approximate "solution" given as the logistic regression parameters. While it is true that logistic regression lacks a closed-form solution, modern implementations of the numerical optimization are so good that it almost might as well.

$\endgroup$
1

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.