0
$\begingroup$

I am trying to do a binary classification on ticket canceling data from kaggle.

I know this question has been asked before. For example here and here

Summary of what I learned in those references:

  1. this can happen if data is unbalanced
  2. data leakage: one of the input features is actually a direct proxy for the target variable.

My data is unbalanced but not extremely unbalanced. Since this is binary classification:

y.sum()/len(y) = 0.151

Thus I have about 15% in one category. This is high but not exterme. For data leakage, I looked at the correlation matrix which is as follows: enter image description here

The feature importance is enter image description here

The target variable is "Cancel". None of the variables have extremely high correlation. The model is

   model = XGBClassifier(objective='multi:softmax', num_class=3)

Yet my classification report is perfect:

          precision    recall  f1-score   support

       0       1.00      1.00      1.00     21455
       1       1.00      1.00      1.00      3781

accuracy                            1.00     25236
macro avg       1.00      1.00      1.00     25236
weighted avg    1.00      1.00      1.00     25236

How to solve this?

$\endgroup$
2
  • $\begingroup$ Is this classification report calculated using a separate test set? $\endgroup$ Commented Oct 18, 2023 at 20:37
  • $\begingroup$ yes, for a test_train split $\endgroup$ Commented Oct 19, 2023 at 0:36

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.