I am trying to do a binary classification on ticket canceling data from kaggle.
I know this question has been asked before. For example here and here
Summary of what I learned in those references:
- this can happen if data is unbalanced
- data leakage: one of the input features is actually a direct proxy for the target variable.
My data is unbalanced but not extremely unbalanced. Since this is binary classification:
y.sum()/len(y) = 0.151
Thus I have about 15% in one category. This is high but not exterme. For data leakage, I looked at the correlation matrix which is as follows:

The target variable is "Cancel". None of the variables have extremely high correlation. The model is
model = XGBClassifier(objective='multi:softmax', num_class=3)
Yet my classification report is perfect:
precision recall f1-score support
0 1.00 1.00 1.00 21455
1 1.00 1.00 1.00 3781
accuracy 1.00 25236
macro avg 1.00 1.00 1.00 25236
weighted avg 1.00 1.00 1.00 25236
How to solve this?
