I have a conceptual question: after dividing a dataset into a training and test set (70:30), both are balanced and shuffled, should I use the Confusion Matrix and the ROC curve of a model generated by k-fold cross validation with the training set?
I ask this question because everytime I get the accuracy from the grid search, it's always something near 61%. When I make the confusion matrix with the training set, it gets an accuracy of 57%. But when I make the ROC curve and the Confusion Matrix with the training set, the accuracy is always near 100%. Is this sign of overfitting or I should just consider the values of accuracy from the cross validation and the training sets (61% and 57%)?
Here is my code:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0, stratify=y, shuffle=True)
param_grid = {
'n_estimators': [50, 100, 200, 250], # Número de árvores na floresta
'max_depth': [3, 5, 8, 10], # Profundidade máxima das árvores
'min_samples_split': [2, 5, 10], # Número mínimo de amostras necessárias para dividir um nó
'min_samples_leaf': [1, 2, 4], # Número mínimo de amostras necessárias em um nó folha
'max_features': ['sqrt'], # Número de recursos a serem considerados para a melhor divisão
'oob_score': [True]
}
rf = RandomForestClassifier(random_state=0)
grid_search = GridSearchCV(
estimator=rf,
param_grid=param_grid,
cv=10, # 10-fold cross-validation
n_jobs=-1, # Usar todos os núcleos de CPU disponíveis
verbose=2, # Mostrar o processo de busca
scoring='accuracy', # Métrica de avaliação
return_train_score=True
)
grid_search.fit(x_train, y_train)
print("Best hyperparams:")
print(grid_search.best_params_)
#Best Hyperparams:
#{'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 5, #'n_estimators': 100, 'oob_score': True}
print("Best Accuracy Cross Validation:")
print(grid_search.best_score_)
#Best Accuracy Cross Validation: 0.6137931034482758
best_model = grid_search.best_estimator_
test_accuracy = best_model.score(x_test, y_test)
test_accuracy
#0.5684931506849316
train_accuracy = best_model.score(x_train, y_train)
train_accuracy
#0.9844827586206897
PS: I saw in the scikt learn docs here and it stops with the validation in test set after the k-fold, but I didn't find any mention of validating again with the training dataset.