0
$\begingroup$

I have a conceptual question: after dividing a dataset into a training and test set (70:30), both are balanced and shuffled, should I use the Confusion Matrix and the ROC curve of a model generated by k-fold cross validation with the training set?

I ask this question because everytime I get the accuracy from the grid search, it's always something near 61%. When I make the confusion matrix with the training set, it gets an accuracy of 57%. But when I make the ROC curve and the Confusion Matrix with the training set, the accuracy is always near 100%. Is this sign of overfitting or I should just consider the values of accuracy from the cross validation and the training sets (61% and 57%)?

Here is my code:

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0, stratify=y, shuffle=True)

param_grid = {
    'n_estimators': [50, 100, 200, 250],  # Número de árvores na floresta
    'max_depth': [3, 5, 8, 10],    # Profundidade máxima das árvores
    'min_samples_split': [2, 5, 10],  # Número mínimo de amostras necessárias para dividir um nó
    'min_samples_leaf': [1, 2, 4],    # Número mínimo de amostras necessárias em um nó folha
    'max_features': ['sqrt'],  # Número de recursos a serem considerados para a melhor divisão
    'oob_score': [True]
}

rf = RandomForestClassifier(random_state=0)

grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=10,                # 10-fold cross-validation
    n_jobs=-1,            # Usar todos os núcleos de CPU disponíveis
    verbose=2,            # Mostrar o processo de busca
    scoring='accuracy',    # Métrica de avaliação
    return_train_score=True
)

grid_search.fit(x_train, y_train)

print("Best hyperparams:")
print(grid_search.best_params_)
#Best Hyperparams:
#{'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 5, #'n_estimators': 100, 'oob_score': True}

print("Best Accuracy Cross Validation:")
print(grid_search.best_score_)
#Best Accuracy Cross Validation: 0.6137931034482758

best_model = grid_search.best_estimator_
test_accuracy = best_model.score(x_test, y_test)
test_accuracy
#0.5684931506849316

train_accuracy = best_model.score(x_train, y_train)
train_accuracy
#0.9844827586206897

PS: I saw in the scikt learn docs here and it stops with the validation in test set after the k-fold, but I didn't find any mention of validating again with the training dataset.

$\endgroup$
2
  • $\begingroup$ The accuracy on train can be ignored, it's highly overfit, just rely on the CV score or the test score. $\endgroup$ Commented Sep 3, 2024 at 18:37
  • $\begingroup$ I would not split data 70:30 and instead use k-fold exclusively, because splitting is "inefficient." The 0.61 from RF based on a grid search with 10-fold is appropriate. Also, RF is usually pessimistic, such that accuracy is lower. Also, while the RF grid search uses 10-fold CV, RF itself does not need CV, since in-bag and out-of bag is taken care of by all the trees. Recall, for RF, test objects dropped down each trained tree are left out of training with probability=0.37 when the bootstrapping is performed for each tree. Hence, RF essentially does it's own CV. $\endgroup$ Commented Sep 4, 2024 at 13:27

1 Answer 1

0
$\begingroup$

Here are the steps that you should take:

  1. Split dataset into training set and a test set (you could do 80/20 split, 70/30 split, etc.)
  2. Feed the training set into the cross-validation hyperparameter tuning method (GridSearchCV, RandomSearchCV, etc.); get the model with the "best" hyperparameters.
  3. Take the model with the "best" hyperparameters and test it on the testing set.

The final results/performance of your model should be based on when you test it on the test set. For example, you get your ROC curve and confusion matrix, and all of your metrics (accuracy, AUC, recall, etc.) from testing your model on the test set.

Since you do seem to be overfitting, you could try to implement some techniques to combat that. For example, you could try different models that could be simpler (e.g. logistic regression). You could try to implement regularization (L1, L2, etc.). You could try to get more data, as that may help improve the performance of your model. You could try to perform feature selection or dimension reduction, too (e.g. PCA). There are many algorithms for this.

To improve the overall performance of your model, if you are not doing it already, try scaling numerical features and encoding categorical features. It would be best to do everything in a pipeline to remove any chance of data leaks (in Python, you could use a sklearn or imblearn pipeline).

Also, from your code, you are actually doing a 80:20 split, instead of a 70:30 split as you say (since you put test_size = 0.20).

$\endgroup$

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.