0
$\begingroup$

I did random oversampling to handle unbalanced positive and negative data. When I didn't do random oversampling, the accuracy I got was 88%, when I oversampled the train data, it got 87% accuracy and when I oversampled the train data and tests, the accuracy was 84%. Here's the oversampling code:

from collections import Counter
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=0)
Train_X2_rose, Train_Y2_rose = ros.fit_resample(Train_X2_Tfidf, Train_Y2)
Test_X2_rose, Test_Y2_rose = ros.fit_resample(Test_X2_Tfidf, Test_Y2)

#classification model trial 10
def reset_seeds():
   np.random.seed(0) 
   python_random.seed(0)
   tf.random.set_seed(0)

reset_seeds() 

model10 = Sequential()
model10.add(Dense(10, input_dim= Train_X2_rose.shape[1], activation='sigmoid'))
model10.add(Dense(1, activation='sigmoid'))
opt = Adam (learning_rate=0.01)
model10.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])
model10.summary()

es = EarlyStopping(monitor="val_loss",mode='min',patience=10)
history10 = model10.fit(Train_X2_rose, Train_Y2_rose, epochs=1000, verbose=1, 
                        validation_split=0.2, batch_size=64, callbacks =[es])

#prediction and confusion matrix
y_pred = model10.predict(Test_X2_rose) > 0.5

print(confusion_matrix(Test_Y2_rose, y_pred))
print(classification_report(Test_Y2_rose, y_pred))
print('Confusion matrix:')
confusion_matrix(Test_Y2_rose, y_pred)
f, ax = plt.subplots(figsize=(8,5))
sns.heatmap(confusion_matrix(Test_Y2_rose, y_pred), annot=True, fmt=".0f", ax=ax)
plt.xlabel("y_head")
plt.ylabel("y_true")
plt.show()

cm = confusion_matrix(Test_Y2_rose, y_pred)

Why does this decrease in accuracy occur? is there something wrong with my code? because I used hyperparameter tuning with 18 models and some models with oversampling showed a very strange plot loss and accuracy where the training and validation data lines split above and below.

$\endgroup$
2
  • 2
    $\begingroup$ Accuracy is a highly problematic KPI, even (but not only) in the "unbalanced" case: stats.stackexchange.com/q/312780/1352. "Imbalance" is usually not a problem, and oversampling is usually not a solution: stats.stackexchange.com/q/357466/1352 $\endgroup$ Commented Jun 11, 2023 at 12:25
  • 2
    $\begingroup$ Why are you oversampling the test data? The purpose of oversampling is to ensure that the positives and negatives across the training data are balanced so the model can make more accurate predictions. The test data should not be touched - the entire purpose of comparing prediction results to test data is to see how the model would perform on completely unseen data - which may well also be unbalanced in a real-world situation. $\endgroup$ Commented Jun 11, 2023 at 12:45

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.