Accuracy on NN model decrease after random oversampling using library ROSE

Ask Question

Asked 2 years, 5 months ago

Modified 2 years, 5 months ago

Viewed 72 times

I did random oversampling to handle unbalanced positive and negative data. When I didn't do random oversampling, the accuracy I got was 88%, when I oversampled the train data, it got 87% accuracy and when I oversampled the train data and tests, the accuracy was 84%. Here's the oversampling code:

from collections import Counter
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=0)
Train_X2_rose, Train_Y2_rose = ros.fit_resample(Train_X2_Tfidf, Train_Y2)
Test_X2_rose, Test_Y2_rose = ros.fit_resample(Test_X2_Tfidf, Test_Y2)

#classification model trial 10
def reset_seeds():
   np.random.seed(0) 
   python_random.seed(0)
   tf.random.set_seed(0)

reset_seeds() 

model10 = Sequential()
model10.add(Dense(10, input_dim= Train_X2_rose.shape[1], activation='sigmoid'))
model10.add(Dense(1, activation='sigmoid'))
opt = Adam (learning_rate=0.01)
model10.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])
model10.summary()

es = EarlyStopping(monitor="val_loss",mode='min',patience=10)
history10 = model10.fit(Train_X2_rose, Train_Y2_rose, epochs=1000, verbose=1, 
                        validation_split=0.2, batch_size=64, callbacks =[es])

#prediction and confusion matrix
y_pred = model10.predict(Test_X2_rose) > 0.5

print(confusion_matrix(Test_Y2_rose, y_pred))
print(classification_report(Test_Y2_rose, y_pred))
print('Confusion matrix:')
confusion_matrix(Test_Y2_rose, y_pred)
f, ax = plt.subplots(figsize=(8,5))
sns.heatmap(confusion_matrix(Test_Y2_rose, y_pred), annot=True, fmt=".0f", ax=ax)
plt.xlabel("y_head")
plt.ylabel("y_true")
plt.show()

cm = confusion_matrix(Test_Y2_rose, y_pred)

Why does this decrease in accuracy occur? is there something wrong with my code? because I used hyperparameter tuning with 18 models and some models with oversampling showed a very strange plot loss and accuracy where the training and validation data lines split above and below.

asked Jun 11, 2023 at 10:49

andryan86

1475 bronze badges

2

$\begingroup$ Accuracy is a highly problematic KPI, even (but not only) in the "unbalanced" case: stats.stackexchange.com/q/312780/1352. "Imbalance" is usually not a problem, and oversampling is usually not a solution: stats.stackexchange.com/q/357466/1352 $\endgroup$

Stephan Kolassa
– Stephan Kolassa

2023-06-11 12:25:51 +00:00
Commented Jun 11, 2023 at 12:25
2

$\begingroup$ Why are you oversampling the test data? The purpose of oversampling is to ensure that the positives and negatives across the training data are balanced so the model can make more accurate predictions. The test data should not be touched - the entire purpose of comparing prediction results to test data is to see how the model would perform on completely unseen data - which may well also be unbalanced in a real-world situation. $\endgroup$

Michael Grogan
– Michael Grogan

2023-06-11 12:45:29 +00:00
Commented Jun 11, 2023 at 12:45

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Stack Exchange Network

Accuracy on NN model decrease after random oversampling using library ROSE

0

Your Answer

Linked

Hot Network Questions

Accuracy on NN model decrease after random oversampling using library ROSE

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Hot Network Questions