Getting 99-100% accuracy on my training/validation data but performs bad on completely new data

Question

I have a large dataset of the ASL (American Sign Language). I split this data into 70:15:15 for train, validation, test.

I then trained a CNN model on it, where I trained using the 70%, and evaluated on the 15% validation set. After some epochs, I was able to achieve 100% accuracy on the training/validation data. I then ran this trained model on the remaining 'unseen' test data and it also achieved 99% accuracy which I thought was good.

However, I then got some more ASL data from third-party sources which is more 'unseen data' and I only achieved around 40% accuracy. It seems like my model is overfitting heavily on this original data source (the dataset has around 87000 images), and I'm pretty sure I'm not leaking the test set into the model's training either.

Firstly, is there any other explanation for this and secondly, is there any way I can improve my current model? My initial thought was to mix in this third-party data with my original but would the dataset still be imbalanced?

Well, assuming there is no data leakage, how good is your data compared to the third party data? Is your data pristine? Is the third party data garbage quality? Does the third party data contain elements not present in your data? — user2974951
– user2974951, Commented May 6, 2024 at 11:14
kaggle.com/datasets/grassknoted/asl-alphabet/discussion/317814, kaggle.com/datasets/grassknoted/asl-alphabet/discussion/67583 — D.W.
– D.W., Commented May 7, 2024 at 7:59
GIGO. If your training data is too small and/or doesn't match the real data, you can't expect good results. Retrain with more/more varied data. — keshlam
– keshlam, Commented May 8, 2024 at 0:52

J-J-J · Accepted Answer · 2024-05-07 19:44:32Z

It shows pretty well that in general, it is important to read datasets documentation and understand their overall context. Here, it seems you're talking about the "ASL Alphabet" dataset available on Kaggle, which provides 87,000 images.

For one, according to its documentation, it seems that the dataset has been collected from just one adult. So in the first place, any generalization from this data could be difficult, as hands and fingers can vary quite a bit between people, e.g. in size, mobility, color, wrinkles, scars, etc. I inspected visually some of the images, and a lot of them were so similar that in some instances I've been wondering if I was not looking at the same image cropped differently or artificially darkened or blurred. For instance, here are the 10 first images from the dataset for the letter "A":

You see that they are extremely similar, and for some of them I can't even tell the difference with the naked eye. So I don't find it very surprising that you got a high accuracy when testing your model on data originating from the same dataset, but it's unlikely to generalize well.

To explore a bit more systematically my first impression, I randomly sampled 500 images for the letter "A", and compared them automatically to each other to see how similar they were, using their structural similarity index for this purpose. Then I generated the following (quick and dirty) heatmap, where the maximum value "1" means that the two compared images are identical. The horizontal and vertical axis represent the identifiers of the images, sorted in order from "A4" to "A2995" (only some of the identifiers appear on the heatmap, but this is really a 500x500 table):

While there are no perfect match among the randomly sampled images, at first glance it looks like there are perhaps 4 main clusters of quite similar images, with some variations inside each cluster. It hints to a systemic lack of diversity in the dataset. Depending on your use case, this is something you may want to investigate further perhaps by testing other letters, using other metrics for similarity, or using other methods for detecting clusters.

Secondly, it seems that the dataset contains incorrect data, with some letters not coming from the American Sign Language (ASL), as stated on this related Kaggle thread:

[...] it looks like several of the letters in the data set are not ASL. Some, like M and N, appear to be Italian Sign Language. Others, like T, I'm not sure what language they come from, but it isn't ASL. Overall, G, K, T, M, N, and P are all not ASL.

If we want to check it for ourselves, and compare visually the Kaggle images to other sources of information, we see that there's indeed a problem. For instance, here is how "T" is fingerspelled in the Kaggle dataset:

Compare it to the version of the American Society for Deaf Children:

I'm not an ASL practitioner or expert, and can't say with 100% certainty which one is correct, even though I'd bet a lot of money that this is not the American Society for Deaf Children who is incorrect here. But in any case, there is some sort of disagreement between the two versions, which makes it harder to model the problem correctly with the data you have.

If you're interested in solving this fingerspelling recognition problems, discussing the issue with sign language experts and practitioners may be certainly fruitful. For instance, just a quick online research will teach you that people with some physical limitations might fingerspell a bit differently from other people. So discussing in depth with experts will certainly give you a good idea of how to model correctly the problem you want to solve, and what kind of data you need for that. In particular, you should note that a person on the Kaggle forum says that still images are not suitable for this kind of task, so you might find that you'll have to change your approach altogether, depending on what you want to do with your model ultimately.

"a lot of them were so similar that in some instances I've been wondering if I was not looking at the same image cropped" --> pseudo replication — Ggjj11
– Ggjj11, Commented May 6, 2024 at 17:33
I have some experience with ASL, and the "thumb between index and middle finger" is definitely the correct way to sign a T. — The Guy with The Hat
– The Guy with The Hat, Commented May 6, 2024 at 19:25
@TheGuywithTheHat I definitely think you're right. I've been a bit cautious in how I presented the situation, as I don't have experience with ASL and the person who created this data might have some kind of justification for it I'm not aware of. I don't want to spread some misinformation due to my ignorance. However, as a complete outsider to the ASL culture and without more info for the moment, I think it's very reasonable to assume it's simply some mistake from their part, after reading various online ASL resources and the feedback the dataset received (including your own comment). — J-J-J
– J-J-J, Commented May 6, 2024 at 23:03
They're not only similar in that the hand position looks near identical, but the background is near identical as well. One additional risk that creates is the model classifies based on what's in the background rather than focusing on the hand. — David Waterworth
– David Waterworth, Commented May 7, 2024 at 23:17
@DavidWaterworth +1. After reading your comment, I've been thinking about editing my answer to add a word or two about that - but I think I edited my answer too many times already, and I don't want the website homepage being monopolized by my edits. Anyway people facing the same kind of problem and stumbling upon this thread will likely see your comment. But maybe you could write an answer addressing this specific point? — J-J-J
– J-J-J, Commented May 8, 2024 at 10:14

Stack Exchange Network

Getting 99-100% accuracy on my training/validation data but performs bad on completely new data

1 Answer 1

Your Answer

Hot Network Questions

Getting 99-100% accuracy on my training/validation data but performs bad on completely new data

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions