I am working on a project that uses training data selection techniques; it involves sampling the training set in some smart way rather than sampling randomly. The goal is to compare different data selection techniques on the accuracies of the downstream tasks; this requires sampling many datasets.
Suppose I have a large dataset (its size much larger than 12000) to sample from to make sure the train: validation: test = 10000, 1000, 1000. After I randomly sample a test set, I have two choices for the training and validation set:
- Option 1: First smartly sample a train set of 11000, and then randomly sample 1000 validation set from these 11000 samples.
- Option 2: Independently and randomly sample a validation set of 1000 and then smartly sample the training set of 10000.
Though they may not look quite different at first glance, there are two practical implications:
- Option 1 makes validation distribution similar to training distribution. However, as I need to sample multiple training sets, the validation sets will all be different.
- Option 2 makes the validation set the same for all samples. However, its distribution is different from the training set.
Then which option should I take and why?