0
$\begingroup$

I am working on a project that uses training data selection techniques; it involves sampling the training set in some smart way rather than sampling randomly. The goal is to compare different data selection techniques on the accuracies of the downstream tasks; this requires sampling many datasets.

Suppose I have a large dataset (its size much larger than 12000) to sample from to make sure the train: validation: test = 10000, 1000, 1000. After I randomly sample a test set, I have two choices for the training and validation set:

  • Option 1: First smartly sample a train set of 11000, and then randomly sample 1000 validation set from these 11000 samples.
  • Option 2: Independently and randomly sample a validation set of 1000 and then smartly sample the training set of 10000.

Though they may not look quite different at first glance, there are two practical implications:

  • Option 1 makes validation distribution similar to training distribution. However, as I need to sample multiple training sets, the validation sets will all be different.
  • Option 2 makes the validation set the same for all samples. However, its distribution is different from the training set.

Then which option should I take and why?

$\endgroup$
2
  • $\begingroup$ Do you not have the option of smartly sampling both? $\endgroup$ Commented Sep 20, 2023 at 5:21
  • $\begingroup$ The other question: what information do you have that you can use for sampling smartly and what information do you not have until you sample? $\endgroup$ Commented Sep 20, 2023 at 5:28

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.