0
$\begingroup$

What can I do, to assess a classifiers accuracy, when class presence is scarce.

Setup 1: I have 1000 boxes, 500 contain gold. I build an automated tool to find the gold.

The recommended approach would be to open N boxes and compare with device's prediction. Stratified sampling would be a (better) alternative: Find N/2 empty boxes and N/2 gold boxes and calculate accuracy for empty/gold separately as it leads to balanced accuracy estimates, which is more interesting to me.

Setup 2: I have 1000 boxes, 5 contain gold. For robust estimates I would need to open hundreds of boxes. It would be much easier if I used the device to point at those 5 gold boxes and just checking if it was correct?

However, the last approach would introduce bias, correct? What else could I do?

$\endgroup$
2
  • $\begingroup$ Look at calibration and threshold tuning for cost sensitive ml (hint precision recall trade-off due to costs=gold) $\endgroup$ Commented Oct 25, 2024 at 10:34
  • 1
    $\begingroup$ Your situations look like you already know that exactly $k$ out of $N$ samples are of the "positive" class, and the task is to find out which ones they are. Are you sure that this correctly reflects your situation? This is not what is typically done in "classification" tasks, where for a given instance we need to predict (probabilistically or not) whether it belongs to the positive class. This is a difference. That said, I would argue that the initial problem is using accuracy as a KPI. $\endgroup$ Commented Oct 25, 2024 at 12:13

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.