1
$\begingroup$

I have a table containing elements in $[1,c]$. The elements may be repeated in the table. I want to sample $m$ unique elements from this table.

I can reduce this problem to weighted sampling without replacement. This would require me to a) count the number of times each element occurs - say element $i$ occurs $n_i$ times, b) generate random numbers $U_i^{1/n_i}, 1 \le i \le c$ , and c) pick elements corresponding to the top $m$ values (reference). Here $U_i \sim Unif([0,1])$.

If I want to do this without counting the frequency of all elements, can I use the following algorithm?

  • Generate a uniform random number for each row in the table.

  • Sort these numbers.

  • Pick the top $m$ values such that their corresponding elements are unique.

Notice that instead of generating one uniform random number per element, this method generates $n_i$ random numbers for element $i$. Is the above algorithm equivalent to weighted sampling without replacement?

$\endgroup$
2
  • $\begingroup$ You describe a situation, but you never directly ask a question. Do you want a way to do the sampling that avoids duplications? Or do you want the probability of avoiding duplications if you do weighted sampling without replacement? $\endgroup$ Commented May 16, 2020 at 22:40
  • $\begingroup$ Sorry if my question is unclear. I want to know a way to do the sampling that avoids duplications, without having to first count the number of times each element occurs in the table. Does that clarify my question? If I could count the number of times $n_i$ each element $i$ occurs, I could use the $U_i^{1/n_i}$ method. $\endgroup$ Commented May 17, 2020 at 2:51

2 Answers 2

1
$\begingroup$

Your initial approach doesn't actually work, due to a subtlety in the definition used by your reference [it's a good approximation if none of the elements are repeated very often]. Their weighted random sample definition is a sequential weighted random sample: $$p_i=\frac{w_i}{\sum_{\text{remaining}} w_i}$$ which does not result in $p_i\propto w_i$, and so doesn't give equal . Efraimidis and Spirakis don't claim that you get probability proportional to $w_i$, but they also don't go out of their way to make it explicit that you don't. It does work in the sense that it samples $m$ unique values, of course, just that it doesn't do so with uniform probability for each element.

Your proposed method is equivalent to the Efraimidis and Spirakis method. You pick the first value with probability proportional to its frequency, the second with probability proportional to its frequency among the remaining elements, and so on.

True probability-proportional-to-size sampling without replacement is surprisingly difficult. A good reference is Yves Tillé's book Sampling Algorithms and his R package with Alina Matei, sampling. There don't seem to be any algorithms that are exact for large samples and populations [I looked at this about a year ago because I wanted to add an algorithm to R's sample() function, which currently does the sequential sampling]

$\endgroup$
2
  • $\begingroup$ There's a part missing in your last sentence. $\endgroup$ Commented Oct 2 at 6:09
  • 1
    $\begingroup$ Thanks! There was an internet outage at a critical time $\endgroup$ Commented Oct 2 at 7:04
0
$\begingroup$

Suppose you have $c = 100$ balls with colors (coded as numbers 1 through 5) as follows: 40 Red(=1), 20 Blue(=2), 20 Green(=3), 10 Brown(=4), and 10 Purple(=5).

In R you can express the population of $c$ balls as follows:

pop = rep(1:5, c(40,20,20,10,10))

In R, you can choose a random sample from of size $m = 3$ from this population as follows:

draw = sample(pop, 3)

Three replications of the experiment look like this:

pop = rep(1:5, c(40,20,20,10,10))
draw = sample(pop, 3); draw
[1] 1 1 2
draw = sample(pop, 3); draw
[1] 1 3 2
draw = sample(pop, 3); draw
[1] 5 4 1

A simulation to approximate the probability of getting $m = 3$ bals of $m$ different colors, makes $B = 100\,000$ samples, and assesses the results.

set.seed(515)
pop = rep(1:5, c(40,20,20,10,10))
uniq = replicate(10^5, length(unique(sample(pop, 3))))
mean(uniq==3)
[1] 0.39415
2*sd(uniq==3)/sqrt(10^5)
[1] 0.003090619

table(uniq)/10^5
uniq
      1       2       3 
0.07675 0.52910 0.39415 

So the probability of getting three different colors is $0.394.\pm 0.003.$

The combinatorial paths to an exact solution that occur to me right now seem tediously intricate.

Note: The alternative code illustrated below always chooses three uniquely different colors---with regard to specified probabilities of available colors when each ball is chosen. (if 1 has already been chosen, then a lower-weighted ball must be chosen afterward.)

sample(1:5, 3, p=c(.4,.2,.2,.1,.1))
[1] 1 4 2
sample(1:5, 3, p=c(.4,.2,.2,.1,.1))
[1] 1 3 2
sample(1:5, 3, p=c(.4,.2,.2,.1,.1))
[1] 2 5 4
sample(1:5, 3, p=c(.4,.2,.2,.1,.1))
[1] 3 2 1
$\endgroup$

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.