Sample unique elements from an array containing repeated values

Question

I have a table containing elements in $[1,c]$. The elements may be repeated in the table. I want to sample $m$ unique elements from this table.

I can reduce this problem to weighted sampling without replacement. This would require me to a) count the number of times each element occurs - say element $i$ occurs $n_i$ times, b) generate random numbers $U_i^{1/n_i}, 1 \le i \le c$ , and c) pick elements corresponding to the top $m$ values (reference). Here $U_i \sim Unif([0,1])$.

If I want to do this without counting the frequency of all elements, can I use the following algorithm?

Generate a uniform random number for each row in the table.
Sort these numbers.
Pick the top $m$ values such that their corresponding elements are unique.

Notice that instead of generating one uniform random number per element, this method generates $n_i$ random numbers for element $i$. Is the above algorithm equivalent to weighted sampling without replacement?

You describe a situation, but you never directly ask a question. Do you want a way to do the sampling that avoids duplications? Or do you want the probability of avoiding duplications if you do weighted sampling without replacement? — BruceET
– BruceET, Commented May 16, 2020 at 22:40
Sorry if my question is unclear. I want to know a way to do the sampling that avoids duplications, without having to first count the number of times each element occurs in the table. Does that clarify my question? If I could count the number of times $n_i$ each element $i$ occurs, I could use the $U_i^{1/n_i}$ method. — elexhobby
– elexhobby, Commented May 17, 2020 at 2:51

Thomas Lumley · Accepted Answer · 2025-10-02 07:02:44Z

Your initial approach doesn't actually work, due to a subtlety in the definition used by your reference [it's a good approximation if none of the elements are repeated very often]. Their weighted random sample definition is a sequential weighted random sample: $$p_i=\frac{w_i}{\sum_{\text{remaining}} w_i}$$ which does not result in $p_i\propto w_i$, and so doesn't give equal . Efraimidis and Spirakis don't claim that you get probability proportional to $w_i$, but they also don't go out of their way to make it explicit that you don't. It does work in the sense that it samples $m$ unique values, of course, just that it doesn't do so with uniform probability for each element.

Your proposed method is equivalent to the Efraimidis and Spirakis method. You pick the first value with probability proportional to its frequency, the second with probability proportional to its frequency among the remaining elements, and so on.

True probability-proportional-to-size sampling without replacement is surprisingly difficult. A good reference is Yves Tillé's book Sampling Algorithms and his R package with Alina Matei, sampling. There don't seem to be any algorithms that are exact for large samples and populations [I looked at this about a year ago because I wanted to add an algorithm to R's sample() function, which currently does the sequential sampling]

$\begingroup$ There's a part missing in your last sentence. $\endgroup$

J-J-J
– J-J-J

2025-10-02 06:09:07 +00:00
Commented Oct 2 at 6:09 — J-J-J
– J-J-J, Commented Oct 2 at 6:09
$\begingroup$ Thanks! There was an internet outage at a critical time $\endgroup$

Thomas Lumley
– Thomas Lumley

2025-10-02 07:04:19 +00:00
Commented Oct 2 at 7:04 — Thomas Lumley
– Thomas Lumley, Commented Oct 2 at 7:04

BruceET · Accepted Answer · 2020-05-16 23:42:43Z

Suppose you have $c = 100$ balls with colors (coded as numbers 1 through 5) as follows: 40 Red(=1), 20 Blue(=2), 20 Green(=3), 10 Brown(=4), and 10 Purple(=5).

In R you can express the population of $c$ balls as follows:

pop = rep(1:5, c(40,20,20,10,10))

In R, you can choose a random sample from of size $m = 3$ from this population as follows:

draw = sample(pop, 3)

Three replications of the experiment look like this:

pop = rep(1:5, c(40,20,20,10,10))
draw = sample(pop, 3); draw
[1] 1 1 2
draw = sample(pop, 3); draw
[1] 1 3 2
draw = sample(pop, 3); draw
[1] 5 4 1

A simulation to approximate the probability of getting $m = 3$ bals of $m$ different colors, makes $B = 100\,000$ samples, and assesses the results.

set.seed(515)
pop = rep(1:5, c(40,20,20,10,10))
uniq = replicate(10^5, length(unique(sample(pop, 3))))
mean(uniq==3)
[1] 0.39415
2*sd(uniq==3)/sqrt(10^5)
[1] 0.003090619

table(uniq)/10^5
uniq
      1       2       3 
0.07675 0.52910 0.39415

So the probability of getting three different colors is $0.394.\pm 0.003.$

The combinatorial paths to an exact solution that occur to me right now seem tediously intricate.

Note: The alternative code illustrated below always chooses three uniquely different colors---with regard to specified probabilities of available colors when each ball is chosen. (if 1 has already been chosen, then a lower-weighted ball must be chosen afterward.)

sample(1:5, 3, p=c(.4,.2,.2,.1,.1))
[1] 1 4 2
sample(1:5, 3, p=c(.4,.2,.2,.1,.1))
[1] 1 3 2
sample(1:5, 3, p=c(.4,.2,.2,.1,.1))
[1] 2 5 4
sample(1:5, 3, p=c(.4,.2,.2,.1,.1))
[1] 3 2 1

Stack Exchange Network

Sample unique elements from an array containing repeated values

2 Answers 2

Your Answer

Hot Network Questions

Sample unique elements from an array containing repeated values

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions