Info & Theory
Reservoir sampling picks k items uniformly
at random from a stream whose length N you do not
know in advance, using only one pass and O(k)
memory.
Algorithm R
- Put the first
kitems into the reservoir. -
For each later item
i > k, pick a random integerjin1…i. Ifj ≤ k, replace slotjwith itemi; otherwise discard it.
So item i enters the reservoir with probability
exactly k/i.
Why it is uniform
Item i survives to the end if it is chosen on step
i (prob k/i) and is then never evicted.
The probability of not being evicted on a later step
m is 1 − 1/m. Multiplying gives
(k/i) · ∏(m=i+1..N) (1 − 1/m) = (k/i)·(i/N) = k/N
so every element ends up in the sample with the same
probability k/N. The histogram on the right is the
empirical inclusion rate per item over many runs — it converges
to the flat dashed line at k/N.