Sketchy Polytopes

Estimating uniques - sampling with replacement

While exploring ways to estimate multiset union and intersection cardinalities, I discovered an interesting result mentioned in a paper by Shukla [1], who in turn cites Feller [2]: If r elements are chosen uniformly and at random from a set of n elements, the expected number of distinct elements obtained is n−n(1−1/n)r.

Given an element a from the set, the probability that there are none of the elements after r draws is a is (1−1/n)r. Thus the probability that there is at least one element is a (or equivalently that there is at least one unique item) is 1−(1−1/n)r, and hence the expected number of uniques is n−n(1−1/n)r. Alternatively, as Feller explains, the probability of a single unique at the ith step is ((1−1/n)/n)i−1, hence the expected number of uniques after r draws is

r∑i=1((1−1/n)/n)i−1

A geometric series that converges to n−n(1−1/n)r.

This is an instance of the Birthday problem. Stackexchange has a couple of other more complicated but interesting solutions.

  • 1 A. Shukla et al, Storage estimation for multidimensional aggregates in the presence of hierarchies
  • [2] W. Feller, An introduction to probability theory and its applications, vol. 1, 1957