On Thu, Dec 09, 2021 at 03:45:06PM -0700, David Fifield wrote:
We can use a capture–recapture technique to estimate the total population size. https://en.wikipedia.org/wiki/Mark_and_recapture#Lincoln%E2%80%93Petersen_es... Divide the 1000 images into 2 equal halves, and count the unique images in each half: n = 488, k = 492. The number of images in the second half that were already seen in the first half is K = 23. The estimate for N = n*K/k = 488*492/23 = 10439, so I guess the captcha cache dir on the BridgeDB server holds only about 10000 images.
pop = list(open("bridgedb.hashes")) s1, s2 = set(pop[:len(pop)//2]), set(pop[len(pop)//2:]) len(s1)
488
len(s2)
492
len(s1.intersection(s2))
23
len(s1)*len(s2)/len(s1.intersection(s2))
10438.95652173913
BridgeDB should have 10,000 CAPTCHAs; at least it did when I last generated a batch, in January 2020: https://gitlab.torproject.org/tpo/anti-censorship/bridgedb/-/issues/24607#no...