For the Moat and HTTPS distributors, BridgeDB uses a cache of pregenerated captcha images. It does not generate a fresh captcha for every challenge. https://gitlab.torproject.org/tpo/anti-censorship/bridgedb/-/blob/eeca27703d... > ...The second method uses a local cache of pre-made CAPTCHAs, > created by scripting Gimp using gimp-captcha. The latter > cannot easily be run on headless server, unfortunately, > because Gimp requires an X server to be installed. https://gitlab.torproject.org/tpo/anti-censorship/bridgedb/-/blob/eeca27703d... imageFilename = random.SystemRandom().choice(os.listdir(self.cacheDir)) imagePath = os.path.join(self.cacheDir, imageFilename) with open(imagePath, 'rb') as imageFile: self.image = imageFile.read()
It may be that there are simply too few pregenerated captcha images. If there are N total, and an adversary invests effort to solve n of them, then the adversary will get a captcha it knows in n / N fraction of later bridge queries, until the cache of pregenerated images is regenerated.
I downloaded 1000 captcha images from the Moat API and hashed them: for a in $(seq 1 1000); do curl -s -x socks5h://127.0.0.1:9050/ https://bridges.torproject.org/moat/fetch -H 'Content-type: application/vnd.api+json' --data-raw '{"data": [{"version": "0.1.0", "type": "client-transports"}]}' | jq '.data[0].image' | sha256sum; done | tee bridgedb.hashes
Out of 1000 images drawn randomly with replacement, 916 appeared 1 time 39 appeared 2 times 2 appeared 3 times
We can use a capture–recapture technique to estimate the total population size. https://en.wikipedia.org/wiki/Mark_and_recapture#Lincoln%E2%80%93Petersen_es... Divide the 1000 images into 2 equal halves, and count the unique images in each half: n = 488, k = 492. The number of images in the second half that were already seen in the first half is K = 23. The estimate for N = n*K/k = 488*492/23 = 10439, so I guess the captcha cache dir on the BridgeDB server holds only about 10000 images. >>> pop = list(open("bridgedb.hashes")) >>> s1, s2 = set(pop[:len(pop)//2]), set(pop[len(pop)//2:]) >>> len(s1) 488 >>> len(s2) 492 >>> len(s1.intersection(s2)) 23 >>> len(s1)*len(s2)/len(s1.intersection(s2)) 10438.95652173913
It would be best to generate a fresh captcha image for each challenge, but if that's not possible, we should increase the number of cached images or regnerate the cache periodically.