For the Moat and HTTPS distributors, BridgeDB uses a cache of pregenerated captcha images. It does not generate a fresh captcha for every challenge. https://gitlab.torproject.org/tpo/anti-censorship/bridgedb/-/blob/eeca27703d... > ...The second method uses a local cache of pre-made CAPTCHAs, > created by scripting Gimp using gimp-captcha. The latter > cannot easily be run on headless server, unfortunately, > because Gimp requires an X server to be installed. https://gitlab.torproject.org/tpo/anti-censorship/bridgedb/-/blob/eeca27703d... imageFilename = random.SystemRandom().choice(os.listdir(self.cacheDir)) imagePath = os.path.join(self.cacheDir, imageFilename) with open(imagePath, 'rb') as imageFile: self.image = imageFile.read()
It may be that there are simply too few pregenerated captcha images. If there are N total, and an adversary invests effort to solve n of them, then the adversary will get a captcha it knows in n / N fraction of later bridge queries, until the cache of pregenerated images is regenerated.
I downloaded 1000 captcha images from the Moat API and hashed them: for a in $(seq 1 1000); do curl -s -x socks5h://127.0.0.1:9050/ https://bridges.torproject.org/moat/fetch -H 'Content-type: application/vnd.api+json' --data-raw '{"data": [{"version": "0.1.0", "type": "client-transports"}]}' | jq '.data[0].image' | sha256sum; done | tee bridgedb.hashes
Out of 1000 images drawn randomly with replacement, 916 appeared 1 time 39 appeared 2 times 2 appeared 3 times
We can use a capture–recapture technique to estimate the total population size. https://en.wikipedia.org/wiki/Mark_and_recapture#Lincoln%E2%80%93Petersen_es... Divide the 1000 images into 2 equal halves, and count the unique images in each half: n = 488, k = 492. The number of images in the second half that were already seen in the first half is K = 23. The estimate for N = n*K/k = 488*492/23 = 10439, so I guess the captcha cache dir on the BridgeDB server holds only about 10000 images. >>> pop = list(open("bridgedb.hashes")) >>> s1, s2 = set(pop[:len(pop)//2]), set(pop[len(pop)//2:]) >>> len(s1) 488 >>> len(s2) 492 >>> len(s1.intersection(s2)) 23 >>> len(s1)*len(s2)/len(s1.intersection(s2)) 10438.95652173913
It would be best to generate a fresh captcha image for each challenge, but if that's not possible, we should increase the number of cached images or regnerate the cache periodically.
On Thu, Dec 09, 2021 at 03:45:06PM -0700, David Fifield wrote:
We can use a capture–recapture technique to estimate the total population size. https://en.wikipedia.org/wiki/Mark_and_recapture#Lincoln%E2%80%93Petersen_es... Divide the 1000 images into 2 equal halves, and count the unique images in each half: n = 488, k = 492. The number of images in the second half that were already seen in the first half is K = 23. The estimate for N = n*K/k = 488*492/23 = 10439, so I guess the captcha cache dir on the BridgeDB server holds only about 10000 images.
pop = list(open("bridgedb.hashes")) s1, s2 = set(pop[:len(pop)//2]), set(pop[len(pop)//2:]) len(s1)
488
len(s2)
492
len(s1.intersection(s2))
23
len(s1)*len(s2)/len(s1.intersection(s2))
10438.95652173913
BridgeDB should have 10,000 CAPTCHAs; at least it did when I last generated a batch, in January 2020: https://gitlab.torproject.org/tpo/anti-censorship/bridgedb/-/issues/24607#no...
Quoting David Fifield (2021-12-09 23:45:06)
It would be best to generate a fresh captcha image for each challenge, but if that's not possible, we should increase the number of cached images or regnerate the cache periodically.
Our current mechanism to generate captchas is: https://github.com/isislovecruft/gimp-captcha
Which requires gimp, and might not be fast enough to generate a captcha per request besides not sure how TPA will feel about installing gimp the server. We could consider other options.
I see there are few libraries in go or python for it (will require some investigation to see if they are not way easier to break than gimp ones). Or we could use reCAPTCHA (bridgedb redame says is supported) or hCaptcha, that I guess will produce some doubts about privacy of the users.
There is a conversation about deprecating the captchas (as they are broken in many situations and are hard for many people) and we are setting up a new API[0] that will not have catpchas to see how it goes.
Anyway, I would prefer not to change how we serve captchas until we reimplement moat in rdsys. But we could regenerate the captchas, I don't think anybody has done it since phw did over one year ago. I created an issue to do it: https://gitlab.torproject.org/tpo/anti-censorship/bridgedb/-/issues/24607#no...
[0] https://gitlab.torproject.org/tpo/anti-censorship/bridgedb/-/issues/40025
anti-censorship-team@lists.torproject.org