Insufficiently many pregenerated BridgeDB captchas? - anti-censorship-team

9 Dec 2021


      For the Moat and HTTPS distributors, BridgeDB uses a cache of
pregenerated captcha images. It does not generate a fresh captcha for
every challenge.
https://gitlab.torproject.org/tpo/anti-censorship/bridgedb/-/blob/eeca27703d...
    > ...The second method uses a local cache of pre-made CAPTCHAs,
    > created by scripting Gimp using gimp-captcha. The latter
    > cannot easily be run on headless server, unfortunately,
    > because Gimp requires an X server to be installed.
https://gitlab.torproject.org/tpo/anti-censorship/bridgedb/-/blob/eeca27703d...
    imageFilename = random.SystemRandom().choice(os.listdir(self.cacheDir))
    imagePath = os.path.join(self.cacheDir, imageFilename)
    with open(imagePath, 'rb') as imageFile:    
        self.image = imageFile.read()
It may be that there are simply too few pregenerated captcha images. If
there are N total, and an adversary invests effort to solve n of them,
then the adversary will get a captcha it knows in n / N fraction of
later bridge queries, until the cache of pregenerated images is
regenerated.
I downloaded 1000 captcha images from the Moat API and hashed them:
    for a in $(seq 1 1000); do curl -s -x socks5h://127.0.0.1:9050/ https://bridges.torproject.org/moat/fetch -H 'Content-type: application/vnd.api+json' --data-raw '{"data": [{"version": "0.1.0", "type": "client-transports"}]}' | jq '.data[0].image' | sha256sum; done | tee bridgedb.hashes
Out of 1000 images drawn randomly with replacement,
    916 appeared 1 time
     39 appeared 2 times
      2 appeared 3 times
We can use a capture–recapture technique to estimate the total
population size.
https://en.wikipedia.org/wiki/Mark_and_recapture#Lincoln%E2%80%93Petersen_es...
Divide the 1000 images into 2 equal halves, and count the unique images
in each half: n = 488, k = 492. The number of images in the second half
that were already seen in the first half is K = 23. The estimate for
N = n*K/k = 488*492/23 = 10439, so I guess the captcha cache dir on the
BridgeDB server holds only about 10000 images.
    >>> pop = list(open("bridgedb.hashes"))
    >>> s1, s2 = set(pop[:len(pop)//2]), set(pop[len(pop)//2:])
    >>> len(s1)
    488
    >>> len(s2)
    492
    >>> len(s1.intersection(s2))
    23
    >>> len(s1)*len(s2)/len(s1.intersection(s2))
    10438.95652173913
It would be best to generate a fresh captcha image for each challenge,
but if that's not possible, we should increase the number of cached
images or regnerate the cache periodically.