CAPTCHA Monitoring Project Updates & Findings - tor-dev

10 Jul 2020


      Hi everyone,
I made progress on the Cloudflare CAPTCHA Monitoring project since my last
email, and I wanted to share some of the updates & findings. This year Tor
Project is participating in GSoC under the DIAL umbrella, and I have
already been posting updates to the DIAL blog [1] weekly. I started
mirroring these updates [2] to my project's wiki page, and I will be
posting more frequent updates here.
a) Updates:
Firstly, I moved the wiki page that explains the project, the code base,
and the issue tracker to Tor Project's GitLab. They are all in the same
GitLab project. You can find detailed information about the project on the
wiki page [3] and leave comments & suggestions within that repository by
creating issues.
Secondly, I got a fully functioning system up and running. The system
fetches various URLs with Tor Browser & Firefox over Tor and checks for
CAPTCHAs. The system also checks if any third-party code was injected by
comparing the hash of the received page with an expected hash value. It
repeats these experiments using different exit relays and records results.
You can view the results on the dashboard [4] I created. I'm looking for
more URLs to track for CAPTCHAs. Feel free to share the websites you
frequently visit and get CAPTCHAs, so that I can track these websites with
this tool as well. I want to experiment with all types of CAPTCHAs, and
these URLs don't have to be fronted by Cloudflare.
b) Findings:
So far, I have observed that using the Tor Browser Bundle out of the box
without changing its configurations doesn't lead to a high CAPTCHA rate on
Cloudflare fronted websites (assuming the website owners don't explicitly
block exit relays [5]). That said, modifying the user-agent or any other
modifications that deviate your browser's fingerprint from a typical Tor
Browser user, significantly increases the chance of getting CAPTCHAs. For
example, using the regular Firefox over Tor resulted in getting CAPTCHAs in
~90% of the measurements. I believe Cloudflare is very aggressive against
the "Firefox over Tor" users because many people, unfortunately, use
Chromium/Firefox + Selenium + Tor to scrape web pages and bypass IP-based
rate limits. That's why I'm interested in hearing about your specific
browser/Tor configurations to test them with the CAPTCHA Monitor. Not
everyone is affected in the same way because of these differences in the
way we use Tor, but we can understand which differences affect the CAPTCHA
rate more than others by experimenting.
Additionally, I observed that the TLS fingerprint has a significant role in
whether someone gets a CAPTCHA or not. As a part of the project, I decided
to capture the HTTP headers during measurements to understand how they
affect the CAPTCHA rates. Initially, I was using a Python library called
seleniumwire to capture the HTTP headers by intercepting the traffic
between the Tor Browser and Tor. By doing this, I got a very high CAPTCHA
rate, like 98% of the time. seleniumwire forwards the traffic
transparently, but it has a different TLS fingerprint than Tor Browser. I
figured out that the difference in the TLS fingerprints was triggering the
MITM detection on the Cloudflare side, thus, resulting in very high CAPTCHA
rates.
Interestingly, I tried using the exact same Tor Browser & seleniumwire
setup, but without Tor and, practically, I didn't get any CAPTCHAs. I
believe the MITM detection is more aggressive if the traffic is coming
through an exit relay. So, I stopped using seleniumwire to capture headers
because it didn't reflect what a real human Tor Browser user is usually
experiencing. Please feel free to use the sample code [6] that I used to
combine seleniumwire and Tor, if you are interested in doing further
experimenting on this.
c) Next:
I will work on collecting more metrics by testing more configurations and
websites. I will create a "Relay Search" section on the dashboard, where
CAPTCHA statistics for the relays (exit relays for now) will be available.
I will also work on using the collected data to predict the probability of
getting CAPTCHAs with a given exit relay and configuration/setup.
Best,
Barkin
[1]
https://hub.osc.dial.community/t/tor-project-cloudflare-captcha-monitoring/1...
[2] https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/wikis/Updates
[3] https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/wikis/home
[4] https://dashboard.captcha.wtf/
[5] Cloudflare has a setting to block all traffic originating from the Tor
network, but that setting is not "turned on" by default
[6] https://gist.github.com/woswos/38b921f0b82de009c12c6494db3f50c5