I want to search OONI reports for cases of Tor exits being blocked by the server (things like the CloudFlare 403 captcha). The http_requests test is great for that because it fetches a bunch of web pages with Tor and without.
The attached script is my first-draft attempt at finding block pages. Its output is at the end of this message. You can see it finds a lot of CloudFlare captchas and other blocks.
First, I ran ooniprobe to get a report-http_requests.yamloo file. Then I ran the script, which does this: Skip the first YAML document, because it's a header. For all other documents: Skip it if it has non-None control_failure or experiment_failure--there are a few of these.
Look for exactly two non-failed requests, one with is_tor:false and one with is_tor:true. Skip it if it lacks these.
Classify the blocked status of the is_tor:false and is_tor:true responses. 400-series and 500-series status codes are classified as blocked and all others are unblocked.
Print an output line if the blocked status of is_tor:false does not match the blocked status of is_tor:true.
I have a few questions.
Is this a reasonable way to process reports? Is there a more standard way to do e.g. YAMLOO processing?
I know there are many reports at https://ooni.torproject.org/reports/. Is that all of them? I think I heard from Arturo that some reports are not online because of storage issues.
What's the best way for me to get the reports for processing? Just download all *http_requests* files from the web server?
Here is the output of the script. 403-CLOUDFLARE is the famous "Attention Required!" captcha page. I investigated some of the others manually and they are mostly custom block pages or generic web server 403s. (There are also a couple of CloudFlare pages that have a different form.) Overall, almost 4% of the 1000 URLs scanned by ooniprobe served a block page over Tor.
I'm not sure what's up with the non-Tor 503s from Amazon. They just look like localized internal service error pages ("ist ein technischer Fehler aufgetreten", "une erreur de système interne a été décelée"). The one for blog.com is a generic Nginx "Bad Gateway" page.
non-Tor Tor domain 302 403-OTHER yandex.ru 302 403-OTHER craigslist.org 301 403-CLOUDFLARE thepiratebay.se 503-OTHER 301 amazon.de 200 403-CLOUDFLARE adf.ly 301 403-OTHER squidoo.com 301 410-OTHER myspace.com 303 503-OTHER yelp.com 302 403-CLOUDFLARE typepad.com 503-OTHER 301 amazon.fr 301 403-CLOUDFLARE digitalpoint.com 301 403-CLOUDFLARE extratorrent.com 200 403-OTHER ezinearticles.com 200 403-OTHER hubpages.com 200 403-OTHER 2ch.net 200 403-OTHER hdfcbank.com 302 403-CLOUDFLARE meetup.com 302 403-CLOUDFLARE 1channel.ch 200 403-CLOUDFLARE multiply.com 301 403-CLOUDFLARE clixsense.com 301 403-OTHER zillow.com 301 403-CLOUDFLARE odesk.com 301 403-CLOUDFLARE elance.com 301 403-CLOUDFLARE youm7.com 200 403-CLOUDFLARE jquery.com 200 403-CLOUDFLARE sergey-mavrodi.com 301 403-CLOUDFLARE templatemonster.com 302 403-CLOUDFLARE 4tube.com 301 403-CLOUDFLARE mp3skull.com 301 403-CLOUDFLARE porntube.com 200 403-OTHER tutsplus.com 200 403-CLOUDFLARE bitshare.com 301 403-OTHER sears.com 200 403-CLOUDFLARE zwaar.net 502-OTHER 200 blog.com 302 403-CLOUDFLARE myegy.com 301 400-OTHER mercadolibre.com.ve 302 403-OTHER jabong.com 301 403-CLOUDFLARE free-tv-video-online.me 302 403-CLOUDFLARE traidnt.net