I am working on a project with Sheharbano Khattak, Sadia Afroz, Mobin Javed, Srikanth Sundaresan, Vern Paxson, Steven Murdoch, and Damon McCoy to measure how often web sites treat Tor users differently (by serving them a block page or a captcha, for example). We used OONI reports for part of the project. This post is about running our code and some general tips about working with OONI data. I hope it can be of some use to the ADINA15 participants :)
The source code I'm talking about is here: git clone https://www.bamsoftware.com/git/ooni-tor-blocks.git
One of its outputs is here, a big poster showing the web sites with the highest blocking rates against Tor users: https://people.torproject.org/~dcf/graphs/tor-blocker-poster-20150914.pdf
I am attaching the README of our code. One of OONI's tests does URL downloads with Tor and without Tor. The code processes OONI reports, compares the Tor and non-Tor HTTP responses, and notes whenever Tor appears to be blocked while non-Tor appears to be unblocked.
Our code is focused on the task of finding Tor blocks, but parts of it are generic and will be useful to others who are working with OONI data. The ooni-report-urls program gives you the URLs of every OONI report published at api.ooni.io. The ooni.py Python module provides an iterator over OONI YAML files that deals with encoding errors and compression. The classify.py Python module is able to identify many common types of block pages (e.g. CloudFlare, Akamai).
Now some notes and lessons learned about working with OONI data.
The repository of all OONI reports is here: http://api.ooni.io/ The web site doesn't make it obvious, but there is a JSON index of all reports, so you can download many of them in bulk (thanks to Arturo for pointing this out). Our source code contains a program called ooni-report-urls that extracts the URLs from the JSON file so you can pipe them to Wget or whatever. (Check before you start downloading, because there are a lot of files and some of them are big!) wget -O ooni-reports.json http://api.ooni.io/api/reports ./ooni-report-urls ooni-reports.json | sort | uniq > ooni-report-urls.txt
The choice of a YAML parser really really matters, like 30× performance difference matters. See here: https://bugs.torproject.org/13720 yaml.safe_load_all(f) function is slow. yaml.load_all(f, Loader=yaml.CSafeLoader) is what you want to use instead. yaml.CSafeLoader differs slightly in its handling of certain invalid Unicode escapes that can appear in OONI's representation of HTTP bodies, for example separately encoded UTF-16 surrogates: "\uD83D\uDD07". ooni.py has a way to skip over records like that (there aren't very many of them). With yaml.CSafeLoader, findblocks takes about 2 hours to process 2.5 years of http_requests reports (about 33 GB compressed).
There are some inconsistencies and format differences in some OONI reports, particularly very early ones. For example, the test_name field of reports is not always the same for the same test. We were looking for http_requests tests, and we had to match all of the following test_names: http_requests http_requests_test tor_http_requests_test HTTP Requests Test In addition, the YAML format is occasionally different. In http_requests reports, for example, the way of indicating that Tor is in use for a request can be any of: tor: true tor: {is_tor: true} tor: {exit_ip: 109.163.234.5, exit_name: hessel2, is_tor: true} And even in some requests, the special URL scheme "shttp" indicates a Tor request; e.g. "shttp://example.com/". The ooni.py script fixes up some of these issues, but only for the http_requests test. You'll have to figure it out on your own for other tests.
A very early version of this processing code appeared here: https://lists.torproject.org/pipermail/ooni-dev/2015-June/000288.html
On Sat, Sep 26, 2015 at 07:15:58PM -0700, David Fifield wrote:
I am working on a project with Sheharbano Khattak, Sadia Afroz, Mobin Javed, Srikanth Sundaresan, Vern Paxson, Steven Murdoch, and Damon McCoy to measure how often web sites treat Tor users differently (by serving them a block page or a captcha, for example). We used OONI reports for part of the project. This post is about running our code and some general tips about working with OONI data. I hope it can be of some use to the ADINA15 participants :)
The source code I'm talking about is here: git clone https://www.bamsoftware.com/git/ooni-tor-blocks.git
One of its outputs is here, a big poster showing the web sites with the highest blocking rates against Tor users: https://people.torproject.org/~dcf/graphs/tor-blocker-poster-20150914.pdf
This OONI-processing code is now updated to work with JSON reports.
git clone https://www.bamsoftware.com/git/ooni-tor-blocks.git