I am working on a project with Sheharbano Khattak, Sadia Afroz, Mobin
Javed, Srikanth Sundaresan, Vern Paxson, Steven Murdoch, and Damon McCoy
to measure how often web sites treat Tor users differently (by serving
them a block page or a captcha, for example). We used OONI reports for
part of the project. This post is about running our code and some
general tips about working with OONI data. I hope it can be of some use
to the ADINA15 participants :)
The source code I'm talking about is here:
git clone https://www.bamsoftware.com/git/ooni-tor-blocks.git
One of its outputs is here, a big poster showing the web sites with the
highest blocking rates against Tor users:
https://people.torproject.org/~dcf/graphs/tor-blocker-poster-20150914.pdf
I am attaching the README of our code. One of OONI's tests does URL
downloads with Tor and without Tor. The code processes OONI reports,
compares the Tor and non-Tor HTTP responses, and notes whenever Tor
appears to be blocked while non-Tor appears to be unblocked.
Our code is focused on the task of finding Tor blocks, but parts of it
are generic and will be useful to others who are working with OONI data.
The ooni-report-urls program gives you the URLs of every OONI report
published at api.ooni.io. The ooni.py Python module provides an iterator
over OONI YAML files that deals with encoding errors and compression.
The classify.py Python module is able to identify many common types of
block pages (e.g. CloudFlare, Akamai).
Now some notes and lessons learned about working with OONI data.
The repository of all OONI reports is here:
http://api.ooni.io/
The web site doesn't make it obvious, but there is a JSON index of all
reports, so you can download many of them in bulk (thanks to Arturo for
pointing this out). Our source code contains a program called
ooni-report-urls that extracts the URLs from the JSON file so you can
pipe them to Wget or whatever. (Check before you start downloading,
because there are a lot of files and some of them are big!)
wget -O ooni-reports.json http://api.ooni.io/api/reports
./ooni-report-urls ooni-reports.json | sort | uniq > ooni-report-urls.txt
The choice of a YAML parser really really matters, like 30× performance
difference matters. See here:
https://bugs.torproject.org/13720
yaml.safe_load_all(f) function is slow.
yaml.load_all(f, Loader=yaml.CSafeLoader) is what you want to use
instead. yaml.CSafeLoader differs slightly in its handling of certain
invalid Unicode escapes that can appear in OONI's representation of HTTP
bodies, for example separately encoded UTF-16 surrogates:
"\uD83D\uDD07". ooni.py has a way to skip over records like that (there
aren't very many of them). With yaml.CSafeLoader, findblocks takes about
2 hours to process 2.5 years of http_requests reports (about 33 GB
compressed).
There are some inconsistencies and format differences in some OONI
reports, particularly very early ones. For example, the test_name field
of reports is not always the same for the same test. We were looking for
http_requests tests, and we had to match all of the following
test_names:
http_requests
http_requests_test
tor_http_requests_test
HTTP Requests Test
In addition, the YAML format is occasionally different. In http_requests
reports, for example, the way of indicating that Tor is in use for a
request can be any of:
tor: true
tor: {is_tor: true}
tor: {exit_ip: 109.163.234.5, exit_name: hessel2, is_tor: true}
And even in some requests, the special URL scheme "shttp" indicates a
Tor request; e.g. "shttp://example.com/". The ooni.py script fixes up
some of these issues, but only for the http_requests test. You'll have
to figure it out on your own for other tests.
A very early version of this processing code appeared here:
https://lists.torproject.org/pipermail/ooni-dev/2015-June/000288.html
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
# What we did in October 2015
* Restore OONI website
* Restore OONI github page
* Restore OpenObservatory.org\net\com
* Re-key OTF Greenhost cloud
* Mini hackathon in Rome:
* https://lists.torproject.org/pipermail/ooni-dev/2015-October/000353.ht
ml
* Development of measurement-kit: cleanup of DNS and HTTP code,
documentation
* Attended OTF Summit
* Attended Princeton university conference on internet censorship,
interference, and control
* Development of munin master and slave monitoring roles
We also did the following OONI dev meetings:
http://meetbot.debian.net/ooni/2015/ooni.2015-10-05-17.00.log.htmlhttp://meetbot.debian.net/ooni/2015/ooni.2015-10-12-16.59.log.htmlhttp://meetbot.debian.net/ooni/2015/ooni.2015-10-19-17.01.log.htmlhttp://meetbot.debian.net/ooni/2015/ooni.2015-10-26-16.59.log.html
~ Vasilis
-----BEGIN PGP SIGNATURE-----
iQIcBAEBCgAGBQJWeERFAAoJEF+/cLHRJgFisTkP+gMV4YqAFhkCUusGyeHxNpPk
9POAwWLYFQG+3L2AjtYmo6/68GgvpTKwojUC3XhV6ghVIkb+E9/as0sKKIllbS48
1MxwXZAId0K6SOFpzQN4WzpBrUyKdGSBVivoIxtdTmKubiULozUfu2zW6jkmXLqP
i/dKpHUvoG8v0HsfhSZWXi9KpZc/YIjvhhgL85EywBnVOsGy+yPMnfixyStFIPZk
pJbW0J+NcgoZ7iL5Jq8ZboMjEkSfdmKJ4SCKFGIwIDbSO7WZ+CamYwGP7gCxA0kU
1mUBkA+/gOupok+uqQB9SZ0DsllNIQ+ITs5ZRwsdXWLwEeZ9iMwRrTp8TyUbaZ1B
U/x41WaojytQvIlXlxyZdt6iKfZLCnAQ1B4lSeAiyh0HZLiSSjYb98Zu9Mj5vmIA
3s2Moqecp8AZGccnMrYyJdbW/kzHHVmy62+WG9XoigvTdqdO23VU+Ew2cRXslQWn
BNdEyxbPMpLwm151Zudy+mfjv5s9g66grr384iKX1ZbK5jFrZ4mmz/Oc+L33h3d1
Y5+godRxKo9uz21o0WzKjgMWVRFIFjQ9gJzXIvbn6rwKl9QVL7N3UdLagU7aNkEm
/oZw7v+PGjJHDCSkheMmVY3UKgDa0YbZcNMe0POyfkh8J4B7LfJ9tGsz1aHjZk7l
aLLh2Pnf4h2eoh7wCzeE
=3k1u
-----END PGP SIGNATURE-----
Hello Oonitarians,
This is a reminder that today there will be the weekly OONI gathering.
It will happen as usual on the #ooni channel on irc.oftc.net at 17:00
UTC (18:00 CEST, 12:00 EST, 09:00 PST).
You can join via the web from: https://kiwiirc.com/client/irc.oftc.net/ooni (Note: sometimes Tor is blocked by OFTC, but it should mask your IP if you trust that stuff).
Everybody is welcome to join us and bring their questions and feedback.
See you later,
~ Arturo
As requested by Vasilis I am forwarding the progress report (originally posted on tor-reports: https://lists.torproject.org/pipermail/tor-reports/2015-December/000949.html) for the work we did in November on OONI to the ooni-dev mailing list as well.
# What we did in November 2015
* Released ooniprobe v1.3.2:
https://lists.torproject.org/pipermail/ooni-dev/2015-November/000360.html
* Wrote documentation for the ooni-pipeline (
http://docs.openobservatory.org/ooni/pipeline/
)
* Continued development on the OONI-API to interface with the collected measurement data
* Updated deployment scripts for ooni-pipeline to support changes in AWS
* Held various OONI weekly meetings
* Worked on OTF proposal
* Worked on migrating away from Greenhost infrastructure
* Setup "puppet master” (machine responsible for running the batch jobs and handling orchestration between OONI collectors) for the pipeline on GRNet infrastructure.
* Implemented regexp of strings parsing in network-meter
Fun fact: ooniprobe has now been run in 85 different countries around the world!
# Next steps
* Figuring out the ideal deployment strategy for the ooni-api in production
* Working on implementing the frontend to the ooni-api
* Writing tasks for automatic execution of the spark tasks to generate views to be used by the ooni-api
* Various devops tasks
* Reviewing the feedback on the OTF proposal and submitting it
* Coming up with design and content for OpenObservatory.org
* Make mockups of the GUI for network-meter
~ Arturo
Hello Oonitarians,
This is a reminder that today there will be the weekly OONI gathering.
It will happen as usual on the #ooni channel on irc.oftc.net at 17:00
UTC (18:00 CEST, 12:00 EST, 09:00 PST).
You can join via the web from: https://kiwiirc.com/client/irc.oftc.net/ooni (Note: sometimes Tor is blocked by OFTC, but it should mask your IP if you trust that stuff).
Everybody is welcome to join us and bring their questions and feedback.
See you later,
~ Arturo
Hello Oonitarians,
This is a reminder that today there will be the weekly OONI gathering.
It will happen as usual on the #ooni channel on irc.oftc.net at 17:00
UTC (18:00 CEST, 12:00 EST, 09:00 PST).
You can join via the web from: https://kiwiirc.com/client/irc.oftc.net/ooni (Note: sometimes Tor is blocked by OFTC, but it should mask your IP if you trust that stuff).
For those of you in Berlin some of us will also be meeting at C-Base (http://www.c-base.org/) at the time of the meeting to participate both virtually and physically!
Everybody is welcome to join us and bring their questions and feedback.
See you later,
~ Arturo