I am working on a project with Sheharbano Khattak, Sadia Afroz, Mobin
Javed, Srikanth Sundaresan, Vern Paxson, Steven Murdoch, and Damon McCoy
to measure how often web sites treat Tor users differently (by serving
them a block page or a captcha, for example). We used OONI reports for
part of the project. This post is about running our code and some
general tips about working with OONI data. I hope it can be of some use
to the ADINA15 participants :)
The source code I'm talking about is here:
git clone https://www.bamsoftware.com/git/ooni-tor-blocks.git
One of its outputs is here, a big poster showing the web sites with the
highest blocking rates against Tor users:
https://people.torproject.org/~dcf/graphs/tor-blocker-poster-20150914.pdf
I am attaching the README of our code. One of OONI's tests does URL
downloads with Tor and without Tor. The code processes OONI reports,
compares the Tor and non-Tor HTTP responses, and notes whenever Tor
appears to be blocked while non-Tor appears to be unblocked.
Our code is focused on the task of finding Tor blocks, but parts of it
are generic and will be useful to others who are working with OONI data.
The ooni-report-urls program gives you the URLs of every OONI report
published at api.ooni.io. The ooni.py Python module provides an iterator
over OONI YAML files that deals with encoding errors and compression.
The classify.py Python module is able to identify many common types of
block pages (e.g. CloudFlare, Akamai).
Now some notes and lessons learned about working with OONI data.
The repository of all OONI reports is here:
http://api.ooni.io/
The web site doesn't make it obvious, but there is a JSON index of all
reports, so you can download many of them in bulk (thanks to Arturo for
pointing this out). Our source code contains a program called
ooni-report-urls that extracts the URLs from the JSON file so you can
pipe them to Wget or whatever. (Check before you start downloading,
because there are a lot of files and some of them are big!)
wget -O ooni-reports.json http://api.ooni.io/api/reports
./ooni-report-urls ooni-reports.json | sort | uniq > ooni-report-urls.txt
The choice of a YAML parser really really matters, like 30× performance
difference matters. See here:
https://bugs.torproject.org/13720
yaml.safe_load_all(f) function is slow.
yaml.load_all(f, Loader=yaml.CSafeLoader) is what you want to use
instead. yaml.CSafeLoader differs slightly in its handling of certain
invalid Unicode escapes that can appear in OONI's representation of HTTP
bodies, for example separately encoded UTF-16 surrogates:
"\uD83D\uDD07". ooni.py has a way to skip over records like that (there
aren't very many of them). With yaml.CSafeLoader, findblocks takes about
2 hours to process 2.5 years of http_requests reports (about 33 GB
compressed).
There are some inconsistencies and format differences in some OONI
reports, particularly very early ones. For example, the test_name field
of reports is not always the same for the same test. We were looking for
http_requests tests, and we had to match all of the following
test_names:
http_requests
http_requests_test
tor_http_requests_test
HTTP Requests Test
In addition, the YAML format is occasionally different. In http_requests
reports, for example, the way of indicating that Tor is in use for a
request can be any of:
tor: true
tor: {is_tor: true}
tor: {exit_ip: 109.163.234.5, exit_name: hessel2, is_tor: true}
And even in some requests, the special URL scheme "shttp" indicates a
Tor request; e.g. "shttp://example.com/". The ooni.py script fixes up
some of these issues, but only for the http_requests test. You'll have
to figure it out on your own for other tests.
A very early version of this processing code appeared here:
https://lists.torproject.org/pipermail/ooni-dev/2015-June/000288.html
Hello Oonitarians,
This is a reminder that today there will be the weekly OONI gathering.
It will happen as usual on the #ooni channel on irc.oftc.net at 17:00
UTC (18:00 CEST, 12:00 EST, 09:00 PST).
You can join via the web from: https://kiwiirc.com/client/irc.oftc.net/ooni (Note: sometimes Tor is blocked by OFTC, but it should mask your IP if you trust that stuff).
Everybody is welcome to join us and bring their questions and feedback.
See you later,
~ Arturo
Hello Oonitarians,
This is a reminder that today there will be the weekly OONI gathering.
It will happen as usual on the #ooni channel on irc.oftc.net at 17:00
UTC (18:00 CEST, 12:00 EST, 09:00 PST).
You can join via the web from: https://kiwiirc.com/client/irc.oftc.net/ooni (Note: sometimes Tor is blocked by OFTC, but it should mask your IP if you trust that stuff).
Everybody is welcome to join us and bring their questions and feedback.
See you later,
~ Arturo
Dear Oonitarians,
We are pleased to announce that the new version of ooni-probe is out!
Here is the juicy changelog for your pleasing:
v1.3.2 (Fri, 20 Nov 2015)
-------------------------
* Implement third party test template
* Add tutorial for using TCP test
* Add tests for censorship resistance
* Add meek test
* Add lantern test
* Support for Twisted 15.0
* Various stability and bug fixes
Please note that I have revoked my previous PGP key (150FE210) and will now sign packages with my new key (67EF 3966 5099 86E9 6ACE E84E 5D67 CD18 7022 87F4).
As usual you can find it on pypi at the following address:
https://pypi.python.org/pypi/ooniprobe/1.3.2
Debian packages and other platforms to come.
Happy measurements!
~ Arturo
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
Hello Oonitarians,
This is a reminder that today there will be the weekly OONI gathering.
It will happen as usual on the #ooni channel on irc.oftc.net at 17:00
UTC (18:00 CEST, 12:00 EST, 09:00 PST).
You can join via the web from:
https://kiwiirc.com/client/irc.oftc.net/ooni (Note: sometimes Tor is
blocked by OFTC, but it should mask your IP if you trust that stuff).
Everybody is welcome to join us and bring their questions and feedback.
See you later,
- -sbs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAEBCAAGBQJWSeqdAAoJEIC2kSd3M9lb6yIQAK0AMHJFz1tVuGpFjygXDUES
0jFFDqJtucPKuWef5npiOg4UtRUDgdJLOKwu8CHk6CYyxknZ3azGfp5gbrNYKKCH
d6Aq/vs4Crm+pnNFrXdbQulY5hkgUPcNJJ90c8tGRu0N96fF9xn0LDUAf0oyWLjB
JrinU55sWgaY6r3IZr13jwrbIoqg8bloH6B2lQUeG4DL6g5WaeEhBcmLPlFu344N
IULpPa8M7E4H22OyvLU7WfWg/IS0LxVy2iHRd2xaYk9Lhu8UkqdvG4Zpbd5tqgGE
bRUHhPx5BkPnKDrKx00W8IJLzVNKvU0uG2Gf8q9ErvCxWS884hPYnah8q8skSKgv
rjXoyrbJsJB+sotAH8YSouex3BX602UMWMJu37M9Bu73vp/C6AGGRT23SgBW+iWC
e4g0derf7TaQfkvy2DmSyei4ANSOOcDWc01B2oT2OZlulaFJX4hpM3Oh7/c7jOe0
QGNxdlWVRlKPi+r/0VnxnpeT725frwQ5wrjVeo2/K/LR3Ni6F/3G/+N2ZoBD1fDp
V/Pn13OeYdBaFvLnRL7/t7La0PKH+ZBVuUQvbkAXbo4On3fO6S2wZV5AVWLxLN1F
jsE95X9q7JPY+4XM1g8pZyPQbS6qn6n67d07B2cgO1ChoRy1jDEs2hiD5lU89+Ej
GfS0Jqr78W+iwXQmBiXE
=S2HO
-----END PGP SIGNATURE-----
Hello Oonitarians,
This is a reminder that today there will be the weekly OONI gathering.
It will happen as usual on the #ooni channel on irc.oftc.net at 17:00
UTC (18:00 CEST, 12:00 EST, 09:00 PST).
You can join via the web from: https://kiwiirc.com/client/irc.oftc.net/ooni (Note: sometimes Tor is blocked by OFTC, but it should mask your IP if you trust that stuff).
Everybody is welcome to join us and bring their questions and feedback.
See you later,
~ Arturo
Hi,
Pabs made me aware of the github repository being quite ahead of the one
on git.torproject.org. Having 'Source code[link] is available (GitHub
mirror[link]).' is confusing, because I didn't expect such a discrepancy.
3 possible solutions to avoid this confusion:
- have some little automatic bot pushing changes from git.torproject to
github (post-hook for example)
- write a little mail to Github, to get the same treatment as, for
example, most Apache[1] projects so they pull (twice a day) changes from
git.torproject to github
- change the wording, and make the one on git.torproject the 'mirror
which reflects the state of a foregone era'
Ciao,
kwadro
[1] https://github.com/apache/parquet-mr