I am working on a project with Sheharbano Khattak, Sadia Afroz, Mobin
Javed, Srikanth Sundaresan, Vern Paxson, Steven Murdoch, and Damon McCoy
to measure how often web sites treat Tor users differently (by serving
them a block page or a captcha, for example). We used OONI reports for
part of the project. This post is about running our code and some
general tips about working with OONI data. I hope it can be of some use
to the ADINA15 participants :)
The source code I'm talking about is here:
git clone https://www.bamsoftware.com/git/ooni-tor-blocks.git
One of its outputs is here, a big poster showing the web sites with the
highest blocking rates against Tor users:
https://people.torproject.org/~dcf/graphs/tor-blocker-poster-20150914.pdf
I am attaching the README of our code. One of OONI's tests does URL
downloads with Tor and without Tor. The code processes OONI reports,
compares the Tor and non-Tor HTTP responses, and notes whenever Tor
appears to be blocked while non-Tor appears to be unblocked.
Our code is focused on the task of finding Tor blocks, but parts of it
are generic and will be useful to others who are working with OONI data.
The ooni-report-urls program gives you the URLs of every OONI report
published at api.ooni.io. The ooni.py Python module provides an iterator
over OONI YAML files that deals with encoding errors and compression.
The classify.py Python module is able to identify many common types of
block pages (e.g. CloudFlare, Akamai).
Now some notes and lessons learned about working with OONI data.
The repository of all OONI reports is here:
http://api.ooni.io/
The web site doesn't make it obvious, but there is a JSON index of all
reports, so you can download many of them in bulk (thanks to Arturo for
pointing this out). Our source code contains a program called
ooni-report-urls that extracts the URLs from the JSON file so you can
pipe them to Wget or whatever. (Check before you start downloading,
because there are a lot of files and some of them are big!)
wget -O ooni-reports.json http://api.ooni.io/api/reports
./ooni-report-urls ooni-reports.json | sort | uniq > ooni-report-urls.txt
The choice of a YAML parser really really matters, like 30× performance
difference matters. See here:
https://bugs.torproject.org/13720
yaml.safe_load_all(f) function is slow.
yaml.load_all(f, Loader=yaml.CSafeLoader) is what you want to use
instead. yaml.CSafeLoader differs slightly in its handling of certain
invalid Unicode escapes that can appear in OONI's representation of HTTP
bodies, for example separately encoded UTF-16 surrogates:
"\uD83D\uDD07". ooni.py has a way to skip over records like that (there
aren't very many of them). With yaml.CSafeLoader, findblocks takes about
2 hours to process 2.5 years of http_requests reports (about 33 GB
compressed).
There are some inconsistencies and format differences in some OONI
reports, particularly very early ones. For example, the test_name field
of reports is not always the same for the same test. We were looking for
http_requests tests, and we had to match all of the following
test_names:
http_requests
http_requests_test
tor_http_requests_test
HTTP Requests Test
In addition, the YAML format is occasionally different. In http_requests
reports, for example, the way of indicating that Tor is in use for a
request can be any of:
tor: true
tor: {is_tor: true}
tor: {exit_ip: 109.163.234.5, exit_name: hessel2, is_tor: true}
And even in some requests, the special URL scheme "shttp" indicates a
Tor request; e.g. "shttp://example.com/". The ooni.py script fixes up
some of these issues, but only for the http_requests test. You'll have
to figure it out on your own for other tests.
A very early version of this processing code appeared here:
https://lists.torproject.org/pipermail/ooni-dev/2015-June/000288.html
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
Hello everyone,
This is a reminder for today's weekly OONI IRC meeting.
It will happen as usual on the #ooni channel on irc.oftc.net at 17:00
UTC (18:00 CEST, 12:00 EST, 09:00 PST).
Event Times around the world:
https://www.timeanddate.com/worldclock/fixedtime.html?iso=20160321T17
Everybody is welcome to join us and bring their questions and feedback.
~Vasilis
- --
Fingerprint: 8FD5 CF5F 39FC 03EB B382 7470 5FBF 70B1 D126 0162
Pubkey: https://pgp.mit.edu/pks/lookup?op=get&search=0x5FBF70B1D1260162
-----BEGIN PGP SIGNATURE-----
iQIcBAEBCgAGBQJW8BNhAAoJEF+/cLHRJgFie3QP/RK3GgYJ3LZrZpW2s9gnOAsn
sI1vP/+0KcYSDmCv0fUwOs0807Dtb/LA8gXNhoVLRh+giU/i3/3bo/8WckwU2B7J
UN9mcd6okRGmfCrRVC541arXBgi2X0bAnxXzjVnjlFqDImemWLLK0Zja6Vr48PDA
Q4+2wE6hXhZCV56uqd1xx3HyG50LImbJvI8SqwtVQ99jefhlQo18ZWQPFkEPJKuD
z5QBrGHNnhVjNMxQpktLOY6iimxHXg6ghhpKUSUpLRmYAO7DXIYSm0eZaSJA2arU
l51MLXtXEKWT2DzDVQKfKgWSB7rORcVfyC4YlCueHy0Xq4R/fYLfYO4jw+tRIhri
yM8Nj3Kifpsq2j7XyKJC6hDGTG74cxU6x00B/Kx6GtE0OJyHQ8I2iXspecyCS9c2
lEfHugsEbiKL0Qm02Zw6WF2QOG7ApfjMIv10fTfjg8TsXL61KJtW0IhxnpLqAAPy
S+1sxn7sjtb/b9dDqw8E5u7QkVXvGVrlnm2zOG+o/9uxlbdNRS5Elh9Ojy4N1Sre
po4v8mLmXN0oJgnSx3GHxLIGu7tZ7A7dq38H8u5s+LXEPZUfuCCga7R6e/b8XWy3
BwNt8BcDj7Aq5gHiTfEeW3kHrEbvMQmbifj+ly9BQXdAt8r/QXRBt+/YSBZ5MEIc
UKi8Wir8n3+fK3NS5YHH
=NNZq
-----END PGP SIGNATURE-----
I just downloaded all the http_requests reports from
https://measurements.ooni.torproject.org/. It took quite a long time and
I wonder if we can make things more efficient by compressing the reports
on the server.
This is the command I ran to download the reports:
wget -c -r -l 2 -np --no-directories -A '*http_requests*' --no-http-keep-alive https://measurements.ooni.torproject.org/
This resulted in 309 GB and 6387 files.
If I compress the files with xz,
xz -v *.json
they only take up 29 GB (9%).
Processing xz-compressed files is pretty easy, as long as you don't have
to seek. Just do something like this:
def open_xz(filename):
p = subprocess.Popen(["xz", "-dc", filename], stdout=subprocess.PIPE, bufsize=-1)
return p.stdout
for line in open_xz("report.json"):
doc = json.loads(line)
...
Of course you can do the same thing with gzip.
Parts of the ooni-pipeline are using datetime.fromtimestamp instead of
datetime.utcfromtimestamp. It means the timestamp will be parsed
differently according to the time zone in which the code is run:
https://github.com/TheTorProject/ooni-pipeline/blob/355ac1780f1f05eefb9ea3b…
entry['test_start_time'] = datetime.fromtimestamp(entry.pop('start_time',
0)).strftime("%Y-%m-%d %H:%M:%S")
https://github.com/TheTorProject/ooni-pipeline/blob/355ac1780f1f05eefb9ea3b…
start_time = datetime.fromtimestamp(header.get('start_time', 0))
Now, it appears that there's no actual harm done by this bug, because it
looks like the timestamps were read and written by someone in UTC+1, so
the conversion evens out in the final JSON. But it could go wrong if run
by someone in a different time zone.
I noticed this because when I process the new JSON files, I get
timestamps that are an hour different than I got parsing the YAML files.
To parse the JSON times, I use datetime.datetime.strptime(..., "%Y-%m-%d %H:%M:%S"),
which gives the correct answer because of the reason in the previous
paragraph. My old YAML times were off by an hour, because I used
datetime.datetime.utcfromtimestamp(...), parsing a UTC+1 timestamp as if
it were UTC.
Here's some related information I found while researching the above. It
looks like individual test results are *also* using local timestamps, or
have their clocks set wrong, or something. Because when I compare the
"test_start_time" field to the "Date" header (which is always supposed
to be UTC) in a response to an http_requests test, I get some widely
divergent values (I saw up to around 9 hours).
For example consider
https://measurements.ooni.torproject.org/2015-01-01/20150101T060029Z-BE-AS2…
It has a difference of about 6 hours:
{
"test_start_time":"2015-01-01 07:00:29",
"test_keys": {
"requests": [
{
"response": {
"headers": {
"Date": "Thu, 01 Jan 2015 01:00:35 GMT"
...
}
}
}
]
},
probe_cc: "BE"
}
probe_cc is "BE", Belgium, which is supposed to be UTC+1, so I don't
know where the 6-hour discrepancy is coming from. Note also that the
timestamp in the file name is not the same: 060029.
Here's a summary of the http_requests reports from 2015-01-01. Most of
them match, but some are off by 6 or 1 hour:
CC: CH test_start_time: 2015-01-01 03:00:28 Date: Thu, 01 Jan 2015 02:59:14 GMT +0
CC: BE test_start_time: 2015-01-01 07:00:29 Date: Thu, 01 Jan 2015 01:00:35 GMT +6
CC: FR test_start_time: 2015-01-01 09:41:08 Date: Thu, 01 Jan 2015 09:41:14 GMT +0
CC: FR test_start_time: 2015-01-01 10:47:20 Date: Thu, 01 Jan 2015 10:47:23 GMT +0
CC: NL test_start_time: 2015-01-01 12:42:16 Date: Thu, 01 Jan 2015 12:42:21 GMT +0
CC: NL test_start_time: 2015-01-01 15:41:32 Date: Thu, 01 Jan 2015 15:41:44 GMT +0
CC: CZ test_start_time: 2015-01-01 15:46:27 Date: Thu, 01 Jan 2015 15:46:34 GMT +0
CC: NL test_start_time: 2015-01-01 17:19:39 Date: Thu, 01 Jan 2015 17:19:43 GMT +0
CC: HK test_start_time: 2015-01-01 18:03:57 Date: Thu, 01 Jan 2015 17:04:10 GMT +1
CC: NL test_start_time: 2015-01-01 21:47:32 Date: Thu, 01 Jan 2015 21:47:37 GMT +0
I used this script:
import json
import sys
for line in sys.stdin:
doc = json.loads(line)
try:
probe_cc = doc["probe_cc"]
test_start_time = doc["test_start_time"]
for request in doc["test_keys"]["requests"]:
response_time = request["response"]["headers"]["Date"]
print "CC: %s test_start_time: %s Date: %s" % (probe_cc, test_start_time, response_time)
except KeyError:
pass
The YAML reports had two time fields:
start_time: timestamp of the start of ooni-probe run
test_start_time: timestamp of the start of each individual test
Within a single report file, start_time was constant, while
test_start_time would advance with each successive test, depending on
how long each test took to run.
The JSON format reports have just one of the fields, test_start_time,
but it confusingly appears to have the same meaning as start_time in the
old YAML reports (it doesn't change within a report file):
test_start_time: timestamp of the start of ooni-probe run
It might be because of this code:
https://github.com/TheTorProject/ooni-pipeline/blob/355ac1780f1f05eefb9ea3b…
entry['test_start_time'] = datetime.fromtimestamp(entry.pop('start_time',
0)).strftime("%Y-%m-%d %H:%M:%S")
Some tests can take many minutes or hours to run, so by the end, the
JSON test_start_time might be far off from the real time when it was
run. Is there a way we could get both timestamps in each record again?
What I'm doing now is incrementing a counter according to the
test_runtime field of each record, and adding that counter to the
test_start_time in order to estimate the individual test's start time.
But I feel that is only approximate, and some older reports do not have
test_runtime.
Dear oonitarians,
Following the discussion during the last meeting, we have created a
design document [0] that outlines the future plans for network meter.
It's still very much a living document, and is updated often.
Feel free to edit or add suggestions.
[0] https://pad.riseup.net/p/network-meter
--
poly
@0xPoly
https://darkdepths.net/pages/contact-keys.html
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
Hello ooni people,
The official OONI reports webserver is up and running!
Canonical location: https://reports.ooni.torproject.org
The sanitized public reports hosted there are currently updated hourly.
Hope this helps.
~Vasilis
-----BEGIN PGP SIGNATURE-----
iQIcBAEBCgAGBQJW6XQdAAoJEF+/cLHRJgFiwpkQAJWFxbZUtqQH8y1ntKkGnv3Q
wLf68T5cH0YbGJvE2koGqIzTj2sAP/0wuUT1NhWmA6lHwPLh85giiC6EuUfJD1Zg
JpjkclAlyL2ahNg4hgTjrPa1fSIgOaWj41LRZgAfJD/0yQJg572aa2HzcHVC0GLH
1IREIJwD2HEef1myOPWmDMzBiH4eKIFbqIYq2Ua+XzAesOBmErKdxslHcm5PAy0u
tWkJfNFtnrgLNgr5g/fYGA8RVGktMQubqHak0TiMT6XV2nwXc9igDvFbZ2t8LlX9
KWUMIXLCgHDPOVx1iyEFlsyvyuqhwZZoffQvfeIjB1MNg4Gm29QXRkSF9sCNDbVA
uIOHau7GdoXzl6z2glWpjShy7nAcnPU8x3H/ahNkEcaTDjLmwyVOddiCWkW/vJ6a
+JVO5esXmG9qChLFo3x8lXhLiRhYntXOJzDSajZsK9PgcCBqXO0TwYaQNFDFZesW
wXsKE/gwV+IZ2gb5vIsypmCV1so1zhjgVwtWNXEWTOecrCkgx8Z3pJDfGE5UJISb
S2BIqvdHxCA3FYDCcR2n5mwmZt/M4ox60ix6CX7PV1rsC8yAUvnwn8nRwlH5hYt5
dH1KMvoAmWM7U6E63qxhQiq8eIRt2lpN127SnAbbj82srGPyPVJLbidn14IOJjMm
g8iRbLDZDgzEPUk7hqmK
=+lTK
-----END PGP SIGNATURE-----
Hello Oonitarians,
I was having a discussion with vasilis about some changes that I recently did to the lantern
and psiphon test around exposing some extra configuration options and would like to know
what your opinion is on the matter.
Basically these tests will try to run the lantern and psiphon tool, then attempt to connect to
a certain website with it and verify if the response it gets from the website is the expected one.
In the past it was possible to configure the URL to be fetched and the response body that is
expected with a sane default.
I have changed this to no longer be the case and instead use a hardcoded value of a website
that we expect to not change in the future and an expected result for it.
In the specific case it’s http://www.google.com/humans.txt
The reason for doing this is that I want to avoid the possibility of a user misconfiguring the
URL and expected body to something that is not true (I say foo.com results “bar” while it
actually returns “foo”) and leading to inconsistent results.
The argument against this is that the website we use for testing may change in the future
and if we don’t notice then we can still have inconsistent results.
To this I believe that even if that were to become the case it’s more likely that us developers
of the tool will notice and hence ship an update than expect the user to tweak their ooniprobe
to provide valid measurements.
I believe that exposing some settings that can lead to measurements that are not true is
sub-optimal, but I would like to hear contrasting opinions.
~ Arturo
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
Hello,
This is a reminder for today's weekly OONI IRC meeting.
It will happen as usual on the #ooni channel on irc.oftc.net at 17:00
UTC (18:00 CEST, 12:00 EST, 09:00 PST).
Event Times around the world:
https://www.timeanddate.com/worldclock/fixedtime.html?iso=20160314T17
Everybody is welcome to join us and bring their questions and feedback.
~Vasilis
-----BEGIN PGP SIGNATURE-----
iQIcBAEBCgAGBQJW5sJnAAoJEF+/cLHRJgFia4cP/A39Hp7AdLgHjck4m71zpSe5
4sz/5w4gOjsyVWmUBYsYIZ3UvzbL2OtIdO57DGahG7ayktdhKGDmdlJWQQPkgxsc
eaDV194ioxDKm76fYmyPLUnrVsgr/TiVFK1mIwnoIKmT6EGvpv38LCiiZj8X/R6s
M7WCIzgffb+F0waCAB3j1zr8SRmVWvNUdWQpRv+NmdEWG14AGvcBNVDxGdfmC0ir
Nfznd7+iNreSj0KA9zUjlU9ivMyll/ivt7IKuiI3AW1iCWOF6tXTE0SIveRddx29
3nckvINThlLIlvXA/ErxAfinm84ED91+rilFbkE73h76GsOcNl7LI7JbV/8hlYdz
qlNw+GYLmAeJSG6HXo8WuP0yRa60Plvej+VybrFZSMCXacSIsDqQ09RPBxKkhtV6
/Vp6TvYtaUg0hmF6DEsZwHD7a7KYFQn6NDkTlRabJC/OTjToEXWe6SFRWk1wHnpY
g8t+xtUmmaxalp3vLcu7sdUTS0utGKti/qnSBSKiS0Dq1/gPsefedqUYv77KprVp
Po8vr0TMtwxbRE6IZobAx/m3KmgJrrl5G2tD1mtg9L+NsnYbjRCJSqiQZZWameGw
6mr0vJqRw8gmOYH/4xUafOi2I0RsFXN3NxTTaYQ7znMk8qW5nk9fiAdEhjIIQxfA
w+QfKvE0ig2DoGA1NUEC
=a8Jm
-----END PGP SIGNATURE-----