I am working on a project with Sheharbano Khattak, Sadia Afroz, Mobin
Javed, Srikanth Sundaresan, Vern Paxson, Steven Murdoch, and Damon McCoy
to measure how often web sites treat Tor users differently (by serving
them a block page or a captcha, for example). We used OONI reports for
part of the project. This post is about running our code and some
general tips about working with OONI data. I hope it can be of some use
to the ADINA15 participants :)
The source code I'm talking about is here:
git clone https://www.bamsoftware.com/git/ooni-tor-blocks.git
One of its outputs is here, a big poster showing the web sites with the
highest blocking rates against Tor users:
https://people.torproject.org/~dcf/graphs/tor-blocker-poster-20150914.pdf
I am attaching the README of our code. One of OONI's tests does URL
downloads with Tor and without Tor. The code processes OONI reports,
compares the Tor and non-Tor HTTP responses, and notes whenever Tor
appears to be blocked while non-Tor appears to be unblocked.
Our code is focused on the task of finding Tor blocks, but parts of it
are generic and will be useful to others who are working with OONI data.
The ooni-report-urls program gives you the URLs of every OONI report
published at api.ooni.io. The ooni.py Python module provides an iterator
over OONI YAML files that deals with encoding errors and compression.
The classify.py Python module is able to identify many common types of
block pages (e.g. CloudFlare, Akamai).
Now some notes and lessons learned about working with OONI data.
The repository of all OONI reports is here:
http://api.ooni.io/
The web site doesn't make it obvious, but there is a JSON index of all
reports, so you can download many of them in bulk (thanks to Arturo for
pointing this out). Our source code contains a program called
ooni-report-urls that extracts the URLs from the JSON file so you can
pipe them to Wget or whatever. (Check before you start downloading,
because there are a lot of files and some of them are big!)
wget -O ooni-reports.json http://api.ooni.io/api/reports
./ooni-report-urls ooni-reports.json | sort | uniq > ooni-report-urls.txt
The choice of a YAML parser really really matters, like 30× performance
difference matters. See here:
https://bugs.torproject.org/13720
yaml.safe_load_all(f) function is slow.
yaml.load_all(f, Loader=yaml.CSafeLoader) is what you want to use
instead. yaml.CSafeLoader differs slightly in its handling of certain
invalid Unicode escapes that can appear in OONI's representation of HTTP
bodies, for example separately encoded UTF-16 surrogates:
"\uD83D\uDD07". ooni.py has a way to skip over records like that (there
aren't very many of them). With yaml.CSafeLoader, findblocks takes about
2 hours to process 2.5 years of http_requests reports (about 33 GB
compressed).
There are some inconsistencies and format differences in some OONI
reports, particularly very early ones. For example, the test_name field
of reports is not always the same for the same test. We were looking for
http_requests tests, and we had to match all of the following
test_names:
http_requests
http_requests_test
tor_http_requests_test
HTTP Requests Test
In addition, the YAML format is occasionally different. In http_requests
reports, for example, the way of indicating that Tor is in use for a
request can be any of:
tor: true
tor: {is_tor: true}
tor: {exit_ip: 109.163.234.5, exit_name: hessel2, is_tor: true}
And even in some requests, the special URL scheme "shttp" indicates a
Tor request; e.g. "shttp://example.com/". The ooni.py script fixes up
some of these issues, but only for the http_requests test. You'll have
to figure it out on your own for other tests.
A very early version of this processing code appeared here:
https://lists.torproject.org/pipermail/ooni-dev/2015-June/000288.html
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
Hi Arturo,
- ---
Currently for our data processing needs we have begun to bucket
reports by date (every date corresponds to when a certain report has
been submitted to the collector). What I would like to know is of the
two following options what would be most convenient to you for
accessing the data.
The options are:
OPTION A:
Have 1 JSON stream for every day of measurements (either gzipped or plai
n)
ex.
- https://ooni.torproject.org/reports/json/2016-01-01.json
- https://ooni.torproject.org/reports/json/2016-01-02.json
- https://ooni.torproject.org/reports/json/2016-01-03.json
etc.
OPTION B:
Have 1 JSON stream for every ooni-probe test run and publish them
inside of a directory with the timestamp of when it was collected
ex.
- - https://ooni.torproject.org/reports/json/2016-01-01/20160101T204732Z
NL-AS3265-http_requests-v1-probe.json.gz
- - https://ooni.torproject.org/reports/json/2016-01-01/20160101T204732Z
US-AS3265-dns_consistency-v1-probe.json.gz
etc.
Since we are internally using the daily batches for doing the
processing and analysis of reports unless there is an explicit request
to publish them on a test run basis we will probably end up going for
option A, so don’t be shy to reply :)
- ---
I agree with David in that it will be easier to access specific
ooni-probe test results using option (B) (i.e. the current solution).
What benefits did you identify when considering to switch to option (A)?
A few reasons to stick with option (B) include:
- - Retaining the ability to run ooni-pipeline on a subset of reports
associated with a given time period by filtering by date prefix, and
substrings within key names;
- - Retaining the ability to distribute small units of work easily among
subprocesses; and
- - Retaining the idempotent nature of ooni-pipeline, and the luigi
framework - switching from lots of small files to a single large file
for a given day will invariably increase the time required to recover
from failures (i.e. if a small dnst-based test fails to normalise,
you'll have to renormalise everything as opposed to a single test;
- - Developers will not have to download hundreds of megabytes of data
in order to access a traceroute test result that is only a few
kilobytes in size; and
- - It's generally easier to work with smaller files than it is to work
with big files.
Cheers,
Tyler
GPG fingerprint: 8931 45DF 609B EE2E BC32 5E71 631E 6FC3 4686 F0EB
(tyler(a)tylerfisher.org)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAEBCAAGBQJWn1+qAAoJEGMeb8NGhvDr6hUP/R6XcXEejwT8DYuKLoVBpujs
CqXtIj88A5JYhtt1npRF/a0peNihzFRbYuQpAUX/D1EdPa1UHDwCuqp1hO642xIJ
WePgWHWIS7qzYK/i5LbMXC+oWmfAA0J25SawmjyNWclK+NCgIwQ1k7kleyFP7Ul5
PFjJKLCcuqJkQl1hnZlW7YhgLYZAf2QHOD1cJauLM5aDCNBDUSgfIP+/P/xfFLq3
XqLGFBfNrMXaWmOfDGLR7tV4mS3R4M5L7rL66AiomQULdld4cuLonAht4CWLDuhV
MKyKrURixRqgTUoing59OcjgOcGEVQD5P5NaMuVruU1hFbHW0wr430/mEq8pdDW9
BQHZh/VZ/f2xz4rjWiE8Mfl3mgmGfbFiT6WMKQTRY3vr5mwbmefg0/IneJ1eHtIo
A/XMX579DQt3V19tMa7rO4TjpdBKIWwJ8/6mwwaw9QrS/I2pmlg8AscLU0oQtMqc
3CcWELOdoV7uIPBVg3TfiL+RLDSxzIJIp0k6IM19tkwZAxGLmD+cZvDo+dxME3fd
Y7+fYxovuQrt4vvhaPFU15EDzFWHMoMqlNUSOeC0FuIhpbYbM0Dqn1qL4EScJ+PA
+p/rtMTZLiiIJLtYuZCRiZjbaMvqsfZAmPEZ5ZgSShJKsB4jhV4/5LsmHedYCHeW
S/IdJ81xUzTfxMUZBqgI
=kztT
-----END PGP SIGNATURE-----
Hello Oonitarians,
This is a reminder that today there will be the weekly OONI gathering.
It will happen as usual on the #ooni channel on irc.oftc.net at 17:00
UTC (18:00 CEST, 12:00 EST, 09:00 PST).
You can join via the web from: https://kiwiirc.com/client/irc.oftc.net/ooni (Note: sometimes Tor is blocked by OFTC, but it should mask your IP if you trust that stuff).
Everybody is welcome to join us and bring their questions and feedback.
See you later,
~ Arturo
Hello Oonitarians!
In the past months we have been working on re-engineering the data processing pipeline for OONI. As a result you may have noticed that the publishing of reports via https://ooni.torproject.org/reports/.
Do not fear we have not been loosing reports and we will soon begin to resume publishing of the reports, but I would like to ask what would be the most convenient way to do so.
The major improvement is that the reports will from now on be published in JSON as opposed to YAML. The report format will change slightly to make the task of parsing them a bit easier. Each report will be a JSON stream (that is a series of JSON documents separated by newline) where every document contains also every key present in the report header. This adds a little bit of overhead to the filesize, but allows you to store offsets into the files and not have to always seek to the header to get the all the common information relative to that measurement.
Running some benchmarks on a small sample of the reports collected in one day we can see that the performance increase is huge:
87M 2015-12-22.json
97M 2015-12-22.yaml
vanilla json: 1.37932395935
ultra json: 0.421966075897
simple json: 1.23581790924
pyyaml (without CLoader): 193.864903927
pyyaml (with CLoader): 4.40925312042
Currently for our data processing needs we have begun to bucket reports by date (every date corresponds to when a certain report has been submitted to the collector). What I would like to know is of the two following options what would be most convenient to you for accessing the data.
The options are:
OPTION A:
Have 1 JSON stream for every day of measurements (either gzipped or plain)
ex.
- https://ooni.torproject.org/reports/json/2016-01-01.json
- https://ooni.torproject.org/reports/json/2016-01-02.json
- https://ooni.torproject.org/reports/json/2016-01-03.json
etc.
OPTION B:
Have 1 JSON stream for every ooni-probe test run and publish them inside of a directory with the timestamp of when it was collected
ex.
- https://ooni.torproject.org/reports/json/2016-01-01/20160101T204732Z-NL-AS3…
- https://ooni.torproject.org/reports/json/2016-01-01/20160101T204732Z-US-AS3…
- https://ooni.torproject.org/reports/json/2016-01-01/20160101T204732Z-IT-AS3…
- https://ooni.torproject.org/reports/json/2016-01-02/20160102T204732Z-NL-AS3…
- https://ooni.torproject.org/reports/json/2016-01-02/20160102T204732Z-US-AS3…
- https://ooni.torproject.org/reports/json/2016-01-03/20160103T204732Z-IT-AS3…
etc.
Since we are internally using the daily batches for doing the processing and analysis of reports unless there is an explicit request to publish them on a test run basis we will probably end up going for option A, so don’t be shy to reply :)
~ Arturo
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
Hello,
I am working on normalisation for all of the DNS based tests right now
(i.e. dns_consistency, and dns_injection) and was wondering if any of
you had any suggestions with regards to how we should be normalising
these results.
So far, this is what I have come up with looks like this:
{'data_format_version': None,
'input': 'www.ignored.ch',
'options': ['-f', 'citizenlab-urls-global.txt', '-T',
'dns-server-ch.txt'],
'probe_asn': 'AS41715',
'probe_cc': 'CH',
'probe_ip': '127.0.0.1',
'report_filename':
's3://ooni-private/reports-raw/yaml/2016-01-01/dns_consistency-2015-12-3
1T220031Z-AS41715-probe.yamloo',
'report_id':
'bWEWmX6oEftSSJq9yEF5oH0VPOU5VZJooX06gQENo136sSoj9MzlTBk7EjhfH1Td',
'software_name': 'ooniprobe',
'software_version': '1.3.2',
'test_helpers': {'backend': '213.138.109.232:57004'},
'test_keys': {'annotations': None,
'backend_version': '1.1.4',
'control_resolver': '213.138.109.232:57004',
'errors': {'130.60.128.3': 'dns_lookup_error',
'130.60.128.5': 'dns_lookup_error',
'194.158.230.53': False,
'194.230.1.5': False,
'82.195.224.5': 'no_answer'},
'failed': {'130.60.128.3',
'130.60.128.5',
'82.195.224.5'},
'input_hashes':
['3f786850e387550fdab836ed7e6dc881de23001b'],
'queries': [{failure': None,
'hostname': 'www.ignored.ch',
'query_type': 'A',
'resolver_hostname': '213.138.109.232',
'resolver_port': 57004},
{'failure': None,
'hostname': 'www.ignored.ch',
'query_type': 'A',
'resolver_hostname': '212.147.10.10',
'resolver_port': 53}],
'successful': {'194.158.230.53',
'194.230.1.5',
'195.186.1.111',
'81.221.252.10'}},
'test_name': 'dns_consistency',
'test_runtime': 32.54842686653137,
'test_start_time': 1451605073.0,
'test_version': '0.6'}
After looking into the source code for the DNS consistency test, and
the dnst template I was able to determine the subject of the DNS
query, however, I am not sure how to handle the addr. section which
changes depending on whether the associated DNS query has a type of
A/SOA/NS (see:
https://github.com/TheTorProject/ooni-probe/blob/master/ooni/templates/d
nst.py#L153).
If you have any suggestions with regards to how to normalise dnst
results, I've linked to the raw, and normalised reports below.
Gist: https://gist.github.com/TylerJFisher/7372f9c31c54b5207d2a
Normalisation routine:
https://gist.github.com/TylerJFisher/7372f9c31c54b5207d2a#file-normalise
- -py
- ---
Cheers,
Tyler Fisher
GPG fingerprint: 8931 45DF 609B EE2E BC32 5E71 631E 6FC3 4686 F0EB
(tyler(a)tylerfisher.org)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAEBCAAGBQJWlxpcAAoJEGMeb8NGhvDrQyQQALPRZH/r6w7bPJ+iI2lBky7B
CjoFKWje9zKFpTEsl11dzgbdPnbc+e5ww8ntAuHxAdokFgG2iez8lhOzaN6XDFeM
KM0rCKlgoi2ZXYtdYNfWbBatY8DnIK4qDl7Yhar9DYO8Giaj5xlGxRvVt8lO4s+a
9a1GImFiJNEcJEU5WZg2+lGIMMeb4XmHev5MhX9UNr6TssJGWRUJQ1HjMSD5L2m4
kll6PFJ6TJetsKzvatkt8KDVkCJAg0j6UIEicHwlxLuwBHz3mIDHZ1xFXcRfBFAl
navG2Idl/JsUEir78wnK4A/ssV49s2Cd38QdOpwN5LLA3LtHwUOqQSGmEHsLB9vK
+xGB3mCt1XAaMpoSCK+SPMDKJkJ0oqOd8v7Pu3aOzNDEAKsp0ZF+U+kY0YLFgMmt
nE4SEgF5RBG7LcCcGOrBoy+/bo8DIu7PjdPPKax3qLo99VCdxEzXarujRAmKHWz/
nz9JlMennWd/v2UCINu1yUPADRXcZj9iReMqpo4zUZZoEH38b04wYvsv3wzDU3hm
j2H6aFMyC8872Ygsv0lqb00zJcYfJqMgG/G6iiQ1LD5OtyqEEtnI1VIsb3MVKkfi
7UUb7pF9t/UgEbbdIXq72+4ioISroauTZnYXxSq6BAWeY8fiEprPKic3w6fRgE2X
lcHBndiEJa+paJhqiPLj
=L335
-----END PGP SIGNATURE-----
# What we did in December 2015
* Write deployment scripts for ooni-api
* More work on implementing the frontend to explore the ooni reports (ooni-api)
* Review and submit final version of OTF proposal
* Work on ETL pipeline for OONI reports (focus on implementing an MVP where the analysis and anomaly detection is done at the database layer)
* Configuration and setup of the server at Humbolt university
* Fixes to the lantern network test
* Add support for latest version of twisted
* Do a roadmapping/brainstorming meetup at CCC
* Do the following weekly dev gatherings:
http://meetbot.debian.net/ooni/2015/ooni.2015-12-07-16.59.log.htmlhttp://meetbot.debian.net/ooni/2015/ooni.2015-12-14-16.59.log.htmlhttp://meetbot.debian.net/ooni/2015/ooni.2015-12-21-17.00.log.html
# What we plan to do in January 2016
* Merge the work done on the ooni-pipeline into master
* Tag an alpha release of the ooni reports explorer (ooni-api)
* Setup redundant copy of the pipeline on the humboldt server
* Start review of the ooni data formats
* Start writing specification for the new test that should replace http_requests, dns_consistency and tcp_connect called “web_connectivity”
~ Arturo
Hello Oonitarians,
This is a reminder that today there will be the weekly OONI gathering.
It will happen as usual on the #ooni channel on irc.oftc.net at 17:00
UTC (18:00 CEST, 12:00 EST, 09:00 PST).
You can join via the web from: https://kiwiirc.com/client/irc.oftc.net/ooni (Note: sometimes Tor is blocked by OFTC, but it should mask your IP if you trust that stuff).
Everybody is welcome to join us and bring their questions and feedback.
See you later,
~ Arturo
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
Hello,
This is a reminder for today's weekly OONI meeting.
It will happen as usual on the #ooni channel on irc.oftc.net at 17:00
UTC (18:00 CEST, 12:00 EST, 09:00 PST).
Everybody is welcome to join us and bring their questions and feedback.
~Vasilis
-----BEGIN PGP SIGNATURE-----
iQIcBAEBCgAGBQJWk4HwAAoJEF+/cLHRJgFixpYQAJaHzXjC4QUwJIU9DBTOWUiu
jRyhZReBZDr7ewhklb2H4NonAqwS6HdYXSAVpAGruqPCvXj27NUNducqk5RZlhkP
JOQhvilpOiX+i1NUxTlqxJT/4EXh5KdAfcl7RLuFJmSFWbCu6B13RYrhZUTutcrD
RFR/HXxSe8ctLwm8lCQVzasX89lsX9aSr4wNMoEx1TCm7vzHpgtT4tGME9EoajwV
L2UQXFqWrT4im3TXcS2wxK0Jp19BKimABLQLodsxG93N5yac95o1Q7b15fqugcx3
TEpzJkjnMp9IAQkOhGNTpeZTsuCopq0V7y1ej0GpxyX5t7eNU4S46P2vmO6iNL8T
g/UDH8kGUrQ6EkLDoRZ1k9PsRFvrl5/S0CMSn/Vb2lDGG+gjWVM4eaLbykUJGR0Q
aIcs5aitYLy3BLK1mw72LyS0aL/0UM024oznfxooTVhuQKg11UbnjiNy6uLtph1/
MK7Bvs07TXQ9zXbPGRzdsmnddktphCzgs3PawcFTyHuKAvGAVhm4g7ErWn8IbWZc
mi6DCjLT39O0hzQzvlmv0WQqfeazd81W05pHCS1CrgIZczezu2M+QXhwOlHYqaLb
n7CnNNDqxb5jB55CHEMFLKJSMTfZzg2FDF+ha+8zoPXSRMHYmudR6d6OWcBJDIBf
tQHH4KZi1Qx+KvqKWmp/
=zUav
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
Hello all,
This is a reminder that today there will be the weekly OONI meeting.
It will happen as usual on the #ooni channel on irc.oftc.net at 17:00
UTC (18:00 CEST, 12:00 EST, 09:00 PST).
The meeting will take place at C-base (https://www.c-base.org/). for
all of you that happen to be around you are more than welcome to join us
.
Everybody is welcome to join us and bring their questions and feedback.
~Vasilis
-----BEGIN PGP SIGNATURE-----
iQIcBAEBCgAGBQJWineGAAoJEF+/cLHRJgFirAYP/jAoQCE+OUqWVPHw2T4Tu2yM
BlncdKUsOvXl6xLQEE98hWN+evYZOIQw7o0u5Id57PyoM9taLAPIjI/D3Dva/VaY
406w/w2H7QB0Zc67iK1ZEYQgaOR08N4kM0EtTcEoz5yCii6+uMf/nhPvEXvosiMD
xPiEMPr5jlnSWTpLfs5jKgYFKYyXIhLI8Xnzwh4E3aNZ/ndoXAhR4P66JZUKlmQF
Vj2/KJk3cR6ANHopQGCQ+CAcd294mE6D6weD/HnQYYE/pghBWrM8KWsrm4AsBxRU
EC9phmvOd5RZrerU9ZHDdVJDKjwfYX0B66pvoOBdiG9rXIRwE3jp+VsB1QHS0Gm2
cy1GHpWtlAxT2e628c9TxLn9aDcuqPe2Tf/WhRC9gaAJatNUHFowAbWinAbI91dE
PpMBGVWg8YL0iJA88Jmb6DXAFXNx+nCgzlQHaDbQWXRkMMv+NAHMrq0nuVAIG0X2
wr5K51Ly7HfuwX7WW1WaNhaP+8kx56UuVIWaNoffyMhBewtYIClZtXXYqQ4hMsHS
rqKBEfu2JODhX3v96B8gjFLGYtGblJmtPd/eeWrhXQSjDD97P6ls4fwM5+AfqwJ6
THdbFpmGUtooeehRnAoPBjuU/kDGjHKbt/X+GYtn5fKVf89fIK6oXtSjrXt3qWdq
yKmm07KdfiaguUTUUqCW
=GXbv
-----END PGP SIGNATURE-----