ooni-dev March 2016

ooni-dev@lists.torproject.org

7 participants
13 discussions

ooniprobe Web GUI Mockups
by poly 29 Mar '16

29 Mar '16

Hey, Here are the completed mockups we discussed during the meeting yesterday: https://darkdepths.net/drafts/ooniprobe-gui-mockup.html Best, poly

2 1

Working with OONI reports; or, how we used OONI to find discrimination against Tor exits
by David Fifield 24 Mar '16

24 Mar '16

I am working on a project with Sheharbano Khattak, Sadia Afroz, Mobin Javed, Srikanth Sundaresan, Vern Paxson, Steven Murdoch, and Damon McCoy to measure how often web sites treat Tor users differently (by serving them a block page or a captcha, for example). We used OONI reports for part of the project. This post is about running our code and some general tips about working with OONI data. I hope it can be of some use to the ADINA15 participants :) The source code I'm talking about is here: git clone https://www.bamsoftware.com/git/ooni-tor-blocks.git One of its outputs is here, a big poster showing the web sites with the highest blocking rates against Tor users: https://people.torproject.org/~dcf/graphs/tor-blocker-poster-20150914.pdf I am attaching the README of our code. One of OONI's tests does URL downloads with Tor and without Tor. The code processes OONI reports, compares the Tor and non-Tor HTTP responses, and notes whenever Tor appears to be blocked while non-Tor appears to be unblocked. Our code is focused on the task of finding Tor blocks, but parts of it are generic and will be useful to others who are working with OONI data. The ooni-report-urls program gives you the URLs of every OONI report published at api.ooni.io. The ooni.py Python module provides an iterator over OONI YAML files that deals with encoding errors and compression. The classify.py Python module is able to identify many common types of block pages (e.g. CloudFlare, Akamai). Now some notes and lessons learned about working with OONI data. The repository of all OONI reports is here: http://api.ooni.io/ The web site doesn't make it obvious, but there is a JSON index of all reports, so you can download many of them in bulk (thanks to Arturo for pointing this out). Our source code contains a program called ooni-report-urls that extracts the URLs from the JSON file so you can pipe them to Wget or whatever. (Check before you start downloading, because there are a lot of files and some of them are big!) wget -O ooni-reports.json http://api.ooni.io/api/reports ./ooni-report-urls ooni-reports.json | sort | uniq > ooni-report-urls.txt The choice of a YAML parser really really matters, like 30× performance difference matters. See here: https://bugs.torproject.org/13720 yaml.safe_load_all(f) function is slow. yaml.load_all(f, Loader=yaml.CSafeLoader) is what you want to use instead. yaml.CSafeLoader differs slightly in its handling of certain invalid Unicode escapes that can appear in OONI's representation of HTTP bodies, for example separately encoded UTF-16 surrogates: "\uD83D\uDD07". ooni.py has a way to skip over records like that (there aren't very many of them). With yaml.CSafeLoader, findblocks takes about 2 hours to process 2.5 years of http_requests reports (about 33 GB compressed). There are some inconsistencies and format differences in some OONI reports, particularly very early ones. For example, the test_name field of reports is not always the same for the same test. We were looking for http_requests tests, and we had to match all of the following test_names: http_requests http_requests_test tor_http_requests_test HTTP Requests Test In addition, the YAML format is occasionally different. In http_requests reports, for example, the way of indicating that Tor is in use for a request can be any of: tor: true tor: {is_tor: true} tor: {exit_ip: 109.163.234.5, exit_name: hessel2, is_tor: true} And even in some requests, the special URL scheme "shttp" indicates a Tor request; e.g. "shttp://example.com/". The ooni.py script fixes up some of these issues, but only for the http_requests test. You'll have to figure it out on your own for other tests. A very early version of this processing code appeared here: https://lists.torproject.org/pipermail/ooni-dev/2015-June/000288.html

1 1

Reminder for weekly OONI IRC meeting: Monday 2016-03-21, 17:00 UTC in #ooni
by Vasilis 21 Mar '16

21 Mar '16

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 Hello everyone, This is a reminder for today's weekly OONI IRC meeting. It will happen as usual on the #ooni channel on irc.oftc.net at 17:00 UTC (18:00 CEST, 12:00 EST, 09:00 PST). Event Times around the world: https://www.timeanddate.com/worldclock/fixedtime.html?iso=20160321T17 Everybody is welcome to join us and bring their questions and feedback. ~Vasilis - -- Fingerprint: 8FD5 CF5F 39FC 03EB B382 7470 5FBF 70B1 D126 0162 Pubkey: https://pgp.mit.edu/pks/lookup?op=get&search=0x5FBF70B1D1260162 -----BEGIN PGP SIGNATURE----- iQIcBAEBCgAGBQJW8BNhAAoJEF+/cLHRJgFie3QP/RK3GgYJ3LZrZpW2s9gnOAsn sI1vP/+0KcYSDmCv0fUwOs0807Dtb/LA8gXNhoVLRh+giU/i3/3bo/8WckwU2B7J UN9mcd6okRGmfCrRVC541arXBgi2X0bAnxXzjVnjlFqDImemWLLK0Zja6Vr48PDA Q4+2wE6hXhZCV56uqd1xx3HyG50LImbJvI8SqwtVQ99jefhlQo18ZWQPFkEPJKuD z5QBrGHNnhVjNMxQpktLOY6iimxHXg6ghhpKUSUpLRmYAO7DXIYSm0eZaSJA2arU l51MLXtXEKWT2DzDVQKfKgWSB7rORcVfyC4YlCueHy0Xq4R/fYLfYO4jw+tRIhri yM8Nj3Kifpsq2j7XyKJC6hDGTG74cxU6x00B/Kx6GtE0OJyHQ8I2iXspecyCS9c2 lEfHugsEbiKL0Qm02Zw6WF2QOG7ApfjMIv10fTfjg8TsXL61KJtW0IhxnpLqAAPy S+1sxn7sjtb/b9dDqw8E5u7QkVXvGVrlnm2zOG+o/9uxlbdNRS5Elh9Ojy4N1Sre po4v8mLmXN0oJgnSx3GHxLIGu7tZ7A7dq38H8u5s+LXEPZUfuCCga7R6e/b8XWy3 BwNt8BcDj7Aq5gHiTfEeW3kHrEbvMQmbifj+ly9BQXdAt8r/QXRBt+/YSBZ5MEIc UKi8Wir8n3+fK3NS5YHH =NNZq -----END PGP SIGNATURE-----

1 0

Compressing reports?
by David Fifield 19 Mar '16

19 Mar '16

I just downloaded all the http_requests reports from https://measurements.ooni.torproject.org/. It took quite a long time and I wonder if we can make things more efficient by compressing the reports on the server. This is the command I ran to download the reports: wget -c -r -l 2 -np --no-directories -A '*http_requests*' --no-http-keep-alive https://measurements.ooni.torproject.org/ This resulted in 309 GB and 6387 files. If I compress the files with xz, xz -v *.json they only take up 29 GB (9%). Processing xz-compressed files is pretty easy, as long as you don't have to seek. Just do something like this: def open_xz(filename): p = subprocess.Popen(["xz", "-dc", filename], stdout=subprocess.PIPE, bufsize=-1) return p.stdout for line in open_xz("report.json"): doc = json.loads(line) ... Of course you can do the same thing with gzip.

2 2

ooni-pipeline using local time zone when parsing timestamps
by David Fifield 17 Mar '16

17 Mar '16

Parts of the ooni-pipeline are using datetime.fromtimestamp instead of datetime.utcfromtimestamp. It means the timestamp will be parsed differently according to the time zone in which the code is run: https://github.com/TheTorProject/ooni-pipeline/blob/355ac1780f1f05eefb9ea3b… entry['test_start_time'] = datetime.fromtimestamp(entry.pop('start_time', 0)).strftime("%Y-%m-%d %H:%M:%S") https://github.com/TheTorProject/ooni-pipeline/blob/355ac1780f1f05eefb9ea3b… start_time = datetime.fromtimestamp(header.get('start_time', 0)) Now, it appears that there's no actual harm done by this bug, because it looks like the timestamps were read and written by someone in UTC+1, so the conversion evens out in the final JSON. But it could go wrong if run by someone in a different time zone. I noticed this because when I process the new JSON files, I get timestamps that are an hour different than I got parsing the YAML files. To parse the JSON times, I use datetime.datetime.strptime(..., "%Y-%m-%d %H:%M:%S"), which gives the correct answer because of the reason in the previous paragraph. My old YAML times were off by an hour, because I used datetime.datetime.utcfromtimestamp(...), parsing a UTC+1 timestamp as if it were UTC. Here's some related information I found while researching the above. It looks like individual test results are *also* using local timestamps, or have their clocks set wrong, or something. Because when I compare the "test_start_time" field to the "Date" header (which is always supposed to be UTC) in a response to an http_requests test, I get some widely divergent values (I saw up to around 9 hours). For example consider https://measurements.ooni.torproject.org/2015-01-01/20150101T060029Z-BE-AS2… It has a difference of about 6 hours: { "test_start_time":"2015-01-01 07:00:29", "test_keys": { "requests": [ { "response": { "headers": { "Date": "Thu, 01 Jan 2015 01:00:35 GMT" ... } } } ] }, probe_cc: "BE" } probe_cc is "BE", Belgium, which is supposed to be UTC+1, so I don't know where the 6-hour discrepancy is coming from. Note also that the timestamp in the file name is not the same: 060029. Here's a summary of the http_requests reports from 2015-01-01. Most of them match, but some are off by 6 or 1 hour: CC: CH test_start_time: 2015-01-01 03:00:28 Date: Thu, 01 Jan 2015 02:59:14 GMT +0 CC: BE test_start_time: 2015-01-01 07:00:29 Date: Thu, 01 Jan 2015 01:00:35 GMT +6 CC: FR test_start_time: 2015-01-01 09:41:08 Date: Thu, 01 Jan 2015 09:41:14 GMT +0 CC: FR test_start_time: 2015-01-01 10:47:20 Date: Thu, 01 Jan 2015 10:47:23 GMT +0 CC: NL test_start_time: 2015-01-01 12:42:16 Date: Thu, 01 Jan 2015 12:42:21 GMT +0 CC: NL test_start_time: 2015-01-01 15:41:32 Date: Thu, 01 Jan 2015 15:41:44 GMT +0 CC: CZ test_start_time: 2015-01-01 15:46:27 Date: Thu, 01 Jan 2015 15:46:34 GMT +0 CC: NL test_start_time: 2015-01-01 17:19:39 Date: Thu, 01 Jan 2015 17:19:43 GMT +0 CC: HK test_start_time: 2015-01-01 18:03:57 Date: Thu, 01 Jan 2015 17:04:10 GMT +1 CC: NL test_start_time: 2015-01-01 21:47:32 Date: Thu, 01 Jan 2015 21:47:37 GMT +0 I used this script: import json import sys for line in sys.stdin: doc = json.loads(line) try: probe_cc = doc["probe_cc"] test_start_time = doc["test_start_time"] for request in doc["test_keys"]["requests"]: response_time = request["response"]["headers"]["Date"] print "CC: %s test_start_time: %s Date: %s" % (probe_cc, test_start_time, response_time) except KeyError: pass

2 4

test_start_time in JSON reports
by David Fifield 17 Mar '16

17 Mar '16

The YAML reports had two time fields: start_time: timestamp of the start of ooni-probe run test_start_time: timestamp of the start of each individual test Within a single report file, start_time was constant, while test_start_time would advance with each successive test, depending on how long each test took to run. The JSON format reports have just one of the fields, test_start_time, but it confusingly appears to have the same meaning as start_time in the old YAML reports (it doesn't change within a report file): test_start_time: timestamp of the start of ooni-probe run It might be because of this code: https://github.com/TheTorProject/ooni-pipeline/blob/355ac1780f1f05eefb9ea3b… entry['test_start_time'] = datetime.fromtimestamp(entry.pop('start_time', 0)).strftime("%Y-%m-%d %H:%M:%S") Some tests can take many minutes or hours to run, so by the end, the JSON test_start_time might be far off from the real time when it was run. Is there a way we could get both timestamps in each record again? What I'm doing now is incrementing a counter according to the test_runtime field of each record, and adding that counter to the test_start_time in order to estimate the individual test's start time. But I feel that is only approximate, and some older reports do not have test_runtime.

2 2

Network Meter Design Document
by poly 17 Mar '16

17 Mar '16

Dear oonitarians, Following the discussion during the last meeting, we have created a design document [0] that outlines the future plans for network meter. It's still very much a living document, and is updated often. Feel free to edit or add suggestions. [0] https://pad.riseup.net/p/network-meter -- poly @0xPoly https://darkdepths.net/pages/contact-keys.html

1 0

reports.ooni.torproject.org
by Vasilis 16 Mar '16

16 Mar '16

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 Hello ooni people, The official OONI reports webserver is up and running! Canonical location: https://reports.ooni.torproject.org The sanitized public reports hosted there are currently updated hourly. Hope this helps. ~Vasilis -----BEGIN PGP SIGNATURE----- iQIcBAEBCgAGBQJW6XQdAAoJEF+/cLHRJgFiwpkQAJWFxbZUtqQH8y1ntKkGnv3Q wLf68T5cH0YbGJvE2koGqIzTj2sAP/0wuUT1NhWmA6lHwPLh85giiC6EuUfJD1Zg JpjkclAlyL2ahNg4hgTjrPa1fSIgOaWj41LRZgAfJD/0yQJg572aa2HzcHVC0GLH 1IREIJwD2HEef1myOPWmDMzBiH4eKIFbqIYq2Ua+XzAesOBmErKdxslHcm5PAy0u tWkJfNFtnrgLNgr5g/fYGA8RVGktMQubqHak0TiMT6XV2nwXc9igDvFbZ2t8LlX9 KWUMIXLCgHDPOVx1iyEFlsyvyuqhwZZoffQvfeIjB1MNg4Gm29QXRkSF9sCNDbVA uIOHau7GdoXzl6z2glWpjShy7nAcnPU8x3H/ahNkEcaTDjLmwyVOddiCWkW/vJ6a +JVO5esXmG9qChLFo3x8lXhLiRhYntXOJzDSajZsK9PgcCBqXO0TwYaQNFDFZesW wXsKE/gwV+IZ2gb5vIsypmCV1so1zhjgVwtWNXEWTOecrCkgx8Z3pJDfGE5UJISb S2BIqvdHxCA3FYDCcR2n5mwmZt/M4ox60ix6CX7PV1rsC8yAUvnwn8nRwlH5hYt5 dH1KMvoAmWM7U6E63qxhQiq8eIRt2lpN127SnAbbj82srGPyPVJLbidn14IOJjMm g8iRbLDZDgzEPUk7hqmK =+lTK -----END PGP SIGNATURE-----

1 1

Exposing too many or too few options from ooni tests
by Arturo Filastò 15 Mar '16

15 Mar '16

Hello Oonitarians, I was having a discussion with vasilis about some changes that I recently did to the lantern and psiphon test around exposing some extra configuration options and would like to know what your opinion is on the matter. Basically these tests will try to run the lantern and psiphon tool, then attempt to connect to a certain website with it and verify if the response it gets from the website is the expected one. In the past it was possible to configure the URL to be fetched and the response body that is expected with a sane default. I have changed this to no longer be the case and instead use a hardcoded value of a website that we expect to not change in the future and an expected result for it. In the specific case it’s http://www.google.com/humans.txt The reason for doing this is that I want to avoid the possibility of a user misconfiguring the URL and expected body to something that is not true (I say foo.com results “bar” while it actually returns “foo”) and leading to inconsistent results. The argument against this is that the website we use for testing may change in the future and if we don’t notice then we can still have inconsistent results. To this I believe that even if that were to become the case it’s more likely that us developers of the tool will notice and hence ship an update than expect the user to tweak their ooniprobe to provide valid measurements. I believe that exposing some settings that can lead to measurements that are not true is sub-optimal, but I would like to hear contrasting opinions. ~ Arturo

5 4

Reminder for weekly OONI IRC meeting: Monday 2016-03-14, 17:00 UTC in #ooni
by Vasilis 14 Mar '16

14 Mar '16

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 Hello, This is a reminder for today's weekly OONI IRC meeting. It will happen as usual on the #ooni channel on irc.oftc.net at 17:00 UTC (18:00 CEST, 12:00 EST, 09:00 PST). Event Times around the world: https://www.timeanddate.com/worldclock/fixedtime.html?iso=20160314T17 Everybody is welcome to join us and bring their questions and feedback. ~Vasilis -----BEGIN PGP SIGNATURE----- iQIcBAEBCgAGBQJW5sJnAAoJEF+/cLHRJgFia4cP/A39Hp7AdLgHjck4m71zpSe5 4sz/5w4gOjsyVWmUBYsYIZ3UvzbL2OtIdO57DGahG7ayktdhKGDmdlJWQQPkgxsc eaDV194ioxDKm76fYmyPLUnrVsgr/TiVFK1mIwnoIKmT6EGvpv38LCiiZj8X/R6s M7WCIzgffb+F0waCAB3j1zr8SRmVWvNUdWQpRv+NmdEWG14AGvcBNVDxGdfmC0ir Nfznd7+iNreSj0KA9zUjlU9ivMyll/ivt7IKuiI3AW1iCWOF6tXTE0SIveRddx29 3nckvINThlLIlvXA/ErxAfinm84ED91+rilFbkE73h76GsOcNl7LI7JbV/8hlYdz qlNw+GYLmAeJSG6HXo8WuP0yRa60Plvej+VybrFZSMCXacSIsDqQ09RPBxKkhtV6 /Vp6TvYtaUg0hmF6DEsZwHD7a7KYFQn6NDkTlRabJC/OTjToEXWe6SFRWk1wHnpY g8t+xtUmmaxalp3vLcu7sdUTS0utGKti/qnSBSKiS0Dq1/gPsefedqUYv77KprVp Po8vr0TMtwxbRE6IZobAx/m3KmgJrrl5G2tD1mtg9L+NsnYbjRCJSqiQZZWameGw 6mr0vJqRw8gmOYH/4xUafOi2I0RsFXN3NxTTaYQ7znMk8qW5nk9fiAdEhjIIQxfA w+QfKvE0ig2DoGA1NUEC =a8Jm -----END PGP SIGNATURE-----

1 0

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

ooni-dev March 2016