ooni-dev January 2016

ooni-dev@lists.torproject.org

4 participants
9 discussions

Working with OONI reports; or, how we used OONI to find discrimination against Tor exits
by David Fifield 24 Mar '16

24 Mar '16

I am working on a project with Sheharbano Khattak, Sadia Afroz, Mobin Javed, Srikanth Sundaresan, Vern Paxson, Steven Murdoch, and Damon McCoy to measure how often web sites treat Tor users differently (by serving them a block page or a captcha, for example). We used OONI reports for part of the project. This post is about running our code and some general tips about working with OONI data. I hope it can be of some use to the ADINA15 participants :) The source code I'm talking about is here: git clone https://www.bamsoftware.com/git/ooni-tor-blocks.git One of its outputs is here, a big poster showing the web sites with the highest blocking rates against Tor users: https://people.torproject.org/~dcf/graphs/tor-blocker-poster-20150914.pdf I am attaching the README of our code. One of OONI's tests does URL downloads with Tor and without Tor. The code processes OONI reports, compares the Tor and non-Tor HTTP responses, and notes whenever Tor appears to be blocked while non-Tor appears to be unblocked. Our code is focused on the task of finding Tor blocks, but parts of it are generic and will be useful to others who are working with OONI data. The ooni-report-urls program gives you the URLs of every OONI report published at api.ooni.io. The ooni.py Python module provides an iterator over OONI YAML files that deals with encoding errors and compression. The classify.py Python module is able to identify many common types of block pages (e.g. CloudFlare, Akamai). Now some notes and lessons learned about working with OONI data. The repository of all OONI reports is here: http://api.ooni.io/ The web site doesn't make it obvious, but there is a JSON index of all reports, so you can download many of them in bulk (thanks to Arturo for pointing this out). Our source code contains a program called ooni-report-urls that extracts the URLs from the JSON file so you can pipe them to Wget or whatever. (Check before you start downloading, because there are a lot of files and some of them are big!) wget -O ooni-reports.json http://api.ooni.io/api/reports ./ooni-report-urls ooni-reports.json | sort | uniq > ooni-report-urls.txt The choice of a YAML parser really really matters, like 30× performance difference matters. See here: https://bugs.torproject.org/13720 yaml.safe_load_all(f) function is slow. yaml.load_all(f, Loader=yaml.CSafeLoader) is what you want to use instead. yaml.CSafeLoader differs slightly in its handling of certain invalid Unicode escapes that can appear in OONI's representation of HTTP bodies, for example separately encoded UTF-16 surrogates: "\uD83D\uDD07". ooni.py has a way to skip over records like that (there aren't very many of them). With yaml.CSafeLoader, findblocks takes about 2 hours to process 2.5 years of http_requests reports (about 33 GB compressed). There are some inconsistencies and format differences in some OONI reports, particularly very early ones. For example, the test_name field of reports is not always the same for the same test. We were looking for http_requests tests, and we had to match all of the following test_names: http_requests http_requests_test tor_http_requests_test HTTP Requests Test In addition, the YAML format is occasionally different. In http_requests reports, for example, the way of indicating that Tor is in use for a request can be any of: tor: true tor: {is_tor: true} tor: {exit_ip: 109.163.234.5, exit_name: hessel2, is_tor: true} And even in some requests, the special URL scheme "shttp" indicates a Tor request; e.g. "shttp://example.com/". The ooni.py script fixes up some of these issues, but only for the http_requests test. You'll have to figure it out on your own for other tests. A very early version of this processing code appeared here: https://lists.torproject.org/pipermail/ooni-dev/2015-June/000288.html

1 1

Re: [ooni-dev] Request for feedback on resuming of publishing of the reports
by Tyler Fisher 11 Mar '16

11 Mar '16

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Hi Arturo, - --- Currently for our data processing needs we have begun to bucket reports by date (every date corresponds to when a certain report has been submitted to the collector). What I would like to know is of the two following options what would be most convenient to you for accessing the data. The options are: OPTION A: Have 1 JSON stream for every day of measurements (either gzipped or plai n) ex. - https://ooni.torproject.org/reports/json/2016-01-01.json - https://ooni.torproject.org/reports/json/2016-01-02.json - https://ooni.torproject.org/reports/json/2016-01-03.json etc. OPTION B: Have 1 JSON stream for every ooni-probe test run and publish them inside of a directory with the timestamp of when it was collected ex. - - https://ooni.torproject.org/reports/json/2016-01-01/20160101T204732Z NL-AS3265-http_requests-v1-probe.json.gz - - https://ooni.torproject.org/reports/json/2016-01-01/20160101T204732Z US-AS3265-dns_consistency-v1-probe.json.gz etc. Since we are internally using the daily batches for doing the processing and analysis of reports unless there is an explicit request to publish them on a test run basis we will probably end up going for option A, so don’t be shy to reply :) - --- I agree with David in that it will be easier to access specific ooni-probe test results using option (B) (i.e. the current solution). What benefits did you identify when considering to switch to option (A)? A few reasons to stick with option (B) include: - - Retaining the ability to run ooni-pipeline on a subset of reports associated with a given time period by filtering by date prefix, and substrings within key names; - - Retaining the ability to distribute small units of work easily among subprocesses; and - - Retaining the idempotent nature of ooni-pipeline, and the luigi framework - switching from lots of small files to a single large file for a given day will invariably increase the time required to recover from failures (i.e. if a small dnst-based test fails to normalise, you'll have to renormalise everything as opposed to a single test; - - Developers will not have to download hundreds of megabytes of data in order to access a traceroute test result that is only a few kilobytes in size; and - - It's generally easier to work with smaller files than it is to work with big files. Cheers, Tyler GPG fingerprint: 8931 45DF 609B EE2E BC32 5E71 631E 6FC3 4686 F0EB (tyler(a)tylerfisher.org) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBCAAGBQJWn1+qAAoJEGMeb8NGhvDr6hUP/R6XcXEejwT8DYuKLoVBpujs CqXtIj88A5JYhtt1npRF/a0peNihzFRbYuQpAUX/D1EdPa1UHDwCuqp1hO642xIJ WePgWHWIS7qzYK/i5LbMXC+oWmfAA0J25SawmjyNWclK+NCgIwQ1k7kleyFP7Ul5 PFjJKLCcuqJkQl1hnZlW7YhgLYZAf2QHOD1cJauLM5aDCNBDUSgfIP+/P/xfFLq3 XqLGFBfNrMXaWmOfDGLR7tV4mS3R4M5L7rL66AiomQULdld4cuLonAht4CWLDuhV MKyKrURixRqgTUoing59OcjgOcGEVQD5P5NaMuVruU1hFbHW0wr430/mEq8pdDW9 BQHZh/VZ/f2xz4rjWiE8Mfl3mgmGfbFiT6WMKQTRY3vr5mwbmefg0/IneJ1eHtIo A/XMX579DQt3V19tMa7rO4TjpdBKIWwJ8/6mwwaw9QrS/I2pmlg8AscLU0oQtMqc 3CcWELOdoV7uIPBVg3TfiL+RLDSxzIJIp0k6IM19tkwZAxGLmD+cZvDo+dxME3fd Y7+fYxovuQrt4vvhaPFU15EDzFWHMoMqlNUSOeC0FuIhpbYbM0Dqn1qL4EScJ+PA +p/rtMTZLiiIJLtYuZCRiZjbaMvqsfZAmPEZ5ZgSShJKsB4jhV4/5LsmHedYCHeW S/IdJ81xUzTfxMUZBqgI =kztT -----END PGP SIGNATURE-----

4 5

Reminder for weekly OONI dev gathering Monday 2016-01-25 17:00 UTC (18:00 CEST)
by Arturo Filastò 25 Jan '16

25 Jan '16

Hello Oonitarians, This is a reminder that today there will be the weekly OONI gathering. It will happen as usual on the #ooni channel on irc.oftc.net at 17:00 UTC (18:00 CEST, 12:00 EST, 09:00 PST). You can join via the web from: https://kiwiirc.com/client/irc.oftc.net/ooni (Note: sometimes Tor is blocked by OFTC, but it should mask your IP if you trust that stuff). Everybody is welcome to join us and bring their questions and feedback. See you later, ~ Arturo

1 0

Request for feedback on resuming of publishing of the reports
by Arturo Filastò 19 Jan '16

19 Jan '16

Hello Oonitarians! In the past months we have been working on re-engineering the data processing pipeline for OONI. As a result you may have noticed that the publishing of reports via https://ooni.torproject.org/reports/. Do not fear we have not been loosing reports and we will soon begin to resume publishing of the reports, but I would like to ask what would be the most convenient way to do so. The major improvement is that the reports will from now on be published in JSON as opposed to YAML. The report format will change slightly to make the task of parsing them a bit easier. Each report will be a JSON stream (that is a series of JSON documents separated by newline) where every document contains also every key present in the report header. This adds a little bit of overhead to the filesize, but allows you to store offsets into the files and not have to always seek to the header to get the all the common information relative to that measurement. Running some benchmarks on a small sample of the reports collected in one day we can see that the performance increase is huge: 87M 2015-12-22.json 97M 2015-12-22.yaml vanilla json: 1.37932395935 ultra json: 0.421966075897 simple json: 1.23581790924 pyyaml (without CLoader): 193.864903927 pyyaml (with CLoader): 4.40925312042 Currently for our data processing needs we have begun to bucket reports by date (every date corresponds to when a certain report has been submitted to the collector). What I would like to know is of the two following options what would be most convenient to you for accessing the data. The options are: OPTION A: Have 1 JSON stream for every day of measurements (either gzipped or plain) ex. - https://ooni.torproject.org/reports/json/2016-01-01.json - https://ooni.torproject.org/reports/json/2016-01-02.json - https://ooni.torproject.org/reports/json/2016-01-03.json etc. OPTION B: Have 1 JSON stream for every ooni-probe test run and publish them inside of a directory with the timestamp of when it was collected ex. - https://ooni.torproject.org/reports/json/2016-01-01/20160101T204732Z-NL-AS3… - https://ooni.torproject.org/reports/json/2016-01-01/20160101T204732Z-US-AS3… - https://ooni.torproject.org/reports/json/2016-01-01/20160101T204732Z-IT-AS3… - https://ooni.torproject.org/reports/json/2016-01-02/20160102T204732Z-NL-AS3… - https://ooni.torproject.org/reports/json/2016-01-02/20160102T204732Z-US-AS3… - https://ooni.torproject.org/reports/json/2016-01-03/20160103T204732Z-IT-AS3… etc. Since we are internally using the daily batches for doing the processing and analysis of reports unless there is an explicit request to publish them on a test run basis we will probably end up going for option A, so don’t be shy to reply :) ~ Arturo

2 1

How should we normalise DNS test results?
by Tyler Fisher 18 Jan '16

18 Jan '16

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Hello, I am working on normalisation for all of the DNS based tests right now (i.e. dns_consistency, and dns_injection) and was wondering if any of you had any suggestions with regards to how we should be normalising these results. So far, this is what I have come up with looks like this: {'data_format_version': None, 'input': 'www.ignored.ch', 'options': ['-f', 'citizenlab-urls-global.txt', '-T', 'dns-server-ch.txt'], 'probe_asn': 'AS41715', 'probe_cc': 'CH', 'probe_ip': '127.0.0.1', 'report_filename': 's3://ooni-private/reports-raw/yaml/2016-01-01/dns_consistency-2015-12-3 1T220031Z-AS41715-probe.yamloo', 'report_id': 'bWEWmX6oEftSSJq9yEF5oH0VPOU5VZJooX06gQENo136sSoj9MzlTBk7EjhfH1Td', 'software_name': 'ooniprobe', 'software_version': '1.3.2', 'test_helpers': {'backend': '213.138.109.232:57004'}, 'test_keys': {'annotations': None, 'backend_version': '1.1.4', 'control_resolver': '213.138.109.232:57004', 'errors': {'130.60.128.3': 'dns_lookup_error', '130.60.128.5': 'dns_lookup_error', '194.158.230.53': False, '194.230.1.5': False, '82.195.224.5': 'no_answer'}, 'failed': {'130.60.128.3', '130.60.128.5', '82.195.224.5'}, 'input_hashes': ['3f786850e387550fdab836ed7e6dc881de23001b'], 'queries': [{failure': None, 'hostname': 'www.ignored.ch', 'query_type': 'A', 'resolver_hostname': '213.138.109.232', 'resolver_port': 57004}, {'failure': None, 'hostname': 'www.ignored.ch', 'query_type': 'A', 'resolver_hostname': '212.147.10.10', 'resolver_port': 53}], 'successful': {'194.158.230.53', '194.230.1.5', '195.186.1.111', '81.221.252.10'}}, 'test_name': 'dns_consistency', 'test_runtime': 32.54842686653137, 'test_start_time': 1451605073.0, 'test_version': '0.6'} After looking into the source code for the DNS consistency test, and the dnst template I was able to determine the subject of the DNS query, however, I am not sure how to handle the addr. section which changes depending on whether the associated DNS query has a type of A/SOA/NS (see: https://github.com/TheTorProject/ooni-probe/blob/master/ooni/templates/d nst.py#L153). If you have any suggestions with regards to how to normalise dnst results, I've linked to the raw, and normalised reports below. Gist: https://gist.github.com/TylerJFisher/7372f9c31c54b5207d2a Normalisation routine: https://gist.github.com/TylerJFisher/7372f9c31c54b5207d2a#file-normalise - -py - --- Cheers, Tyler Fisher GPG fingerprint: 8931 45DF 609B EE2E BC32 5E71 631E 6FC3 4686 F0EB (tyler(a)tylerfisher.org) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBCAAGBQJWlxpcAAoJEGMeb8NGhvDrQyQQALPRZH/r6w7bPJ+iI2lBky7B CjoFKWje9zKFpTEsl11dzgbdPnbc+e5ww8ntAuHxAdokFgG2iez8lhOzaN6XDFeM KM0rCKlgoi2ZXYtdYNfWbBatY8DnIK4qDl7Yhar9DYO8Giaj5xlGxRvVt8lO4s+a 9a1GImFiJNEcJEU5WZg2+lGIMMeb4XmHev5MhX9UNr6TssJGWRUJQ1HjMSD5L2m4 kll6PFJ6TJetsKzvatkt8KDVkCJAg0j6UIEicHwlxLuwBHz3mIDHZ1xFXcRfBFAl navG2Idl/JsUEir78wnK4A/ssV49s2Cd38QdOpwN5LLA3LtHwUOqQSGmEHsLB9vK +xGB3mCt1XAaMpoSCK+SPMDKJkJ0oqOd8v7Pu3aOzNDEAKsp0ZF+U+kY0YLFgMmt nE4SEgF5RBG7LcCcGOrBoy+/bo8DIu7PjdPPKax3qLo99VCdxEzXarujRAmKHWz/ nz9JlMennWd/v2UCINu1yUPADRXcZj9iReMqpo4zUZZoEH38b04wYvsv3wzDU3hm j2H6aFMyC8872Ygsv0lqb00zJcYfJqMgG/G6iiQ1LD5OtyqEEtnI1VIsb3MVKkfi 7UUb7pF9t/UgEbbdIXq72+4ioISroauTZnYXxSq6BAWeY8fiEprPKic3w6fRgE2X lcHBndiEJa+paJhqiPLj =L335 -----END PGP SIGNATURE-----

2 1

OONI team progress report for December 2015
by Arturo Filastò 18 Jan '16

18 Jan '16

# What we did in December 2015 * Write deployment scripts for ooni-api * More work on implementing the frontend to explore the ooni reports (ooni-api) * Review and submit final version of OTF proposal * Work on ETL pipeline for OONI reports (focus on implementing an MVP where the analysis and anomaly detection is done at the database layer) * Configuration and setup of the server at Humbolt university * Fixes to the lantern network test * Add support for latest version of twisted * Do a roadmapping/brainstorming meetup at CCC * Do the following weekly dev gatherings: http://meetbot.debian.net/ooni/2015/ooni.2015-12-07-16.59.log.html http://meetbot.debian.net/ooni/2015/ooni.2015-12-14-16.59.log.html http://meetbot.debian.net/ooni/2015/ooni.2015-12-21-17.00.log.html # What we plan to do in January 2016 * Merge the work done on the ooni-pipeline into master * Tag an alpha release of the ooni reports explorer (ooni-api) * Setup redundant copy of the pipeline on the humboldt server * Start review of the ooni data formats * Start writing specification for the new test that should replace http_requests, dns_consistency and tcp_connect called “web_connectivity” ~ Arturo

1 0

Reminder for weekly OONI dev gathering Monday 2016-01-18 17:00 UTC (18:00 CEST)
by Arturo Filastò 18 Jan '16

18 Jan '16

1 0

Reminder for weekly OONI dev gathering Monday 2016-01-11 17:00 UTC (18:00 CEST)
by Vasilis 11 Jan '16

11 Jan '16

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 Hello, This is a reminder for today's weekly OONI meeting. It will happen as usual on the #ooni channel on irc.oftc.net at 17:00 UTC (18:00 CEST, 12:00 EST, 09:00 PST). Everybody is welcome to join us and bring their questions and feedback. ~Vasilis -----BEGIN PGP SIGNATURE----- iQIcBAEBCgAGBQJWk4HwAAoJEF+/cLHRJgFixpYQAJaHzXjC4QUwJIU9DBTOWUiu jRyhZReBZDr7ewhklb2H4NonAqwS6HdYXSAVpAGruqPCvXj27NUNducqk5RZlhkP JOQhvilpOiX+i1NUxTlqxJT/4EXh5KdAfcl7RLuFJmSFWbCu6B13RYrhZUTutcrD RFR/HXxSe8ctLwm8lCQVzasX89lsX9aSr4wNMoEx1TCm7vzHpgtT4tGME9EoajwV L2UQXFqWrT4im3TXcS2wxK0Jp19BKimABLQLodsxG93N5yac95o1Q7b15fqugcx3 TEpzJkjnMp9IAQkOhGNTpeZTsuCopq0V7y1ej0GpxyX5t7eNU4S46P2vmO6iNL8T g/UDH8kGUrQ6EkLDoRZ1k9PsRFvrl5/S0CMSn/Vb2lDGG+gjWVM4eaLbykUJGR0Q aIcs5aitYLy3BLK1mw72LyS0aL/0UM024oznfxooTVhuQKg11UbnjiNy6uLtph1/ MK7Bvs07TXQ9zXbPGRzdsmnddktphCzgs3PawcFTyHuKAvGAVhm4g7ErWn8IbWZc mi6DCjLT39O0hzQzvlmv0WQqfeazd81W05pHCS1CrgIZczezu2M+QXhwOlHYqaLb n7CnNNDqxb5jB55CHEMFLKJSMTfZzg2FDF+ha+8zoPXSRMHYmudR6d6OWcBJDIBf tQHH4KZi1Qx+KvqKWmp/ =zUav -----END PGP SIGNATURE-----

1 0

Reminder for weekly OONI dev gathering Monday 2016-01-04 17:00 UTC (18:00 CEST)
by Vasilis 04 Jan '16

04 Jan '16

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 Hello all, This is a reminder that today there will be the weekly OONI meeting. It will happen as usual on the #ooni channel on irc.oftc.net at 17:00 UTC (18:00 CEST, 12:00 EST, 09:00 PST). The meeting will take place at C-base (https://www.c-base.org/). for all of you that happen to be around you are more than welcome to join us . Everybody is welcome to join us and bring their questions and feedback. ~Vasilis -----BEGIN PGP SIGNATURE----- iQIcBAEBCgAGBQJWineGAAoJEF+/cLHRJgFirAYP/jAoQCE+OUqWVPHw2T4Tu2yM BlncdKUsOvXl6xLQEE98hWN+evYZOIQw7o0u5Id57PyoM9taLAPIjI/D3Dva/VaY 406w/w2H7QB0Zc67iK1ZEYQgaOR08N4kM0EtTcEoz5yCii6+uMf/nhPvEXvosiMD xPiEMPr5jlnSWTpLfs5jKgYFKYyXIhLI8Xnzwh4E3aNZ/ndoXAhR4P66JZUKlmQF Vj2/KJk3cR6ANHopQGCQ+CAcd294mE6D6weD/HnQYYE/pghBWrM8KWsrm4AsBxRU EC9phmvOd5RZrerU9ZHDdVJDKjwfYX0B66pvoOBdiG9rXIRwE3jp+VsB1QHS0Gm2 cy1GHpWtlAxT2e628c9TxLn9aDcuqPe2Tf/WhRC9gaAJatNUHFowAbWinAbI91dE PpMBGVWg8YL0iJA88Jmb6DXAFXNx+nCgzlQHaDbQWXRkMMv+NAHMrq0nuVAIG0X2 wr5K51Ly7HfuwX7WW1WaNhaP+8kx56UuVIWaNoffyMhBewtYIClZtXXYqQ4hMsHS rqKBEfu2JODhX3v96B8gjFLGYtGblJmtPd/eeWrhXQSjDD97P6ls4fwM5+AfqwJ6 THdbFpmGUtooeehRnAoPBjuU/kDGjHKbt/X+GYtn5fKVf89fIK6oXtSjrXt3qWdq yKmm07KdfiaguUTUUqCW =GXbv -----END PGP SIGNATURE-----

1 0

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

ooni-dev January 2016