From October 24th to 26th the OONI team gathered in Berlin for a
hackfest. Around 20 people ended up showing up and although most of them were seasoned Oonitarians some fairly new people joined us that I hope will become part of the growing OONI community.
The scope of the hackfest was that of data analytics and visualization with special focus on the Tor bridge reachability study we are currently doing.
# Bridge reachability study
The goal of this study [1] is that of answering some questions concerning the blocking of Tor bridges [2] and pluggable transport [3] enabled bridges in the countries of China, Iran, Russia and Ukraine (test vantage points).
To establish a baseline to eliminate the cases in which the bridge is marked as blocked, while it is in fact just offline, we measure also from a vantage point located in the Netherlands. For every test vantage point we perform two types of measurements:
* A Bridge reachability measurement [4][5] that attempts to build a tor circuit using the bridge in question
* A TCP connect measurement [6][7] that simply does a TCP connect to the bridge IP and port
We run both of the measurements to further debug the reason why the blocking is happening, may this be due to a TCP RST or direct IP blocking or tor malfunction.
So far this study has been running for a little less than 1 month.
# OONI data pipeline
In order to produce the aggregate data needed to build visualizations we have built a data pipeline [8]
This consists of a series of operations that are done to the raw reports in order to strip out sensitive information and place the collected data into a database.
The nice thing is that the data pipeline we have designed is not specific to this study, but can and will be in the future expanded to export data needed to visualize also the other types of measurements done by OONI.
The data pipeline is comprised of 3 steps (or states, depending on how you want to look at it). When the data is submitted to a OONI collector it is synchronized with the aggregator. This is a central machine responsible for running all the data processing tasks, storing the collected data in a database and hosting a public interface to the sanitised reports. Since all the steps are independent from one another it is not necessary that they run on the machine, but it may also be more distributed.
Once the data is on the aggregator machine it is said to be in the RAW state. The sanitise task is then run on the RAW data to remove sensitive information and strip out some superfluous information. A RAW copy of every report is also stored in a private compressed archive for future reference. Once the data is sanitised it is said to tbe in SANITISED state. At this point a import task is run on the data to place it inside of a database. The SANITISED reports are then place in a directory that is publicly exposed to the internet to allow people to download also a copy of the YAML reports.
At this point is is possible to run any export task that performs queries on the database and produces as output some documents to be used in the data visualizations (think JSON, CSV, etc.).
# The OONI hackfest
The first day of the hackfest was spent going over the scope of the project we would be working on in the following days as well as working in groups that were interested in tacking the design of one aspect of the problem.
Sticky notes were plentiful and helped us have a clear vision of what lied ahead of us.
By the end of the first day we had clear what were the set of tasks that were needed to achieve our goals and which teams would be responsible for doing what.
The second day was almost entirely dedicated to hacking and everybody had a task to complete that was either completed by the end of the day or sooner. Some people even completed their initially assigned task before the end of the day and came back asking for more!
By the end of the second day we had a real data set to hand over to the visualization team, to start producing some pretty graphs based on real data.
We decided that the first visualization we wanted to do should be kept as simple as possible and be something that we could also use to debug the data we had collected. It should tell us which bridges were working when and it should present the information in a way that would highlight the country involved and the pluggable transport type.
A prototype of it can be seen here:
http://reports.ooni.nu/analytics/bridge_reachability/timeline/
The code for this visualization can be found here: https://github.com/Shidash/OONI-Bridge-Reachability-Timeline
# Next steps
* Write scripts for generating the bridge_db.json document based on the data that is given to us from the bridge db team https://trac.torproject.org/projects/tor/ticket/13570
* Align the dates in the visual timeline https://trac.torproject.org/projects/tor/ticket/13639
* Better tokenising for bridges so that bridges that have the same fingerprint, but different transport are grouped properly https://trac.torproject.org/projects/tor/ticket/13638
* Finish setting up the docker containers for the steps of the data pipeline https://trac.torproject.org/projects/tor/ticket/13568
* Setup disaster recovery procedure and backup: https://trac.torproject.org/projects/tor/ticket/13584
* Setup monitoring of the probes. https://trac.torproject.org/projects/tor/ticket/12549
* Add support for obfs4 https://trac.torproject.org/projects/tor/ticket/13597
* Set upper bound in comparison with the control in the bridge reachability timeline https://trac.torproject.org/projects/tor/ticket/13640
* Make sure that the control measurement is for the specific bridge measurement https://trac.torproject.org/projects/tor/ticket/13655
Questions and comments should be directed to the ooni-dev mailing list or to the #ooni channel on irc.oftc.net.
Have fun!
~ Arturo
[1] https://lists.torproject.org/pipermail/ooni-dev/2014-October/000184.html
[2] https://www.torproject.org/docs/bridges
[3] https://www.torproject.org/docs/pluggable-transports.html.en
[4] https://gitweb.torproject.org/ooni/spec.git/blob/HEAD:/test-specs/ts-011-bri...
[5] https://gitweb.torproject.org/ooni-probe.git/blob/HEAD:/ooni/nettests/blocki...
[6] https://gitweb.torproject.org/ooni/spec.git/blob/HEAD:/test-specs/ts-008-tcp...
[7] https://gitweb.torproject.org/ooni-probe.git/blob/HEAD:/ooni/nettests/blocki...
[8] https://github.com/TheTorProject/ooni-pipeline/blob/master/Readme.md#ooni-pi...