Hi all For the last couple of days i've been thinking about the visualization of the bridge reachability data and how it relates to the currently deployed ooni [7] system, here are the conclussions:
== Variables == I think that the statistical variables for the bridge reachability reports are: - Success of the nettest (yes, no, errors) - The PT of the bridge (obfs3, obfs2, fte, vanilla) - The pool from where the bridge has been extracted (private, tbb, BridgeDB https, BridgeDB email) - The country "of the ooni-probe" With these variables I believe we can answer a lot of questions related to how much censorship is being taken, where and how.
But there's something left: the timing. George sent an email [0] in which he proposes a timeline of events [5] of every bridge that would allow us to diagnose with much more precission how and why is a bridge being censored. To build that diagram we should define first the events that will be showed in the timeline. I think those events are the values of the pool variable and if the bridge is being blocked in a given country. With the events defined i think we can define another variable: - Time deltas between bridge events. So, for example, what this variable will answer is: how many {days, hours...} does it take China to block a bridge that is published in bridgeDB? Is China blocking new bridges at the same speed that Iran? How many days does it take China block a private bridge? There are some ambiguities related to the deltas, for example if the bridge is sometimes blocked and sometimes not in a country, which delta should we compute?
Finally, in the etherpad [1] the tor's bootstrap is suggested as a variable, i don't understand why. Is it to detect some way of censorship? Can anyone explain a little more?
== Data schema == In the last email Ruben, Laurier and Pascal "strongly recommended importing the reports into a database". I deeply believe the same. We should provide a service to query the values of the previous variables plus the timestamp of the nettest and the fingerprint of the bridge. With this database the inconsistencies between the data formats of the reports should be erased and the work with the data is much more easy. I think that we should also provide a way to export the queries to csv/json to allow other people to dig into the data. I also believe that we could use mongodb just because one reason: we can distribute it very easily. But let me explain why in the Future section.
== Biased data == Can a malicious ooni-probe bias the data? For example, if it executes in bursts some tests the reports are going to be the same and the general picture could be biased. Any more ideas?
== Geo Data == In the etherpad [1] it's suggested to increase the granularity of the geo data to detect geographical patterns, but it seems [2] that at least in China there's not such patterns so maybe we should discard the idea altogether.
== Playing with data == So until now i've talked about data. Now i want to address how to present the data. I think we should provide a way to play with data to allow a more thoughtful and precise diagnosis of censorship. What i was thinking is to enhance the interactivity of the visualization by allowing the user a way to render the diagrams at the same time she thinks about the data. The idea is to allow the user to go from more general to more concret data patterns. So imagine that the user loads the visualization's page, first he sees a global heated map of censorship measured with the bridge reachability test, he is chinese so he clicks in his country and a histogram like [3] for China is stacked at the bottom of the global map, he then clicks on the obfs2 and a diagram like [4] is also stacked at the bottom but only showing the success variable for the obfs2 PT, then he clicks on the True value for the success variable and all the bridges that have been reached by all the nettests executions in that period of time in China are showed, finally he selects one bridge and it's timeline [5] plus it's link to atlas [6] is provided. This is only a particular scenario, the core idea is to provide the user with the enhanced capability to drive conclusions as much as she desires. The user started with the more general concept of the data, and he applied restrictions to the datapoints to dig more into the data. From general to specific he can start making hypothesis that he later discards or approves with more info displayed in the next diagram. There are some usability problems with the selection of diagram+variable and the diverse set of users that will use the system, but i'd be very glad to think about them if you like the idea.
== Users == I think there are three set of users: 1- User of tor that is interested in the censorship performed in its country and how to avoid it. 2- Journalist that wants to write something about censorship but isn't that tech savvy. 3- Researcher that wants updated and detailed data about censorship.
I believe we can provide a system that satisfies the three of them if we succeed in the previous bullet point.
== Future == So, why do i think that we should index the data with mongodb? Because i think that this data repository should be provided as a new ooni-backend API related to the current collector. Right now the collectors can write down reports from any ooni-probe instance that chooses to do so and its API is completly separated from the bouncer API, which overall is a wise design decision because you can deploy ooni-backend to only work as a collector. So it's not unreasonable to think that we can have several collectors collecting different reports because the backend is designed to do so, therefore we need the data repository to be distributed. And mongodb is good at this. If we build the database for the bridge reachability nettests, i think that we should design it to index in the future all nettest reports and therefore generalize the desgin, implementation and deployment of all the work that we are going to do to the bridge reachability. That way an analyst can query the distributed database with a proper client that connects to the data repository ooni-backend API.
So to sum up, I started talking about the bridge reachability visualization problem and finished with a much broader vision that intends to integrate the ongoing efforts of the bridge reachability to improve ooni as a whole. Hope the email is not too large. ciao
[0] https://lists.torproject.org/pipermail/tor-dev/2014-October/007585.html [1] https://pad.riseup.net/p/bridgereachability [2] https://blog.torproject.org/blog/closer-look-great-firewall-china [3] http://ooniviz.chokepointproject.net/transports.htm [4] http://ooniviz.chokepointproject.net/successes.htm [5] https://people.torproject.org/~asn/bridget_vis/tbb_blocked_timeline.jpg [6] https://atlas.torproject.org [7] https://ooni.torproject.org/