On 10/16/14, 7:59 PM, kudrom wrote:
Hi all For the last couple of days i've been thinking about the visualization of the bridge reachability data and how it relates to the currently deployed ooni [7] system, here are the conclussions:
Hi Kudrom,
Thanks for taking the time to compose this very detailed email!
I will add a bit of comments here and there. I will also create a wiki page to contain all of this information so that it doesn't get lost and can be updated as we think more about it during the hackathon.
== Variables == I think that the statistical variables for the bridge reachability reports are:
- Success of the nettest (yes, no, errors)
- The PT of the bridge (obfs3, obfs2, fte, vanilla)
- The pool from where the bridge has been extracted (private, tbb,
BridgeDB https, BridgeDB email)
- The country "of the ooni-probe"
I would also add:
- The time of the measurement - How long it took for Tor to reach 100% bootstrapping - The result of the corresponding tcp_connect test
With these variables I believe we can answer a lot of questions related to how much censorship is being taken, where and how.
But there's something left: the timing. George sent an email [0] in which he proposes a timeline of events [5] of every bridge that would allow us to diagnose with much more precission how and why is a bridge being censored. To build that diagram we should define first the events that will be showed in the timeline. I think those events are the values of the pool variable and if the bridge is being blocked in a given
country.
With the events defined i think we can define another variable:
- Time deltas between bridge events.
So, for example, what this variable will answer is: how many {days, hours...} does it take China to block a bridge that is published in bridgeDB? Is China blocking new bridges at the same speed that Iran? How many days does it take China block a private bridge? There are some ambiguities related to the deltas, for example if the bridge is sometimes blocked and sometimes not in a country, which delta should we compute?
To include deltas into the picture I think we first need to define what the base reference points are. Some things that come to mind are:
* Dates in which a new bridge was added/removed from TBB
* Dates in which a bridge was added to bridgeDB
Other?
Finally, in the etherpad [1] the tor's bootstrap is suggested as a variable, i don't understand why. Is it to detect some way of censorship? Can anyone explain a little more?
I think this is useful more for debugging purposes, so it's probably not something that we want to be part of the core of the visualizations.
Generally when a bridge is blocked it will fail at ~50% with the progress tag of "Loading relay descriptors". If that does not happen there may be some other reason for it turning out as blocked (some bugs in ooniprobe, obfsproxy, txtorcon, tor).
== Data schema == In the last email Ruben, Laurier and Pascal "strongly recommended importing the reports into a database". I deeply believe the same. We should provide a service to query the values of the previous variables plus the timestamp of the nettest and the fingerprint of the bridge. With this database the inconsistencies between the data formats of the reports should be erased and the work with the data is much more easy. I think that we should also provide a way to export the queries to csv/json to allow other people to dig into the data. I also believe that we could use mongodb just because one reason: we can distribute it very easily. But let me explain why in the Future section.
I will go into a bit more detail about this in the other part of the thread, but mongodb was picked because it seemed to have nice a nice python interface, it is easy to setup and has the sorts of features that we need from a database NoSQL database.
I should, though, point out that I don't have that much experience or knowledge of so called NoSQL databases, so I am very open to suggestions and switching to another solution. The main feature that I want is to keep it schema-less (hence the NoSQL solution), because the ooniprobe report formats vary greatly depending on the type of tests and don't want to have to migrate databases (or create a new table) every time a new test is added.
I have imported the bridge reachability data into mongodb, but it's quite easy to adapt the scripts used to import it into another similar database solution.
== Biased data == Can a malicious ooni-probe bias the data? For example, if it executes in bursts some tests the reports are going to be the same and the general picture could be biased. Any more ideas?
Yes this is something that we are well aware of and there is very little that we can do to prevent this.
In the specific bridge reachability test it's not much of a concern, because we are running the probes ourselves, but in general it is an issue.
The types of bad report data is pretty well specified here: https://github.com/TheTorProject/ooni-spec/wiki/Threats#bad-report-data
== Geo Data == In the etherpad [1] it's suggested to increase the granularity of the geo data to detect geographical patterns, but it seems [2] that at least in China there's not such patterns so maybe we should discard the idea altogether.
I think that storing the ASN is more than sufficient for now. We currently only have 1 probe per country so that currently doesn't even matter, but it may be useful in the future when we collect data from more network vantage points.
I don't think we need any more granularity.
== Playing with data == So until now i've talked about data. Now i want to address how to present the data. I think we should provide a way to play with data to allow a more thoughtful and precise diagnosis of censorship. What i was thinking is to enhance the interactivity of the visualization by allowing the user a way to render the diagrams at the same time she thinks about the data. The idea is to allow the user to go from more general to more concret data patterns. So imagine that the user loads the visualization's page, first he sees a global heated map of censorship measured with the bridge reachability test, he is chinese so he clicks in his country and a histogram like [3] for China is stacked at the bottom of the global map, he then clicks on the obfs2 and a diagram like [4] is also stacked at the bottom but only showing the success variable for the obfs2 PT, then he clicks on the True value for the success variable and all the bridges that have been reached by all the nettests executions in that period of time in China are showed, finally he selects one bridge and it's timeline [5] plus it's link to atlas [6] is provided. This is only a particular scenario, the core idea is to provide the user with the enhanced capability to drive conclusions as much as she desires. The user started with the more general concept of the data, and he applied restrictions to the datapoints to dig more into the data. From general to specific he can start making hypothesis that he later discards or approves with more info displayed in the next diagram. There are some usability problems with the selection of diagram+variable and the diverse set of users that will use the system, but i'd be very glad to think about them if you like the idea.
Yes this is a very cool idea, though I think we need to start off by doing the most simple thing we can and that is a fixed set of static visualizations. I think we need first need to make the base visualizations very solid and useful, before complicating things further.
Once we have a feeling of how some particular views of the data look like we can start making some of the variables pickable by the user.
== Users == I think there are three set of users: 1- User of tor that is interested in the censorship performed in its country and how to avoid it. 2- Journalist that wants to write something about censorship but isn't that tech savvy. 3- Researcher that wants updated and detailed data about censorship.
I believe we can provide a system that satisfies the three of them if we succeed in the previous bullet point.
I would also add a third user that is:
4- Policy maker that is interested in how censorship is being performed in some countries of their interest.
== Future == So, why do i think that we should index the data with mongodb? Because i think that this data repository should be provided as a new ooni-backend API related to the current collector. Right now the collectors can write down reports from any ooni-probe instance that chooses to do so and its API is completly separated from the bouncer API, which overall is a wise design decision because you can deploy ooni-backend to only work as a collector. So it's not unreasonable to think that we can have several collectors collecting different reports because the backend is designed to do so, therefore we need the data repository to be distributed. And mongodb is good at this. If we build the database for the bridge reachability nettests, i think that we should design it to index in the future all nettest reports and therefore generalize the desgin, implementation and deployment of all the work that we are going to do to the bridge reachability. That way an analyst can query the distributed database with a proper client that connects to the data repository ooni-backend API.
Yes I think that we need something along these lines, though I think it is important to separate the process of "Publishing" from that of "Aggeragation and indexing".
By publishing I mean how the *raw reports* are exposed to the public. Currently how this works is that we have every collector setup with an rsync process the syncs it's archive dir with a central repository of reports. There is then a manual process of me logging into a machine and running a command that generates the static reports/0.1 directory and publishes it to ooni.torproject.org/reports/0.1/.
Ideally the process of rsyncing the reports would be integrated as part of oonibackend and there would be some sort of authentication mechanism with the canonical publisher that will accept reports only from authorized collectors.
I think that we should also separate the reports that are published from the main website, so that we can host it on a machine that has greater capacity and that can also do the aggregation and indexing.
By Aggreation and indexing I mean the process of putting all the reports inside of a database and exposing the list of them to the user in a queriable manner.
I think that this step should be done, at least initially, in a centralized fashion, by having just 1 server that is responsible for doing this. We should however pick a technology that makes it possible to scale and decentralize (if we wish to do so in the future).
So to sum up, I started talking about the bridge reachability visualization problem and finished with a much broader vision that intends to integrate the ongoing efforts of the bridge reachability to improve ooni as a whole. Hope the email is not too large.
Thank you very much for the time and thought you put into this email :).
~ Arturo
ciao
[0]
https://lists.torproject.org/pipermail/tor-dev/2014-October/007585.html
[1] https://pad.riseup.net/p/bridgereachability [2] https://blog.torproject.org/blog/closer-look-great-firewall-china [3] http://ooniviz.chokepointproject.net/transports.htm [4] http://ooniviz.chokepointproject.net/successes.htm [5]
https://people.torproject.org/~asn/bridget_vis/tbb_blocked_timeline.jpg
[6] https://atlas.torproject.org [7] https://ooni.torproject.org/ _______________________________________________ ooni-dev mailing list ooni-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/ooni-dev