Hi all For the last couple of days i've been thinking about the visualization of the bridge reachability data and how it relates to the currently deployed ooni [7] system, here are the conclussions:
== Variables == I think that the statistical variables for the bridge reachability reports are: - Success of the nettest (yes, no, errors) - The PT of the bridge (obfs3, obfs2, fte, vanilla) - The pool from where the bridge has been extracted (private, tbb, BridgeDB https, BridgeDB email) - The country "of the ooni-probe" With these variables I believe we can answer a lot of questions related to how much censorship is being taken, where and how.
But there's something left: the timing. George sent an email [0] in which he proposes a timeline of events [5] of every bridge that would allow us to diagnose with much more precission how and why is a bridge being censored. To build that diagram we should define first the events that will be showed in the timeline. I think those events are the values of the pool variable and if the bridge is being blocked in a given country. With the events defined i think we can define another variable: - Time deltas between bridge events. So, for example, what this variable will answer is: how many {days, hours...} does it take China to block a bridge that is published in bridgeDB? Is China blocking new bridges at the same speed that Iran? How many days does it take China block a private bridge? There are some ambiguities related to the deltas, for example if the bridge is sometimes blocked and sometimes not in a country, which delta should we compute?
Finally, in the etherpad [1] the tor's bootstrap is suggested as a variable, i don't understand why. Is it to detect some way of censorship? Can anyone explain a little more?
== Data schema == In the last email Ruben, Laurier and Pascal "strongly recommended importing the reports into a database". I deeply believe the same. We should provide a service to query the values of the previous variables plus the timestamp of the nettest and the fingerprint of the bridge. With this database the inconsistencies between the data formats of the reports should be erased and the work with the data is much more easy. I think that we should also provide a way to export the queries to csv/json to allow other people to dig into the data. I also believe that we could use mongodb just because one reason: we can distribute it very easily. But let me explain why in the Future section.
== Biased data == Can a malicious ooni-probe bias the data? For example, if it executes in bursts some tests the reports are going to be the same and the general picture could be biased. Any more ideas?
== Geo Data == In the etherpad [1] it's suggested to increase the granularity of the geo data to detect geographical patterns, but it seems [2] that at least in China there's not such patterns so maybe we should discard the idea altogether.
== Playing with data == So until now i've talked about data. Now i want to address how to present the data. I think we should provide a way to play with data to allow a more thoughtful and precise diagnosis of censorship. What i was thinking is to enhance the interactivity of the visualization by allowing the user a way to render the diagrams at the same time she thinks about the data. The idea is to allow the user to go from more general to more concret data patterns. So imagine that the user loads the visualization's page, first he sees a global heated map of censorship measured with the bridge reachability test, he is chinese so he clicks in his country and a histogram like [3] for China is stacked at the bottom of the global map, he then clicks on the obfs2 and a diagram like [4] is also stacked at the bottom but only showing the success variable for the obfs2 PT, then he clicks on the True value for the success variable and all the bridges that have been reached by all the nettests executions in that period of time in China are showed, finally he selects one bridge and it's timeline [5] plus it's link to atlas [6] is provided. This is only a particular scenario, the core idea is to provide the user with the enhanced capability to drive conclusions as much as she desires. The user started with the more general concept of the data, and he applied restrictions to the datapoints to dig more into the data. From general to specific he can start making hypothesis that he later discards or approves with more info displayed in the next diagram. There are some usability problems with the selection of diagram+variable and the diverse set of users that will use the system, but i'd be very glad to think about them if you like the idea.
== Users == I think there are three set of users: 1- User of tor that is interested in the censorship performed in its country and how to avoid it. 2- Journalist that wants to write something about censorship but isn't that tech savvy. 3- Researcher that wants updated and detailed data about censorship.
I believe we can provide a system that satisfies the three of them if we succeed in the previous bullet point.
== Future == So, why do i think that we should index the data with mongodb? Because i think that this data repository should be provided as a new ooni-backend API related to the current collector. Right now the collectors can write down reports from any ooni-probe instance that chooses to do so and its API is completly separated from the bouncer API, which overall is a wise design decision because you can deploy ooni-backend to only work as a collector. So it's not unreasonable to think that we can have several collectors collecting different reports because the backend is designed to do so, therefore we need the data repository to be distributed. And mongodb is good at this. If we build the database for the bridge reachability nettests, i think that we should design it to index in the future all nettest reports and therefore generalize the desgin, implementation and deployment of all the work that we are going to do to the bridge reachability. That way an analyst can query the distributed database with a proper client that connects to the data repository ooni-backend API.
So to sum up, I started talking about the bridge reachability visualization problem and finished with a much broader vision that intends to integrate the ongoing efforts of the bridge reachability to improve ooni as a whole. Hope the email is not too large. ciao
[0] https://lists.torproject.org/pipermail/tor-dev/2014-October/007585.html [1] https://pad.riseup.net/p/bridgereachability [2] https://blog.torproject.org/blog/closer-look-great-firewall-china [3] http://ooniviz.chokepointproject.net/transports.htm [4] http://ooniviz.chokepointproject.net/successes.htm [5] https://people.torproject.org/~asn/bridget_vis/tbb_blocked_timeline.jpg [6] https://atlas.torproject.org [7] https://ooni.torproject.org/
Hi.
I don't think Mongodb would not be a good choice because of the reasons you listed: 1. distributed - No Mongodb is not distributed... at all. All writes must be performed on the master node... and then and then they are serialized before being replicated to slave nodes. 2. scales - No Mongodb does not scale. (i shouldn't have to explain why because... see #1)
Furthermore, Mongodb has nested-hash-like column family schemas which absolutely does not support arbitrarily complex queries that you did not conceive of at the time the schema was created. Therefore when using Mongodb and other NoSQL datastores you really have to be aware of the types of query patterns you are likely to use before creating the schema.
Sincerely,
David Stainton
On 18/10/14 07:14, David Stainton wrote:
Hi.
Hi
I don't think Mongodb would not be a good choice because of the reasons you listed: 1. distributed - No Mongodb is not distributed... at all. All writes must be performed on the master node... and then and then they are serialized before being replicated to slave nodes. 2. scales - No Mongodb does not scale. (i shouldn't have to explain why because... see #1)
To be honest, at the time i wrote the email i knew that mongoDB provided sharding which should provide horizontal scaling but at the time i didn't know how it worked in mongodb, because i hadn't time to dig through the docs. Now, after learning a bit more about mongodb i still don't know if i know :) but i agree with you, this is not distributed. So, i think we should index the reports to provide a query API, this still applies, but, should we build a distributed datastore that will fit with every deployed collector? or a central respository that grabs the reports of the collectors and index them? should we care at all?
Furthermore, Mongodb has nested-hash-like column family schemas which absolutely does not support arbitrarily complex queries that you did not conceive of at the time the schema was created. Therefore when using Mongodb and other NoSQL datastores you really
have to be aware of the types
of query patterns you are likely to use before creating the schema.
Sincerely,
David Stainton
ciao
To be honest, at the time i wrote the email i knew that mongoDB provided sharding which should provide horizontal scaling but at the time i didn't know how it worked in mongodb, because i hadn't time to dig through the docs. Now, after learning a bit more about mongodb i still don't know if i know :) but i agree with you, this is not distributed.
ah interesting mongodb has built in sharding: http://docs.mongodb.org/manual/core/sharding-introduction/
perhaps you are correct about mongo db in that it does seem like it would scale well.
however we have to carefully evaluate several more criteria before choosing a data store. for instance operational costs should always be evaluated:
Is it a pain to setup? (sharded mongo db seems heavy weight!) Is it a pain to add a new replica to a replica set? How are additional shards added? Does balancing the cluster after adding additional shards kill performance and take a long time? (most likely yes)
So, i think we should index the reports to provide a query API, this still applies, but, should we build a distributed datastore that will fit with every deployed collector? or a central respository that grabs the reports of the collectors and index them? should we care at all?
Yes... "indexed" reports sound much easier to work with than just the reports... however it is not yet clear that we really need the datastore to be distributed. Highly or mostly highly availability might be a requirement for this project. That is much easier to accomplish!
OK... so if we go with one of these CF (column family) data stores... then we must keep in mind the types of queries we will need when creating the schema. Another possibility would be Redis. It supports a set-theoretic query language... Also I've heard good things about CouchDB. I think we should look at these different datastore possibilities and discuss potential schema and query design for our project. I suspect a discussion of schema and query patterns will be more useful than discussing operational properties especially if a centralized datastore is good enough.
cheers,
david
-----BEGIN PGP MESSAGE----- Charset: ISO-8859-1 Version: GnuPG v1.4.15 (Darwin) Comment: GPGTools - http://gpgtools.org
hQIMA5qq1v9FvppnAQ/+JgWUB+NcwUFRrpzWRMTNiFoupM35fbQ+0yhoinoPpiCz DWQUD+T2UMpaoRwrgNgos3+G/hCqeQu/n+j61Ps2GbIPsuHUfk53qF0HgD0GxkQ7 gCfPa2ntY26c6T1ccDPsrq64vf9/bZIL/sEfpAOD2Du8bmkSpnnKuoziHGIY9/Oc Jp5obTR1UwS8Tq9VwQnMmnmj3oWe4G/a3yOkzckMAM5aTkmyAmH9caCvb/aoJJrO 79B+9SZj7wF3miT2THMUH3yNWPQWwPd2V61+x8zQqIkrbM8krRkenvJZiSNi5fO4 ryzAHoowsQr+o+P36liLYxRrPW9QOFLiW2fk8fmm50ZCqsJ2+HP3MJIeOHp0dD7R W/wkN+ITA3nuEwp04IWPHZvzRyAcesBNzauVe8PuqUg5MzuCgwdwV0o+IB6TZi/r X4xQKaEoL2IAkt5hBoO8eNbxprB0xmtvZD3YBjjBUkcrYLMazLZlTVnz6eEqWcIt CyTTAx3frUaGDMJdGlTgMMUksk000NsTGTX7RnSRxrpNiBa2RlwX4Xxc0a823vgo y47C/XuRwqLWT304E1rcUaV8KEr20vBSLYPjl6kVykZoUgjotHyvFCyaB9bsFSpE 7PDRhsUPkEGyUQNpD5+dd/vHft29cfc7RuNlpgR+AFtcDn5V0iLnH4+2EmfsY6CF AgwDB5zYufQjsnwBD/9XqaijFUyxdonNwSwAc7u/v4q+dSuPtpK8sOa/L3McLQ/s lYf/Z4fQNB7a83YCbopT6rpvxbwW1ISaXWYZCk78ZKGCD8GjXQGOI0mIjcBEZC7i U56W2+JoGay7lZOGqC1NLm8Zp+KpuUl0NR7IkzUaK5DsDyW/M/NtDO8LKZfG36G6 jySINpvzadkoL4zjQIwBoeHrekMZbMtEHcG5yLcTHDFP/mX5El3LD65ewA2gHYOn 9/QBzRda7ZRQdc/0IdXNr+D35zpHwo13Tt7FTinSwNLLatS58Uf3viUEE8Gz+iEt zN4z9KU0tBfA+HIfO5C6WmNTe2/Dk7iSckFYqpcCg8M1ElDKCLtDDh8ZxR6SE+J4 C78P89WhNtDkgP/pc6hcs3MUxezOTqVjeJVHzSQ6bBOifBr0OH/qwMSHWFUv6Qp9 K8fpKTJ9gYbRIqt4waiKVFtZnX8XsOPsYgCZ4I6ap9YoOcte9kqNDModJ9lWyN5l aKMOzqx+1urhX0yrl0tplUD8+wO4vFyiMMy8flvEHx2ZGDrZRYMMlX1VoiJCEdm4 ndhleVRVe9xjeUdx/fXyfwQUbeKl2+W/7mTkY04vJ3xN/b8f5BpFCNWoRndqY9yt ZJL/SrcDFqzPfqqqv1QpRDdEC0BmyGam3u5xnytXUGWsxJtyzxVKe0g5LgwefNLs AQSpTEH9ZEd3rXX+f+chvEJh9D6+vGaJkhjJ/HsSWc0vN1rAMkhJljeqBpjKj2bU jdrMtn/4WI43TwQJXIT0edJonbTUVE2Zd9wLsmsODgiOI7f463tFFHQtNX5DmN4w aIi06pfjM0U75O+5t9X17HCdy+wxXmkmXKQD10/p4UgSUa7lDSVEAyniGj/3BmGE fznV3UyF3ejmbRyQZ4wXsSZ8MBSxtEvyUVdF5xGv6G6DlpDqY2aKxfw0xRVu0JNP 64pQmFOsQq+NQBWlCwX+1uuxurPI+UJ49yz9MVC4+iFMe2Clp0tSzZSRb3raZ+E5 G/ENJvuZPK9hLVEpQZQ5qLZ63Qg6zBF0lWOf5suvEJLisAFq5vLhF2OWgDqJnmDw 6RbFFaiFcCD6VfO+8LkGxMPyxdOIRebvIy7cOxjtXpGdzG/8/ZH1ZE5GG2dEUokK hkK7dWjCkaEetyRYXdOr2E6la1BcLqRM1SY6xrxDPfHbgxjwGNEeaTTKQrpsdjUP Dk6kS6YN3/ZSTV+9YOGBQ9pIHTODxuWf/qqFKYw8ztQdj7tlb8IvPny4eSK1Mf1g SuDeJCSgHu9M7OuEBmEFxTTMvH//0ZYDEmiCftIBtvTh2jnam448cAjTZg3NTVsZ 8ekh3fwuCYVq7unxn+4TIg7iAzTCKtuYhmdS+n6nC++D7pY8ABMyoT1WOn6EeGQG HTgg/4SrotuCpKlixSqJjp+xhsP2p2QRQqusJEjbxjl9zNLCTNKognJz7M35J34F TWeJ6wQul0cK4bu9DWfHItxMDRnP1GUVcSVwJNUWUaaq3F3dAUUp9EIHG4oQTVEd rlxyvJ3jPYsOAhiOuZRdF0hcAGZp98CRp6a1dZAaXcrZQyvd7gYoU5GGD9IyAZ2o +xFas3CH11JgsNKuT/wBo7zLIquxzkm+8nDkrOyLa1M4ClB2zLF4aHosuCm1oWxz kUwtTgb3y9PayQpgKhzd/lOqMHBvl/P5qxJ6a+jXKnieMhPgYb3/pFuF5/Bm2F8K AfHmuvz/ATxa/7aX3OMNHIImR+HRrutGp+8vCDlC3EuajyidqG/EVyYwkNYTcdm7 /8BiWJdYmQ/5O23tkkRdklKmlBj7joFUoCe7LFOtk9teIv/z+85MysL+H6pdlZrv qiohHKK7B1lN2U47P6Hul50zE4cnwD2hziFOks4+8yOFuxKUvIdeoyLYvIov/lpU E5TgkEIQp3G2Y9Mckv06aSpE9KzA1LvcKxGIjR6Q7/Byum1p9O/UCjssaDDp8m0x HnvghVFeMP1ClhaV9trBAqE/fF1n5vV320cwZ9z0AG4CQXpt/HCmqBVN6x+bN5Lx p+lYnOeyGQLJndOU//DgMHY2cdUNWihURaIeU276WnA7cKkVPioNus5fsfVK145k b0T+TB9kOSoaEmkSnDlM7B3lbccZEni/NCQNqioAxjF3EqTLgX/UhaeM/RaosORY gVFuBt7p6NUS6f5jMgTy5ObUZOokVGMCmKFRvBduUrYMSkyLoVdDCR1bZmjNxtRD gFUTm+oSbyu1N7qp1vYVDJ1Os1Zyf9t13MroZxEwk6f6spxDB7uwWHZl+87Psndw O6hdCCs2s0ZFpk4EWlOzdovtsW2imhz2WSt2V1Iz+D/DZMuzPtNDcLdSfosxzqUq 9BKrrEnl6v5AcZePLZQl35Kig65Bg2XtPUFAp/vgm8SZsuAqlG5nAaqD4obYg7/r ce8cDSbwnlRzjIwwX1N1un/jIPoavj5TVe08QZDcjOmol7zDHTK3+ihq3pXXm/Yk s+WPLVIzHNpaXg/hl1vfJ/ta+DyJlRTv7m1goXALI/sQfvjKk3BwlvyGfCTBvHRf I98wfbQVNfQnsFQ8dzLVl1eJvQTolSitl67bvTA/yV7yiKHAXLtMzNuqHtHE9Jll iM5QXH01No9S7iJAzjXryX/amsuv1cCszHymBa2gyoJfE+gZ//nJjHUZHjVkEdWh ME7y2UEAj/KdEzxiV0Ro14D/rNWQpKFjuLYpn/FW+ILXGJmNVukPeNNSfj2DTg7X jDD/3IXG/oK4NVIHK8pD2psBo93uaZ1m6CeIC67x2Pba/sM/tTGfxh7lVjuMb8AT 9A+TOhpbJI2SAimSwYS3yGvfxZ9JkLJRc/uKf70IIMeJ6SXwIL80tax/vJgkMY5J mE9jMTwXyVjcVnbjf3SZh1ulYbNJu/EmFojM7Ut0xS6DjUijVBT/hplv9qeVZyY6 SzP8510c/rIHOmy3BbD0VZnrnigzwI3Z1NaTA+hngeOagK926HeDtVv3xkwv0TVJ BYEQdoc4l9+CCojdmmHLkr3vlEGatreW5IuBPrU+tBwH7dl5mxBh4C7pbkUxs5f8 Lypynz47bofbGIQ8jobVjisoDgyiR8yvF+6kptTmfzoydvqVUT4MhpDpbkk433ii SHc3zP2LnlKs6RWEns2fQEk0sRO3iDucTAdmiYyGZlG2ctmIZBIztYkfZj4yq/Al lWUsg5YC/seOz0pePZ0xoklDHpTKwsoCuTkodx5W4s7FH27DHQecphKqUNlEebhm BizlxEKy91FFEBfzn48F6YMpCUJyxt4eiBLPlKjr0E+JDdIZJd/bWbPDOyN3p094 uvkJ3O03i+GVy0t5XIc68CMpONCUG++gVYqehLPQ/8ehNi4zzxFqUhrrBM4TVEhX 1FgQPX0wzDhFm+ifduHyPqCb2ib2dv5u57D1tO0g+910XXrGNvYcT26BSO4UHtZz 5sOhzj9uk97RKxNpt7fTXRqhGSTP+6WOrRVBfCMt6sjdZ/q9WtW5iBj7Ooq2vEIr dqXSWbzXezfrE5abkHO7W9qqu3IMap1r4gZmfDpC5GbSxKjDP7i6mUcox0TcvdUg /bPYsPKYCgDq8jd1pVoH5+vIjeE+zOvWx8RZ6UiRv4RP3RzV4R9+P9nd2BzoKUD3 KEsyNdPQvxtM73dXgPpp2vrAv3aXDE1GVUiQyRTQvtDkGhDtjFI9pMsr8lEaXll6 qP4rHB+0V3WQyEjiLDhk0FDOeWMOYWAu5QGypD+JG6mSN7wjK4nyaqDjGozlBy4i pFVTgO3lWb756dvELTjbSieZeYNE7KGxfM1xnv1BXc2sGzwJYQB9JYy4bAilRtCc ZuKbMfrtrOtOUDPK6ZTtP8cxwUMlgY7QEmPVvbpc+edpELUyv0wOp3rvJgufprSU 8y04VS/xhqfAVE6Jf6mryoOCMCGwTdqCALq4Jrdx2p4PXkv9xlQyfgQh7w2mJDKI BLvMSyEKlUBkvaO4qvt8hry47alzwfpWzGAZ80b3Ik0sn3CS1VRZgvTbVkfitCph 6TJ9RPCdOsRkPonIG9A9h/d/LvASqigB8ciV10y0ClRnB4b3tjtfJCHCdmy8IWwZ SiBbykkSlN9Jo44YDdRddjSHJ+Rc57HaPO1A6pBvx7usIlgEUj+6NhXYp2UmXQCt NEE9TaOOFQ2VBJRm7KBqZAxNTMpybSzi+Pc/u2D9KQA615YGvnE9+3BHca1WYXkv 7mpILIr++9j0vw7OBLjkA6WUMFkEi0RG/H0DFRMqyo5ZqsWc915PTK1PQqvg0bSy mw0vCX7Aqff1fEn3kxLP1mKmc16I8yXn2JVWqPGwTAZgJITFvFvckrFtRSPqH+1l j+hwHUVMlO5OK16PpOfYTX3MqO62V4B3PJh0IlnPQhkL6yM07S8wRpdvhUVRfyEf LEsnoudoIaVN0dRsDolKvSrU1ta8BR8gbPFi6XbmRUYrwYn4/G3JmyjlLlnb8g4x atarQ9B7/KzYAi3Qp16+vyE6YJynlgiddb3ksp8dOa9roZJ7Rn83VP81qsvS5Z1k 3sbHPaWXIaME+l1VrS9hjuloDn2oXY5zC7ZwgTUtc8fW/v9zM+tOvf/BupCugitj NfFeol8fK9mFyy7dixIf14451AIFpNk3DZ8hcUEEs3jEweYKYfhF5A6z6GKtfI0e Cg10Fz7Z1W3Vm9r3yOS+t2/gTR55dg7DiSc5k9AIsZECmRyXvfMUOvAEByhcRZoQ ncKYKBK+ufcns2u+y5cW/S9GMxPwgCyur60aXAGfLSzws1ZwRQmarRqFmIf+i8K+ KlN2i1wrmYGWMuQdiOZFu2UT9J+y793pwAoMdwomXTgtZo6kp0PMxz1nv9lMOCPe PHwDZio30PSs/sf6oMZoPYJD5ydJHvDARlks2BYHYMtdy69I2WjG8gbrPMsY1FvF 3memwlJilT3G8pOPchMqnHrzlWpXMl2YOOSwJSmJOxumEdCcbE05/An+3izOtPux nIO1ulVMmSn9w54YuGauCtPZYEEzpTZKyNWZzJjYOmxpop2xbfxVZySKmVJdm/08 ON7i5ASHIeBxoaW0hAkLL58xjl8DnWfQWiAzaZX2vUqOPzHXqn7+VgRHPEy6QMyl qICrMb+3wIp+8LqT1wvUyFaVCnalYpsSbmuDpmz3kds5FSs1SKCyrlnvk9Ph8OfO fdAE68AWUqa9IJw6hcG70Kt1fZmTwCORpf6crkJGMsxiyyBF3w0a6tOfzl4BxzGJ Ul3x4Vu16lzFmJqZ4YuL9sAuYtGnjknpVO0wNvrXAtl2cHC6J0kCB6QTWlJ/gmpM /nT+B32aYQdNYNTPjh4xh+DMdtYBaNf1GjppxHICnc1BqoBDlpjDf/cM66ClUYEq exVKFLdNWAnCh8GJ+F7EgpVGm228bxLX6a9mHQIGdQSjXrPhRCZB6EoIm5iQ9KdO RhTSJ7Z+ufBExZ/zlv2TpYa5hIDq153FstfizI/b6JyJb6XqCvTU0lQ11mbOYECS HSysuVFCf31Jp3DIiI1AvXEF7lSldIrtpHpvMyEZuZDzgnD2yqbW9OILJxx1S65X sUJJxCcxR2hBqeahN2aThjm+FLQ5WKy/MrM7VjeitPBaQm3pyvucZh9h9ROZ6fZZ MfBm1Pup68976otrSU5KTf61X+Ji+6l4fWlOjvu8OaJwRtsQv8wRZIsmGltH+tVX +484d6s2mN3WrOm8NOdMETuv9iwREHL0UUtTyIfctvX7am3cjzYrXY2zygFiClSf wBF2n4yIBG14SYyfHQsVE3vmDWMPmSZOj/07SihuYxnpyWB9I8B5XSMHgApSc5w9 S1Vb199ekAd/v8dANXCwfxBppbxA9Uu0lLU5/+S7xoHFfg7YWtYIWL7/DmsQAWtU 8TTdWksYKjAw3tPmE2CalwllCyUpiNraIV95xrQymuimHMoIqU0RDJFsCo/OQYsE zbjJNbfs4lHihiASf0okHpRlcwhUWpdbUvVkamfN911RKy8wIWjGETSj+2rgDFDN jbKPimu/aAOHWH75Mw4PmqA/sQGNd2XB8u5JHZ5uAGsHctIHhIJNQ0xppLohsWQJ dK/uHy7pi0ji3tCclM4JezTYiKOor6IAnJ0IbAvuV1dOl3yB2ZjFU7brXaTsAyd1 w6qPto1Pt+RnEC6oNqZj/ephfl1DKtEJ+sid5DQ5KRzN2Tj9BumwvJ1Is0XY9EIQ dhnHaPeSxgRHZJjlPBI2kkIztIyP/Y2Lbg+OWlqe0PrctkOaaFlvvA09nkHIupWt P9eszCacgh+3+atb40ttNAFZ1YR3pVGWoG/LKVUiW/hzhH1f8izTqAp2Ww/E18yf lDr9K2qAawGzCWhAqhi8qX8gM7HZEi8IcIFxE/A5iYLyg9f9rRaLKpuBa9LfXCIF MnU7/ZQ3KJeMD9UrWRTm0PzB0HzrN4LXRd34ROQP+duhI7xhBVnChvzWd+88o02v Zy4ZS0YIBYxVJc1rzxQmKbqksruNnDljfJ8hKgy97cU3JSn1KDTLu82LD2ppXkRH xgWeSqgsArj0SfT0klsbjU7DW5b0mB0gfcW/T4SkLq0RulJ1SMAPh4MGtBQW6gYa /ccChLlHoVVjGOlyMUbLGBsAABIk/fLlVJ7Zo0fzGYZpcQGdvb7mOgSHPIE0vAM9 970QYkXv26udSHO5vBeCGTFNXBKntxIY4j1fzWSoJeMJNNJkqBwlw7HrXPGVSMfP NZ0ay8+R+UQtZ9NwtNwW5jg/kRNXxHfNqIGi4Y+dlV5OMGTbii1u12kVNKbarJNU F+9PJ8Jr8D7TVWfbRe9vNNbAbIxoufpP99NkeBoDddxLx17gW3PycwWE++qFmAiK 45O1GZPYf9rCBON/ur7SUm6bNVnY37+NCVq9nkVQPy9kIZyYPND9I09iTxUP/KqX ZaIYSiep6SJbigStxdAmILj68tH/XRjffq2CbA9FQyjMeB9FXg+FGRGd8oWHSteO DbfEC1UMlyWGFHvgD8qt1GiBxhX3J1xoYds8VLs2Yk159RoU17bbJUAqKBDTFNN+ ntuNT5pllcalC5+iwF6oMMeSZhdJCkHNZkn7Zy8b3lybcxzpedsfZ/smAOipKVte NMsWD2//woNmyN7qFTMzGdaxGXE3YmiwtDuZs41F6aGPSgbWF5/ynxtgj+O/dTJy 8BfX6inzkPXTVcdm19MoFNA6f2Cf72RgYcz13EriPADfdccWpeneWnQOamEoYH0f lwwsIf9e4ZUGOaqLq5dVrJXnH899Ac5lsOP3SMiKjxwBA/ofYt50KrObB/00hvPf Bq90ao9bB2spjwpTpNw98tyYuXgX3kcxiAOm+fpUTTePWz8LJJS2Ois9gHv2SKX6 DFZV6+2jUKHYg1rRQrASN+RC6zfC2yO2Dqa5PjgMgYjOdvXJNdkdmvc55L1/UVXD ktrtPCSN3JKMV1mKf+FoCaSPLOfQdsQzgXA9hme4vwXDyvtxaSgOYgK0HJjxpadf 41BxqFVD8QKH1Sh9o1cYK0uajlosfDHgxX11bybzFsa3KegxWQ3bdO7sFjih+uQl weYdvuFiNL9Ie/JPzz+e76+k6TnSQFxNDq1M =PYU3 -----END PGP MESSAGE-----
On 10/16/14, 7:59 PM, kudrom wrote:
Hi all For the last couple of days i've been thinking about the visualization of the bridge reachability data and how it relates to the currently deployed ooni [7] system, here are the conclussions:
Hi Kudrom,
Thanks for taking the time to compose this very detailed email!
I will add a bit of comments here and there. I will also create a wiki page to contain all of this information so that it doesn't get lost and can be updated as we think more about it during the hackathon.
== Variables == I think that the statistical variables for the bridge reachability reports are:
- Success of the nettest (yes, no, errors)
- The PT of the bridge (obfs3, obfs2, fte, vanilla)
- The pool from where the bridge has been extracted (private, tbb,
BridgeDB https, BridgeDB email)
- The country "of the ooni-probe"
I would also add:
- The time of the measurement - How long it took for Tor to reach 100% bootstrapping - The result of the corresponding tcp_connect test
With these variables I believe we can answer a lot of questions related to how much censorship is being taken, where and how.
But there's something left: the timing. George sent an email [0] in which he proposes a timeline of events [5] of every bridge that would allow us to diagnose with much more precission how and why is a bridge being censored. To build that diagram we should define first the events that will be showed in the timeline. I think those events are the values of the pool variable and if the bridge is being blocked in a given
country.
With the events defined i think we can define another variable:
- Time deltas between bridge events.
So, for example, what this variable will answer is: how many {days, hours...} does it take China to block a bridge that is published in bridgeDB? Is China blocking new bridges at the same speed that Iran? How many days does it take China block a private bridge? There are some ambiguities related to the deltas, for example if the bridge is sometimes blocked and sometimes not in a country, which delta should we compute?
To include deltas into the picture I think we first need to define what the base reference points are. Some things that come to mind are:
* Dates in which a new bridge was added/removed from TBB
* Dates in which a bridge was added to bridgeDB
Other?
Finally, in the etherpad [1] the tor's bootstrap is suggested as a variable, i don't understand why. Is it to detect some way of censorship? Can anyone explain a little more?
I think this is useful more for debugging purposes, so it's probably not something that we want to be part of the core of the visualizations.
Generally when a bridge is blocked it will fail at ~50% with the progress tag of "Loading relay descriptors". If that does not happen there may be some other reason for it turning out as blocked (some bugs in ooniprobe, obfsproxy, txtorcon, tor).
== Data schema == In the last email Ruben, Laurier and Pascal "strongly recommended importing the reports into a database". I deeply believe the same. We should provide a service to query the values of the previous variables plus the timestamp of the nettest and the fingerprint of the bridge. With this database the inconsistencies between the data formats of the reports should be erased and the work with the data is much more easy. I think that we should also provide a way to export the queries to csv/json to allow other people to dig into the data. I also believe that we could use mongodb just because one reason: we can distribute it very easily. But let me explain why in the Future section.
I will go into a bit more detail about this in the other part of the thread, but mongodb was picked because it seemed to have nice a nice python interface, it is easy to setup and has the sorts of features that we need from a database NoSQL database.
I should, though, point out that I don't have that much experience or knowledge of so called NoSQL databases, so I am very open to suggestions and switching to another solution. The main feature that I want is to keep it schema-less (hence the NoSQL solution), because the ooniprobe report formats vary greatly depending on the type of tests and don't want to have to migrate databases (or create a new table) every time a new test is added.
I have imported the bridge reachability data into mongodb, but it's quite easy to adapt the scripts used to import it into another similar database solution.
== Biased data == Can a malicious ooni-probe bias the data? For example, if it executes in bursts some tests the reports are going to be the same and the general picture could be biased. Any more ideas?
Yes this is something that we are well aware of and there is very little that we can do to prevent this.
In the specific bridge reachability test it's not much of a concern, because we are running the probes ourselves, but in general it is an issue.
The types of bad report data is pretty well specified here: https://github.com/TheTorProject/ooni-spec/wiki/Threats#bad-report-data
== Geo Data == In the etherpad [1] it's suggested to increase the granularity of the geo data to detect geographical patterns, but it seems [2] that at least in China there's not such patterns so maybe we should discard the idea altogether.
I think that storing the ASN is more than sufficient for now. We currently only have 1 probe per country so that currently doesn't even matter, but it may be useful in the future when we collect data from more network vantage points.
I don't think we need any more granularity.
== Playing with data == So until now i've talked about data. Now i want to address how to present the data. I think we should provide a way to play with data to allow a more thoughtful and precise diagnosis of censorship. What i was thinking is to enhance the interactivity of the visualization by allowing the user a way to render the diagrams at the same time she thinks about the data. The idea is to allow the user to go from more general to more concret data patterns. So imagine that the user loads the visualization's page, first he sees a global heated map of censorship measured with the bridge reachability test, he is chinese so he clicks in his country and a histogram like [3] for China is stacked at the bottom of the global map, he then clicks on the obfs2 and a diagram like [4] is also stacked at the bottom but only showing the success variable for the obfs2 PT, then he clicks on the True value for the success variable and all the bridges that have been reached by all the nettests executions in that period of time in China are showed, finally he selects one bridge and it's timeline [5] plus it's link to atlas [6] is provided. This is only a particular scenario, the core idea is to provide the user with the enhanced capability to drive conclusions as much as she desires. The user started with the more general concept of the data, and he applied restrictions to the datapoints to dig more into the data. From general to specific he can start making hypothesis that he later discards or approves with more info displayed in the next diagram. There are some usability problems with the selection of diagram+variable and the diverse set of users that will use the system, but i'd be very glad to think about them if you like the idea.
Yes this is a very cool idea, though I think we need to start off by doing the most simple thing we can and that is a fixed set of static visualizations. I think we need first need to make the base visualizations very solid and useful, before complicating things further.
Once we have a feeling of how some particular views of the data look like we can start making some of the variables pickable by the user.
== Users == I think there are three set of users: 1- User of tor that is interested in the censorship performed in its country and how to avoid it. 2- Journalist that wants to write something about censorship but isn't that tech savvy. 3- Researcher that wants updated and detailed data about censorship.
I believe we can provide a system that satisfies the three of them if we succeed in the previous bullet point.
I would also add a third user that is:
4- Policy maker that is interested in how censorship is being performed in some countries of their interest.
== Future == So, why do i think that we should index the data with mongodb? Because i think that this data repository should be provided as a new ooni-backend API related to the current collector. Right now the collectors can write down reports from any ooni-probe instance that chooses to do so and its API is completly separated from the bouncer API, which overall is a wise design decision because you can deploy ooni-backend to only work as a collector. So it's not unreasonable to think that we can have several collectors collecting different reports because the backend is designed to do so, therefore we need the data repository to be distributed. And mongodb is good at this. If we build the database for the bridge reachability nettests, i think that we should design it to index in the future all nettest reports and therefore generalize the desgin, implementation and deployment of all the work that we are going to do to the bridge reachability. That way an analyst can query the distributed database with a proper client that connects to the data repository ooni-backend API.
Yes I think that we need something along these lines, though I think it is important to separate the process of "Publishing" from that of "Aggeragation and indexing".
By publishing I mean how the *raw reports* are exposed to the public. Currently how this works is that we have every collector setup with an rsync process the syncs it's archive dir with a central repository of reports. There is then a manual process of me logging into a machine and running a command that generates the static reports/0.1 directory and publishes it to ooni.torproject.org/reports/0.1/.
Ideally the process of rsyncing the reports would be integrated as part of oonibackend and there would be some sort of authentication mechanism with the canonical publisher that will accept reports only from authorized collectors.
I think that we should also separate the reports that are published from the main website, so that we can host it on a machine that has greater capacity and that can also do the aggregation and indexing.
By Aggreation and indexing I mean the process of putting all the reports inside of a database and exposing the list of them to the user in a queriable manner.
I think that this step should be done, at least initially, in a centralized fashion, by having just 1 server that is responsible for doing this. We should however pick a technology that makes it possible to scale and decentralize (if we wish to do so in the future).
So to sum up, I started talking about the bridge reachability visualization problem and finished with a much broader vision that intends to integrate the ongoing efforts of the bridge reachability to improve ooni as a whole. Hope the email is not too large.
Thank you very much for the time and thought you put into this email :).
~ Arturo
ciao
[0]
https://lists.torproject.org/pipermail/tor-dev/2014-October/007585.html
[1] https://pad.riseup.net/p/bridgereachability [2] https://blog.torproject.org/blog/closer-look-great-firewall-china [3] http://ooniviz.chokepointproject.net/transports.htm [4] http://ooniviz.chokepointproject.net/successes.htm [5]
https://people.torproject.org/~asn/bridget_vis/tbb_blocked_timeline.jpg
[6] https://atlas.torproject.org [7] https://ooni.torproject.org/ _______________________________________________ ooni-dev mailing list ooni-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/ooni-dev