== What is bridge reachability data? ==
By bridge reachability data I'm referring to information about which Tor bridges are censored in different parts of the world.
The OONI project has been developing a test that allows probes in censored countries to test which bridges are blocked and which are not. The test simply takes as input a list of bridges and tests whether they work. It's also able to test obfuscated bridges with various pluggable transports (PTs).
== Why do we care about this bridgability data? ==
A few different parties care about the results of the bridge reachability test [0]. Some examples:
Tor developers and censorship researchers can study the bridge reachability data to learn which PTs are currently useful around the world, by seeing which pluggable transports get blocked and where. We can also learn which bridge distribution mechanisms are busted and which are not.
Bridge operators, the press, funders and curious people, can learn which countries conduct censorship and how advanced technology they use. They can also learn how long it takes jurisdictions to block public bridges. And in general, they can get a better understanding of how well Tor is doing in censorship circumvention around the world.
Finally, censored users and world travelers can use the data to learn which PTs are safe to use in a given jurisdiction.
== Visualizing bridge reachability data ==
So let's look at the data.
Currently, OONI bridge reachability reports look like this: https://ooni.torproject.org/reports/0.1/CN/bridge_reachability-2014-07-02T00... and you can retrieve them from this directory listing: https://ooni.torproject.org/reports/0.1/
That's nice, but I doubt that many people will be able to access (let alone understand) those reports. Hence, we need some kind of visualization (and better dir listing) to conveniently display the data to human beings.
However, a simple x-to-y graph will not suffice: our ploblem is multidimensional. There are many use cases for the data and bridges have various characteristics (obfuscation method, distribution method, etc.) hence there are more than one useful ways to visualize this dataset.
To give you an idea, I will show you two mockups of visualizations that I would find useful. Please don't pay attention to the data itself, I just made some things up while on a train.
Here is one that shows which PTs are blocked in which countries: https://people.torproject.org/~asn/bridget_vis/countries_pts.jpg The list would only include countries that are blocking at least a bridge. Green is "works", red is "blocked". Also, you can imagine the same visualization, but instead of PT names for columns it has distribution methods ("BridgeDB HTTP distributor", "BridgeDB mail distributor", "Private bridge", etc.).
And here is another one that shows how fast jurisdictions block the default TBB bridges: https://people.torproject.org/~asn/bridget_vis/tbb_blocked_timeline.jpg
These visualizations could be helpful, but they are not the only ones.
What other use cases do you imagine using this dataset for?
What graphs or visualizations would you like to see?
[0]: Here are some use cases:
Tor developers / Researcers: *** Which pluggable transports are blocked and where? *** Do they do DPI? Or did they just block the TBB hardcoded bridges? *** Which jurisdictions are most aggressive and what blocking technology do they use? *** Do they block based on IP or on (IP && PORT)?
Users: *** Which pluggable transport should I use in my jurisdiction?
Bridge operators / Press / Funders / Curious people: *** Which jurisdictions conduct Tor censorship? (block pluggable transports/distribution methods) *** How quickly do jurisdictions block bridges? *** How many users/traffic (and which locations) did the blocked bridges serve? **** Can be found out through extrainfo descriptors. *** How well are Tor bridges doing in censorship circumvention?
George Kadianakis transcribed 4.1K bytes:
Currently, OONI bridge reachability reports look like this: https://ooni.torproject.org/reports/0.1/CN/bridge_reachability-2014-07-02T00... and you can retrieve them from this directory listing: https://ooni.torproject.org/reports/0.1/
A few concerns:
1. The tests have no control.
I am concerned that the test has no real control. One cannot say, "The experiment is testing if these bridges are reachable from China, and the control is whether or not they are reachable from the US." The problem with that is that there is absolutely no way to determine if the act of measurement is effecting the data being measured. How do you know that the test isn't causing the bridges to get blocked?
2. This test is attempting to connect simultaneously to multiple bridges with multiple different PT protocols.
That is, this test is doing precisely what we all decided that Tor Browser should *not* do, because the Great Firewall probably can't ask for better filter training material. :(
3. That test still isn't able to reliably start some transports, i.e. fteproxy.
4. The fingerprint should always be in the bridge line; otherwise you've got no proof that you've actually connected to the bridge. :)
5. There is unnecessarily unsafe data in the report output.
BridgeDB sends the bridge descriptors to the Metrics backend, so that Metrics can process them, come up with all the rest of the graphs we have, and put the sanitised data in Onionoo. What if these reports were to contain only data which is public, such as the data which Onionoo currently has?
To play it safe, I would prefer not to have a bunch of bridge fingerprints and ip:ports lying around, on a thousand poorly maintained machines all over the planet. The generated reports could instead output:
* The hashed fingerprint (as is the case for bridges in onionoo) * The hashed ip:port * The transport name * [true|false|null] for whether the test was successful.
This way, the data added to the rest of the bridge's data in onionoo, and all the visualisation/metrics tools which use Onionoo (all of them, I believe) won't need to do anything different. Then BridgeDB could either get the data from Onionoo.
6. Your tests would give more accurate data if they didn't use "real" bridges.
I've mentioned this in #ooni on IRC, but for everyone else: To figure out if a PT protocol is blocked, you do not need to use "real" bridges from Tor Browser or BridgeDB. If you (ideally automatedly) setup a couple bridges for each protocol, this would:
* Reduce the number of test inputs, making test runs complete faster and use less memory. * Eliminate the potential to get "real" bridges blocked through testing. * Test both sides of the connection, thus reducing false negatives. * Allow us to more accurately control variables while attempting to determine if a PT protocol is blocked by a certain country.
Here is one that shows which PTs are blocked in which countries: https://people.torproject.org/~asn/bridget_vis/countries_pts.jpg The list would only include countries that are blocking at least a bridge. Green is "works", red is "blocked". Also, you can imagine the same visualization, but instead of PT names for columns it has distribution methods ("BridgeDB HTTP distributor", "BridgeDB mail distributor", "Private bridge", etc.).
To be honest, I don't care which pool. Also, that data is in already publicly available in Onionoo (or deducible via its lack of availability).
And here is another one that shows how fast jurisdictions block the default TBB bridges: https://people.torproject.org/~asn/bridget_vis/tbb_blocked_timeline.jpg
Neat idea!
These visualizations could be helpful, but they are not the only ones.
What other use cases do you imagine using this dataset for?
In order to better hand out bridges, it would be quite excellent if BridgeDB could someday have something like:
{ hashed_bridge_address: SHA1('IP:PORT'), hashed_bridge_fingerprint: SHA1('FINGERPRINT'), pt_method: PT_METHOD|'vanilla', regions: { ..., BR: { reachable: false, since: TIMESTAMP_WHEN_IT_FIRST_BECAME_UNREACHABLE }, ..., CA: { reachable: true, since: TIMESTAMP_WHEN_IT_FIRST_BECAME_REACHABLE }, CN: { reachable: false, since: TIMESTAMP_WHEN_IT_FIRST_BECAME_UNREACHABLE }, ..., }, }, ...,
isis transcribed 6.6K bytes:
- The hashed fingerprint (as is the case for bridges in onionoo)
- The hashed ip:port
Actually, my apologies, I was quite tired when I wrote this and totally completely wrong.
A hashed ip:port would be a terrible idea because IPv4 space is only 2^32 and ports are 2^16. In total that's a 2^48 message space. Hashing for a preimage to get the bridge addresses in quite feasible in those constaints, as well as precomputing the attack offline.
We should come up with a different way to hide ip:ports.
On 24/10/14 01:53, isis wrote:
isis transcribed 6.6K bytes:
- The hashed fingerprint (as is the case for bridges in onionoo)
- The hashed ip:port
Actually, my apologies, I was quite tired when I wrote this and totally completely wrong.
A hashed ip:port would be a terrible idea because IPv4 space is only 2^32 and ports are 2^16. In total that's a 2^48 message space. Hashing for a preimage to get the bridge addresses in quite feasible in those constaints, as well as precomputing the attack offline.
We should come up with a different way to hide ip:ports.
I'm lacking context, but just in case this is even remotely relevant, here's how CollecTor sanitizes bridge IP addresses:
https://collector.torproject.org/formats.html#bridge-descriptors
All the best, Karsten
On Sat, Oct 25, 2014 at 01:01:52PM +0200, Karsten Loesing wrote:
On 24/10/14 01:53, isis wrote:
isis transcribed 6.6K bytes:
- The hashed fingerprint (as is the case for bridges in onionoo)
- The hashed ip:port
Actually, my apologies, I was quite tired when I wrote this and totally completely wrong.
A hashed ip:port would be a terrible idea because IPv4 space is only 2^32 and ports are 2^16. In total that's a 2^48 message space. Hashing for a preimage to get the bridge addresses in quite feasible in those constaints, as well as precomputing the attack offline.
We should come up with a different way to hide ip:ports.
I'm lacking context, but just in case this is even remotely relevant, here's how CollecTor sanitizes bridge IP addresses:
https://collector.torproject.org/formats.html#bridge-descriptors
Hey Karsten,
Yes, this is very relevant, thanks! Currently our plan involves keying the JSON dataset using unsanitized "IP Address:port" internally and the sanitized public version will replace this key with H(H(fingerprint)). This seems like the easiest way to avoid the problem of leaking the IP address.
At this point, we don't think we need an IP address in the resulting dataset, so a unique, linkable fingerprint seems sufficient. If we find that IP addresses are useful then Collector's algorithm seems like a good starting point.
- Matt
On Sat, Oct 25, 2014 at 11:26:50AM +0000, Matthew Finkel wrote:
On Sat, Oct 25, 2014 at 01:01:52PM +0200, Karsten Loesing wrote:
On 24/10/14 01:53, isis wrote:
isis transcribed 6.6K bytes:
- The hashed fingerprint (as is the case for bridges in onionoo)
- The hashed ip:port
Actually, my apologies, I was quite tired when I wrote this and totally completely wrong.
A hashed ip:port would be a terrible idea because IPv4 space is only 2^32 and ports are 2^16. In total that's a 2^48 message space. Hashing for a preimage to get the bridge addresses in quite feasible in those constaints, as well as precomputing the attack offline.
We should come up with a different way to hide ip:ports.
I'm lacking context, but just in case this is even remotely relevant, here's how CollecTor sanitizes bridge IP addresses:
https://collector.torproject.org/formats.html#bridge-descriptors
Hey Karsten,
Yes, this is very relevant, thanks! Currently our plan involves keying the JSON dataset using unsanitized "IP Address:port" internally and the sanitized public version will replace this key with H(H(fingerprint)). This seems like the easiest way to avoid the problem of leaking the IP address.
Whoops, that should be H(fingerprint), nothing special. Sorry, I got a little hashing happy.
Matthew Finkel transcribed 1.6K bytes:
On Sat, Oct 25, 2014 at 01:01:52PM +0200, Karsten Loesing wrote:
On 24/10/14 01:53, isis wrote:
isis transcribed 6.6K bytes:
- The hashed fingerprint (as is the case for bridges in onionoo)
- The hashed ip:port
Actually, my apologies, I was quite tired when I wrote this and totally completely wrong.
A hashed ip:port would be a terrible idea because IPv4 space is only 2^32 and ports are 2^16. In total that's a 2^48 message space. Hashing for a preimage to get the bridge addresses in quite feasible in those constaints, as well as precomputing the attack offline.
We should come up with a different way to hide ip:ports.
I'm lacking context, but just in case this is even remotely relevant, here's how CollecTor sanitizes bridge IP addresses:
https://collector.torproject.org/formats.html#bridge-descriptors
Yes, this is very relevant, thanks! Currently our plan involves keying the JSON dataset using unsanitized "IP Address:port" internally and the sanitized public version will replace this key with H(H(fingerprint)). This seems like the easiest way to avoid the problem of leaking the IP address.
At this point, we don't think we need an IP address in the resulting dataset, so a unique, linkable fingerprint seems sufficient. If we find that IP addresses are useful then Collector's algorithm seems like a good starting point.
I agree that we could probably do without any IP:port information in the resulting reports. The hashed fingerprint is enough for BridgeDB to deduce a bridge's IP:ports; it should also be enough for Metrics to deduce which bridge a particular set of additional reachability information concerns, without needing to do any additional processing of either the IP:ports or the fingerprints.
With respect to CollecTor's algorithms for sanitising bridge IP:ports (should we decide to instead keep the bridge address information in OONI's bridge reachability reports and wish to sanitise those reports), Robert Ransom spoke with me on the 24th of October, and made the following points and suggestions:
Robert Ransom transcribed 1.0K bytes:
The Metrics system currently sanitizes bridge TCP addresses (IP+port) by HMACing them with a secret key stored on the server. That won't work for the reachability testing system for two reasons:
- The reachability-testing bridge clients should not know the key
needed to obfuscate TCP (or UDP, or other) addresses deterministically. (A deterministic public-key encryption would be just as bad as a hash.)
- BridgeDB must be able to learn the address for which a bridge's
reachability test was performed, so that it can decide whether the reachability-test results are valid for the bridge's current address.
I would suggest that the reachability-testing bridge client report a (randomized) public-key encryption of the address, where the decryption key is held by BridgeDB (so it can check whether the reachability test is relevant to the current ‘Bridge line’) and the Metrics sanitization server (so it can compute and publish a deterministically sanitized address, following the current sanitization procedure).