Hi everybody,
we're discussing in #5684 whether we can stop sanitizing nicknames in the bridge descriptors that we publish here:
https://metrics.torproject.org/data.html#bridgedesc
The sanitizing process is described here:
https://metrics.torproject.org/formats.html#bridgedesc
When we started making sanitized bridge descriptors available on the metrics website we replaced all contained nicknames with "Unnamed". The reason was that "bridge nicknames might give hints on the location of the bridge if chosen without care; e.g. a bridge nickname might be very similar to the operators' relay nicknames which might be located on adjacent IP addresses."
This was an easy decision back then, because we didn't use the nickname for anything. This has changed with #5629 where we try to count EC2 bridges which all have a similar nickname. So, while we don't have that information, there'd now be a use for it. Another advantage of having bridge nicknames would be that they're easier to look up in a status website like Atlas (which doesn't support searching for bridges yet). We should re-consider whether it still makes sense to sanitize nicknames in bridge descriptors or not.
Regarding the reasoning above, couldn't an adversary just scan adjacent IP addresses of all known relays, not just the ones with similar nicknames? And are we giving away anything else with the nicknames?
It would be great to get some feedback here whether leaving nicknames in the sanitized descriptors is a terrible idea, and if so, why.
If nobody objects within the next, say, two weeks, I'm going to make an old tarball from 2008 available with original nicknames. And if nobody screams, I'll provide the remaining tarballs containing original nicknames another two weeks later.
Thanks! Karsten
Karsten Loesing, 02.05.2012 14:30:
we're discussing in #5684 whether we can stop sanitizing nicknames in the bridge descriptors that we publish
I don't mind, but...
The reason was that "bridge nicknames might give hints on the location of the bridge if chosen without care; e.g. a bridge nickname might be very similar to the operators' relay nicknames which might be located on adjacent IP addresses."
That doesn't seem to have changed. Anyone can set up an relay and a bridge on "adjacent IP addresses".
It's not recommend to set up bridge and relay on the same network. e.g. 1.2.3.4:443 and 1.2.3.4:444. Since most adversaries don't block 1.2.3.4 completely I wondered how much clients would have problems connecting to such a bridge.
In the mentioned case (adjacent addresses like 1.2.3.4:443 and 1.2.3.5:443) it would be different. I assume it would be much better.
How much damage would be done if a certain percentage of bridges would be (both at the same time) a) named similar b) and actually be closed together (based on IP)?
Do similar names actually mean that bridges are located where the relay is? (Apparently you've got the data to see these correlations)
At any point and regardless how small the damage might be you (the Tor people) should raise awareness to name the bridge in a way that it does not reveal a adjacent IP.
This was an easy decision back then, because we didn't use the nickname for anything.
"We don't need it, so better remove it." I really like that.
This has changed with #5629 where we try to count EC2 bridges which all have a similar nickname. So, while we don't have that information, there'd now be a use for it.
As #5684 mentions there would be another way.
- Use the extra platform string. (Don't know how good that would work)
Another advantage of having bridge nicknames would be that they're easier to look up in a status website like Atlas (which doesn't support searching for bridges yet).
That's of course something that's not possible with the platform string.
For me it would be nice to have, but could live without it, when it's not safe enough. If there's a doubt about it's safety, I would stay away from changing it.
Regarding the reasoning above, couldn't an adversary just scan adjacent IP addresses of all known relays, not just the ones with similar nicknames?
As an adversary I would look up all known relays and scan for useful services. When 1.2.3.4:80 hosts a website I whitelist it and block all other 1.2.3.4:IPs, especially the ORport. If enough resource are available I'd scan "adjacent addresses".
And are we giving away anything else with the nicknames?
Maybe it's location ;)
As I read "hints on the location" for the first time; I though it would mean that "TowerBridge" or "BridgeofLondon" would be bad since it could hint to London.
It would be great to get some feedback here whether leaving nicknames in the sanitized descriptors is a terrible idea, and if so, why.
Probably there are many unnamed bridges. Should someone find its bridge named like its relay he still can rename them.
There's a goal (or some), so it would have a purpose.
If nobody objects within the next, say, two weeks, I'm going to make an old tarball from 2008 available with original nicknames.
I don't object. As you have the tarballs available (in a safe and secure place, I hope) you can see if there are relays and bridges which give away anything, especially "adjacent addresses".
Since your message is older than five hours and I'm the only one that replies to it so far (which makes me feel like I shouldn't have), I assume that there is not so much to object.
I like that you wait....
Could it make sense to ask the same question on the tor-relay list? Here you (the Tor people) have more data again and know who subscribed to both lists. I for myself assume that relay and bridge operators, which could object, because it's their naming scheme that could reveal something, are more likely to be subscribed to tor-relays.
And if nobody screams, I'll provide the remaining tarballs containing original nicknames another two weeks later.
Probably two weeks later, since unpacking, processing and re-packing takes some time :) I know the sanitized ones are large when they are unpacked. Windows needs some time to delete the extracted files.
Thanks! Karsten
Thanks for your work.
Regards, bastik_tor
[Cc'ing tor-relays, because this discussion might be relevant for relay/bridge operators, too. Please keep the discussion on tor-dev. See https://lists.torproject.org/pipermail/tor-dev/2012-May/003489.html for the whole thread.]
Hi Sebastian,
On 5/2/12 9:35 PM, Sebastian G. <bastik.tor> wrote:
[...]
Do similar names actually mean that bridges are located where the relay is? (Apparently you've got the data to see these correlations)
A fine question.
How do we define "similar" and "located where the relay is?" I can see how a relay "bastik1" and a bridge "bastik2" have similar nicknames, but would we also teach a program that "bastikrelay" and "bastikbridge" are similar? And are two IP address in the same, say, /30 located nearby, or is the same /28 or even /24 okay, too?
So, while we have the data to see these correlations, I think that whatever similarity algorithm we come up with, somebody else might come up with something smarter. If we do the analysis you suggest and learn that it's safe to include nicknames, that doesn't say very much. Only because we have the data to confirm how well our attack would works doesn't automatically mean we're in a good position to design the attack.
If you want to run this analysis with the 2008 tarball (assuming there won't be general objections within the next two weeks), I'm happy to take your list of likely bridge IP addresses and tell you how accurate your algorithm is.
"We don't need it, so better remove it." I really like that.
I think we're really conservative with giving out bridge data, and that's good.
At the same time there's a value in giving out information about bridges, so that "remove everything" is not a good answer. For example, I think if we give bridge operators better feedback how their bridge is doing, we'll suddenly have a lot more bridges. Making it easy for bridge operators to use Atlas would be a good step into that direction. The same applies to funders who realize from our statistics how successful the Tor Cloud project is and who then want to fund it more to make it more usable, support more cloud providers, etc.
And are we giving away anything else with the nicknames?
Maybe it's location ;)
As I read "hints on the location" for the first time; I though it would mean that "TowerBridge" or "BridgeofLondon" would be bad since it could hint to London.
Well, in that case you'd learn that there's a (Tor) bridge in London. But that wouldn't help you very much, would it?
Could it make sense to ask the same question on the tor-relay list? Here you (the Tor people) have more data again and know who subscribed to both lists. I for myself assume that relay and bridge operators, which could object, because it's their naming scheme that could reveal something, are more likely to be subscribed to tor-relays.
Good idea. I added tor-relays to the Cc to let relay/bridge operators know. Let's keep this discussion on tor-dev though.
And if nobody screams, I'll provide the remaining tarballs containing original nicknames another two weeks later.
Probably two weeks later, since unpacking, processing and re-packing takes some time :) I know the sanitized ones are large when they are unpacked. Windows needs some time to delete the extracted files.
Right. :) I'll probably start sanitizing all bridge descriptors at once in two weeks, starting with the 2008 ones, and provide only the 2008 tarball then. It's going to keep my CPU and disks busy for a while.
Thanks for your input!
Best, Karsten
Karsten Loesing, 03.05.2012 13:32:
Hi Sebastian,
Hi Karsten,
Do similar names actually mean that bridges are located where the relay is? (Apparently you've got the data to see these correlations)
A fine question.
How do we define "similar"
That's the same problem an attacker would have. I truly understand the point. One can up with a naming scheme that's not in "the pattern", but where it's still easy to correlate the name. e.g. relay "Earth" and bridge "Moon", relay "Jupiter" bridge "Callisto".
and "located where the relay is?"
I guessed that would be defined already.
The safest way is to ensure that bridge and relay operators are aware of the fact that their naming scheme should avoid correlations, wherever both are actually located. The question here is on how to ensure it?!
So, while we have the data to see these correlations, I think that whatever similarity algorithm we come up with, somebody else might come up with something smarter. If we do the analysis you suggest and learn that it's safe to include nicknames, that doesn't say very much. Only because we have the data to confirm how well our attack would works doesn't automatically mean we're in a good position to design the attack.
If I remember correctly Bruce Schneier "once" said that it's easy to built/invent your own cipher which you are unable to break, but that you can't be sure that no one else can.
If you want to run this analysis with the 2008 tarball (assuming there won't be general objections within the next two weeks), I'm happy to take your list of likely bridge IP addresses and tell you how accurate your algorithm is.
I'll could flip a coin and it would be much more accurate than any algorithm I could come up with.
I'm not able to use any mathematical function on the data. And I have no "skill" to do that in a batch. I as an adversary would "crowdsource" the similarity since humans might have a better understanding what might belong together.
All I could do is look through the list manually and compare them with the list of relays. I don't think I'm going to do this as I don't believe that I'm going to find anything.
"We don't need it, so better remove it." I really like that.
I think we're really conservative with giving out bridge data, and that's good.
I agree.
At the same time there's a value in giving out information about bridges, so that "remove everything" is not a good answer. For example, I think if we give bridge operators better feedback how their bridge is doing, we'll suddenly have a lot more bridges. Making it easy for bridge operators to use Atlas would be a good step into that direction.
I agree again. At least for interested operators it's very nice to have.
The same applies to funders who realize from our statistics how successful the Tor Cloud project is and who then want to fund it more to make it more usable, support more cloud providers, etc.
When it's helpful.
And are we giving away anything else with the nicknames?
Maybe it's location ;)
As I read "hints on the location" for the first time; I though it would mean that "TowerBridge" or "BridgeofLondon" would be bad since it could hint to London.
Well, in that case you'd learn that there's a (Tor) bridge in London. But that wouldn't help you very much, would it?
That was my first misinterpretation of "hints on the location" as I read it "back then" (some time ago). I tried and failed to understand what one would get from knowing where are bridge would be located (geographically). I think it does not hurt.
Thanks for your input!
Probably it wasn't necessary at all. Don't think it was to useful either.
Best, Karsten
Same to you! Sebastian
On 5/3/12 7:22 PM, Sebastian G. <bastik.tor> wrote:
The safest way is to ensure that bridge and relay operators are aware of the fact that their naming scheme should avoid correlations, wherever both are actually located. The question here is on how to ensure it?!
This is a usability question. Telling bridge operators that they should use a very different nickname for their bridge than what they used for their relays could be useful. But it's yet one more thing to tell them. We should also tell them not to run their bridge on the same IP address where they ran a relay before. Or they shouldn't re-use their relay identity key for running a bridge. And we could even test these cases automatically. But my sense is that we'd only confuse potential bridge operators, either by telling them these things in a howto or by notifying them when they do one of these things. We'd probably overload poor Runa who has to answer the support questions coming out of this. Probably not worth it.
So, while we have the data to see these correlations, I think that whatever similarity algorithm we come up with, somebody else might come up with something smarter. If we do the analysis you suggest and learn that it's safe to include nicknames, that doesn't say very much. Only because we have the data to confirm how well our attack would works doesn't automatically mean we're in a good position to design the attack.
If I remember correctly Bruce Schneier "once" said that it's easy to built/invent your own cipher which you are unable to break, but that you can't be sure that no one else can.
I fully agree. That's why I want to avoid doing the analysis and telling people everything's good.
I'm not able to use any mathematical function on the data. And I have no "skill" to do that in a batch. I as an adversary would "crowdsource" the similarity since humans might have a better understanding what might belong together.
All I could do is look through the list manually and compare them with the list of relays. I don't think I'm going to do this as I don't believe that I'm going to find anything.
Sounds like a fine approach. Want to do it (when the 2008 tarball is available)? It would be interesting to see a) what fraction of bridges you think you can derive IP addresses for and b) how accurate your guesses are.
Best, Karsten
Karsten Loesing, 04.05.2012 12:31:
On 5/3/12 7:22 PM, Sebastian G. <bastik.tor> wrote:
The safest way is to ensure that bridge and relay operators are aware of the fact that their naming scheme should avoid correlations, wherever both are actually located. The question here is on how to ensure it?!
This is a usability question. Telling bridge operators that they should use a very different nickname for their bridge than what they used for their relays could be useful. But it's yet one more thing to tell them. We should also tell them not to run their bridge on the same IP address where they ran a relay before. Or they shouldn't re-use their relay identity key for running a bridge. And we could even test these cases automatically. But my sense is that we'd only confuse potential bridge operators, either by telling them these things in a howto or by notifying them when they do one of these things. We'd probably overload poor Runa who has to answer the support questions coming out of this. Probably not worth it.
I agree, that it's already enough information an operator would have to consider.
[...]
All I could do is look through the list manually and compare them with the list of relays. I don't think I'm going to do this as I don't believe that I'm going to find anything.
Sounds like a fine approach. Want to do it (when the 2008 tarball is available)? It would be interesting to see a) what fraction of bridges you think you can derive IP addresses for and b) how accurate your guesses are.
Since it will be released in two weeks and the next wave is released in two weeks after that I think there's enough time in which I can do that.
When you think it's useful I'm at least going to try. We should take this "off list" and then can post the results on it.
I encourage anyone to try the same. It might be interesting to see different results (What's similar). In the case that's useful.
Best, Karsten
Regards, Sebastian
On 5/4/12 3:52 PM, Sebastian G. <bastik.tor> wrote:
Karsten Loesing, 04.05.2012 12:31:
Sounds like a fine approach. Want to do it (when the 2008 tarball is available)? It would be interesting to see a) what fraction of bridges you think you can derive IP addresses for and b) how accurate your guesses are.
Since it will be released in two weeks and the next wave is released in two weeks after that I think there's enough time in which I can do that.
Cool!
When you think it's useful I'm at least going to try. We should take this "off list" and then can post the results on it.
Let's only discuss things off-list to reduce the potential noise for others, but let's post all results directly to the list. If you come up with guesses which bridges in 2008 were located nearby which relays, that's entirely based on (then) publicly available information. My response will be of the sort "you found x% of the bridges" which is fine to post to the list, too. This is why we're trying this with the 2008 bridges first before making the more recent tarballs available. And even then, it's good to find issues before we have 50,000 bridges in the network.
I encourage anyone to try the same. It might be interesting to see different results (What's similar). In the case that's useful.
Sure!
I encourage anyone to find issues in the sanitized bridge descriptor tarballs at any time, not restricted to this specific discussion. Knowing that there's a potential problem is great, because then we can fix it. So far I was the only one finding and fixing problems. Maybe others did find them, too, but didn't tell anyone.
Thanks, Karsten
On 5/3/12, Karsten Loesing karsten@torproject.org wrote:
How do we define "similar" and "located where the relay is?" I can see how a relay "bastik1" and a bridge "bastik2" have similar nicknames, but would we also teach a program that "bastikrelay" and "bastikbridge" are similar? And are two IP address in the same, say, /30 located nearby, or is the same /28 or even /24 okay, too?
http://freehaven.net/anonbib/papers/pets2011/p1-perito.pdf
Robert Ransom
On 5/3/12 8:06 PM, Robert Ransom wrote:
On 5/3/12, Karsten Loesing karsten@torproject.org wrote:
How do we define "similar" and "located where the relay is?" I can see how a relay "bastik1" and a bridge "bastik2" have similar nicknames, but would we also teach a program that "bastikrelay" and "bastikbridge" are similar? And are two IP address in the same, say, /30 located nearby, or is the same /28 or even /24 okay, too?
Interesting.
I'd think there's some similarity between usernames and nicknames, but they're not the same. People make different decisions when picking a username for themselves than when finding nicknames for their relays or bridges. For example, my three relays burning spare bandwidth are named "ene", "mene", and "miste", and my bridge is named "mu"; all names are part of a German variant of "eenie meenie miney mo". I picked these names, both because I lacked creativity to pick better nicknames, but also because I can easily memorize that they belong together. I wouldn't pick usernames this different when registering in online communities.
Sounds like a fun research project to apply their paper to relay/bridge nicknames.
Best, Karsten
On 05/03/2012 01:32 PM, Karsten Loesing wrote:
On 5/2/12 9:35 PM, Sebastian G. <bastik.tor> wrote:
[...] "We don't need it, so better remove it." I really like that.
I think we're really conservative with giving out bridge data, and that's good.
At the same time there's a value in giving out information about bridges, so that "remove everything" is not a good answer. For example, I think if we give bridge operators better feedback how their bridge is doing, we'll suddenly have a lot more bridges. Making it easy for bridge operators to use Atlas would be a good step into that direction. The same applies to funders who realize from our statistics how successful the Tor Cloud project is and who then want to fund it more to make it more usable, support more cloud providers, etc.
I would suggest looking at homomorphic hash [1] and Shamir's discrete logarithm hash function [2]. (Those also work well with linear network coding [3], but not sure if it could be useful here.)
For example, encoding FQDN, IP or nick can be done by splitting the argument-to-encode by fields or characters. The parts can be then used as input to the hash function.
The function allows checking whether a nick/FQDN/IP has specific part, or two have identical part, but does not disclose "plaintext" of the part.
Obviously, there are statistical attacks possible: e.g. for FQDNs, the attacker could guess which component maps to 'com', as it is the most common TLD. Similarly, splitting up into characters can be attacked by using frequency tables. There are other things that could apply here (thinking about attacks on "plaintext RSA" without padding).
Nevertheless, I think it's still better than publishing plaintext data if we are not sure what they might give away. Implementation using gmp/gmpy/numpy should be fairly easy.
Ondrej
[1] On-the-Fly Verification of Rateless Erasure Codes for Efficient Content Distribution, http://pdos.csail.mit.edu/papers/otfvec/paper.pdf (see section IV) [2] http://www.senderek.com/SDLH/ [3] https://en.wikipedia.org/wiki/Network_coding
On 5/4/12 2:21 AM, Ondrej Mikle wrote:
On 05/03/2012 01:32 PM, Karsten Loesing wrote:
On 5/2/12 9:35 PM, Sebastian G. <bastik.tor> wrote:
[...] "We don't need it, so better remove it." I really like that.
I think we're really conservative with giving out bridge data, and that's good.
At the same time there's a value in giving out information about bridges, so that "remove everything" is not a good answer. For example, I think if we give bridge operators better feedback how their bridge is doing, we'll suddenly have a lot more bridges. Making it easy for bridge operators to use Atlas would be a good step into that direction. The same applies to funders who realize from our statistics how successful the Tor Cloud project is and who then want to fund it more to make it more usable, support more cloud providers, etc.
I would suggest looking at homomorphic hash [1] and Shamir's discrete logarithm hash function [2]. (Those also work well with linear network coding [3], but not sure if it could be useful here.)
For example, encoding FQDN, IP or nick can be done by splitting the argument-to-encode by fields or characters. The parts can be then used as input to the hash function.
The function allows checking whether a nick/FQDN/IP has specific part, or two have identical part, but does not disclose "plaintext" of the part.
Obviously, there are statistical attacks possible: e.g. for FQDNs, the attacker could guess which component maps to 'com', as it is the most common TLD. Similarly, splitting up into characters can be attacked by using frequency tables. There are other things that could apply here (thinking about attacks on "plaintext RSA" without padding).
Nevertheless, I think it's still better than publishing plaintext data if we are not sure what they might give away. Implementation using gmp/gmpy/numpy should be fairly easy.
Interesting approach. So, the idea would be to split a nickname like "ec2bridgeb268f2ae6" into its characters (or pairs of 2 or more characters?), run it through the hash function, and then be able to check if the nickname starts with "ec2bridge"? Plus, the approach would still work if we later decide we want to find all bridges with nicknames starting with "rackspacebridge"?
My first concern is that there's not enough entropy in nicknames for the hash function to provide sufficient protection. I could imagine it's not hard to throw variants of all known relay nicknames into that hash function and learn 50%, if not 75%, of all used bridge nicknames.
My second concern is that this approach would only solve the problem of counting EC2 bridges, but wouldn't make sites like Atlas more usable for bridge operators.
Best, Karsten
On 05/04/2012 01:04 PM, Karsten Loesing wrote:
On 5/4/12 2:21 AM, Ondrej Mikle wrote:
On 05/03/2012 01:32 PM, Karsten Loesing wrote:
On 5/2/12 9:35 PM, Sebastian G. <bastik.tor> wrote:
[...] "We don't need it, so better remove it." I really like that.
I think we're really conservative with giving out bridge data, and that's good.
At the same time there's a value in giving out information about bridges, so that "remove everything" is not a good answer. For example, I think if we give bridge operators better feedback how their bridge is doing, we'll suddenly have a lot more bridges. Making it easy for bridge operators to use Atlas would be a good step into that direction. The same applies to funders who realize from our statistics how successful the Tor Cloud project is and who then want to fund it more to make it more usable, support more cloud providers, etc.
I would suggest looking at homomorphic hash [1] and Shamir's discrete logarithm hash function [2]. (Those also work well with linear network coding [3], but not sure if it could be useful here.)
For example, encoding FQDN, IP or nick can be done by splitting the argument-to-encode by fields or characters. The parts can be then used as input to the hash function.
The function allows checking whether a nick/FQDN/IP has specific part, or two have identical part, but does not disclose "plaintext" of the part.
Obviously, there are statistical attacks possible: e.g. for FQDNs, the attacker could guess which component maps to 'com', as it is the most common TLD. Similarly, splitting up into characters can be attacked by using frequency tables. There are other things that could apply here (thinking about attacks on "plaintext RSA" without padding).
Nevertheless, I think it's still better than publishing plaintext data if we are not sure what they might give away. Implementation using gmp/gmpy/numpy should be fairly easy.
Interesting approach. So, the idea would be to split a nickname like "ec2bridgeb268f2ae6" into its characters (or pairs of 2 or more characters?), run it through the hash function, and then be able to check if the nickname starts with "ec2bridge"? Plus, the approach would still work if we later decide we want to find all bridges with nicknames starting with "rackspacebridge"?
Basically, yes (you'd need to check for "rackspacebridg", since it's length is multiple of two; it's possible to check in middle as well as at end). Though as you write below, the low entropy make it not so much usable for groups of 2-3 characters.
Thus the idea is probably over-engineering in this case. I just note below some design facets I had in mind:
My first concern is that there's not enough entropy in nicknames for the hash function to provide sufficient protection. I could imagine it's not hard to throw variants of all known relay nicknames into that hash function and learn 50%, if not 75%, of all used bridge nicknames.
One trick I had in mind was create "secret hash function" (take the following with a grain of salt, algebra is not my "primary thing"):
- you keep generators g_i secret, which turns the problem from discrete-log into a problem of finding n-th root in finite field (n is the value of the digraph, trigraph or few characters, e.g. encoded value of 'ec2bridge', possibly "blinded" by another multiplication with secret constant c_i) - in general, computing n-th root is quite slow [1], but there are many special cases like square root (quadratic residue) - the above properties would make it slow for attacker to brute-force all possible values - i.e. attacker can't just compute the result values of such homomorphic hash, because he doesn't know the function parameters (e.g. without computing the generators), but everyone can use the "homomorphic property" for testing parts
My second concern is that this approach would only solve the problem of counting EC2 bridges, but wouldn't make sites like Atlas more usable for bridge operators.
Yes, the scheme severely restricts what you can do with the data.
Ondrej
[1] http://www.ma.utexas.edu/users/voloch/Preprints/roots.pdf
On Sat, May 05, 2012 at 01:47:45AM +0200, Ondrej Mikle wrote:
One trick I had in mind was create "secret hash function" (take the following with a grain of salt, algebra is not my "primary thing"):
- you keep generators g_i secret, which turns the problem from discrete-log into
a problem of finding n-th root in finite field (n is the value of the digraph, trigraph or few characters, e.g. encoded value of 'ec2bridge', possibly "blinded" by another multiplication with secret constant c_i)
- in general, computing n-th root is quite slow [1], but there are many special
cases like square root (quadratic residue)
- the above properties would make it slow for attacker to brute-force all
possible values - i.e. attacker can't just compute the result values of such homomorphic hash, because he doesn't know the function parameters (e.g. without computing the generators), but everyone can use the "homomorphic property" for testing parts
It sounds like you're talking about the homomorphic hashing paper you linked to in your last email. But there, the exponentiations are in Z_p, and taking n-th roots in Z_p is totally trivial.
I seem to have lost the thread of why we're doing this. Is it just to count how many bridges are running our ec2 / rackspace / etc. bridge images? Can't we just report that out of band? Is the EC2 IP space not known? I must be missing something.
- Ian
On 05/05/2012 03:05 PM, Ian Goldberg wrote:
On Sat, May 05, 2012 at 01:47:45AM +0200, Ondrej Mikle wrote:
One trick I had in mind was create "secret hash function" (take the following with a grain of salt, algebra is not my "primary thing"):
- you keep generators g_i secret, which turns the problem from discrete-log into
a problem of finding n-th root in finite field (n is the value of the digraph, trigraph or few characters, e.g. encoded value of 'ec2bridge', possibly "blinded" by another multiplication with secret constant c_i)
- in general, computing n-th root is quite slow [1], but there are many special
cases like square root (quadratic residue)
- the above properties would make it slow for attacker to brute-force all
possible values - i.e. attacker can't just compute the result values of such homomorphic hash, because he doesn't know the function parameters (e.g. without computing the generators), but everyone can use the "homomorphic property" for testing parts
It sounds like you're talking about the homomorphic hashing paper you linked to in your last email. But there, the exponentiations are in Z_p, and taking n-th roots in Z_p is totally trivial.
I seem to have lost the thread of why we're doing this. Is it just to count how many bridges are running our ec2 / rackspace / etc. bridge images? Can't we just report that out of band? Is the EC2 IP space not known? I must be missing something.
As I wrote before, the idea is over-engineered (I got a bit carried away; the EC2 IP space is known, it was an attempt to hide nicknames not on EC2 or similar cloud services potentially leaking information).
Ondrej
On 5/2/12 2:30 PM, Karsten Loesing wrote:
If nobody objects within the next, say, two weeks, I'm going to make an old tarball from 2008 available with original nicknames. And if nobody screams, I'll provide the remaining tarballs containing original nicknames another two weeks later.
Here we go. These are the sanitized bridge descriptors from May 2008 including original bridge nicknames:
http://freehaven.net/~karsten/volatile/bridges-2008-05-nicknames.tar.bz2
Best, Karsten
Karsten Loesing, 16.05.2012 08:47:
On 5/2/12 2:30 PM, Karsten Loesing wrote:
If nobody objects within the next, say, two weeks, I'm going to make an old tarball from 2008 available with original nicknames. And if nobody screams, I'll provide the remaining tarballs containing original nicknames another two weeks later.
Here we go. These are the sanitized bridge descriptors from May 2008 including original bridge nicknames:
http://freehaven.net/~karsten/volatile/bridges-2008-05-nicknames.tar.bz2
Here we go with the similarities of bridge and relay nicknames. While some are sharing names, some seem to share a naming scheme and others could be run by the same person on an adjacent IP address.
Attached you'll see "findings.txt", which contains a bridge line, followed by a relay line, followed by a comment line.
Each should be in another line. Windows and Linux users should see the same as LF (Linefeed) and CR are included. That might be a problem for other files I might release as some contain LF only, which is for Linux and will be understood by a proper notepad even on Windows. When no one objects I'm going to release them somewhere else, so it's possible to check my way of doing things.
I'm not good in time tracking, but spend approx. 4-5 hours on processing the tarballs to get the lists of names for both relays and bridges.
Shortly after I agreed to do this, I downloaded the relay tarball and started to figure out how I would get only what I needed. After I did that I waited for the bridge tarball and did the same to it.
Simply because I know the things I did I assume that it will be possible to do that quicker. With other tools it might be faster. I assume it would be much faster, when processing the tarballs could be done by a script. Unfortunately I'm not capable of doing so.
The comparison was done manually, but I'm sure a algorithm would have found most of the similarities anyway. I took the list of bridge names and compared it to the list of relay names. I looked for and included exact matches and close similarities. After that there wasn't much left to look for on the bridge list. I looked a each bridge name and guessed what it could be. I searched the Internet on them. I may have found something that isn't based on a naming scheme, but can be linked in other ways.
The worst part was to put the lines together. I underestimated the time the plain comparison would take, but copying the lines into findings.txt took longer as I would have imagined. I'm not sure if my provided data are the best in how I put the lines together.
The nicest part, even if it was time consuming, was to look for other things that could link relay and bridge names together. It wasn't as successful as I hoped, but it was fun. Maybe because I like the universe and mythology, which may influenced the findings to a higher degree as useful.
The last part could be done by a script as well. It would have to look for a bridge name in a list "moons" and when it finds one, it has to compare the list "moons" against the relay list to find a relay that's in the "moons" list. Of course with many lists like "freedom activists".
The comparison, including copying the line together took approx. 08-10 hours. I really expected much less as I saw the relay names.
The total time spent was 12-15 hours. I assume that this can be reduced by using scripts and or tools with are designed to do parts of the job.
This mail is getting far too long, so all I'm going to say is that I'm looking forward to the results.
Best regards, bastik_tor
On 5/19/12 11:41 AM, Sebastian G. <bastik.tor> wrote:
Karsten Loesing, 16.05.2012 08:47:
On 5/2/12 2:30 PM, Karsten Loesing wrote:
If nobody objects within the next, say, two weeks, I'm going to make an old tarball from 2008 available with original nicknames. And if nobody screams, I'll provide the remaining tarballs containing original nicknames another two weeks later.
Here we go. These are the sanitized bridge descriptors from May 2008 including original bridge nicknames:
http://freehaven.net/~karsten/volatile/bridges-2008-05-nicknames.tar.bz2
Here we go with the similarities of bridge and relay nicknames.
Thanks for spending this much time on the analysis!
Here's what I did with your findings.txt:
- extract unique fingerprint pairs of relays and bridges that you found as having similar nicknames,
- look through descriptor archives to see if relay and bridge were running in the same /24 at any time in May 2008, and
- determine the absolute and relative number of bridges in a given network status that could have been located via nickname similarity.
Results are that 24 of your 81 guesses (30%) were correct in the sense that a bridge was at least once running in the same /24 as the relay with similar nickname. At any time in May 2008, you'd have located between 1 and 6 bridges (2.5% to 18%) with 3 bridges (10%) in the mean via nickname similarity.
I think it's acceptable to publish more recent bridge descriptors with nicknames in a week from now. Results may look quite different with 1000 bridges instead of 30.
Again, thanks for running this analysis! Maybe you're interested in automating your comparison and re-running it for a 2012 tarball?
Thanks, Karsten
Karsten Loesing, 21.05.2012 11:05:
Here we go with the similarities of bridge and relay nicknames.
Thanks for spending this much time on the analysis!
I could have done far worse, but also a lot better in terms of time spend on extracting the data that I wanted or at least considered that they'd might be useful.
Sometimes I'm just slow at things, e.g. writing this reply.
Here's what I did with your findings.txt:
- extract unique fingerprint pairs of relays and bridges that you found
as having similar nicknames,
- look through descriptor archives to see if relay and bridge were
running in the same /24 at any time in May 2008, and
- determine the absolute and relative number of bridges in a given
network status that could have been located via nickname similarity.
Results are that 24 of your 81 guesses (30%) were correct in the sense that a bridge was at least once running in the same /24 as the relay with similar nickname. At any time in May 2008, you'd have located between 1 and 6 bridges (2.5% to 18%) with 3 bridges (10%) in the mean via nickname similarity.
Not too bad.
I think it's acceptable to publish more recent bridge descriptors with nicknames in a week from now. Results may look quite different with 1000 bridges instead of 30.
May 2008 was the first month with bridges. I expected lot's of relay operators that tested a bridge with the same name. Things may have changed over time. I assume that further comparisons won't have such a "high" hit ratio.
Again, thanks for running this analysis! Maybe you're interested in automating your comparison and re-running it for a 2012 tarball?
My claim was you got the data, so you can check. (Not with May 2008)
To be honest, my first impression was that I wouldn't do anything useful and did not intend to do that. I guessed it wouldn't turn out that it doesn't hurt since at least 2011, so I wouldn't find anything good.
Then you asked and I agreed, but already thought "I couldn't keep my mouth shut!". I mean I replied to this topic. I surely could have said no there. I didn't.
After and while I was doing what I did. I would have said no to the question if I'm going to do this again. That's valid for up to Sunday night. Today I'm agreeing again.
That's a pretty long way to say: Yes!
Thank you,it's an 2012 tarball. The number of bridges is scary.
I'm going to upload some files somewhere and explain what I did. Step by step (somewhat around that). So anyone can check and reproduce what I did. It would be nice to hear feedback and ways to improve the way I did what I did.
Maybe you can tell me if the findings.txt was alright.
Unless one objects or you disagree I'm going to upload the files I created and explain how and maybe I can say even why.
I created a Blog, just because I wanted it some when in the past, but found it silly. That's the channel I planed to use. Maybe it's OK to put it on a Tor-List as well, but maybe it's considered as noise.
Thanks, Karsten
Thank you, Sebastian
On 5/16/12 8:47 AM, Karsten Loesing wrote:
On 5/2/12 2:30 PM, Karsten Loesing wrote:
If nobody objects within the next, say, two weeks, I'm going to make an old tarball from 2008 available with original nicknames. And if nobody screams, I'll provide the remaining tarballs containing original nicknames another two weeks later.
Here we go. These are the sanitized bridge descriptors from May 2008 including original bridge nicknames:
http://freehaven.net/~karsten/volatile/bridges-2008-05-nicknames.tar.bz2
And now, two weeks later, here are the sanitized bridge descriptors containing nicknames:
https://metrics.torproject.org/data.html#bridgedesc
Best, Karsten
Karsten Loesing:
On 5/16/12 8:47 AM, Karsten Loesing wrote:
On 5/2/12 2:30 PM, Karsten Loesing wrote:
If nobody objects within the next, say, two weeks, I'm going to make an old tarball from 2008 available with original nicknames. And if nobody screams, I'll provide the remaining tarballs containing original nicknames another two weeks later.
Here we go. These are the sanitized bridge descriptors from May 2008 including original bridge nicknames:
http://freehaven.net/~karsten/volatile/bridges-2008-05-nicknames.tar.bz2
And now, two weeks later, here are the sanitized bridge descriptors containing nicknames:
https://metrics.torproject.org/data.html#bridgedesc
Best, Karsten
Here are my findings for the tarballs of March 2012. I could pick freely from any 2012 tarball. I picked March 2012 because it contained the "bridge peak" and the relays seemed stable.
I changed the format of the file to save some bandwidth, for instance there are no comments, beside the first two, there are no dates or IP addresses.
Karsten does not require a human readable format and suggested that I give him a two column file with fingerprints only. I'd be fine with that. However I wanted to include the nicknames I found, to make it easier to read for others.
The format is now: 1234maxTOR JaEXbhlA3DGQObgodXayY1G90LA 1234maxTOR w7Fd36WDWPMwDeIyg90iFl8Q5Zg where the first line is the bridge.
My reasoning to include the nicknames is to show how many of them share the same name or how close they are named. I'm surprised how many I found that shared a name.
About processing the data. Karsten pointed me to some tools that helped me to obtain the data I wanted and needed. This took about 30 minutes and includes manual download, manual unpacking and semi-manual extraction of the nickname lines with grep. The rest was done by a semi-automated "batch" I created. I'm going to update the wiki page some when in the future to recreate this and of course improve it.
The comparison was done manually. I took list a) (the bridge-names) and compared it to list b) (the relay names). To improve the accuracy of the time spent I used a stopwatch.
For the most simplest part of finding direct matching or pretty close names I took 8 hours and 45 minutes. That includes copying the lines together. I then tried other ways to find similarities e.g. a revered name. I spent 1 hour and 10 minutes on that. I hadn't found anything to include.
I spent 9 hours and 55 minutes to compare the names. I could have spent more time on the second part, but as I were not able to find something that I could include I stopped at that point. I could have stopped after 8 hours and 45 minutes, with exactly the same results. However, you never know.
The time spent was spread over several days. I assume that my first run with May 2008 did not take as longs as I told you later. I may overestimated the time I spent back then.
My approach was a manual comparison. However I'm still convinced that this can be automated with a tool or script that uses an algorithm or even prints out exact matches only, because there are many of them that share a name. I'm willing to assist, help or something along those lines to create a method with reproducible results.
I'm entirely open to the results.
Best Regards, Sebastian (aka bastik_tor)
On 6/4/12 7:43 PM, Sebastian G. <bastik.tor> wrote:
Karsten Loesing:
On 5/16/12 8:47 AM, Karsten Loesing wrote:
On 5/2/12 2:30 PM, Karsten Loesing wrote:
If nobody objects within the next, say, two weeks, I'm going to make an old tarball from 2008 available with original nicknames. And if nobody screams, I'll provide the remaining tarballs containing original nicknames another two weeks later.
Here we go. These are the sanitized bridge descriptors from May 2008 including original bridge nicknames:
http://freehaven.net/~karsten/volatile/bridges-2008-05-nicknames.tar.bz2
And now, two weeks later, here are the sanitized bridge descriptors containing nicknames:
https://metrics.torproject.org/data.html#bridgedesc
Best, Karsten
Here are my findings for the tarballs of March 2012. I could pick freely from any 2012 tarball. I picked March 2012 because it contained the "bridge peak" and the relays seemed stable.
Results are that 205 of your 308 guesses (66%) were correct in the sense that a bridge was at least once running in the same /24 as the relay with similar nickname. At any time in March 2012, you'd have located between 26 and 46 bridges (1.7% to 3.3%) with 37 bridges (2.5%) in the mean via nickname similarity.
Your accuracy went up from 30% in your May 2008 analysis to 66%, but the overall fraction of bridges you'd have located went down from 10% to 2.5% in the mean.
I think we can live with an adversary being able to locate 1 out of 40 bridges if the operator assigns a similar nickname and runs it on a nearby IP address.
If you think you can come up with a vastly improved rate of located bridges of, say, 5% or more, I can look at another findings.txt of yours for a different month than March 2012.
If not, let's conclude this analysis and assume that publishing bridge nicknames is safe enough---until somebody shows us that we're wrong.
Again, thanks for running this analysis!
Thanks, Karsten
Karsten Loesing:
On 6/4/12 7:43 PM, Sebastian G. <bastik.tor> wrote:
Karsten Loesing:
On 5/16/12 8:47 AM, Karsten Loesing wrote:
On 5/2/12 2:30 PM, Karsten Loesing wrote:
If nobody objects within the next, say, two weeks, I'm going to make an old tarball from 2008 available with original nicknames. And if nobody screams, I'll provide the remaining tarballs containing original nicknames another two weeks later.
Here we go. These are the sanitized bridge descriptors from May 2008 including original bridge nicknames:
http://freehaven.net/~karsten/volatile/bridges-2008-05-nicknames.tar.bz2
And now, two weeks later, here are the sanitized bridge descriptors containing nicknames:
https://metrics.torproject.org/data.html#bridgedesc
Best, Karsten
Here are my findings for the tarballs of March 2012. I could pick freely from any 2012 tarball. I picked March 2012 because it contained the "bridge peak" and the relays seemed stable.
Results are that 205 of your 308 guesses (66%) were correct in the sense that a bridge was at least once running in the same /24 as the relay with similar nickname. At any time in March 2012, you'd have located between 26 and 46 bridges (1.7% to 3.3%) with 37 bridges (2.5%) in the mean via nickname similarity.
That sounds good from my point of view as an attacker. It's not too bad.
Your accuracy went up from 30% in your May 2008 analysis to 66%, but the overall fraction of bridges you'd have located went down from 10% to 2.5% in the mean.
From my point of view as a user it's good that the overall fraction
decreased.
I think we can live with an adversary being able to locate 1 out of 40 bridges if the operator assigns a similar nickname and runs it on a nearby IP address.
You should get more people to run bridges with names of already existing relays that are not their own. That would give a higher false-positive rate. (True, it would but I'm just kidding)
More bridges overall would be wonderful.
If you think you can come up with a vastly improved rate of located bridges of, say, 5% or more, I can look at another findings.txt of yours for a different month than March 2012.
Unless I can come up with an idea to exclude false-positives that's unlikely. Well I went up from 30% to 66%, but I don't know why that was the case.
If not, let's conclude this analysis and assume that publishing bridge nicknames is safe enough---until somebody shows us that we're wrong.
I consider that publishing bridge nicknames is safe enough for achieving the goals (counting EC2, searching them via Atlas), unless somebody (myself not excluded) shows us that we're wrong.
Again, thanks for running this analysis!
Thank you for your work. I did this because, a) I had the idea to look at the data b) you told me it's useful c) I wanted to know how many can be located.
Finally I can say that it was a fine experience and I learned something (at least about processing the data).
Thanks, Karsten
Once again, thank you.
Best, Sebastian