Forwarding my original answer to Sebastian here.
-------- Original Message -------- Subject: Re: [tor-dev] Can we stop sanitizing nicknames in bridge descriptors? Date: Mon, 21 May 2012 19:56:34 +0200 From: Karsten Loesing karsten@torproject.org To: Sebastian G. <bastik.tor> bastik.tor@googlemail.com
Hi Sebastian,
On 5/21/12 7:08 PM, Sebastian G. <bastik.tor> wrote:
(Did you intend to send this mail only to me, not to tor-dev? Feel free to move the discussion back to tor-dev if you want.)
Karsten Loesing, 21.05.2012 11:05:
Here we go with the similarities of bridge and relay nicknames.
Thanks for spending this much time on the analysis!
I could have done far worse, but also a lot better in terms of time spend on extracting the data that I wanted or at least considered that they'd might be useful.
Sometimes I'm just slow at things, e.g. writing this reply.
Here's what I did with your findings.txt:
- extract unique fingerprint pairs of relays and bridges that you found
as having similar nicknames,
- look through descriptor archives to see if relay and bridge were
running in the same /24 at any time in May 2008, and
- determine the absolute and relative number of bridges in a given
network status that could have been located via nickname similarity.
Results are that 24 of your 81 guesses (30%) were correct in the sense that a bridge was at least once running in the same /24 as the relay with similar nickname. At any time in May 2008, you'd have located between 1 and 6 bridges (2.5% to 18%) with 3 bridges (10%) in the mean via nickname similarity.
Not too bad.
I agree. :)
I think it's acceptable to publish more recent bridge descriptors with nicknames in a week from now. Results may look quite different with 1000 bridges instead of 30.
May 2008 was the first month with bridges. I expected lot's of relay operators that tested a bridge with the same name. Things may have changed over time. I assume that further comparisons won't have such a "high" hit ratio.
That would be my guess, too. In May 2008, only a few early adopters were running bridges, and most of those probably ran relays at the same time, too. Plus, they were enthusiastic and put some energy in finding cool nicknames. It might be that this has changed since then. To be honest, I didn't look at 2012 tarballs yet.
Again, thanks for running this analysis! Maybe you're interested in automating your comparison and re-running it for a 2012 tarball?
My claim was you got the data, so you can check. (Not with May 2008)
To be honest, my first impression was that I wouldn't do anything useful and did not intend to do that. I guessed it wouldn't turn out that it doesn't hurt since at least 2011, so I wouldn't find anything good.
Then you asked and I agreed, but already thought "I couldn't keep my mouth shut!". I mean I replied to this topic. I surely could have said no there. I didn't.
After and while I was doing what I did. I would have said no to the question if I'm going to do this again. That's valid for up to Sunday night. Today I'm agreeing again.
That's a pretty long way to say: Yes!
Hah, great! :)
I'm going to make the 2012 tarballs available next Wednesday (May 30), assuming that my poor Linux box doesn't run out of $resource. I'll let you know.
Thank you,it's an 2012 tarball. The number of bridges is scary.
I'm going to upload some files somewhere and explain what I did. Step by step (somewhat around that). So anyone can check and reproduce what I did. It would be nice to hear feedback and ways to improve the way I did what I did.
Maybe you can tell me if the findings.txt was alright.
Yes, the file format was fine.
Unless one objects or you disagree I'm going to upload the files I created and explain how and maybe I can say even why.
No objections at all. Open discussion is good.
I created a Blog, just because I wanted it some when in the past, but found it silly. That's the channel I planed to use. Maybe it's OK to put it on a Tor-List as well, but maybe it's considered as noise.
I wonder if the Tor wiki would be a better place to collect ideas for reversing the bridge descriptor sanitizing process. Feel free to grab a new page in doc/ and start describing what you did.
Best, Karsten
Karsten Loesing, 22.05.2012 09:24:
Forwarding my original answer to Sebastian here.
To fix this, I have to reply again.
Hi Sebastian,
On 5/21/12 7:08 PM, Sebastian G. <bastik.tor> wrote:
(Did you intend to send this mail only to me, not to tor-dev? Feel free to move the discussion back to tor-dev if you want.)
That was my fault and not intended.
I'm going to make the 2012 tarballs available next Wednesday (May 30), assuming that my poor Linux box doesn't run out of $resource. I'll let you know.
It doesn't matter when they will be available.
Unless one objects or you disagree I'm going to upload the files I created and explain how and maybe I can say even why.
No objections at all. Open discussion is good.
I created a Blog, just because I wanted it some when in the past, but found it silly. That's the channel I planed to use. Maybe it's OK to put it on a Tor-List as well, but maybe it's considered as noise.
I wonder if the Tor wiki would be a better place to collect ideas for reversing the bridge descriptor sanitizing process. Feel free to grab a new page in doc/ and start describing what you did.
That sounds like a reasonable idea.
(now everything should be back on the list).
Best, Sebastian
Karsten Loesing, 22.05.2012 09:24:
Unless one objects or you disagree I'm going to upload the files I created and explain how and maybe I can say even why.
No objections at all. Open discussion is good.
I created a Blog, just because I wanted it some when in the past, but found it silly. That's the channel I planed to use. Maybe it's OK to put it on a Tor-List as well, but maybe it's considered as noise.
I wonder if the Tor wiki would be a better place to collect ideas for reversing the bridge descriptor sanitizing process. Feel free to grab a new page in doc/ and start describing what you did.
I did just that.
https://trac.torproject.org/projects/tor/wiki/doc/DataExtractionForCompariso...
I created it three days ago and didn't touch it since then. Mainly because I can't make it much prettier.
You can comment on it, when you use the cypherpunks account, in case you don't have an account on the wiki.
Alternatively you can leave a comment on the blog. https://roastedonion.wordpress.com/2012/05/23/how-i-extracted-the-data/ https://roastedonion.wordpress.com/2012/05/23/the-method-of-extraction/
As for now you don't need to enter an email address, I may change that, when there's trouble. e.g. SPAM.
Regards, bastik_tor
On 5/26/12 9:30 AM, Sebastian G. <bastik.tor> wrote:
Karsten Loesing, 22.05.2012 09:24:
Unless one objects or you disagree I'm going to upload the files I created and explain how and maybe I can say even why.
No objections at all. Open discussion is good.
I created a Blog, just because I wanted it some when in the past, but found it silly. That's the channel I planed to use. Maybe it's OK to put it on a Tor-List as well, but maybe it's considered as noise.
I wonder if the Tor wiki would be a better place to collect ideas for reversing the bridge descriptor sanitizing process. Feel free to grab a new page in doc/ and start describing what you did.
I did just that.
https://trac.torproject.org/projects/tor/wiki/doc/DataExtractionForCompariso...
Thanks for creating that page. Looks line a fine start, though you'll want to automate more things when looking at 2012 tarballs.
grep and friends are fine tools to process Tor descriptors. If you can, find a Unix/Linux-like environment for Windows (Cygwin?) and combine the powers of grep with sort, uniq, and maybe sed or awk. These tools are friggin' fast!
If you're comfortable with Java and want to do more fancy stuff with Tor descriptors, take a look at metrics-lib:
https://gitweb.torproject.org/metrics-lib.git
If you're a Python person, you'll like stem, even though it only implements parsing of a subset of Tor descriptors. More to come soon:
https://gitweb.torproject.org/stem.git
Best, Karsten
Karsten Loesing, 29.05.2012 19:43:
I did just that.
https://trac.torproject.org/projects/tor/wiki/doc/DataExtractionForCompariso...
Thanks for creating that page. Looks line a fine start, though you'll want to automate more things when looking at 2012 tarballs.
Well without grep I'd be still copying out nicknames.
grep and friends are fine tools to process Tor descriptors. If you can, find a Unix/Linux-like environment for Windows (Cygwin?) and combine the powers of grep with sort, uniq, and maybe sed or awk. These tools are friggin' fast!
Cygwin might not be the right solution. I would have to compile the tools from source. Lucky me, those tools are available for Windows. Thanks to the people providing the binaries and the docs.
I have to look if they are equal to the tools you have mentioned. awk is named gawk. All of them are command-line tools and I have to learn how to use them.
And I need to figure out how to strip the "r" or copy only the nickname.
If you're comfortable with Java and want to do more fancy stuff with Tor descriptors, take a look at metrics-lib:
https://gitweb.torproject.org/metrics-lib.git
If you're a Python person, you'll like stem, even though it only implements parsing of a subset of Tor descriptors. More to come soon:
Thanks to anyone for coding on them or otherwise maintaining them.
I can't compile stuff, read or write code. That includes scripting. There's a reason why I'm on Windows.
Regards, Sebastian