Karsten Loesing:
On 5/16/12 8:47 AM, Karsten Loesing wrote:
On 5/2/12 2:30 PM, Karsten Loesing wrote:
If nobody objects within the next, say, two weeks, I'm going to make an old tarball from 2008 available with original nicknames. And if nobody screams, I'll provide the remaining tarballs containing original nicknames another two weeks later.
Here we go. These are the sanitized bridge descriptors from May 2008 including original bridge nicknames:
http://freehaven.net/~karsten/volatile/bridges-2008-05-nicknames.tar.bz2
And now, two weeks later, here are the sanitized bridge descriptors containing nicknames:
https://metrics.torproject.org/data.html#bridgedesc
Best, Karsten
Here are my findings for the tarballs of March 2012. I could pick freely from any 2012 tarball. I picked March 2012 because it contained the "bridge peak" and the relays seemed stable.
I changed the format of the file to save some bandwidth, for instance there are no comments, beside the first two, there are no dates or IP addresses.
Karsten does not require a human readable format and suggested that I give him a two column file with fingerprints only. I'd be fine with that. However I wanted to include the nicknames I found, to make it easier to read for others.
The format is now: 1234maxTOR JaEXbhlA3DGQObgodXayY1G90LA 1234maxTOR w7Fd36WDWPMwDeIyg90iFl8Q5Zg where the first line is the bridge.
My reasoning to include the nicknames is to show how many of them share the same name or how close they are named. I'm surprised how many I found that shared a name.
About processing the data. Karsten pointed me to some tools that helped me to obtain the data I wanted and needed. This took about 30 minutes and includes manual download, manual unpacking and semi-manual extraction of the nickname lines with grep. The rest was done by a semi-automated "batch" I created. I'm going to update the wiki page some when in the future to recreate this and of course improve it.
The comparison was done manually. I took list a) (the bridge-names) and compared it to list b) (the relay names). To improve the accuracy of the time spent I used a stopwatch.
For the most simplest part of finding direct matching or pretty close names I took 8 hours and 45 minutes. That includes copying the lines together. I then tried other ways to find similarities e.g. a revered name. I spent 1 hour and 10 minutes on that. I hadn't found anything to include.
I spent 9 hours and 55 minutes to compare the names. I could have spent more time on the second part, but as I were not able to find something that I could include I stopped at that point. I could have stopped after 8 hours and 45 minutes, with exactly the same results. However, you never know.
The time spent was spread over several days. I assume that my first run with May 2008 did not take as longs as I told you later. I may overestimated the time I spent back then.
My approach was a manual comparison. However I'm still convinced that this can be automated with a tool or script that uses an algorithm or even prints out exact matches only, because there are many of them that share a name. I'm willing to assist, help or something along those lines to create a method with reproducible results.
I'm entirely open to the results.
Best Regards, Sebastian (aka bastik_tor)