It is often helpful to determine the similarity between relay descriptors. For example, to detect Sybil attacks, or to find partners in crime once we found a malicious relay. I recently added code to sybilhunter that can automate this task.
Now to the underlying theory. The algorithm makes use of a vantage point tree (vp-tree), which enables O(log n) search for nearest neighbours. We then take a file as input -- for example a consensus document -- and add every relay to the vp-tree. For node comparison, vp-trees require a distance metric in the mathematical sense, i.e., the metric has to satisfy the triangle inequality. For now, I use the Levenshtein distance. It takes as input two strings and returns the number of character edit operations (replace, insert, remove) that are necessary to turn descriptor A into descriptor B. While easy to understand and use, the Levenshtein distance cannot easily incorporate our domain-specific knowledge. I hope to find something more suitable.
Before comparing relay descriptor, I preprocess them. For network status documents, I consider nickname, ports, publication time, flags, version, and bandwidth. I believe this is a good start, but there is much room for improvement and feedback is very welcome.
Now here's how you can give it a shot. The commands below find the 10 closest relays to the relay we run at Karlstad University:
$ go get git.torproject.org/user/phw/sybilhunter.git $ wget https://www.nymity.ch/~phw/sample-consensus -O /tmp/consensus $ sybilhunter -data /tmp/consensus -neighbours 10 -rootrelay 9B94CD0B7B8057EAF21BA7F023B7A1C8CA9CE645
On my laptop, it takes approximately 10 seconds to generate the vp-tree. The output then consists of 10 blocks that represent the most similar relay descriptors. The first two lines of each block show the input to the Levenshtein distance. The third line shows the distance and a link to the relay's Atlas page.
Cheers, Philipp