Quantifying the similarity between Tor relays - tor-dev

22 May 2015


      It is often helpful to determine the similarity between relay
descriptors.  For example, to detect Sybil attacks, or to find partners
in crime once we found a malicious relay.  I recently added code to
sybilhunter that can automate this task.
Now to the underlying theory.  The algorithm makes use of a vantage
point tree (vp-tree), which enables O(log n) search for nearest
neighbours.  We then take a file as input -- for example a consensus
document -- and add every relay to the vp-tree.  For node comparison,
vp-trees require a distance metric in the mathematical sense, i.e., the
metric has to satisfy the triangle inequality.  For now, I use the
Levenshtein distance.  It takes as input two strings and returns the
number of character edit operations (replace, insert, remove) that are
necessary to turn descriptor A into descriptor B.  While easy to
understand and use, the Levenshtein distance cannot easily incorporate
our domain-specific knowledge.  I hope to find something more suitable.
Before comparing relay descriptor, I preprocess them.  For network
status documents, I consider nickname, ports, publication time, flags,
version, and bandwidth.  I believe this is a good start, but there is
much room for improvement and feedback is very welcome.
Now here's how you can give it a shot.  The commands below find the 10
closest relays to the relay we run at Karlstad University:
$ go get git.torproject.org/user/phw/sybilhunter.git
  $ wget https://www.nymity.ch/~phw/sample-consensus -O /tmp/consensus
  $ sybilhunter -data /tmp/consensus -neighbours 10 -rootrelay 9B94CD0B7B8057EAF21BA7F023B7A1C8CA9CE645
On my laptop, it takes approximately 10 seconds to generate the vp-tree.
The output then consists of 10 blocks that represent the most similar
relay descriptors.  The first two lines of each block show the input to
the Levenshtein distance.  The third line shows the distance and a link
to the relay's Atlas page.
Cheers,
Philipp