[GSoC] Consensus Diffs - First report - tor-dev

8 Jun 2014


      Hello everyone,
This is my first status report regarding my GSoC project, Implementing
consensus diffs. We weekly meetings with Nick and Sebastian, my mentors,
every Wednesday at 14h UTC, if you want to take a closer look at the
project's progress or want to participate.
In the application timeline I specified that I would spend the first
weeks coding the server side part, which is mostly just the generation
of diffs. And that's what I've been doing thus far. I'm a bit behind on
schedule due to exams, but I'm not too far off.
I first implemented a dynamic algorithm, quadratic in both time and
space, which seemed to work fine. Of course it didn't work for
consensuses, since that would mean an allocation of nearly 3GB. So the
next thing I did was switching to Hirschberg's algorithm, which still
takes quadratic time but only takes linear space due to a
divide-and-conquer approach.
But that still leaves us at a horrible 16 seconds of runtime to generate
a regular microdescriptor consensus diff, whereas 'diff -d' only takes a
fifth of a second. That is largely due to the fact that we're not taking
advantage of the consensus properties yet.
So what I will do next is improve the algorithm, which should make it
very close to linear in time. The idea is to navigate the two
consensuses, looking at the digests that will already be sorted for us.
That's part of the 'r' line for descriptor consensuses, and the 'm' line
for microdescriptor consensuses. This should take linear time.
Now, the part of Hirschberg's algorithm that makes it quadratic in time
is finding 'k', the column at which the problem can be split in half
wihtout losing precision. This is well explained here:
http://wordaligned.org/articles/longest-common-subsequence
But once we are navigating the nodes ordered by digests, we can avoid
having to calculate 'k'. Given that we are at any pair of matching
digests, we can assume that this is already a point 'k' because all the
nodes that are before the digest on the first consensus can only be
found before the digest on the second consensus, and the same goes for
later digests.
So the final algorithm should travel both consensuses in linear time,
and use our quadratic time algorithm on the small chunks between these
matching digests. The quadratic time algorithm would at worst be used
for maybe a couple hundred lines, a size at which it still performs
fairly well.
You may be wondering why I went with an O(MN) algorithm when I could
have gone with the more common O(ND) algorithm, which I understand is
the one used in the GNU diffutils. The main reason is the use-case.
O(ND) performs worse than O(MN) when the differences are large, because
then D=2N. In our case, when we are running the algorithm it's because
nodes disappeared, appeared or simply to find changes in nodes that are
still there.
So it will be very common that we find diffs D that are very close in
size to N or M. Plus, as far as I understand, the divide-and-conquer
algorithm is pretty clean and easy to understand while the O(ND)
algorithm by Myers would probably be harder to implement in a clean way.
I'm in no way an expert in diff/lcs algorithms, so suggestions and
comments are very welcome.
-- 
Daniel Martí - mvdan@mvdan.cc - http://mvdan.cc/
PGP: A9DA 13CD F7A1 4ACD D3DE  E530 F4CA FFDB 4348 041C