Onionoo protocol/implementation nuances / Onionoo-connected metrics project stuff - tor-dev

29 Jul 2013


      Hi Karsten,
(not sure whom to CC and whom not to, I have a couple of fairly specific
technical questions / comments (which (again) I should have delved into
earlier), but then again, maybe the scope of the tor-dev mailing list
includes such cases..)
@tor-dev: This is in regard to the Searchable metrics archive project,
intersected with Onionoo stuff.
I originally wanted to ask two questions, but by the time I reached the
middle of the email, I wasn't anymore sure if they were questions, so I
think these are simply my comments / observations about a couple of
specific points, and I just want to make sure I'm making general sense. :)
..So it turns out that it is very much possible to avoid Postgres
sequential scans - to construct such queries/cases which are best executed
using indexes (the query planner thinks so, and it makes sense as far as I
can tell) and whatever efficient (hm, < O(n)?) search algorithms are deemed
best - for the metrics search backend/database; the two things potentially
problematic to me seem to be:
- when doing ORDER BY, making sure that the resulting SELECT covers /
would potentially return a relatively small amount of rows (from what I've
been reading / trying out, whenever a SELECT happens which may cover >
~10-15% of all the table's rows, sequential scan is preferred, as using
indexes would result in even more disk i/o, whereas the seq.scan can read
in massive(r) chunks of data from disk because it's, well, sequential).
This means that doing "ORDER BY nickname" (or fingerprint, or any other
non-unique but covering-a-relatively-small-part-of-all-the-table column in
the network status table) is much faster than doing e.g. "ORDER BY
validafter" (where validafter refers to the consensus document's "valid
after" field) - which makes sense, of course, when you think about it
(ordering a massive amount of rows, even with proper LIMIT etc. is insane.)
We construct separate indexes (e.g. even if 'fingerprint' is part of a
composite (validafter + fingerprint) primary key), and all seems to be well.
I've been looking into the Onionoo implementation, particularly into
ResourceServlet.java [1], to see how ordering etc. is done there. I'm new
to that code, but am I right in saying that as of now, the results (at
least for /summary and /details) are generally fairly unsorted (except for
the possibility of "order by consensus_weight"), with "valid_after" fields
appearing in an unordered manner? This of course makes sense in this case,
as Onionoo is able to return *all* the results for given search criteria
(or, if none given, all available results) at once.
The obvious problem with the archival metrics search project (argh, we are
in need of a decent name I daresay :) maybe I'll think of something.. no
matter) is then, of course, the fact that we can't return all results at
once. I've been so far assuming that it would therefore make sense to
return them, whenever possible (and if not requested otherwise, if "order"
parameter is later implemented), in "valid_after" descending order. I
suppose this makes sense? This would be ideal, methinks. So far, it seems
that we can do that, as long as we have a WHERE clause that is
restricting-enough.
select * from (select distinct on (fingerprint) fingerprint, validafter,
...
nickname from statusentry where nickname like 'moria1' order by
fingerprint, validafter desc) as subq order by validafter desc limit 100;
, for example, works out nicely, in terms of efficiency / query plan,
postgres tells me. (The double ORDER BY is needed, as postgres' DISTINCT
needs an ORDER BY, and that ORDER BY's leftmost criterion has to match
DISTINCT ON (x))
Now, you mentioned / we talked about what is the (large, archival metrics)
backend to do about limiting/cutoff? Especially if there are no search
criteria specified, for example. Ideally, it may return a top-most list of
statuses (which in /details include info from server descriptors), sorted
by "last seen" / "valid after"? Thing is, querying a database for 100 (or
1000) of items with no ORDER BY Is really cheap; introducing ORDER BYs
which would still produce tons of results is considerably less so. I'm now
looking into this (and you did tell me this, i.e. that, I now think, a
large part of DB/backend robustness gets tested at these particular points;
this should have been more obvious to me).
But in any case, I have the question: what *should* the backend return
(when no search parameters/filters specified, or very loose ones (nickname
LIKE "a") are?
What would be cheap / doable: ORDER BY fingerprint (or digest (== hashed
descriptor), etc.) It *might* (well, should) work to order by fingerprint,
limit results, and *then* reorder by validafter - with no guarantee that
the topmost results would be with highest absolute validafters. I mean,
Onionoo is doing this kind of reordering / limiting itself, but it makes
sense as it can handle all the data at once (or I'm missing something, i.e.
the file I linked to already interacts with a subset of data; granted, I
haven't thoroughly read through.) In our/this case, it makes sense to try
and outsource all relational logic to ORM (but of course if it turns out we
can cheaply get 100 arbitrary results and more easily juggle with them
ourselves / on the (direct) python side, then sure.)
But would such arbitrary returned results make sense? It would look just
like Onionoo results, but - a (small) subset of them.
Ramble #2: Onionoo is doing more or less direct (well, compiled, so
efficient) regexps on fingerprint, nickname, etc. By default, "LIKE
%given_nickname_part%" is again (sequentially) expensive; Postgres does
offer full text search extensions (would need to build additional tables,
etc.), and I think this makes sense to be done; it would cover all our "can
supply a subset of :param" bases. I'll see into this. (It's also possible
to construct functional(istic) indexes, e.g. with regexp, but need to know
the template - I wonder if using Onionoo's regexpressions would work / make
sense - will see.) For now, LIKE/= exact_string is very nice, but of course
the whole idea is that it'd be possible to supply substrings. Of note is
the fact that in this case, LIKE %substring% is O(n) in the sense that
query time correlates with row count, afaict. As of now, full text search
extensions would solve the problem I think, even if they may look a bit
like an overkill at first.
End of an email without a proper direction/point. :)
[1]:
https://gitweb.torproject.org/onionoo.git/blob/HEAD:/src/org/torproject/onio...