Re: [tor-dev] [GSoC 2013] Status report - Searchable metrics archive

14 Aug 2013

      On 8/13/13 3:17 PM, Kostas Jakeliunas wrote:
...
On Tue, Aug 13, 2013 at 2:15 PM, Karsten Loesing karsten@torproject.orgwrote:
...
I suggest putting pg_prewarm on the future work list.  I sense there's a
lot of unused potential in stock PostgreSQL.  Tweaking the database at
this point has the word "premature optimization" written on it in big
letters for me.
...
Also, to be very clear here, a tool that requires custom tweaks to
PostgreSQL has minimal chances of running on torproject.org machines in
the future.  The current plan is that we'll have a dedicated database
machine operated by our sysadmins that not even the service operator
will have shell access to.
Oh, understood then, OK, no extensions (at least) for now.
Apropos: as of my current (limited) understanding, it might be difficult to
support, for example, nickname sub-string searches without a (supported,
official) extension. One such extension is pg_trgm [1], which is in the
contrib/ directory in 9.1, and is just one make install away. But for now,
I'll assume this is not possible / we should avoid this.
Looks like pg_trgm is contained in postgresql-contrib-9.1, so it's more
likely that we can run something requiring this extension on a
torproject.org machine.  Still, requiring extensions should be the last
resort if no other solution can be found.  Leaving out searches for
nickname substrings is a valid solution for now.
...
...
So, why do you join descriptors and network statuses in the search
process?  At the Munich dev meeting I suggested joining the tables
already in the import process.  What do you think about that idea?
Yes, I had made a half-hearted attempt to normalize the two tables some
time ago, for a small amount of descriptors and status entries; I'll be
trying out this scheme in full (will need to re-import a major part of the
data (which I didn't do then) to be able to see if it scales well) after I
try something else.
Okay.
...
(Namely, using a third table of unique fingerprints
(the statusentry table currently holds ~170K unique fingerprints vs. ~67M
rows in total) and (non-unique) nicknames for truly quick fingerprint
lookup and nickname search; I did experiment with this as well, but I
worked with a small subset of overall data in that case, too; and I think I
can do a better job now.)
It had seemed to me that the bottleneck was in having to sort a too large
number of rows, but now I understand (if only just a bit) more about the
'explain analyze' output to see that the 'Nested Loop' procedure, which is
what does the join in the join query discussed, is expensive and is part of
the bottleneck so to speak. So I'll look into that after properly
benchmarking stuff with the third table.
Sounds like another fine thing to test, I agree.
...
(By the way, for future reference,
we do have to test out different ideas on a substantial subset of overall
data, as the scale function is not, so to say, linear.) :)
Good point.  Yes, we should keep this in mind for the future.
...
...
...
https://github.com/wfn/torsearch/blob/master/misc/nested_join.sql
We use the following indexes while executing that query:

lower(nickname) on descriptor

(substr(fingerprint, 0, 12), substr(lower(digest), 0, 12)) on

statusentry
Using only the first 12 characters sounds like a fine approach to speed
up things.  But why 12?  Why not 10 or 14?  This is probably something
you should annotate as parameter to find a good value for later in the
process.  (I'm not saying that 12 is a bad number.  It's perfectly fine
for now, but it might not be the best number.)
Yes, this is as unscientific as it gets. As of now, we're using a raw SQL
query, but I'll be encapuslating them properly soon (so we can easily
attach different WHERE clauses, etc.), at which point I'll make it into a
parameter. I did do some tests, but nothing extensive; just made sure the
indexes can fit into memory whole, which was the main constraint. Will do
some tests.
...
Also, would it keep indexes smaller if you took something else than
base16 encoding for fingerprints?  What about base64?  Or is there a
binary type in PostgreSQL that works fine for indexes?
Re: latter, no binary type for B-Trees (which is the default index type in
pgsql) as far as I can see. But it's a good idea / approach, so I'll look
into it, thanks! On the whole though, as long as all the indexes occupy
only a subset of pgsql's internal buffers, there shouldn't be a problem /
that's not the problem, afaik. But, if we're making a well-researched
ORM/database design, I should look into it.
...
Do you have a list of searches you're planning to support?
These are the ones that should *really* be supported:

?search=nickname
?search=fingerprint
?lookup=fingerprint
?search=address [done some limited testing, currently not focusing on

this]
The lookup parameter is basically the same as search=fingerprint with
the additional requirement that fingerprint must be 40 characters long.
 So, this is the current search parameter.
I agree, these would be good to support.
You might also add another parameter ?address=address for ExoneraTor.
That should, in theory, be just a subset of the search parameter.
...

?running=<boolean>

This one is tricky.  So far, Onionoo looks only at the very latest
consensus or bridge status to decide if a relay or bridge is running or not.
But now you're adding archives to Onionoo, so that people can search for
a certain consensus or certain bridge status in the past, or they can
search for a time interval of consensuses or bridge statuses.  How do
you define that a relay or bridge is running, or more importantly
included as not running?
...

?flag=flag [every kind of clause which further narrows down the query

is not bad; the current db model supports all the flags that Stem does, and
   each flag has its own column]
I'd say leave this one out until there's an actual use case.
...

?first_seen_days=range
?last_seen_days=range

As per the plan, the db should be able to return a list of status entries /
validafter ranges (which can be used in {first,last}_seen_days) given some
fingerprint.
Oh, I think there's a misunderstanding of these two fields.  These
fields are only there to search for relays or bridges that have first
appeared or were last seen on a given day.
You'll need two new parameters, say, from=datetime and to=datetime (or
start=datetime and end=datetime) to define a valid-after range for your
search.
Hope this makes sense.
All the best,
Karsten

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [tor-dev] [GSoC 2013] Status report - Searchable metrics archive