On 8/13/13 3:17 PM, Kostas Jakeliunas wrote:
On Tue, Aug 13, 2013 at 2:15 PM, Karsten Loesing karsten@torproject.orgwrote:
I suggest putting pg_prewarm on the future work list. I sense there's a lot of unused potential in stock PostgreSQL. Tweaking the database at this point has the word "premature optimization" written on it in big letters for me.
Also, to be very clear here, a tool that requires custom tweaks to PostgreSQL has minimal chances of running on torproject.org machines in the future. The current plan is that we'll have a dedicated database machine operated by our sysadmins that not even the service operator will have shell access to.
Oh, understood then, OK, no extensions (at least) for now.
Apropos: as of my current (limited) understanding, it might be difficult to support, for example, nickname sub-string searches without a (supported, official) extension. One such extension is pg_trgm [1], which is in the contrib/ directory in 9.1, and is just one make install away. But for now, I'll assume this is not possible / we should avoid this.
Looks like pg_trgm is contained in postgresql-contrib-9.1, so it's more likely that we can run something requiring this extension on a torproject.org machine. Still, requiring extensions should be the last resort if no other solution can be found. Leaving out searches for nickname substrings is a valid solution for now.
So, why do you join descriptors and network statuses in the search process? At the Munich dev meeting I suggested joining the tables already in the import process. What do you think about that idea?
Yes, I had made a half-hearted attempt to normalize the two tables some time ago, for a small amount of descriptors and status entries; I'll be trying out this scheme in full (will need to re-import a major part of the data (which I didn't do then) to be able to see if it scales well) after I try something else.
Okay.
(Namely, using a third table of unique fingerprints (the statusentry table currently holds ~170K unique fingerprints vs. ~67M rows in total) and (non-unique) nicknames for truly quick fingerprint lookup and nickname search; I did experiment with this as well, but I worked with a small subset of overall data in that case, too; and I think I can do a better job now.)
It had seemed to me that the bottleneck was in having to sort a too large number of rows, but now I understand (if only just a bit) more about the 'explain analyze' output to see that the 'Nested Loop' procedure, which is what does the join in the join query discussed, is expensive and is part of the bottleneck so to speak. So I'll look into that after properly benchmarking stuff with the third table.
Sounds like another fine thing to test, I agree.
(By the way, for future reference, we do have to test out different ideas on a substantial subset of overall data, as the scale function is not, so to say, linear.) :)
Good point. Yes, we should keep this in mind for the future.
https://github.com/wfn/torsearch/blob/master/misc/nested_join.sql
We use the following indexes while executing that query:
lower(nickname) on descriptor
(substr(fingerprint, 0, 12), substr(lower(digest), 0, 12)) on
statusentry
Using only the first 12 characters sounds like a fine approach to speed up things. But why 12? Why not 10 or 14? This is probably something you should annotate as parameter to find a good value for later in the process. (I'm not saying that 12 is a bad number. It's perfectly fine for now, but it might not be the best number.)
Yes, this is as unscientific as it gets. As of now, we're using a raw SQL query, but I'll be encapuslating them properly soon (so we can easily attach different WHERE clauses, etc.), at which point I'll make it into a parameter. I did do some tests, but nothing extensive; just made sure the indexes can fit into memory whole, which was the main constraint. Will do some tests.
Also, would it keep indexes smaller if you took something else than base16 encoding for fingerprints? What about base64? Or is there a binary type in PostgreSQL that works fine for indexes?
Re: latter, no binary type for B-Trees (which is the default index type in pgsql) as far as I can see. But it's a good idea / approach, so I'll look into it, thanks! On the whole though, as long as all the indexes occupy only a subset of pgsql's internal buffers, there shouldn't be a problem / that's not the problem, afaik. But, if we're making a well-researched ORM/database design, I should look into it.
Do you have a list of searches you're planning to support?
These are the ones that should *really* be supported:
- ?search=nickname
- ?search=fingerprint
- ?lookup=fingerprint
- ?search=address [done some limited testing, currently not focusing on
this]
The lookup parameter is basically the same as search=fingerprint with the additional requirement that fingerprint must be 40 characters long. So, this is the current search parameter.
I agree, these would be good to support.
You might also add another parameter ?address=address for ExoneraTor. That should, in theory, be just a subset of the search parameter.
- ?running=<boolean>
This one is tricky. So far, Onionoo looks only at the very latest consensus or bridge status to decide if a relay or bridge is running or not.
But now you're adding archives to Onionoo, so that people can search for a certain consensus or certain bridge status in the past, or they can search for a time interval of consensuses or bridge statuses. How do you define that a relay or bridge is running, or more importantly included as not running?
- ?flag=flag [every kind of clause which further narrows down the query
is not bad; the current db model supports all the flags that Stem does, and each flag has its own column]
I'd say leave this one out until there's an actual use case.
- ?first_seen_days=range
- ?last_seen_days=range
As per the plan, the db should be able to return a list of status entries / validafter ranges (which can be used in {first,last}_seen_days) given some fingerprint.
Oh, I think there's a misunderstanding of these two fields. These fields are only there to search for relays or bridges that have first appeared or were last seen on a given day.
You'll need two new parameters, say, from=datetime and to=datetime (or start=datetime and end=datetime) to define a valid-after range for your search.
Hope this makes sense.
All the best, Karsten