Hello,
another busy benchmarking + profiling period for database querying, but this time more rigorous and awesome.
* wrote a generic query analyzer which logs query statements, EXPLAIN, ANALYZE, spots and informs of particular queries that yield inefficient query plans; * wrote a very simple but rather exhaustive profiler (using python's cProfile) which logs query times, function calls, etc.; output is used to see which parts of the e.g. backend are slow during API calls; output can be easily used to construct a general query 'profile' for a particular database, etc.; [1] * benchmarked lots of different queries using these tools, recorded query times, was able to observe deviations/discrepancies; * uploaded the whole database and benchmarked briefly on an amazon EC2 m2.2xlarge instance; * concluded that, provided there is enough memory to cache *and hold* the indexes in cache, query times are good; * in particular, tested the following query scheme extensively: [2] (see comments there as well if curious); concluded that it runs well; * opted for testing raw SQL queries (from within Flask/python) - so far, translating them into ORM queries (while being careful) resulted in degraded performance; if we have to end up using raw SQL, I will create a way to encapsulate them nicely; * made sure data importing is not slowed and remains a quick-enough procedure; * researched PostgreSQL stuff, especially its two-layer caching; I now have an understanding of the way pgsql caches things in memory, how statistics on index usage are gathered and used for maintaining buffer_cache, etc. The searchable metrics archive would work best when all of its indexes are kept in memory. * to this end, looked into buffer cache hibernation [3], etc.; I think pg_prewarm [4, 5] would serve our purpose well. (Apparently many business/etc. solutions do find cache prewarming relevant - pity it's not supported in stock PostgreSQL.)
The latter means that * I don't think we can avoid using certain postgresql extensions (if only one) - which means that deploying will always take more than apt-get && pip install, but I believe it is needed; * next on my agenda is testing pg_prewarm on EC2 and, hopefully, putting our beloved database bottleneck problem to rest.
I planned to expose the EC2 for public tor-dev inquiry (and ended up delaying status report yet again), but I'll have to do this separately. This is possible, however. Sorry for the delayed report.
##
More generally,
I'm happy with my queer queries [2] now; the two constraints/goals of
* being able to run Onionoo-like queries on the whole descriptor / status entry database * being able to get a list of status entries for a particular relay
will hopefully be put to rest very soon. The former is done, provided I have no trouble setting up a database index precaching system (which will ensure that all queries of the same syntax/scheme run quick enough.)
Overall, I'm spending a bit too much time on a specific problem, but at least I have a more intimate lower-level knowledge of PostgreSQL, which turns out to be very relevant to this project. I hope to be able to soon move to extending Onionoo support and providing a clean API for getting lists of consensuses in which a particular relay was present. And maybe start with the frontend. :)
Kostas.
[1]: https://github.com/wfn/torsearch/commit/8e6f16a07c40f7806e98e9c71c1ce0f8e384... [2]: https://github.com/wfn/torsearch/blob/master/misc/nested_join.sql [3]: http://postgresql.1045698.n5.nabble.com/patch-for-new-feature-Buffer-Cache-H... [4]: http://www.postgresql.org/message-id/CA+TgmobRrRxCO+t6gcQrw_dJw+Uf9ZEdwf9beJ... [5]: http://raghavt.blogspot.com/2012/04/caching-in-postgresql.html
On 8/10/13 9:28 PM, Kostas Jakeliunas wrote:
- I don't think we can avoid using certain postgresql extensions (if only
one) - which means that deploying will always take more than apt-get && pip install, but I believe it is needed;
Can you give an example of a query that won't be executed efficiently without this extension and just fine with it? Maybe we can tweak that query somehow so it works fine on a vanilla PostgreSQL. Happy to give that some thoughts.
I'd really want to avoid using stuff that is not in Debian. Or rather, if we really need to add non-standard extensions, we need more than thinking and believing that it's unavoidable. :)
Overall, I'm spending a bit too much time on a specific problem, but at least I have a more intimate lower-level knowledge of PostgreSQL, which turns out to be very relevant to this project. I hope to be able to soon move to extending Onionoo support and providing a clean API for getting lists of consensuses in which a particular relay was present. And maybe start with the frontend. :)
Sounds like a plan, except for the frontend part. Scratch that, I'd say, and focus on the API. Making it shiny can come once everything works as expected.
Best, Karsten
Karsten,
this won't be a very short email, but I honestly swear I did revise it a couple of times. :) This is not urgent by any measure, so whenever you find time to reply will be fine. ctrl+f to "observe:" for some precise data / support for my plan re: using the pg_prewarm extension.
On Mon, Aug 12, 2013 at 2:16 PM, Karsten Loesing karsten@torproject.orgwrote:
On 8/10/13 9:28 PM, Kostas Jakeliunas wrote:
- I don't think we can avoid using certain postgresql extensions (if
only
one) - which means that deploying will always take more than apt-get &&
pip
install, but I believe it is needed;
Can you give an example of a query that won't be executed efficiently without this extension and just fine with it? Maybe we can tweak that query somehow so it works fine on a vanilla PostgreSQL. Happy to give that some thoughts.
I'd really want to avoid using stuff that is not in Debian. Or rather, if we really need to add non-standard extensions, we need more than thinking and believing that it's unavoidable. :)
First off, the general idea. I know this might not sound convincing (see below re: this), but any query that uses an index will take significantly longer to execute if it needs to load parts of the index from disk. More precisely, query time deviation and max(query_time) inversely correlates with the percentage of the index in question in memory. The larger the index, the more difficult it is to 'prep' it into cache, the more unpredictable query exec time gets.
Take a look at the query used to join descriptors and network statuses given some nickname (could be any other criterion, e.g. fingerprint or IP address):
https://github.com/wfn/torsearch/blob/master/misc/nested_join.sql
We use the following indexes while executing that query:
* lower(nickname) on descriptor
* (substr(fingerprint, 0, 12), substr(lower(digest), 0, 12)) on statusentry (this one is used to efficiently join descriptor table with statusentry: (fingerprint, descriptor) pair is completely unique in the descriptor table, and it is fairly unique in the statusentry table (whereas a particular fingerprint usually has lots and lots of rows in statusentry)); this index uses only substrings because otherwise, it will hog memory on my remote development machine (not EC2), leaving not much for other indexes; this composite substring index still takes ~2.5GB for status entries (only) in the range between [2010-01; 2013-05] as of now
* validafter on statusentry (the latter *must* stay in memory, as we use it elsewhere as well; for example, when not given a particular search criterion, we want to return a list of status entries (with distinct fingerprints) sorted by consensus validafter in descending order)
We also want to keep a fingerprint index on the descriptor table because we want to be able to search / look up by fingerprint.
I'm thinking of a way to demonstrate the efficiency of having the whole index in memory. For now, let me summarize what I have observed, intersect with what is relevant now: running the aforementioned query on some nickname that we haven't queried for since the last restart of postgresql, it might take, on average, about 1.5 to 3 seconds to execute on EC2, and considerably longer on my development db if it is a truly popular nickname (otherwise, more or less the same amount of time); sometimes a bit longer - up to ~4s (ideally it should be rather uniform since the indexes are *balanced* trees, but.. and autovacuum is enabled.)
Running that same query later on (after we've run other queries after that first one), it will take <= 160ms to execute and return results (this is a conservative number, usually it's much faster (see below)). Running EXPLAIN (ANALYZE, BUFFERS) shows that what happened was that there was no "[disk] read" next to index operations - only "buffer hit". This means that there was no need to read from disk during all the sorting - only when we knew which rows to return did we need to actually read them from disk. (There are some nuances, but at least this will be true for PostgreSQL >= 9.2 [1], which I haven't tried yet - there might be some pleasant surprises re: query time. Last I checked, Debian 9.0 repository contains postgresql 9.1.9.)
Observe:
1a. Run that query looking for 'moria2' for the first time since postgresql restart - relay is an old one, only one distinct fingerprint, relatively few status entries: http://sprunge.us/cEGh
1b. Run that same query later on: http://sprunge.us/jiPg (notice: no reads, only hits; notice query time)
2a. Run query on 'gabelmoo' (a ton of status entries) for the first time (development machine, query time is rather insane indeed): http://sprunge.us/fQEK
2b. Run that same query on 'gablemoo' later on: http://sprunge.us/fDDV
PostgresSQL is rather clever: it will keep the parts of indexes more often used in cache. What pg_prewarm simply does is:
* load all (or critical for us) indexes to memory (and load them whole), which is possible given a large enough cache; * load them to shared_buffers, which is an internal pgsql cache. pgsql uses two types of cache - its internal one (the memory for which is preallocated on daemon start), and OS cache - the latter is faster, but pgsql has not much weight in determining what gets kept there, and what gets flushed out.
If the shared_buffers value is large enough (say, 8GB) and all our relevant indexes fit into it nicely, postgresql does not flush them out from its internal cache - there's no need for it.
The cached indexes make a lot of sense when we make very non-discriminatory queries. What I mean is, if we query for status entries WHERE lower(nickname) = 'default' or 'unnamed', lots of index scanning will take place. EC2 executes such queries very nicely if the part of lower(nickname) index for, e.g., 'default' is loaded into cache. For that, we need enough memory and we need the indexes to be loaded into it. It doesn't happen automatically.
It might be possible to manually prepare the indexes by extracting, for example, the most common nicknames, and then simply running queries on them, thus pushing those parts of indexes into memory. As far as I understand, however, there is no guarantee that those parts will be put into the internal shared_buffers cache, which means that they might get pushed out of cache by the underlying OS depending on $some_uncontrollable_factor. This is my present understanding, but I can look into this more if there is a need for it.
Regarding the particular query at hand, I will agree it does look hairy. See the comments there - I can try to rephrase them and format them into a more readable form maybe? Looking up a particular nickname on the statusentry table is an expensive operation because, well, there are so many rows featuring it (usually). The descriptor table is much smaller. Removing the inner DISTINCT ON (descriptor.fingerprint) will result in a less hairy-looking EXPLAIN, but will take considerably longer to execute. So first, we narrow down the fingerprints using the descriptor table, extract a list of unique fingerprints (this particular operation is not expensive), then get a list of status entries featuring (fingerprint, descriptor) pairs, and again extract unique fingerprints (there multiple status entries to a particular (fp, descriptor) pair of course), sorting by validafter desc.
Regarding installing the extension in question / preparing a debian instance for this whole task, from a sysadmin perspective, the whole preparation can be reduced to a single chef script / makefile / whatnot; it will be a single command. But I understand that using additional things means an increased potential edge case and maintenance surface, so to speak. Note that the patch for pg_prewarm in question [2] is rather clean extension in the sense that it doesn't patch 'into' any postgresql core files; it creates new files and adds itself into an extension listing; standard stuff for extensions. Still, if this is something that sounds too risky (re: future integration/support/maintenance), I completely understand.
I plan to run some stuff on EC2 again soon, and install pg_prewarm there, and record query times, EXPLAIN BUFFERS, etc. Let me know what I should do to further prove/disprove my point.
Overall, I'm spending a bit too much time on a specific problem, but at
least I have a more intimate lower-level knowledge of PostgreSQL, which turns out to be very relevant to this project. I hope to be able to soon move to extending Onionoo support and providing a clean API for getting lists of consensuses in which a particular relay was present. And maybe start with the frontend. :)
Sounds like a plan, except for the frontend part. Scratch that, I'd say, and focus on the API. Making it shiny can come once everything works as expected.
Understood and makes sense, OK. I won't start thinking about the frontend until there's a jolly awesome backend running.
Cheers Kostas.
[1]: http://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9.2#Index-only_sc... [2]: http://www.postgresql.org/message-id/attachment/12593/pg_prewarm_v1.patch
On 8/12/13 2:58 PM, Kostas Jakeliunas wrote:
Karsten,
this won't be a very short email, but I honestly swear I did revise it a couple of times. :)
Okay. I'm putting in feedback below where I think it makes most sense.
This is not urgent by any measure, so whenever you find time to reply will be fine. ctrl+f to "observe:" for some precise data / support for my plan re: using the pg_prewarm extension.
On Mon, Aug 12, 2013 at 2:16 PM, Karsten Loesing karsten@torproject.orgwrote:
On 8/10/13 9:28 PM, Kostas Jakeliunas wrote:
- I don't think we can avoid using certain postgresql extensions (if
only
one) - which means that deploying will always take more than apt-get &&
pip
install, but I believe it is needed;
Can you give an example of a query that won't be executed efficiently without this extension and just fine with it? Maybe we can tweak that query somehow so it works fine on a vanilla PostgreSQL. Happy to give that some thoughts.
I'd really want to avoid using stuff that is not in Debian. Or rather, if we really need to add non-standard extensions, we need more than thinking and believing that it's unavoidable. :)
First off, the general idea. I know this might not sound convincing (see below re: this), but any query that uses an index will take significantly longer to execute if it needs to load parts of the index from disk. More precisely, query time deviation and max(query_time) inversely correlates with the percentage of the index in question in memory. The larger the index, the more difficult it is to 'prep' it into cache, the more unpredictable query exec time gets.
Take a look at the query used to join descriptors and network statuses given some nickname (could be any other criterion, e.g. fingerprint or IP address):
So, why do you join descriptors and network statuses in the search process? At the Munich dev meeting I suggested joining the tables already in the import process. What do you think about that idea?
https://github.com/wfn/torsearch/blob/master/misc/nested_join.sql
We use the following indexes while executing that query:
lower(nickname) on descriptor
(substr(fingerprint, 0, 12), substr(lower(digest), 0, 12)) on statusentry
Using only the first 12 characters sounds like a fine approach to speed up things. But why 12? Why not 10 or 14? This is probably something you should annotate as parameter to find a good value for later in the process. (I'm not saying that 12 is a bad number. It's perfectly fine for now, but it might not be the best number.)
Also, would it keep indexes smaller if you took something else than base16 encoding for fingerprints? What about base64? Or is there a binary type in PostgreSQL that works fine for indexes?
(this one is used to efficiently join descriptor table with statusentry: (fingerprint, descriptor) pair is completely unique in the descriptor table, and it is fairly unique in the statusentry table (whereas a particular fingerprint usually has lots and lots of rows in statusentry)); this index uses only substrings because otherwise, it will hog memory on my remote development machine (not EC2), leaving not much for other indexes; this composite substring index still takes ~2.5GB for status entries (only) in the range between [2010-01; 2013-05] as of now
- validafter on statusentry (the latter *must* stay in memory, as we use
it elsewhere as well; for example, when not given a particular search criterion, we want to return a list of status entries (with distinct fingerprints) sorted by consensus validafter in descending order)
We also want to keep a fingerprint index on the descriptor table because we want to be able to search / look up by fingerprint.
Do you have a list of searches you're planning to support?
I'm thinking of a way to demonstrate the efficiency of having the whole index in memory. For now, let me summarize what I have observed, intersect with what is relevant now: running the aforementioned query on some nickname that we haven't queried for since the last restart of postgresql, it might take, on average, about 1.5 to 3 seconds to execute on EC2, and considerably longer on my development db if it is a truly popular nickname (otherwise, more or less the same amount of time); sometimes a bit longer - up to ~4s (ideally it should be rather uniform since the indexes are *balanced* trees, but.. and autovacuum is enabled.)
Running that same query later on (after we've run other queries after that first one), it will take <= 160ms to execute and return results (this is a conservative number, usually it's much faster (see below)). Running EXPLAIN (ANALYZE, BUFFERS) shows that what happened was that there was no "[disk] read" next to index operations - only "buffer hit". This means that there was no need to read from disk during all the sorting - only when we knew which rows to return did we need to actually read them from disk. (There are some nuances, but at least this will be true for PostgreSQL >= 9.2 [1], which I haven't tried yet - there might be some pleasant surprises re: query time. Last I checked, Debian 9.0 repository contains postgresql 9.1.9.)
I doubt that PostgreSQL >= 9.2 will magically make things faster that were slow in 9.1.9. Please stick with what's in Wheezy.
Observe:
1a. Run that query looking for 'moria2' for the first time since postgresql restart - relay is an old one, only one distinct fingerprint, relatively few status entries: http://sprunge.us/cEGh
1b. Run that same query later on: http://sprunge.us/jiPg (notice: no reads, only hits; notice query time)
2a. Run query on 'gabelmoo' (a ton of status entries) for the first time (development machine, query time is rather insane indeed): http://sprunge.us/fQEK
2b. Run that same query on 'gablemoo' later on: http://sprunge.us/fDDV
PostgresSQL is rather clever: it will keep the parts of indexes more often used in cache. What pg_prewarm simply does is:
- load all (or critical for us) indexes to memory (and load them whole),
which is possible given a large enough cache;
- load them to shared_buffers, which is an internal pgsql cache. pgsql
uses two types of cache - its internal one (the memory for which is preallocated on daemon start), and OS cache - the latter is faster, but pgsql has not much weight in determining what gets kept there, and what gets flushed out.
If the shared_buffers value is large enough (say, 8GB) and all our relevant indexes fit into it nicely, postgresql does not flush them out from its internal cache - there's no need for it.
The cached indexes make a lot of sense when we make very non-discriminatory queries. What I mean is, if we query for status entries WHERE lower(nickname) = 'default' or 'unnamed', lots of index scanning will take place. EC2 executes such queries very nicely if the part of lower(nickname) index for, e.g., 'default' is loaded into cache. For that, we need enough memory and we need the indexes to be loaded into it. It doesn't happen automatically.
It might be possible to manually prepare the indexes by extracting, for example, the most common nicknames, and then simply running queries on them, thus pushing those parts of indexes into memory. As far as I understand, however, there is no guarantee that those parts will be put into the internal shared_buffers cache, which means that they might get pushed out of cache by the underlying OS depending on $some_uncontrollable_factor. This is my present understanding, but I can look into this more if there is a need for it.
Regarding the particular query at hand, I will agree it does look hairy. See the comments there - I can try to rephrase them and format them into a more readable form maybe? Looking up a particular nickname on the statusentry table is an expensive operation because, well, there are so many rows featuring it (usually). The descriptor table is much smaller. Removing the inner DISTINCT ON (descriptor.fingerprint) will result in a less hairy-looking EXPLAIN, but will take considerably longer to execute. So first, we narrow down the fingerprints using the descriptor table, extract a list of unique fingerprints (this particular operation is not expensive), then get a list of status entries featuring (fingerprint, descriptor) pairs, and again extract unique fingerprints (there multiple status entries to a particular (fp, descriptor) pair of course), sorting by validafter desc.
Regarding installing the extension in question / preparing a debian instance for this whole task, from a sysadmin perspective, the whole preparation can be reduced to a single chef script / makefile / whatnot; it will be a single command. But I understand that using additional things means an increased potential edge case and maintenance surface, so to speak. Note that the patch for pg_prewarm in question [2] is rather clean extension in the sense that it doesn't patch 'into' any postgresql core files; it creates new files and adds itself into an extension listing; standard stuff for extensions. Still, if this is something that sounds too risky (re: future integration/support/maintenance), I completely understand.
I plan to run some stuff on EC2 again soon, and install pg_prewarm there, and record query times, EXPLAIN BUFFERS, etc. Let me know what I should do to further prove/disprove my point.
I suggest putting pg_prewarm on the future work list. I sense there's a lot of unused potential in stock PostgreSQL. Tweaking the database at this point has the word "premature optimization" written on it in big letters for me.
Also, to be very clear here, a tool that requires custom tweaks to PostgreSQL has minimal chances of running on torproject.org machines in the future. The current plan is that we'll have a dedicated database machine operated by our sysadmins that not even the service operator will have shell access to.
Overall, I'm spending a bit too much time on a specific problem, but at
least I have a more intimate lower-level knowledge of PostgreSQL, which turns out to be very relevant to this project. I hope to be able to soon move to extending Onionoo support and providing a clean API for getting lists of consensuses in which a particular relay was present. And maybe start with the frontend. :)
Sounds like a plan, except for the frontend part. Scratch that, I'd say, and focus on the API. Making it shiny can come once everything works as expected.
Understood and makes sense, OK. I won't start thinking about the frontend until there's a jolly awesome backend running.
Sounds good.
All the best, Karsten
Cheers Kostas.
On Tue, Aug 13, 2013 at 2:15 PM, Karsten Loesing karsten@torproject.orgwrote:
I suggest putting pg_prewarm on the future work list. I sense there's a lot of unused potential in stock PostgreSQL. Tweaking the database at this point has the word "premature optimization" written on it in big letters for me.
Also, to be very clear here, a tool that requires custom tweaks to PostgreSQL has minimal chances of running on torproject.org machines in the future. The current plan is that we'll have a dedicated database machine operated by our sysadmins that not even the service operator will have shell access to.
Oh, understood then, OK, no extensions (at least) for now.
Apropos: as of my current (limited) understanding, it might be difficult to support, for example, nickname sub-string searches without a (supported, official) extension. One such extension is pg_trgm [1], which is in the contrib/ directory in 9.1, and is just one make install away. But for now, I'll assume this is not possible / we should avoid this.
So, why do you join descriptors and network statuses in the search
process? At the Munich dev meeting I suggested joining the tables already in the import process. What do you think about that idea?
Yes, I had made a half-hearted attempt to normalize the two tables some time ago, for a small amount of descriptors and status entries; I'll be trying out this scheme in full (will need to re-import a major part of the data (which I didn't do then) to be able to see if it scales well) after I try something else. (Namely, using a third table of unique fingerprints (the statusentry table currently holds ~170K unique fingerprints vs. ~67M rows in total) and (non-unique) nicknames for truly quick fingerprint lookup and nickname search; I did experiment with this as well, but I worked with a small subset of overall data in that case, too; and I think I can do a better job now.)
It had seemed to me that the bottleneck was in having to sort a too large number of rows, but now I understand (if only just a bit) more about the 'explain analyze' output to see that the 'Nested Loop' procedure, which is what does the join in the join query discussed, is expensive and is part of the bottleneck so to speak. So I'll look into that after properly benchmarking stuff with the third table. (By the way, for future reference, we do have to test out different ideas on a substantial subset of overall data, as the scale function is not, so to say, linear.) :)
https://github.com/wfn/torsearch/blob/master/misc/nested_join.sql
We use the following indexes while executing that query:
lower(nickname) on descriptor
(substr(fingerprint, 0, 12), substr(lower(digest), 0, 12)) on
statusentry
Using only the first 12 characters sounds like a fine approach to speed up things. But why 12? Why not 10 or 14? This is probably something you should annotate as parameter to find a good value for later in the process. (I'm not saying that 12 is a bad number. It's perfectly fine for now, but it might not be the best number.)
Yes, this is as unscientific as it gets. As of now, we're using a raw SQL query, but I'll be encapuslating them properly soon (so we can easily attach different WHERE clauses, etc.), at which point I'll make it into a parameter. I did do some tests, but nothing extensive; just made sure the indexes can fit into memory whole, which was the main constraint. Will do some tests.
Also, would it keep indexes smaller if you took something else than base16 encoding for fingerprints? What about base64? Or is there a binary type in PostgreSQL that works fine for indexes?
Re: latter, no binary type for B-Trees (which is the default index type in pgsql) as far as I can see. But it's a good idea / approach, so I'll look into it, thanks! On the whole though, as long as all the indexes occupy only a subset of pgsql's internal buffers, there shouldn't be a problem / that's not the problem, afaik. But, if we're making a well-researched ORM/database design, I should look into it.
Do you have a list of searches you're planning to support?
These are the ones that should *really* be supported:
- ?search=nickname - ?search=fingerprint - ?lookup=fingerprint - ?search=address [done some limited testing, currently not focusing on this] - ?running=<boolean> - ?flag=flag [every kind of clause which further narrows down the query is not bad; the current db model supports all the flags that Stem does, and each flag has its own column] - ?first_seen_days=range - ?last_seen_days=range
As per the plan, the db should be able to return a list of status entries / validafter ranges (which can be used in {first,last}_seen_days) given some fingerprint.
Thanks for your feedback and reply!
Kostas.
On 8/13/13 3:17 PM, Kostas Jakeliunas wrote:
On Tue, Aug 13, 2013 at 2:15 PM, Karsten Loesing karsten@torproject.orgwrote:
I suggest putting pg_prewarm on the future work list. I sense there's a lot of unused potential in stock PostgreSQL. Tweaking the database at this point has the word "premature optimization" written on it in big letters for me.
Also, to be very clear here, a tool that requires custom tweaks to PostgreSQL has minimal chances of running on torproject.org machines in the future. The current plan is that we'll have a dedicated database machine operated by our sysadmins that not even the service operator will have shell access to.
Oh, understood then, OK, no extensions (at least) for now.
Apropos: as of my current (limited) understanding, it might be difficult to support, for example, nickname sub-string searches without a (supported, official) extension. One such extension is pg_trgm [1], which is in the contrib/ directory in 9.1, and is just one make install away. But for now, I'll assume this is not possible / we should avoid this.
Looks like pg_trgm is contained in postgresql-contrib-9.1, so it's more likely that we can run something requiring this extension on a torproject.org machine. Still, requiring extensions should be the last resort if no other solution can be found. Leaving out searches for nickname substrings is a valid solution for now.
So, why do you join descriptors and network statuses in the search process? At the Munich dev meeting I suggested joining the tables already in the import process. What do you think about that idea?
Yes, I had made a half-hearted attempt to normalize the two tables some time ago, for a small amount of descriptors and status entries; I'll be trying out this scheme in full (will need to re-import a major part of the data (which I didn't do then) to be able to see if it scales well) after I try something else.
Okay.
(Namely, using a third table of unique fingerprints (the statusentry table currently holds ~170K unique fingerprints vs. ~67M rows in total) and (non-unique) nicknames for truly quick fingerprint lookup and nickname search; I did experiment with this as well, but I worked with a small subset of overall data in that case, too; and I think I can do a better job now.)
It had seemed to me that the bottleneck was in having to sort a too large number of rows, but now I understand (if only just a bit) more about the 'explain analyze' output to see that the 'Nested Loop' procedure, which is what does the join in the join query discussed, is expensive and is part of the bottleneck so to speak. So I'll look into that after properly benchmarking stuff with the third table.
Sounds like another fine thing to test, I agree.
(By the way, for future reference, we do have to test out different ideas on a substantial subset of overall data, as the scale function is not, so to say, linear.) :)
Good point. Yes, we should keep this in mind for the future.
https://github.com/wfn/torsearch/blob/master/misc/nested_join.sql
We use the following indexes while executing that query:
lower(nickname) on descriptor
(substr(fingerprint, 0, 12), substr(lower(digest), 0, 12)) on
statusentry
Using only the first 12 characters sounds like a fine approach to speed up things. But why 12? Why not 10 or 14? This is probably something you should annotate as parameter to find a good value for later in the process. (I'm not saying that 12 is a bad number. It's perfectly fine for now, but it might not be the best number.)
Yes, this is as unscientific as it gets. As of now, we're using a raw SQL query, but I'll be encapuslating them properly soon (so we can easily attach different WHERE clauses, etc.), at which point I'll make it into a parameter. I did do some tests, but nothing extensive; just made sure the indexes can fit into memory whole, which was the main constraint. Will do some tests.
Also, would it keep indexes smaller if you took something else than base16 encoding for fingerprints? What about base64? Or is there a binary type in PostgreSQL that works fine for indexes?
Re: latter, no binary type for B-Trees (which is the default index type in pgsql) as far as I can see. But it's a good idea / approach, so I'll look into it, thanks! On the whole though, as long as all the indexes occupy only a subset of pgsql's internal buffers, there shouldn't be a problem / that's not the problem, afaik. But, if we're making a well-researched ORM/database design, I should look into it.
Do you have a list of searches you're planning to support?
These are the ones that should *really* be supported:
- ?search=nickname
- ?search=fingerprint
- ?lookup=fingerprint
- ?search=address [done some limited testing, currently not focusing on
this]
The lookup parameter is basically the same as search=fingerprint with the additional requirement that fingerprint must be 40 characters long. So, this is the current search parameter.
I agree, these would be good to support.
You might also add another parameter ?address=address for ExoneraTor. That should, in theory, be just a subset of the search parameter.
- ?running=<boolean>
This one is tricky. So far, Onionoo looks only at the very latest consensus or bridge status to decide if a relay or bridge is running or not.
But now you're adding archives to Onionoo, so that people can search for a certain consensus or certain bridge status in the past, or they can search for a time interval of consensuses or bridge statuses. How do you define that a relay or bridge is running, or more importantly included as not running?
- ?flag=flag [every kind of clause which further narrows down the query
is not bad; the current db model supports all the flags that Stem does, and each flag has its own column]
I'd say leave this one out until there's an actual use case.
- ?first_seen_days=range
- ?last_seen_days=range
As per the plan, the db should be able to return a list of status entries / validafter ranges (which can be used in {first,last}_seen_days) given some fingerprint.
Oh, I think there's a misunderstanding of these two fields. These fields are only there to search for relays or bridges that have first appeared or were last seen on a given day.
You'll need two new parameters, say, from=datetime and to=datetime (or start=datetime and end=datetime) to define a valid-after range for your search.
Hope this makes sense.
All the best, Karsten
On Wed, Aug 14, 2013 at 1:33 PM, Karsten Loesing karsten@torproject.orgwrote:
Looks like pg_trgm is contained in postgresql-contrib-9.1, so it's more likely that we can run something requiring this extension on a torproject.org machine. Still, requiring extensions should be the last resort if no other solution can be found. Leaving out searches for nickname substrings is a valid solution for now.
Got it.
Do you have a list of searches you're planning to support?
These are the ones that should *really* be supported:
- ?search=nickname
- ?search=fingerprint
- ?lookup=fingerprint
- ?search=address [done some limited testing, currently not focusing
on
this]
The lookup parameter is basically the same as search=fingerprint with the additional requirement that fingerprint must be 40 characters long. So, this is the current search parameter.
I agree, these would be good to support.
You might also add another parameter ?address=address for ExoneraTor. That should, in theory, be just a subset of the search parameter.
Oh yes, makes a lot of sense, OK.
By the way: I considered having the last consensus (all the data for at least the /summary document, or /details as well) be stored in memory (this is possible) (probably as a hashtable where key = fingerprint, value = all the fields we'd need to return) so that when the backend is queried without any search criteria, it would be possible to avoid hitting the database (which is always nice), and just dump the last consensus. (There's also caching of course, which we could discuss at a (probably quite a bit) later point.)
- ?running=<boolean>
This one is tricky. So far, Onionoo looks only at the very latest consensus or bridge status to decide if a relay or bridge is running or not.
But now you're adding archives to Onionoo, so that people can search for a certain consensus or certain bridge status in the past, or they can search for a time interval of consensuses or bridge statuses. How do you define that a relay or bridge is running, or more importantly included as not running?
Agree, this is not clear. (And whatever ends up being done, this should be well documented and clearly articulated (of course.))
For me at least, 'running' implies the clause whether a given relay/bridge is running *right now*, i.e. whether it is present in the very last consensus. (Here's where that hashtable (with fingerprints as keys) in memory might be able to help: no need to run a separate query / do an inner join / whatnot; it would depend on whether there's a LIMIT involved though, etc.)
I'm not sure which one is more useful (intuitively for me, the "whether it is running *right now*" is more useful.) Do you mean that it might make sense to have a field (or have "running" be it) indicating whether a given relay/bridge was present in the last consensus in the specified date range? If this is what you meant, then the "return all that are/were not running" clause would indeed be kind of..peculiar (semantically - it wouldn't be very obvious what's it doing.)
Maybe it'd be simpler to first answer, what would be the most useful case?
How do you define that a relay or bridge [should be] included as not
running?
Could you rephrase maybe? Do you mean that it might be difficult to construct sane queries to check for this condition? Or that the situation where
- a "from..to" date range is specified - ?running=false is specified
would be rather confusing ('exclude those nodes which are running *right now* ('now' possibly having nothing to do with the date range)?
- ?flag=flag [every kind of clause which further narrows down the
query
is not bad; the current db model supports all the flags that Stem
does, and
each flag has its own column]
I'd say leave this one out until there's an actual use case.
Ok, I won't focus on these now; just wanted to say that these should be possible without much ado/problems.
- ?first_seen_days=range
- ?last_seen_days=range
As per the plan, the db should be able to return a list of status
entries /
validafter ranges (which can be used in {first,last}_seen_days) given
some
fingerprint.
Oh, I think there's a misunderstanding of these two fields. These fields are only there to search for relays or bridges that have first appeared or were last seen on a given day.
You'll need two new parameters, say, from=datetime and to=datetime (or start=datetime and end=datetime) to define a valid-after range for your search.
Ah! I wasn't paying attention here. :) Ok, all good.
Thanks as always! Regards Kostas.
Hi Kostas,
On 8/15/13 9:50 PM, Kostas Jakeliunas wrote:
On Wed, Aug 14, 2013 at 1:33 PM, Karsten Loesing karsten@torproject.orgwrote:
Looks like pg_trgm is contained in postgresql-contrib-9.1, so it's more likely that we can run something requiring this extension on a torproject.org machine. Still, requiring extensions should be the last resort if no other solution can be found. Leaving out searches for nickname substrings is a valid solution for now.
Got it.
Do you have a list of searches you're planning to support?
These are the ones that should *really* be supported:
- ?search=nickname
- ?search=fingerprint
- ?lookup=fingerprint
- ?search=address [done some limited testing, currently not focusing
on
this]
The lookup parameter is basically the same as search=fingerprint with the additional requirement that fingerprint must be 40 characters long. So, this is the current search parameter.
I agree, these would be good to support.
You might also add another parameter ?address=address for ExoneraTor. That should, in theory, be just a subset of the search parameter.
Oh yes, makes a lot of sense, OK.
By the way: I considered having the last consensus (all the data for at least the /summary document, or /details as well) be stored in memory (this is possible) (probably as a hashtable where key = fingerprint, value = all the fields we'd need to return) so that when the backend is queried without any search criteria, it would be possible to avoid hitting the database (which is always nice), and just dump the last consensus. (There's also caching of course, which we could discuss at a (probably quite a bit) later point.)
Okay.
- ?running=<boolean>
This one is tricky. So far, Onionoo looks only at the very latest consensus or bridge status to decide if a relay or bridge is running or not.
But now you're adding archives to Onionoo, so that people can search for a certain consensus or certain bridge status in the past, or they can search for a time interval of consensuses or bridge statuses. How do you define that a relay or bridge is running, or more importantly included as not running?
Agree, this is not clear. (And whatever ends up being done, this should be well documented and clearly articulated (of course.))
For me at least, 'running' implies the clause whether a given relay/bridge is running *right now*, i.e. whether it is present in the very last consensus. (Here's where that hashtable (with fingerprints as keys) in memory might be able to help: no need to run a separate query / do an inner join / whatnot; it would depend on whether there's a LIMIT involved though, etc.)
I'm not sure which one is more useful (intuitively for me, the "whether it is running *right now*" is more useful.) Do you mean that it might make sense to have a field (or have "running" be it) indicating whether a given relay/bridge was present in the last consensus in the specified date range? If this is what you meant, then the "return all that are/were not running" clause would indeed be kind of..peculiar (semantically - it wouldn't be very obvious what's it doing.)
Maybe it'd be simpler to first answer, what would be the most useful case?
How do you define that a relay or bridge [should be] included as not
running?
Could you rephrase maybe? Do you mean that it might be difficult to construct sane queries to check for this condition? Or that the situation where
- a "from..to" date range is specified
- ?running=false is specified
would be rather confusing ('exclude those nodes which are running *right now* ('now' possibly having nothing to do with the date range)?
I was referring to the situation you describe. But yes, I agree that your definition of whether a relay or bridge is running *right now* can work here. So, never mind my question/concern, this looks fine!
- ?flag=flag [every kind of clause which further narrows down the
query
is not bad; the current db model supports all the flags that Stem
does, and
each flag has its own column]
I'd say leave this one out until there's an actual use case.
Ok, I won't focus on these now; just wanted to say that these should be possible without much ado/problems.
Okay.
- ?first_seen_days=range
- ?last_seen_days=range
As per the plan, the db should be able to return a list of status
entries /
validafter ranges (which can be used in {first,last}_seen_days) given
some
fingerprint.
Oh, I think there's a misunderstanding of these two fields. These fields are only there to search for relays or bridges that have first appeared or were last seen on a given day.
You'll need two new parameters, say, from=datetime and to=datetime (or start=datetime and end=datetime) to define a valid-after range for your search.
Ah! I wasn't paying attention here. :) Ok, all good.
Okay.
I wonder, is there a document describing the new API somewhere? If not, do you mind creating one?
All the best, Karsten
Hi Karsten,
On Mon, Aug 19, 2013 at 3:49 PM, Karsten Loesing karsten@torproject.orgwrote:
I wonder, is there a document describing the new API somewhere? If not, do you mind creating one?
I have an outdated draft that covers more than the API. Looks like it's high time I updated and published it (at least the API part), so, will do this!
Kostas.