Hello,

another busy benchmarking + profiling period for database querying, but this time more rigorous and awesome.

* wrote a generic query analyzer which logs query statements, EXPLAIN, ANALYZE, spots and informs of particular queries that yield inefficient query plans;

* wrote a very simple but rather exhaustive profiler (using python's cProfile) which logs query times, function calls, etc.; output is used to see which parts of the e.g. backend are slow during API calls; output can be easily used to construct a general query 'profile' for a particular database, etc.; [1]

* benchmarked lots of different queries using these tools, recorded query times, was able to observe deviations/discrepancies;

* uploaded the whole database and benchmarked briefly on an amazon EC2 m2.2xlarge instance;

* concluded that, provided there is enough memory to cache *and hold* the indexes in cache, query times are good;

* in particular, tested the following query scheme extensively: [2] (see comments there as well if curious); concluded that it runs well;

* opted for testing raw SQL queries (from within Flask/python) - so far, translating them into ORM queries (while being careful) resulted in degraded performance; if we have to end up using raw SQL, I will create a way to encapsulate them nicely;

* made sure data importing is not slowed and remains a quick-enough procedure;

* researched PostgreSQL stuff, especially its two-layer caching; I now have an understanding of the way pgsql caches things in memory, how statistics on index usage are gathered and used for maintaining buffer_cache, etc.

The searchable metrics archive would work best when all of its indexes are kept in memory.

* to this end, looked into buffer cache hibernation [3], etc.; I think pg_prewarm [4, 5] would serve our purpose well. (Apparently many business/etc. solutions do find cache prewarming relevant - pity it's not supported in stock PostgreSQL.)

The latter means that

* I don't think we can avoid using certain postgresql extensions (if only one) - which means that deploying will always take more than apt-get && pip install, but I believe it is needed;

* next on my agenda is testing pg_prewarm on EC2 and, hopefully, putting our beloved database bottleneck problem to rest.

I planned to expose the EC2 for public tor-dev inquiry (and ended up delaying status report yet again), but I'll have to do this separately. This is possible, however. Sorry for the delayed report.

More generally,

I'm happy with my queer queries [2] now;

the two constraints/goals of

* being able to run Onionoo-like queries on the whole descriptor / status entry database

* being able to get a list of status entries for a particular relay

will hopefully be put to rest very soon. The former is done, provided I have no trouble setting up a database index precaching system (which will ensure that all queries of the same syntax/scheme run quick enough.)

Overall, I'm spending a bit too much time on a specific problem, but at least I have a more intimate lower-level knowledge of PostgreSQL, which turns out to be very relevant to this project. I hope to be able to soon move to extending Onionoo support and providing a clean API for getting lists of consensuses in which a particular relay was present. And maybe start with the frontend. :)

Kostas.

[1]: https://github.com/wfn/torsearch/commit/8e6f16a07c40f7806e98e9c71c1ce0f8e3849911

[2]: https://github.com/wfn/torsearch/blob/master/misc/nested_join.sql

[3]: http://postgresql.1045698.n5.nabble.com/patch-for-new-feature-Buffer-Cache-Hibernation-td4370109.html

[4]: http://www.postgresql.org/message-id/CA+TgmobRrRxCO+t6gcQrw_dJw+Uf9ZEdwf9beJnu+RB5TEBjEw@mail.gmail.com

[5]: http://raghavt.blogspot.com/2012/04/caching-in-postgresql.html