Hi everyone,

Some clean-ish working code is finally available online [1] (the old PoC code has been moved to [2]); I'll be adding more soon, but this part does what it's supposed to do, i.e.:

  - archival data import (download, mapping to ORM via Stem, efficiently avoiding re-import of existing data via Stem's persistence path, etc.); what's left for this part is a nice and simple rsync+cron setup to be able to continuously download and import new data (via Metrics archive's 'recent')
  - data models and Stem <-> ORM <-> database mapping for descriptors, consensuses and network statuses contained in consensuses
  - models can be easily queried by sqlalchemy's ORM; Karsten suggested that an additional 'query layer' / internal API is not needed until there's actual need for it (i.e., my plan was to provide an additional query API abstracted from ORM (which is itself built on top of database/SQL/python classes), and to build a backend on top of it, as a neat client of that API as it were; I had some simple and ugly PoC's that are now pushed out of priority queue until needed (if ever))
 - one example of how this querying (directly atop the ORM) works is provided: a simple (very partial) Onionoo protocol implementation for /summary and /details, including ?search, ?limit and ?offset. Querying takes place over all NetworkStatuses. This is new in the sense that it uses the ORM directly. If there is a need to formulate SQL queries more directly, we'll do that as well.

During the Tor developer meetings in Munich, we tried talking over the existing & proposed parts of the system with Karsten. I will be focusing on making sure the Onionoo-like backend (which is being extended) is stable and efficient. I'm still looking into database optimization (with Karsten's advice); an efficient backend for the majority of all archival data available would be a great deliverable in itself, and hopefully we can achieve at least that. I might do well to try and document the database iterations and development, as a lot of thinking now resides in a kind of 'black box' of DB spec, which does not produce code.

The large Postgres datasets are residing on a server I'm managing; I'm working on exposing the Onionoo-like API for public queries; doing some simple higher-level benchmarking (simulating multiple clients requesting different data at once, etc.) now. I might need to move the datasets to yet another server (again), but maybe not; it's easy to blame things on limited CPU/memory resources. :)

Kostas.

[1]: https://github.com/wfn/torsearch
[2]: https://github.com/wfn/torsearch-poc