On Mon, Sep 2, 2013 at 2:20 PM, Karsten Loesing karsten@torproject.orgwrote:
On 8/23/13 3:12 PM, Kostas Jakeliunas wrote:
[snip]
Hi Kostas,
I finally managed to test your service and take a look at the specification document.
Hey Karsten!
Awesome, thanks a bunch!
The few tests I tried ran pretty fast! I didn't hammer the service, so
maybe there are still bottlenecks that I didn't find. But AFAICS, you did a great job there!
Thanks for doing some poking! There is probably space for quite a bit more of parallelized benchmarking (not sure of term) to be done, but at least in principle (and from what I've observed / benchmarked so far), if a single query runs in good time, it's rather safe to assume that scaling to multiple queries at the same time will not be a big problem. There's always a limit of course, which I haven't yet observed (and which I should be able to / would do well to find, ideally.) This is, however, one of the strengths of PostgreSQL in any case: very nice parallel-query-scaling. Of course, since the queries are, more or less, always disk i/o-bound, there still could be hidden sneaky bottlenecks, that is very true for sure.
Thanks for writing down the specification.
So, would it be accurate to say that you're mostly not touching summary, status, bandwidth, and weights resources, but that you're adding a new fifth resource statuses?
In other words, does the attached diagram visualize what you're going to add to Onionoo? Some explanations:
- summary and details documents contain only the last known information
about a relay or bridge, but those are on a pretty high detail level (at least for details documents). In contrast to the current Onionoo, your service returns summary and details documents for relays that didn't run in the last week, so basically since 2007. However, you're not going to provide summary or details for arbitrary points in time, right? (Which is okay, I'm just asking if I understood this correctly.)
(Nice diagram, useful-) responding to particular points / nuances:
summary and details documents contain only the last known information
about a relay or bridge, but those are on a pretty high detail level (at least for details documents)
This is true: the summary/details documents (just like in Onionoo proper) deal with the *last* known info about relays. That is how it works now, anyway.
As per our subsequent IRC chat, we will now assume this is how it is intended to be. The way I see it from the perspective of my original project goals etc., the summary and details (+ bandwidth and weights) documents are meant for Onionoo {near-, full-}compatibility; they must stay Onionoo-like. The new network status document is the "olden archive browse and info extract" part: it is one of the ways of exposing an interface to the whole database (after all, we do store all the flags and nicknames and IP addresses for *all* the network statuses.)
However, you're not going to
provide summary or details for arbitrary points in time, right? (Which is okay, I'm just asking if I understood this correctly.)
There is no reason why this wouldn't be possible. (I experimented with new search parameters, but haven't pushed them to master / changed the backend instance that is currently running.)
A query involving date ranges could, for example, be something akin to,
"get a listing of details documents for relays which match this $nickname / $address / $fingerprint, and which have run (been listed in consensuses dated) from $startDate to $endDate." (would use new ?from=.., ?to=.. parameters, which you've mentioned / clarified earlier.)
As per our IRC chat, I will add these parameters / query options not only to the network status document, but also to the summary and details documents.
- bandwidth and weights documents always contain information covering
the whole lifetime of a relay or bridge, where recent events have higher detail level. Again, you're not going to change anything here besides providing these documents for relays and bridges that are offline for more than a week.
- statuses have the same level of detail for any time in the past.
These documents are new. They're designed for the relay search service and for a simplified version of ExoneraTor (which doesn't care about exit policies and doesn't provide original descriptor contents). There are no statuses documents for bridges, right?
Yes & yes. No documents for bridges, for now. I'm not sure of the priority of the task of including bridges - it would sure be awesome to have bridges as well. For now, I assume that everything else should be finished (the protocol, the final scalable database schema/setup, etc.) before embarking on this point.
The status entry API point is indeed about getting info from the whole archives, at the same detail level for any portion of the archives.
(I should have articulated this / put into a design doc before, but this important nuance is still fresh in my mind. It seems that now it's all finally coming into place (including my mind.))
[The new network status documents are] designed for the relay search service
and for a simplified version of ExoneraTor (which doesn't care about exit policies and doesn't provide original descriptor contents).
By the way, just as a general note, it is always possible to reconstruct any descriptor, and any network status entry, in principle. I point this out because, for one, I recall Damian mentioning that it would be nice if the torsearch system could be used as part of other apps - it would be able to reconstruct original Stem instances/objects for any descriptor / network status entry in question. (The focus for now, though, is Onionoo and database, of course.)
If this is correct (and please tell me if it's not), this seems like a plausible extension of Onionoo.
Thanks for taking a close look at the protocol description and thanks for the feedback, everything is correct as far as I can see!
A few ideas on statuses documents: how about you change the format of statuses, so that there's no more one document per relay and valid-after time, but exactly one document per relay? That document could then contain an array of status objects saying when the relay was contained in the network status, together with information about its addresses.
This makes a lot of sense (I've been juggling these ideas as well, but at the end of the day, I'm not sure. So I will do this instead.)
The nickname for a given relay (identified by a fingerprint) can change through time as well. So the status object would ideally include the date of containment in network status / consensus, addresses, and nickname. (This is where a listing of flags would go in as well, I suppose.) I think that would make sense?
Since we know that there will only be one relay document, its fields could be made to be top-level (so not {relays: [ {"fingerprint" : "$fingerprint", ..., "entries": [ { ... }, { ... }, ... ]} ]} but, rather (hopefully not garbled up identation),
{ "fingerprint": "$fingerprint", ... # first_seen, last_seen, for example "entries": [ { ... }, { ... }, ... ] }
It might be useful to group consecutive valid-after times when all addresses and other relevant information about a relay stayed the same. So, rather than adding "valid_after", put in "valid_after_from" and "valid_after_to".
Yes, thought about this as well! This would be ideal. It would indeed I think require that we
[...] could even generate these statuses documents in advance once
per hour and store them as JSON documents in the database, similar to
what's the plan for the other document types? That might reduce database load a lot, though you'll still need most of your database foo for the search part.
Some kind of caching at some level would be needed for sure, inevitably. Preprocessing/preparing JSON documents (the way Onionoo does it, I suppose) makes sense.
I'm not sure of scale, however. Ideally torsearch would be able to keep track of outdated JSON documents / which ones need changing. Again, there already are around 170K unique fingerprints in the current online database as of now.
I'll think about this. Lots of things can be done at the postgres level (you're probably thinking about this as well.)
Also:
If it was OK (it would be a bit queer maybe) to involve result pagination at this level as well, the API could be told to, say,
"group the last `min(limit, UPPER_LIMIT)` [e.g. 500] status entries for this fingerprint into a status object / valid-after range summary." => produce status entry objects, each featuring addresses, nickname, valid_after_from, and valid_after_to.
As a rule of thumb, the count of status objects returned would be (much) less than (say) 500, of course. A client would then append the parameters ?offset=500[&limit=500] (or whatnot) to get a status entry summary (summary in the sense that does not reduce the amount of actual useful information returned) for the next 500 network statuses of this relay.
It would be great if this kind of protocol querying approach made sense. But if it's a bit strange / unoptimal (from the perspective of a client querying the DB), let me know.
And maybe you can compress information even more by
putting all relevant IP addresses in a list and refer to them by list index. Compare this to bandwidth and weights documents which are optimized for size, too.
Yeah, this would be great, actually. I'll think about all these & practical caching / JSON document generation options. I'm unsure of feasibility (it's definitely doable in the end, but not sure of scope), but I hope to be able to accomplish all this. Might follow up later on / tomorrow, etc.
Happy to chat more about these ideas on IRC.
Please report any inconsistencies / errors / time-outs / anything that takes a few seconds or more to execute. I'm logging the queries (together with IP addresses for now - for shame!), so will be able to later
correlate
activity with database load, which will hopefully provide some realistic semi-benchmark-like data.
I could imagine that you'll get more testers if you provide instructions for using your service as relay search or ExoneraTor replacement. Maybe you could write down the five most common searches that people could perform to search for a relay or find out whether an IP address was a Tor relay at a given time? If you want, I can link to such a page from the relay search and the ExoneraTor page.
Indeed, I was thinking lately that it should be made more explicit that, for example, this present system already encompasses ExoneraTor use cases, and so on. I was planning to eventually write up something of the kind (with lots of examples and clearly articulated use cases, etc.) of course, but maybe I should do this sooner. OK.
I also already have a way of constantly updating the database (using cron -> rsync & torsearch import), but it's a bit of a hack, still. Hopefully soon I will ramp up the DB to actually have the latest consensuses in Reality(tm).
Once I have the latter running nicely,
If you want, I can link to such a page from the relay search and the ExoneraTor page.
we can think of doing this!
All in all, great work! Nice!
Thanks, Karsten
Thanks for your as always great feedback, Karsten :)
Kostas.