On Wed, Aug 14, 2013 at 1:33 PM, Karsten Loesing <karsten@torproject.org> wrote:

Looks like pg_trgm is contained in postgresql-contrib-9.1, so it's more
likely that we can run something requiring this extension on a
torproject.org machine.  Still, requiring extensions should be the last
resort if no other solution can be found.  Leaving out searches for
nickname substrings is a valid solution for now.

Got it. 

>> Do you have a list of searches you're planning to support?
>
>
> These are the ones that should *really* be supported:
>
>    - ?search=nickname
>    - ?search=fingerprint
>    - ?lookup=fingerprint
>    - ?search=address [done some limited testing, currently not focusing on
>    this]

The lookup parameter is basically the same as search=fingerprint with
the additional requirement that fingerprint must be 40 characters long.
 So, this is the current search parameter.

I agree, these would be good to support.

You might also add another parameter ?address=address for ExoneraTor.
That should, in theory, be just a subset of the search parameter.

Oh yes, makes a lot of sense, OK.

By the way: I considered having the last consensus (all the data for at least the /summary document, or /details as well) be stored in memory (this is possible) (probably as a hashtable where key = fingerprint, value = all the fields we'd need to return) so that when the backend is queried without any search criteria, it would be possible to avoid hitting the database (which is always nice), and just dump the last consensus. (There's also caching of course, which we could discuss at a (probably quite a bit) later point.)
 
>    - ?running=<boolean>

This one is tricky.  So far, Onionoo looks only at the very latest
consensus or bridge status to decide if a relay or bridge is running or not.

But now you're adding archives to Onionoo, so that people can search for
a certain consensus or certain bridge status in the past, or they can
search for a time interval of consensuses or bridge statuses.  How do
you define that a relay or bridge is running, or more importantly
included as not running?

Agree, this is not clear. (And whatever ends up being done, this should be well documented and clearly articulated (of course.))

For me at least, 'running' implies the clause whether a given relay/bridge is running *right now*, i.e. whether it is present in the very last consensus. (Here's where that hashtable (with fingerprints as keys) in memory might be able to help: no need to run a separate query / do an inner join / whatnot; it would depend on whether there's a LIMIT involved though, etc.)

I'm not sure which one is more useful (intuitively for me, the "whether it is running *right now*" is more useful.) Do you mean that it might make sense to have a field (or have "running" be it) indicating whether a given relay/bridge was present in the last consensus in the specified date range? If this is what you meant, then the "return all that are/were not running" clause would indeed be kind of..peculiar (semantically - it wouldn't be very obvious what's it doing.)

Maybe it'd be simpler to first answer, what would be the most useful case?
 
> How do you define that a relay or bridge [should be] included as not running?

Could you rephrase maybe? Do you mean that it might be difficult to construct sane queries to check for this condition? Or that the situation where
  • a "from..to" date range is specified
  • ?running=false is specified
would be rather confusing ('exclude those nodes which are running *right now* ('now' possibly having nothing to do with the date range)?

>    - ?flag=flag [every kind of clause which further narrows down the query
>    is not bad; the current db model supports all the flags that Stem does, and
>    each flag has its own column]

I'd say leave this one out until there's an actual use case.

Ok, I won't focus on these now; just wanted to say that these should be possible without much ado/problems.
 
>    - ?first_seen_days=range
>    - ?last_seen_days=range
>
> As per the plan, the db should be able to return a list of status entries /
> validafter ranges (which can be used in {first,last}_seen_days) given some
> fingerprint.

Oh, I think there's a misunderstanding of these two fields.  These
fields are only there to search for relays or bridges that have first
appeared or were last seen on a given day.

You'll need two new parameters, say, from=datetime and to=datetime (or
start=datetime and end=datetime) to define a valid-after range for your
search.

Ah! I wasn't paying attention here. :) Ok, all good.
 
Thanks as always!
Regards
Kostas.