Hi Karsten. This is diving into enough detail that we might as well move this over to tor-dev@. For the list's benefit, Karsten and I are discussing a Python rewrite of ExoneraTor...
https://exonerator.torproject.org/ https://gitweb.torproject.org/exonerator.git
First I think I need to take a step back to figure out exactly what we're after. From a quick peek at ExoneraTor it looks like it behaves as follows...
a. User enters an address (IPv4 or IPv6) and a date (either for a day or an hour).
b. ExoneraTor lists router status entries for all relays that match the criteria. These entries link to the consensus they came from and server descriptors they reference.
c. The user can then enter a destination address and port to search exit policies in TorDNSEL entres.
Step 'a' and 'b' make sense to me. Step 'c' however I'm having a little difficulty groking. Ignoring TorDNSEL entries for a moment, we already have all the ingredients to provide the user with three fields to start with...
* Source Address (required) * Timestamp (required) * Destination Address and/or Port (optional)
The source address and timestamp come from the consensus, and an optional 'can it exit to destination X' consults the server descriptor's exit policy.
So what is TorDNSEL providing us and why is it a separate search on the page? As I understand it the value of TorDNSEL is that we can't trust the address in the router status entries. If that's the case then our present search fields don't make sense to me...
* Our initial search consults consensus information for the address and timestamp but not the exit policy. This is weird both because the address this has is faulty, and we have the exit policy so we could trivially include that in our search criteria.
* Our second search gives the impression that we're using the earlier consensus results to query exit criteria from TorDNSEL. As I understand it though that's not what it's doing. TorDNSEL is completely independent from the consensus information.
I could understand a search that just consults consensus information (ignoring address accuracy, it has everything we need). I could also understand a search that just consults TorDNSEL information (ignoring its inconsistent poll rate, it has everything we need).
However, this hybrid approach and how it's presented really confuses me. Unless I'm mistaken with something above what I'd expect from ExoneraTor is...
* The three search fields mentioned above.
* It shows results based on the consensus information like we presently do.
* If we have TorDNSEL entries that either indicate that a relay we're presenting had a different external address or another relay had the address we're searching for then note that.
That is to say, the base search is based on consensus information (using server descriptor exit policies if we want to filter by that), and the TorDNSEL results are just appended notes since we can't rely on its poll rate.
Thoughts?
Cheers! -Damian
PS. Congratulations on getting me invested. I just spent the last three hours in front of a whiteboard trying to puzzle out why ExoneraTor works the way it presently does. ;)
PPS. Stem's ExitPolicy class has a can_eixt_to() method that would be really handy for this...
https://stem.torproject.org/api/exit_policy.html#stem.exit_policy.ExitPolicy...
PPPS. I'm still hesitant about actually tackling this project. Arm is midway through being rewritten, and considering its sudden uptick in usage probably the most important project on my plate right now.
That said, I'm happy to discuss this. Even if we don't implement it right now this thread will be useful so we know where we're going with ticket #8260.
Concerning the earlier discussion of 'work with Karsten on a python project' I have a personal bias toward collaborating when the project has few unknowns for me, but working alone when *I'm* learning something. That is to say, I'd love to work with you on a straightforward Stem project and I'd also like to discuss ExoneraTor's design. But when it comes to coding, this has enough unknowns that if I take it on I'd prefer to experiment alone for a while - at least until I know enough about the APIs involved that I can avoid embarrassing myself. :)
On Sun, Jun 8, 2014 at 2:56 AM, Karsten Loesing karsten@torproject.org wrote:
On 08/06/14 06:27, Damian Johnson wrote:
Here's a quick overview of the codebase to facilitate reading through it:
Ahhh, very useful - thanks.
Hmmm. Just took a quick peek at the ExoneraTor codebase and, unless I'm mistaken, it doesn't actually use metrics-lib, does it?
You're right, looks like it doesn't.
Honestly looking over the code is making me a little hesitant to take this on after all. I was anticipating a small, quick project of DocTor's scope but I've never touched SQLAlchemy or Posgress before.
I don't think we'll even have to touch the Postgres for moving from Java to Python. The Python code would simply do SQL calls via its SQL library just like Java does.
I just copied all SQL statements that the Python part would have to prepare and execute:
CALL insert_descriptor(?, ?); CALL insert_statusentry(?, ?, ?, ?, ?, ?, ?); CALL insert_consensus(?, ?); CALL insert_exitlistentry(?, ?, ?, ?, ?); SELECT MIN(validafter) AS first, MAX(validafter) AS last FROM consensus; SELECT validafter FROM consensus WHERE validafter >= ? AND validafter <= ?; CALL search_statusentries_by_address_date(?, ?); CALL search_addresses_in_same_24 (?, ?); CALL search_addresses_in_same_48 (?, ?); SELECT rawdescriptor FROM descriptor WHERE descriptor = ?; SELECT descriptor, rawdescriptor FROM descriptor WHERE descriptor LIKE ?; SELECT rawconsensus FROM consensus WHERE validafter = ?;
That's it. No further knowledge about Postgres required.
Once I wrote this I realized I'm being a damn hypocrite. Here I was saying "Karsten, learn Python so we can leverage each other's codebases!" but then I hightail it once the project delves into areas new to me. New arm users are showing up almost daily on irc and I'm anxious to give them a new release... but then this is exactly the issue, isn't it? Deliverables you'd like to focus on crowding out time to learn new things.
So TL;DR I'm gonna eat my own words and suggest we focus on our separate domains for now. I really would like to work on some small metrics projects with you. Each month I eyeball your status reports asking myself "Is there anything here I can work with Karsten on to draw our spaces closer together?" so please let me know if you run across anything in Metrics we can collaborate on.
(Replying below, first replying to the DynamoDB part.)
Your hypocritical friend, ~Damian
PS. When we next meet I'd like to discuss ExoneraTor's design a bit. First thought I had when looking at the code was 'huh... I wonder if this would be a good use case for DynamoDB'.
I'm wary about moving to another database, especially NoSQL ones and/or cloud-based ones. They don't magically make things faster, and Postgres is something I understand quite well by now. And again, I think that we keep the Postgres part entirely unchanged when moving to Python. Not saying that DymanoDB can't be the better choice, but switching the database is not a priority for me.
So, regarding the rewrite: rather than canceling the project before it starts, how about we find a role for you that you're more comfortable with?
For example, I'd want to try rewriting it step by step based on your suggestion of frameworks/libraries and with some code review of yours.
If you're interested, which framework would I use for the new Python ExoneraTor? It's supposed to do the following tasks:
- Provide a simple web site with a web form, backed by the PostgreSQL
database.
- Maybe offer a simple RESTful API for lookups that the web form could
use to compose responses, but that could also be used by other applications directly.
- Return documents from the database by identifier, so without
providing a search functionality.
- Run a scheduled task once per hour that fetches data from CollecTor
and puts it in a database.
Bonus points if the result is as easy to deploy on Debian Wheezy as possible. Like, install these few Debian packages, run the setup script, done.
Of course, if you'd prefer to focus on other things and not discuss ExoneraTor stuff, that's perfectly fine, too. :)
All the best, Karsten
Oh, and another quick thought - you once mentioned that a descriptor search service would make ExoneraTor obsolete, and in looking it over I agree. The search functionality ExoneraTor provides is trivial. The only reason it requires such a huge database is because it's storing a copy of every descriptor ever made.
I suspect the actual right solution isn't to rewrite ExoneraTor at all, but rather develop a new service that can be queried for this descriptor data. That would make for a *much* more worthwhile project.
ExoneraTor? Nice to have. Descriptor archive service? Damn useful. :)
On Sun, Jun 8, 2014 at 3:03 PM, Damian Johnson atagar@torproject.org wrote:
Hi Karsten. This is diving into enough detail that we might as well move this over to tor-dev@. For the list's benefit, Karsten and I are discussing a Python rewrite of ExoneraTor...
https://exonerator.torproject.org/ https://gitweb.torproject.org/exonerator.git
First I think I need to take a step back to figure out exactly what we're after. From a quick peek at ExoneraTor it looks like it behaves as follows...
a. User enters an address (IPv4 or IPv6) and a date (either for a day or an hour).
b. ExoneraTor lists router status entries for all relays that match the criteria. These entries link to the consensus they came from and server descriptors they reference.
c. The user can then enter a destination address and port to search exit policies in TorDNSEL entres.
Step 'a' and 'b' make sense to me. Step 'c' however I'm having a little difficulty groking. Ignoring TorDNSEL entries for a moment, we already have all the ingredients to provide the user with three fields to start with...
- Source Address (required)
- Timestamp (required)
- Destination Address and/or Port (optional)
The source address and timestamp come from the consensus, and an optional 'can it exit to destination X' consults the server descriptor's exit policy.
So what is TorDNSEL providing us and why is it a separate search on the page? As I understand it the value of TorDNSEL is that we can't trust the address in the router status entries. If that's the case then our present search fields don't make sense to me...
Our initial search consults consensus information for the address and timestamp but not the exit policy. This is weird both because the address this has is faulty, and we have the exit policy so we could trivially include that in our search criteria.
Our second search gives the impression that we're using the earlier consensus results to query exit criteria from TorDNSEL. As I understand it though that's not what it's doing. TorDNSEL is completely independent from the consensus information.
I could understand a search that just consults consensus information (ignoring address accuracy, it has everything we need). I could also understand a search that just consults TorDNSEL information (ignoring its inconsistent poll rate, it has everything we need).
However, this hybrid approach and how it's presented really confuses me. Unless I'm mistaken with something above what I'd expect from ExoneraTor is...
The three search fields mentioned above.
It shows results based on the consensus information like we presently do.
If we have TorDNSEL entries that either indicate that a relay we're presenting had a different external address or another relay had the address we're searching for then note that.
That is to say, the base search is based on consensus information (using server descriptor exit policies if we want to filter by that), and the TorDNSEL results are just appended notes since we can't rely on its poll rate.
Thoughts?
Cheers! -Damian
PS. Congratulations on getting me invested. I just spent the last three hours in front of a whiteboard trying to puzzle out why ExoneraTor works the way it presently does. ;)
PPS. Stem's ExitPolicy class has a can_eixt_to() method that would be really handy for this...
https://stem.torproject.org/api/exit_policy.html#stem.exit_policy.ExitPolicy...
PPPS. I'm still hesitant about actually tackling this project. Arm is midway through being rewritten, and considering its sudden uptick in usage probably the most important project on my plate right now.
That said, I'm happy to discuss this. Even if we don't implement it right now this thread will be useful so we know where we're going with ticket #8260.
Concerning the earlier discussion of 'work with Karsten on a python project' I have a personal bias toward collaborating when the project has few unknowns for me, but working alone when *I'm* learning something. That is to say, I'd love to work with you on a straightforward Stem project and I'd also like to discuss ExoneraTor's design. But when it comes to coding, this has enough unknowns that if I take it on I'd prefer to experiment alone for a while - at least until I know enough about the APIs involved that I can avoid embarrassing myself. :)
On Sun, Jun 8, 2014 at 2:56 AM, Karsten Loesing karsten@torproject.org wrote:
On 08/06/14 06:27, Damian Johnson wrote:
Here's a quick overview of the codebase to facilitate reading through it:
Ahhh, very useful - thanks.
Hmmm. Just took a quick peek at the ExoneraTor codebase and, unless I'm mistaken, it doesn't actually use metrics-lib, does it?
You're right, looks like it doesn't.
Honestly looking over the code is making me a little hesitant to take this on after all. I was anticipating a small, quick project of DocTor's scope but I've never touched SQLAlchemy or Posgress before.
I don't think we'll even have to touch the Postgres for moving from Java to Python. The Python code would simply do SQL calls via its SQL library just like Java does.
I just copied all SQL statements that the Python part would have to prepare and execute:
CALL insert_descriptor(?, ?); CALL insert_statusentry(?, ?, ?, ?, ?, ?, ?); CALL insert_consensus(?, ?); CALL insert_exitlistentry(?, ?, ?, ?, ?); SELECT MIN(validafter) AS first, MAX(validafter) AS last FROM consensus; SELECT validafter FROM consensus WHERE validafter >= ? AND validafter <= ?; CALL search_statusentries_by_address_date(?, ?); CALL search_addresses_in_same_24 (?, ?); CALL search_addresses_in_same_48 (?, ?); SELECT rawdescriptor FROM descriptor WHERE descriptor = ?; SELECT descriptor, rawdescriptor FROM descriptor WHERE descriptor LIKE ?; SELECT rawconsensus FROM consensus WHERE validafter = ?;
That's it. No further knowledge about Postgres required.
Once I wrote this I realized I'm being a damn hypocrite. Here I was saying "Karsten, learn Python so we can leverage each other's codebases!" but then I hightail it once the project delves into areas new to me. New arm users are showing up almost daily on irc and I'm anxious to give them a new release... but then this is exactly the issue, isn't it? Deliverables you'd like to focus on crowding out time to learn new things.
So TL;DR I'm gonna eat my own words and suggest we focus on our separate domains for now. I really would like to work on some small metrics projects with you. Each month I eyeball your status reports asking myself "Is there anything here I can work with Karsten on to draw our spaces closer together?" so please let me know if you run across anything in Metrics we can collaborate on.
(Replying below, first replying to the DynamoDB part.)
Your hypocritical friend, ~Damian
PS. When we next meet I'd like to discuss ExoneraTor's design a bit. First thought I had when looking at the code was 'huh... I wonder if this would be a good use case for DynamoDB'.
I'm wary about moving to another database, especially NoSQL ones and/or cloud-based ones. They don't magically make things faster, and Postgres is something I understand quite well by now. And again, I think that we keep the Postgres part entirely unchanged when moving to Python. Not saying that DymanoDB can't be the better choice, but switching the database is not a priority for me.
So, regarding the rewrite: rather than canceling the project before it starts, how about we find a role for you that you're more comfortable with?
For example, I'd want to try rewriting it step by step based on your suggestion of frameworks/libraries and with some code review of yours.
If you're interested, which framework would I use for the new Python ExoneraTor? It's supposed to do the following tasks:
- Provide a simple web site with a web form, backed by the PostgreSQL
database.
- Maybe offer a simple RESTful API for lookups that the web form could
use to compose responses, but that could also be used by other applications directly.
- Return documents from the database by identifier, so without
providing a search functionality.
- Run a scheduled task once per hour that fetches data from CollecTor
and puts it in a database.
Bonus points if the result is as easy to deploy on Debian Wheezy as possible. Like, install these few Debian packages, run the setup script, done.
Of course, if you'd prefer to focus on other things and not discuss ExoneraTor stuff, that's perfectly fine, too. :)
All the best, Karsten
On 09/06/14 01:26, Damian Johnson wrote:
Oh, and another quick thought - you once mentioned that a descriptor search service would make ExoneraTor obsolete, and in looking it over I agree. The search functionality ExoneraTor provides is trivial. The only reason it requires such a huge database is because it's storing a copy of every descriptor ever made.
I suspect the actual right solution isn't to rewrite ExoneraTor at all, but rather develop a new service that can be queried for this descriptor data. That would make for a *much* more worthwhile project.
ExoneraTor? Nice to have. Descriptor archive service? Damn useful. :)
I agree, that was the idea behind Kostas' GSoC project last year. And I still think it's a good idea. It's just not trivial to get right.
Regarding your comment about storing a copy of every descriptor ever made, I believe that users trust ExoneraTor's results more if they see the actual descriptors that lead to results. Of course, I'm saying that without knowing what ExoneraTor users actually want. But let's not drop descriptor copies from the database easily.
And, heh, when you say that the search functionality ExoneraTor provides is trivial, a little part of me is dying. It's the part that spent a few weeks on getting the search functionality fast enough for production. That was not at all trivial. The oraddress24, oraddress48, and exitaddress24 fields as well as the indexes are the result of me running lots and lots of sample queries and wondering about Postgres' EXPLAIN ANALYZE results. Just saying that it's not going to be trivial to generalize the search functionality towards other fields than IP addresses and dates.
If others want to follow, here's the SQL code I'm talking about:
https://gitweb.torproject.org/exonerator.git/blob/HEAD:/db/exonerator.sql
So, I'm happy to talk about writing a searchable descriptor archive. It could _start_ with ExoneraTor's functionality (minus the target address and port thing discussed in that other email), and then we could consider adding more searches.
Pretty sure that Kostas is reading this (in fact, I just cc'ed him), so let me make one remark about optimizing Postgres defaults: I wrote quite a few database queries in the past, and some of them perform horribly (relay search) whereas others perform really well (ExoneraTor). I believe that the majority of performance gains can be achieved by designing good tables, indexes, and queries. Only as a last resort we should consider optimizing the Postgres defaults.
You realize that a searchable descriptor archives focuses much more on database optimization than the ExoneraTor rewrite from Java to Python (which would leave the database untouched)?
All the best, Karsten
On Mon, Jun 9, 2014 at 3:22 AM, Karsten Loesing karsten@torproject.org wrote:
let me make one remark about optimizing Postgres defaults: I wrote quite a few database queries in the past, and some of them perform horribly (relay search) whereas others perform really well (ExoneraTor). I believe that the majority of performance gains can be achieved by designing good tables, indexes, and queries. Only as a last resort we should consider optimizing the Postgres defaults.
You realize that a searchable descriptor archives focuses much more on database optimization than the ExoneraTor rewrite from Java to Python (which would leave the database untouched)?
Are other datastore models such as splunk or MongoDB useful? [splunk has a free yet proprietary limited binary... those having historical woes and takebacks, mentioned just for example here.]
let me make one remark about optimizing Postgres defaults: I wrote quite a few database queries in the past, and some of them perform horribly (relay search) whereas others perform really well (ExoneraTor). I believe that the majority of performance gains can be achieved by designing good tables, indexes, and queries. Only as a last resort we should consider optimizing the Postgres defaults.
You realize that a searchable descriptor archives focuses much more on database optimization than the ExoneraTor rewrite from Java to Python (which would leave the database untouched)?
Are other datastore models such as splunk or MongoDB useful? [splunk has a free yet proprietary limited binary... those having historical woes and takebacks, mentioned just for example here.]
Earlier I mentioned the idea of Dynamo. Unless I'm mistaken this lends itself pretty naturally to addresses as a hash key, and descriptor dates as the range key. Lookups would then be O(log(n)) where n is the total number of descriptors an address has published (... that is to say very, very quick).
This would be a fun project to give Boto a try. *sigh*... there really should be more hours in the day...
Cheers! -Damian
On 10/06/14 05:41, Damian Johnson wrote:
let me make one remark about optimizing Postgres defaults: I wrote quite a few database queries in the past, and some of them perform horribly (relay search) whereas others perform really well (ExoneraTor). I believe that the majority of performance gains can be achieved by designing good tables, indexes, and queries. Only as a last resort we should consider optimizing the Postgres defaults.
You realize that a searchable descriptor archives focuses much more on database optimization than the ExoneraTor rewrite from Java to Python (which would leave the database untouched)?
Are other datastore models such as splunk or MongoDB useful? [splunk has a free yet proprietary limited binary... those having historical woes and takebacks, mentioned just for example here.]
Earlier I mentioned the idea of Dynamo. Unless I'm mistaken this lends itself pretty naturally to addresses as a hash key, and descriptor dates as the range key. Lookups would then be O(log(n)) where n is the total number of descriptors an address has published (... that is to say very, very quick).
This would be a fun project to give Boto a try. *sigh*... there really should be more hours in the day...
Quoting my reply to Damian to a similar question earlier in the thread:
I'm wary about moving to another database, especially NoSQL ones and/or cloud-based ones. They don't magically make things faster, and Postgres is something I understand quite well by now. [...] Not saying that DymanoDB can't be the better choice, but switching the database is not a priority for me.
If somebody wants to give, say, MongoDB a try, I'd be interested in seeing the performance comparison to the current Postgres schema. When you do, please consider all three search_* functions that the current schema offers, including searches for other IPv4 addresses in the same /24 and other IPv6 addresses in the same /48.
All the best, Karsten
On 10/06/14 18:14, Damian Johnson wrote:
... including searches for other IPv4 addresses in the same /24 and other IPv6 addresses in the same /48.
Ahhh. *That* would indeed make this a lot more of a pita. ExoneraTor gives no indication that it accepts /24 or /48 ranges. Is that capability even used by visitors?
ExoneraTor doesn't indicate that, but if a search for a certain IP address returns no results it looks up nearby addresses.
Example: search for 37.130.227.132 (.133 is TorLand1)
Result:
""" We did not find IP address 37.130.227.132 in any of the relay or exit lists that were published between 2014-06-09 and 2014-06-11.
The following other IP addresses of Tor relays in the same /24 network were found in relay and/or exit lists around the time that could be related to IP address 37.130.227.132:
37.130.227.133 37.130.227.134 """
Now, I can't say whether users would expect that or do anything with those "nearby addresses". I found it possibly useful when I wrote the service. But I'm not at all a usability expert. ;)
All the best, Karsten
On Tue, Jun 10, 2014 at 10:38 AM, Karsten Loesing karsten@torproject.org wrote:
On 10/06/14 05:41, Damian Johnson wrote:
let me make one remark about optimizing Postgres defaults: I wrote quite a few database queries in the past, and some of them perform horribly (relay search) whereas others perform really well (ExoneraTor). I believe that the majority of performance gains can be achieved by designing good tables, indexes, and queries. Only as a last resort we should consider optimizing the Postgres defaults.
You realize that a searchable descriptor archives focuses much more on database optimization than the ExoneraTor rewrite from Java to Python (which would leave the database untouched)?
Are other datastore models such as splunk or MongoDB useful? [splunk has a free yet proprietary limited binary... those having historical woes and takebacks, mentioned just for example here.]
Earlier I mentioned the idea of Dynamo. Unless I'm mistaken this lends itself pretty naturally to addresses as a hash key, and descriptor dates as the range key. Lookups would then be O(log(n)) where n is the total number of descriptors an address has published (... that is to say very, very quick).
This would be a fun project to give Boto a try. *sigh*... there really should be more hours in the day...
Quoting my reply to Damian to a similar question earlier in the thread:
I'm wary about moving to another database, especially NoSQL ones and/or cloud-based ones. They don't magically make things faster, and Postgres is something I understand quite well by now. [...] Not saying that DymanoDB can't be the better choice, but switching the database is not a priority for me.
If somebody wants to give, say, MongoDB a try, I'd be interested in seeing the performance comparison to the current Postgres schema. When you do, please consider all three search_* functions that the current schema offers, including searches for other IPv4 addresses in the same /24 and other IPv6 addresses in the same /48.
Personally, the only NoSQL thing I've come across (and have had some really good experiences with in the past) was Redis, which is a kind of key-value store-in-memory, with some nice simple data structures (like sets, and operations on sets. So if you can reduce your problem to (e.g.) sets and set operations, Redis might be a good fit.)
(I think that isis is actually experimenting with Redis right now, to do prop 226-bridgedb-database-improvements.txt)
If the things that you store in Redis can't be made to fit into memory, you'll probably have a bad time.
So to generalize, if some relational data which needs to be searchable can be made to fit into memory ("we can guarantee it wouldn't exceed x GB [for t time]"), offloading that part onto some key-value (or some more elaborate) system *might* make sense.
Also, I mixed up the link in footnote [2]. It should have linked to this diagnostic postgres query:
https://github.com/wfn/torsearch/blob/master/misc/list_indexes_in_memory.sql
--
regards Kostas
Hi all!
On Mon, Jun 9, 2014 at 10:22 AM, Karsten Loesing karsten@torproject.org wrote:
On 09/06/14 01:26, Damian Johnson wrote:
Oh, and another quick thought - you once mentioned that a descriptor search service would make ExoneraTor obsolete, and in looking it over I agree. The search functionality ExoneraTor provides is trivial. The only reason it requires such a huge database is because it's storing a copy of every descriptor ever made.
I suspect the actual right solution isn't to rewrite ExoneraTor at all, but rather develop a new service that can be queried for this descriptor data. That would make for a *much* more worthwhile project.
ExoneraTor? Nice to have. Descriptor archive service? Damn useful. :)
I agree, that was the idea behind Kostas' GSoC project last year. And I still think it's a good idea. It's just not trivial to get right.
Indeed, not trivial at all!
I'll use this space to mention the running metrics archive backend modulo ExoneraTor stuff / what could be sorta-relevant here.
fwiw, the onionoo-like backend is still running at an obscure address:port: http://ts.mkj.lt:5555/
TL;DR "what can I do with that" is: look at:
https://github.com/wfn/torsearch/blob/master/docs/use_cases_examples.md
In particular, regarding ExoneraTor-like queries (incl. arbitrary subnet / part-of-ip lookups):
https://github.com/wfn/torsearch/blob/master/docs/use_cases_examples.md#exon...
Not sure if it's worth discussing all the weaknesses of this archive backend in this thread, but the short relevant version is that the ExoneraTor-like functionality does mostly work, but I would need to look into it so see how reliable the results are ("is this relay ip address field really the one we should be using?", etc.)
But what's nice is that it is possible to do arbitrary queries on all consensuses since ~2008, with no date specified (if you don't want to.) (Which is to say, "it's possible", not necessarily "this is the right way to do the solution for the problems in this thread")
So e.g. this is the ip address where moria runs, and we want to see what relays have ever run on it:
http://ts.mkj.lt:5555/details?search=128.31.0.34
Take the fingerprint of the one that is currently running (moria1), and look up its last 500 statuses (in a kind of condensed/summary form): http://ts.mkj.lt:5555/statuses?lookup=9695DFC35FFEB861329B9F1AB04C46397020CE...
"from", "to" date ranges can be specified as e.g. 2009, 2009-02, 2009-02-10, 2009-02-10 02:00:00. limit/offset/parameters/etc. specified here: https://github.com/wfn/torsearch/blob/master/docs/onionoo_api.md
(Descriptors/digests aren't currently included (I think they used to), but they can be, etc.)
The point is probably mostly about "this is some evidence that it can be done." ("But there are nuances, things are imperfect, time is needed, etc.")
The question really is regarding the actual scope of this rewrite, I suppose.
I'd probably agree with Karsten that just doing a port of the ExoneraTor functionality as it currently is on exonerator.torproject.org would be the safe bet. See how that goes, venture into more exotic lands later on maybe, etc. (That doesn't mean that I wouldn't be excited to put the current backend to good use, and/or use the knowledge I gained to help you folks in some way!)
Regarding your comment about storing a copy of every descriptor ever made, I believe that users trust ExoneraTor's results more if they see the actual descriptors that lead to results. Of course, I'm saying that without knowing what ExoneraTor users actually want. But let's not drop descriptor copies from the database easily.
And, heh, when you say that the search functionality ExoneraTor provides is trivial, a little part of me is dying. It's the part that spent a few weeks on getting the search functionality fast enough for production. That was not at all trivial. The oraddress24, oraddress48, and exitaddress24 fields as well as the indexes are the result of me running lots and lots of sample queries and wondering about Postgres' EXPLAIN ANALYZE results. Just saying that it's not going to be trivial to generalize the search functionality towards other fields than IP addresses and dates.
Hear hear, I can only imagine! These things and exonerator stuff is not easy to be done in a way that would provide **consistently** good/great performance.
I spent some days of the last summer also looking at EXPLAIN ANALYZE results (it was a great feeling to start to understand what they mean and how I can make them better), but eventually things start making sense. (And when they do, I also get that same feeling that NoSQL stuff doesn't magically solve things.)
If others want to follow, here's the SQL code I'm talking about:
https://gitweb.torproject.org/exonerator.git/blob/HEAD:/db/exonerator.sql
So, I'm happy to talk about writing a searchable descriptor archive. It could _start_ with ExoneraTor's functionality (minus the target address and port thing discussed in that other email), and then we could consider adding more searches.
fwiw, imho this sounds like a sane plan to me. (Of course it could also be possible to work on the onionoo-like archive backend (or fork it, or smash it into parts and steal some of them, etc., but I can see why this might yield unclear deliverables, etc.) (So a short document of "what is wanted" would help, yeah.)
Pretty sure that Kostas is reading this (in fact, I just cc'ed him), so let me make one remark about optimizing Postgres defaults: I wrote quite a few database queries in the past, and some of them perform horribly (relay search) whereas others perform really well (ExoneraTor). I believe that the majority of performance gains can be achieved by designing good tables, indexes, and queries. Only as a last resort we should consider optimizing the Postgres defaults.
Ha, at this point I probably have a sort of "premature optimizer" label in your mind, Karsten. :) (And I kinda deserved it by at one point focusing on very-low-level postgres caching mechanisms last summer, etc etc.)
I've actually come to really appreciate good schema and query design[1] and the wonders that they do. That being said, I'd actually be curious to know how large the indexes of relay-search and current exonerator are.[2] I (still) bet increasing postgres' shared_buffers and effective_cache_size (totally normal practice!) might help! (Oh, is this one of those vim-vs-emacs things? If it is, sorry.)
But the point is that (to invoke a cliche) there is no free lunch, and (2) postgresql can really do wonders and scale well when used right.
You realize that a searchable descriptor archives focuses much more on database optimization than the ExoneraTor rewrite from Java to Python (which would leave the database untouched)?
"leaving database untouched" probably implies (very) significantly less work, so it would be a nice/clear starting point. (caveat, i may be lacking context, etc.)
[1]: also, fun things like "sometimes indexes won't be used because a sequential read will be faster, because if parts of indexes to be used are in various parts across the disk (not all of them are in memory), random seek + read a bit into memory + repeat is slower than 'just read a lot of continuous data into memory'", etc etc.)
[2]: if you're feeling adventuruous, you can run this on each of postgres databases, to see how large the indexes (among all other things) are, and which parts of them are loaded into memory https://github.com/wfn/torsearch/blob/master/misc/buffercache.sql
--
Kostas.
0x0e5dce45 @ pgp.mit.edu
Hi Kostas,
On 11/06/14 04:48, Kostas Jakeliunas wrote:
Hi all!
On Mon, Jun 9, 2014 at 10:22 AM, Karsten Loesing karsten@torproject.org wrote:
On 09/06/14 01:26, Damian Johnson wrote:
Oh, and another quick thought - you once mentioned that a descriptor search service would make ExoneraTor obsolete, and in looking it over I agree. The search functionality ExoneraTor provides is trivial. The only reason it requires such a huge database is because it's storing a copy of every descriptor ever made.
I suspect the actual right solution isn't to rewrite ExoneraTor at all, but rather develop a new service that can be queried for this descriptor data. That would make for a *much* more worthwhile project.
ExoneraTor? Nice to have. Descriptor archive service? Damn useful. :)
I agree, that was the idea behind Kostas' GSoC project last year. And I still think it's a good idea. It's just not trivial to get right.
Indeed, not trivial at all!
I'll use this space to mention the running metrics archive backend modulo ExoneraTor stuff / what could be sorta-relevant here.
fwiw, the onionoo-like backend is still running at an obscure address:port: http://ts.mkj.lt:5555/
Would you want to put the summary you wrote here to that link?
And would you want me to add a sentence or two about your service together with a link to the CollecTor page?
https://collector.torproject.org/#references
What would I write?
TL;DR "what can I do with that" is: look at:
https://github.com/wfn/torsearch/blob/master/docs/use_cases_examples.md
In particular, regarding ExoneraTor-like queries (incl. arbitrary subnet / part-of-ip lookups):
https://github.com/wfn/torsearch/blob/master/docs/use_cases_examples.md#exon...
Not sure if it's worth discussing all the weaknesses of this archive backend in this thread, but the short relevant version is that the ExoneraTor-like functionality does mostly work, but I would need to look into it so see how reliable the results are ("is this relay ip address field really the one we should be using?", etc.)
But what's nice is that it is possible to do arbitrary queries on all consensuses since ~2008, with no date specified (if you don't want to.) (Which is to say, "it's possible", not necessarily "this is the right way to do the solution for the problems in this thread")
So e.g. this is the ip address where moria runs, and we want to see what relays have ever run on it:
http://ts.mkj.lt:5555/details?search=128.31.0.34
Take the fingerprint of the one that is currently running (moria1), and look up its last 500 statuses (in a kind of condensed/summary form): http://ts.mkj.lt:5555/statuses?lookup=9695DFC35FFEB861329B9F1AB04C46397020CE...
"from", "to" date ranges can be specified as e.g. 2009, 2009-02, 2009-02-10, 2009-02-10 02:00:00. limit/offset/parameters/etc. specified here: https://github.com/wfn/torsearch/blob/master/docs/onionoo_api.md
(Descriptors/digests aren't currently included (I think they used to), but they can be, etc.)
The point is probably mostly about "this is some evidence that it can be done." ("But there are nuances, things are imperfect, time is needed, etc.")
The question really is regarding the actual scope of this rewrite, I suppose.
I'd probably agree with Karsten that just doing a port of the ExoneraTor functionality as it currently is on exonerator.torproject.org would be the safe bet. See how that goes, venture into more exotic lands later on maybe, etc. (That doesn't mean that I wouldn't be excited to put the current backend to good use, and/or use the knowledge I gained to help you folks in some way!)
Regarding your comment about storing a copy of every descriptor ever made, I believe that users trust ExoneraTor's results more if they see the actual descriptors that lead to results. Of course, I'm saying that without knowing what ExoneraTor users actually want. But let's not drop descriptor copies from the database easily.
And, heh, when you say that the search functionality ExoneraTor provides is trivial, a little part of me is dying. It's the part that spent a few weeks on getting the search functionality fast enough for production. That was not at all trivial. The oraddress24, oraddress48, and exitaddress24 fields as well as the indexes are the result of me running lots and lots of sample queries and wondering about Postgres' EXPLAIN ANALYZE results. Just saying that it's not going to be trivial to generalize the search functionality towards other fields than IP addresses and dates.
Hear hear, I can only imagine! These things and exonerator stuff is not easy to be done in a way that would provide **consistently** good/great performance.
I spent some days of the last summer also looking at EXPLAIN ANALYZE results (it was a great feeling to start to understand what they mean and how I can make them better), but eventually things start making sense. (And when they do, I also get that same feeling that NoSQL stuff doesn't magically solve things.)
If others want to follow, here's the SQL code I'm talking about:
https://gitweb.torproject.org/exonerator.git/blob/HEAD:/db/exonerator.sql
So, I'm happy to talk about writing a searchable descriptor archive. It could _start_ with ExoneraTor's functionality (minus the target address and port thing discussed in that other email), and then we could consider adding more searches.
fwiw, imho this sounds like a sane plan to me. (Of course it could also be possible to work on the onionoo-like archive backend (or fork it, or smash it into parts and steal some of them, etc., but I can see why this might yield unclear deliverables, etc.) (So a short document of "what is wanted" would help, yeah.)
Pretty sure that Kostas is reading this (in fact, I just cc'ed him), so let me make one remark about optimizing Postgres defaults: I wrote quite a few database queries in the past, and some of them perform horribly (relay search) whereas others perform really well (ExoneraTor). I believe that the majority of performance gains can be achieved by designing good tables, indexes, and queries. Only as a last resort we should consider optimizing the Postgres defaults.
Ha, at this point I probably have a sort of "premature optimizer" label in your mind, Karsten. :) (And I kinda deserved it by at one point focusing on very-low-level postgres caching mechanisms last summer, etc etc.)
I've actually come to really appreciate good schema and query design[1] and the wonders that they do. That being said, I'd actually be curious to know how large the indexes of relay-search and current exonerator are.[2] I (still) bet increasing postgres' shared_buffers and effective_cache_size (totally normal practice!) might help! (Oh, is this one of those vim-vs-emacs things? If it is, sorry.)
I just deleted most of the database contents behind the relay-search service a few days ago. But I might even have agreed there that some PostgreSQL tweaking would have helped. It was a bad database design, mostly because it was built for a different purpose (data aggregation for metrics website), so it's a bad example.
But let me give you some numbers on current ExoneraTor (manually deleted part of the output which we don't care about here):
exonerator=> \dt+ Name | Size ---------------+-------- consensus | 16 GB descriptor | 31 GB exitlistentry | 558 MB statusentry | 50 GB (4 rows)
exonerator=> \di+ Name | Table | Size -----------------------------------------+---------------+--------- consensus_pkey | consensus | 1280 kB descriptor_pkey | descriptor | 1930 MB exitlistentry_exitaddress24_scanneddate | exitlistentry | 82 MB exitlistentry_exitaddress_scanneddate | exitlistentry | 82 MB exitlistentry_pkey | exitlistentry | 173 MB statusentry_oraddress24_validafterdate | statusentry | 5470 MB statusentry_oraddress48_validafterdate | statusentry | 4629 MB statusentry_oraddress_validafterdate | statusentry | 5509 MB statusentry_pkey | statusentry | 10 GB (9 rows)
Happy to run some EXPLAIN ANALYZE queries for you if you tell me what to run.
If we're going to optimize the ExoneraTor database, should we move this discussion to a ticket?
All the best, Karsten
But the point is that (to invoke a cliche) there is no free lunch, and (2) postgresql can really do wonders and scale well when used right.
You realize that a searchable descriptor archives focuses much more on database optimization than the ExoneraTor rewrite from Java to Python (which would leave the database untouched)?
"leaving database untouched" probably implies (very) significantly less work, so it would be a nice/clear starting point. (caveat, i may be lacking context, etc.)
[1]: also, fun things like "sometimes indexes won't be used because a sequential read will be faster, because if parts of indexes to be used are in various parts across the disk (not all of them are in memory), random seek + read a bit into memory + repeat is slower than 'just read a lot of continuous data into memory'", etc etc.)
[2]: if you're feeling adventuruous, you can run this on each of postgres databases, to see how large the indexes (among all other things) are, and which parts of them are loaded into memory https://github.com/wfn/torsearch/blob/master/misc/buffercache.sql
--
Kostas.
0x0e5dce45 @ pgp.mit.edu
Hi Karsten,
On Tue, Jun 17, 2014 at 10:13 AM, Karsten Loesing karsten@torproject.org wrote:
Hi Kostas,
On 11/06/14 04:48, Kostas Jakeliunas wrote:
Hi all!
On Mon, Jun 9, 2014 at 10:22 AM, Karsten Loesing karsten@torproject.org wrote:
On 09/06/14 01:26, Damian Johnson wrote:
Oh, and another quick thought - you once mentioned that a descriptor search service would make ExoneraTor obsolete, and in looking it over I agree. The search functionality ExoneraTor provides is trivial. The only reason it requires such a huge database is because it's storing a copy of every descriptor ever made.
I suspect the actual right solution isn't to rewrite ExoneraTor at all, but rather develop a new service that can be queried for this descriptor data. That would make for a *much* more worthwhile project.
ExoneraTor? Nice to have. Descriptor archive service? Damn useful. :)
I agree, that was the idea behind Kostas' GSoC project last year. And I still think it's a good idea. It's just not trivial to get right.
Indeed, not trivial at all!
I'll use this space to mention the running metrics archive backend modulo ExoneraTor stuff / what could be sorta-relevant here.
fwiw, the onionoo-like backend is still running at an obscure address:port: http://ts.mkj.lt:5555/
Would you want to put the summary you wrote here to that link?
Moved the whole setup to work on port 80 (via uWSGI, with nginx as the reverse proxy) ("ts.mkj.lt:5555/some/request" now transparently perma-redirects to "ts.mkj.lt/some/request"), and put a simple very short summary on the index:
http://ts.mkj.lt/ (have you heard of this new edgy font, "Times New Roman"?) Let me know if something is too confusing or reads funny, etc. I can elaborate more in the beginning or after the examples, too.
And would you want me to add a sentence or two about your service together with a link to the CollecTor page?
Ok!
What would I write?
something like? --
The Searchable Metrics Archive backend allows users to search and explore relay metrics data (consensuses and descriptors), present and past. It covers the years 2008-now and provides an Onionoo-like API.
does that make sense?
TL;DR "what can I do with that" is: look at:
https://github.com/wfn/torsearch/blob/master/docs/use_cases_examples.md
In particular, regarding ExoneraTor-like queries (incl. arbitrary subnet / part-of-ip lookups):
https://github.com/wfn/torsearch/blob/master/docs/use_cases_examples.md#exon...
Not sure if it's worth discussing all the weaknesses of this archive backend in this thread, but the short relevant version is that the ExoneraTor-like functionality does mostly work, but I would need to look into it so see how reliable the results are ("is this relay ip address field really the one we should be using?", etc.)
But what's nice is that it is possible to do arbitrary queries on all consensuses since ~2008, with no date specified (if you don't want to.) (Which is to say, "it's possible", not necessarily "this is the right way to do the solution for the problems in this thread")
So e.g. this is the ip address where moria runs, and we want to see what relays have ever run on it:
http://ts.mkj.lt:5555/details?search=128.31.0.34
Take the fingerprint of the one that is currently running (moria1), and look up its last 500 statuses (in a kind of condensed/summary form): http://ts.mkj.lt:5555/statuses?lookup=9695DFC35FFEB861329B9F1AB04C46397020CE...
"from", "to" date ranges can be specified as e.g. 2009, 2009-02, 2009-02-10, 2009-02-10 02:00:00. limit/offset/parameters/etc. specified here: https://github.com/wfn/torsearch/blob/master/docs/onionoo_api.md
(Descriptors/digests aren't currently included (I think they used to), but they can be, etc.)
The point is probably mostly about "this is some evidence that it can be done." ("But there are nuances, things are imperfect, time is needed, etc.")
The question really is regarding the actual scope of this rewrite, I suppose.
I'd probably agree with Karsten that just doing a port of the ExoneraTor functionality as it currently is on exonerator.torproject.org would be the safe bet. See how that goes, venture into more exotic lands later on maybe, etc. (That doesn't mean that I wouldn't be excited to put the current backend to good use, and/or use the knowledge I gained to help you folks in some way!)
Regarding your comment about storing a copy of every descriptor ever made, I believe that users trust ExoneraTor's results more if they see the actual descriptors that lead to results. Of course, I'm saying that without knowing what ExoneraTor users actually want. But let's not drop descriptor copies from the database easily.
And, heh, when you say that the search functionality ExoneraTor provides is trivial, a little part of me is dying. It's the part that spent a few weeks on getting the search functionality fast enough for production. That was not at all trivial. The oraddress24, oraddress48, and exitaddress24 fields as well as the indexes are the result of me running lots and lots of sample queries and wondering about Postgres' EXPLAIN ANALYZE results. Just saying that it's not going to be trivial to generalize the search functionality towards other fields than IP addresses and dates.
Hear hear, I can only imagine! These things and exonerator stuff is not easy to be done in a way that would provide **consistently** good/great performance.
I spent some days of the last summer also looking at EXPLAIN ANALYZE results (it was a great feeling to start to understand what they mean and how I can make them better), but eventually things start making sense. (And when they do, I also get that same feeling that NoSQL stuff doesn't magically solve things.)
If others want to follow, here's the SQL code I'm talking about:
https://gitweb.torproject.org/exonerator.git/blob/HEAD:/db/exonerator.sql
So, I'm happy to talk about writing a searchable descriptor archive. It could _start_ with ExoneraTor's functionality (minus the target address and port thing discussed in that other email), and then we could consider adding more searches.
fwiw, imho this sounds like a sane plan to me. (Of course it could also be possible to work on the onionoo-like archive backend (or fork it, or smash it into parts and steal some of them, etc., but I can see why this might yield unclear deliverables, etc.) (So a short document of "what is wanted" would help, yeah.)
Pretty sure that Kostas is reading this (in fact, I just cc'ed him), so let me make one remark about optimizing Postgres defaults: I wrote quite a few database queries in the past, and some of them perform horribly (relay search) whereas others perform really well (ExoneraTor). I believe that the majority of performance gains can be achieved by designing good tables, indexes, and queries. Only as a last resort we should consider optimizing the Postgres defaults.
Ha, at this point I probably have a sort of "premature optimizer" label in your mind, Karsten. :) (And I kinda deserved it by at one point focusing on very-low-level postgres caching mechanisms last summer, etc etc.)
I've actually come to really appreciate good schema and query design[1] and the wonders that they do. That being said, I'd actually be curious to know how large the indexes of relay-search and current exonerator are.[2] I (still) bet increasing postgres' shared_buffers and effective_cache_size (totally normal practice!) might help! (Oh, is this one of those vim-vs-emacs things? If it is, sorry.)
I just deleted most of the database contents behind the relay-search service a few days ago. But I might even have agreed there that some PostgreSQL tweaking would have helped. It was a bad database design, mostly because it was built for a different purpose (data aggregation for metrics website), so it's a bad example.
But let me give you some numbers on current ExoneraTor (manually deleted part of the output which we don't care about here):
exonerator=> \dt+ Name | Size ---------------+-------- consensus | 16 GB descriptor | 31 GB exitlistentry | 558 MB statusentry | 50 GB (4 rows)
exonerator=> \di+ Name | Table | Size -----------------------------------------+---------------+--------- consensus_pkey | consensus | 1280 kB descriptor_pkey | descriptor | 1930 MB exitlistentry_exitaddress24_scanneddate | exitlistentry | 82 MB exitlistentry_exitaddress_scanneddate | exitlistentry | 82 MB exitlistentry_pkey | exitlistentry | 173 MB statusentry_oraddress24_validafterdate | statusentry | 5470 MB statusentry_oraddress48_validafterdate | statusentry | 4629 MB statusentry_oraddress_validafterdate | statusentry | 5509 MB statusentry_pkey | statusentry | 10 GB (9 rows)
Looks nice! :) thanks! (just for fun, the largest index on my side is one "statusentry_substr_validafter_idx", which is an index on two columns (a (SUBSTR() of) relay nickname and the consensus valid after (DESC)), and it's currently at 7004 MB.) Anyway, "these sizes make sense" is all I can think of right now!
Happy to run some EXPLAIN ANALYZE queries for you if you tell me what to run.
okay, maybe I'll think of something some time, and if I do, I can either open a ticket, or create a new email thread, unless this is kind-of-ok for this thread.
(Regarding "what part of $something is in memory", I remember the "disk read" (or was it "buffer read") words in EXPLAIN ANALYZE being useful. Also, sometimes postgres really mis-assumes on how much it'll have to read, and how much it ends up reading (it's all there in the results iirc, but you probably know all that.) In which case a VACUUM should help (of course), etc.)
If we're going to optimize the ExoneraTor database, should we move this discussion to a ticket?
Derailment with technicalities is always a looming danger I guess, but at this point I'm not even sure what you and Damian (and possibly others) are planning to do with the current ExoneraTor. I assume current ExoneraTor performance is good as it currently stands, so this part of the thread/thoughtspace can be closed for the time being as far as I can see. (And I could open a ticket if I think of something interesting to do regarding diagnosing/optimizing the ExoneraTor database.)
I suppose there's still no consensus whether a python-exonerator should aim to replicate current ExoneraTor's functionality (and, say, use the current database), or whether it should do more(tm). (Happy to participate in some form of discussion at the dev meeting, if my input can be useful!)
best wishes Kostas
All the best, Karsten
But the point is that (to invoke a cliche) there is no free lunch, and (2) postgresql can really do wonders and scale well when used right.
You realize that a searchable descriptor archives focuses much more on database optimization than the ExoneraTor rewrite from Java to Python (which would leave the database untouched)?
"leaving database untouched" probably implies (very) significantly less work, so it would be a nice/clear starting point. (caveat, i may be lacking context, etc.)
[1]: also, fun things like "sometimes indexes won't be used because a sequential read will be faster, because if parts of indexes to be used are in various parts across the disk (not all of them are in memory), random seek + read a bit into memory + repeat is slower than 'just read a lot of continuous data into memory'", etc etc.)
[2]: if you're feeling adventuruous, you can run this on each of postgres databases, to see how large the indexes (among all other things) are, and which parts of them are loaded into memory https://github.com/wfn/torsearch/blob/master/misc/buffercache.sql
--
Kostas.
0x0e5dce45 @ pgp.mit.edu
On 24/06/14 18:53, Kostas Jakeliunas wrote:
Hi Karsten,
Hi Kostas,
On Tue, Jun 17, 2014 at 10:13 AM, Karsten Loesing karsten@torproject.org wrote:
Hi Kostas,
On 11/06/14 04:48, Kostas Jakeliunas wrote:
Hi all!
On Mon, Jun 9, 2014 at 10:22 AM, Karsten Loesing karsten@torproject.org wrote:
On 09/06/14 01:26, Damian Johnson wrote:
Oh, and another quick thought - you once mentioned that a descriptor search service would make ExoneraTor obsolete, and in looking it over I agree. The search functionality ExoneraTor provides is trivial. The only reason it requires such a huge database is because it's storing a copy of every descriptor ever made.
I suspect the actual right solution isn't to rewrite ExoneraTor at all, but rather develop a new service that can be queried for this descriptor data. That would make for a *much* more worthwhile project.
ExoneraTor? Nice to have. Descriptor archive service? Damn useful. :)
I agree, that was the idea behind Kostas' GSoC project last year. And I still think it's a good idea. It's just not trivial to get right.
Indeed, not trivial at all!
I'll use this space to mention the running metrics archive backend modulo ExoneraTor stuff / what could be sorta-relevant here.
fwiw, the onionoo-like backend is still running at an obscure address:port: http://ts.mkj.lt:5555/
Would you want to put the summary you wrote here to that link?
Moved the whole setup to work on port 80 (via uWSGI, with nginx as the reverse proxy) ("ts.mkj.lt:5555/some/request" now transparently perma-redirects to "ts.mkj.lt/some/request"), and put a simple very short summary on the index:
http://ts.mkj.lt/ (have you heard of this new edgy font, "Times New Roman"?) Let me know if something is too confusing or reads funny, etc. I can elaborate more in the beginning or after the examples, too.
Looks good to me!
And would you want me to add a sentence or two about your service together with a link to the CollecTor page?
Ok!
What would I write?
something like? --
The Searchable Metrics Archive backend allows users to search and explore relay metrics data (consensuses and descriptors), present and past. It covers the years 2008-now and provides an Onionoo-like API.
does that make sense?
It does! Tweaked a tiny bit and put online.
TL;DR "what can I do with that" is: look at:
https://github.com/wfn/torsearch/blob/master/docs/use_cases_examples.md
In particular, regarding ExoneraTor-like queries (incl. arbitrary subnet / part-of-ip lookups):
https://github.com/wfn/torsearch/blob/master/docs/use_cases_examples.md#exon...
Not sure if it's worth discussing all the weaknesses of this archive backend in this thread, but the short relevant version is that the ExoneraTor-like functionality does mostly work, but I would need to look into it so see how reliable the results are ("is this relay ip address field really the one we should be using?", etc.)
But what's nice is that it is possible to do arbitrary queries on all consensuses since ~2008, with no date specified (if you don't want to.) (Which is to say, "it's possible", not necessarily "this is the right way to do the solution for the problems in this thread")
So e.g. this is the ip address where moria runs, and we want to see what relays have ever run on it:
http://ts.mkj.lt:5555/details?search=128.31.0.34
Take the fingerprint of the one that is currently running (moria1), and look up its last 500 statuses (in a kind of condensed/summary form): http://ts.mkj.lt:5555/statuses?lookup=9695DFC35FFEB861329B9F1AB04C46397020CE...
"from", "to" date ranges can be specified as e.g. 2009, 2009-02, 2009-02-10, 2009-02-10 02:00:00. limit/offset/parameters/etc. specified here: https://github.com/wfn/torsearch/blob/master/docs/onionoo_api.md
(Descriptors/digests aren't currently included (I think they used to), but they can be, etc.)
The point is probably mostly about "this is some evidence that it can be done." ("But there are nuances, things are imperfect, time is needed, etc.")
The question really is regarding the actual scope of this rewrite, I suppose.
I'd probably agree with Karsten that just doing a port of the ExoneraTor functionality as it currently is on exonerator.torproject.org would be the safe bet. See how that goes, venture into more exotic lands later on maybe, etc. (That doesn't mean that I wouldn't be excited to put the current backend to good use, and/or use the knowledge I gained to help you folks in some way!)
Regarding your comment about storing a copy of every descriptor ever made, I believe that users trust ExoneraTor's results more if they see the actual descriptors that lead to results. Of course, I'm saying that without knowing what ExoneraTor users actually want. But let's not drop descriptor copies from the database easily.
And, heh, when you say that the search functionality ExoneraTor provides is trivial, a little part of me is dying. It's the part that spent a few weeks on getting the search functionality fast enough for production. That was not at all trivial. The oraddress24, oraddress48, and exitaddress24 fields as well as the indexes are the result of me running lots and lots of sample queries and wondering about Postgres' EXPLAIN ANALYZE results. Just saying that it's not going to be trivial to generalize the search functionality towards other fields than IP addresses and dates.
Hear hear, I can only imagine! These things and exonerator stuff is not easy to be done in a way that would provide **consistently** good/great performance.
I spent some days of the last summer also looking at EXPLAIN ANALYZE results (it was a great feeling to start to understand what they mean and how I can make them better), but eventually things start making sense. (And when they do, I also get that same feeling that NoSQL stuff doesn't magically solve things.)
If others want to follow, here's the SQL code I'm talking about:
https://gitweb.torproject.org/exonerator.git/blob/HEAD:/db/exonerator.sql
So, I'm happy to talk about writing a searchable descriptor archive. It could _start_ with ExoneraTor's functionality (minus the target address and port thing discussed in that other email), and then we could consider adding more searches.
fwiw, imho this sounds like a sane plan to me. (Of course it could also be possible to work on the onionoo-like archive backend (or fork it, or smash it into parts and steal some of them, etc., but I can see why this might yield unclear deliverables, etc.) (So a short document of "what is wanted" would help, yeah.)
Pretty sure that Kostas is reading this (in fact, I just cc'ed him), so let me make one remark about optimizing Postgres defaults: I wrote quite a few database queries in the past, and some of them perform horribly (relay search) whereas others perform really well (ExoneraTor). I believe that the majority of performance gains can be achieved by designing good tables, indexes, and queries. Only as a last resort we should consider optimizing the Postgres defaults.
Ha, at this point I probably have a sort of "premature optimizer" label in your mind, Karsten. :) (And I kinda deserved it by at one point focusing on very-low-level postgres caching mechanisms last summer, etc etc.)
I've actually come to really appreciate good schema and query design[1] and the wonders that they do. That being said, I'd actually be curious to know how large the indexes of relay-search and current exonerator are.[2] I (still) bet increasing postgres' shared_buffers and effective_cache_size (totally normal practice!) might help! (Oh, is this one of those vim-vs-emacs things? If it is, sorry.)
I just deleted most of the database contents behind the relay-search service a few days ago. But I might even have agreed there that some PostgreSQL tweaking would have helped. It was a bad database design, mostly because it was built for a different purpose (data aggregation for metrics website), so it's a bad example.
But let me give you some numbers on current ExoneraTor (manually deleted part of the output which we don't care about here):
exonerator=> \dt+ Name | Size ---------------+-------- consensus | 16 GB descriptor | 31 GB exitlistentry | 558 MB statusentry | 50 GB (4 rows)
exonerator=> \di+ Name | Table | Size -----------------------------------------+---------------+--------- consensus_pkey | consensus | 1280 kB descriptor_pkey | descriptor | 1930 MB exitlistentry_exitaddress24_scanneddate | exitlistentry | 82 MB exitlistentry_exitaddress_scanneddate | exitlistentry | 82 MB exitlistentry_pkey | exitlistentry | 173 MB statusentry_oraddress24_validafterdate | statusentry | 5470 MB statusentry_oraddress48_validafterdate | statusentry | 4629 MB statusentry_oraddress_validafterdate | statusentry | 5509 MB statusentry_pkey | statusentry | 10 GB (9 rows)
Looks nice! :) thanks! (just for fun, the largest index on my side is one "statusentry_substr_validafter_idx", which is an index on two columns (a (SUBSTR() of) relay nickname and the consensus valid after (DESC)), and it's currently at 7004 MB.) Anyway, "these sizes make sense" is all I can think of right now!
Good to hear.
Happy to run some EXPLAIN ANALYZE queries for you if you tell me what to run.
okay, maybe I'll think of something some time, and if I do, I can either open a ticket, or create a new email thread, unless this is kind-of-ok for this thread.
(Regarding "what part of $something is in memory", I remember the "disk read" (or was it "buffer read") words in EXPLAIN ANALYZE being useful. Also, sometimes postgres really mis-assumes on how much it'll have to read, and how much it ends up reading (it's all there in the results iirc, but you probably know all that.) In which case a VACUUM should help (of course), etc.)
Feel free to start a new thread or create a ticket for this. To be honest, I didn't run EXPLAIN ANALYZE on this database for quite a while. I just assume everything works fine.
If we're going to optimize the ExoneraTor database, should we move this discussion to a ticket?
Derailment with technicalities is always a looming danger I guess, but at this point I'm not even sure what you and Damian (and possibly others) are planning to do with the current ExoneraTor. I assume current ExoneraTor performance is good as it currently stands, so this part of the thread/thoughtspace can be closed for the time being as far as I can see. (And I could open a ticket if I think of something interesting to do regarding diagnosing/optimizing the ExoneraTor database.)
I suppose there's still no consensus whether a python-exonerator should aim to replicate current ExoneraTor's functionality (and, say, use the current database), or whether it should do more(tm). (Happy to participate in some form of discussion at the dev meeting, if my input can be useful!)
Damian won't be in Paris, AFAIK. But sure, happy to discuss more next week.
All the best, Karsten
On 09/06/14 00:03, Damian Johnson wrote:
Hi Karsten. This is diving into enough detail that we might as well move this over to tor-dev@. For the list's benefit, Karsten and I are discussing a Python rewrite of ExoneraTor...
https://exonerator.torproject.org/ https://gitweb.torproject.org/exonerator.git
Hi Damian. Sure, moving this to the list is a fine idea.
First I think I need to take a step back to figure out exactly what we're after. From a quick peek at ExoneraTor it looks like it behaves as follows...
a. User enters an address (IPv4 or IPv6) and a date (either for a day or an hour).
b. ExoneraTor lists router status entries for all relays that match the criteria. These entries link to the consensus they came from and server descriptors they reference.
c. The user can then enter a destination address and port to search exit policies in TorDNSEL entres.
Step 'a' and 'b' make sense to me. Step 'c' however I'm having a little difficulty groking. Ignoring TorDNSEL entries for a moment, we already have all the ingredients to provide the user with three fields to start with...
- Source Address (required)
- Timestamp (required)
- Destination Address and/or Port (optional)
The source address and timestamp come from the consensus, and an optional 'can it exit to destination X' consults the server descriptor's exit policy.
So what is TorDNSEL providing us and why is it a separate search on the page? As I understand it the value of TorDNSEL is that we can't trust the address in the router status entries. If that's the case then our present search fields don't make sense to me...
Our initial search consults consensus information for the address and timestamp but not the exit policy. This is weird both because the address this has is faulty, and we have the exit policy so we could trivially include that in our search criteria.
Our second search gives the impression that we're using the earlier consensus results to query exit criteria from TorDNSEL. As I understand it though that's not what it's doing. TorDNSEL is completely independent from the consensus information.
I could understand a search that just consults consensus information (ignoring address accuracy, it has everything we need). I could also understand a search that just consults TorDNSEL information (ignoring its inconsistent poll rate, it has everything we need).
However, this hybrid approach and how it's presented really confuses me. Unless I'm mistaken with something above what I'd expect from ExoneraTor is...
The three search fields mentioned above.
It shows results based on the consensus information like we presently do.
If we have TorDNSEL entries that either indicate that a relay we're presenting had a different external address or another relay had the address we're searching for then note that.
That is to say, the base search is based on consensus information (using server descriptor exit policies if we want to filter by that), and the TorDNSEL results are just appended notes since we can't rely on its poll rate.
Thoughts?
Your description above is not entirely correct. Here's how ExoneraTor works with regard to consensuses and TorDNSEL exit lists:
- In step 1 it tells you whether an IP address was contained in a consensus _or_ TorDNSEL exit list on a given date or at a given time on that date.
- In step 2 it can tell you whether the relay with that IP address could have exited to another IP address and port.
That being said, I think that step 1 is what people care a lot more about than step 2. Most people probably don't know exactly what step 2 does. And with IPv6, relays don't even include full exit policies in their descriptors that we could use to say whether a relay could have exited to a given IP address and port.
So, I'm open to discuss whether we should leave out step 2 in the rewrite and only focus on step 1.
Cheers! -Damian
PS. Congratulations on getting me invested. I just spent the last three hours in front of a whiteboard trying to puzzle out why ExoneraTor works the way it presently does. ;)
Hooray! :)
PPS. Stem's ExitPolicy class has a can_eixt_to() method that would be really handy for this...
https://stem.torproject.org/api/exit_policy.html#stem.exit_policy.ExitPolicy...
See above. Maybe we don't need this at all.
PPPS. I'm still hesitant about actually tackling this project. Arm is midway through being rewritten, and considering its sudden uptick in usage probably the most important project on my plate right now.
That said, I'm happy to discuss this. Even if we don't implement it right now this thread will be useful so we know where we're going with ticket #8260.
Concerning the earlier discussion of 'work with Karsten on a python project' I have a personal bias toward collaborating when the project has few unknowns for me, but working alone when *I'm* learning something. That is to say, I'd love to work with you on a straightforward Stem project and I'd also like to discuss ExoneraTor's design. But when it comes to coding, this has enough unknowns that if I take it on I'd prefer to experiment alone for a while - at least until I know enough about the APIs involved that I can avoid embarrassing myself. :)
How about we work on an ExoneraTor design document, to start with? The step-2 thing above is not the only open design question. Another open question is how accurately users can provide the date or datetime. Are they aware that all data are in UTC? Should we err on the safe side and include network statuses published _after_ the timestamp the user tells us (#3232)?
But, first replying to your other email, because a searchable descriptor archive is indeed useful, too.
All the best, Karsten
On Sun, Jun 8, 2014 at 2:56 AM, Karsten Loesing karsten@torproject.org wrote:
On 08/06/14 06:27, Damian Johnson wrote:
Here's a quick overview of the codebase to facilitate reading through it:
Ahhh, very useful - thanks.
Hmmm. Just took a quick peek at the ExoneraTor codebase and, unless I'm mistaken, it doesn't actually use metrics-lib, does it?
You're right, looks like it doesn't.
Honestly looking over the code is making me a little hesitant to take this on after all. I was anticipating a small, quick project of DocTor's scope but I've never touched SQLAlchemy or Posgress before.
I don't think we'll even have to touch the Postgres for moving from Java to Python. The Python code would simply do SQL calls via its SQL library just like Java does.
I just copied all SQL statements that the Python part would have to prepare and execute:
CALL insert_descriptor(?, ?); CALL insert_statusentry(?, ?, ?, ?, ?, ?, ?); CALL insert_consensus(?, ?); CALL insert_exitlistentry(?, ?, ?, ?, ?); SELECT MIN(validafter) AS first, MAX(validafter) AS last FROM consensus; SELECT validafter FROM consensus WHERE validafter >= ? AND validafter <= ?; CALL search_statusentries_by_address_date(?, ?); CALL search_addresses_in_same_24 (?, ?); CALL search_addresses_in_same_48 (?, ?); SELECT rawdescriptor FROM descriptor WHERE descriptor = ?; SELECT descriptor, rawdescriptor FROM descriptor WHERE descriptor LIKE ?; SELECT rawconsensus FROM consensus WHERE validafter = ?;
That's it. No further knowledge about Postgres required.
Once I wrote this I realized I'm being a damn hypocrite. Here I was saying "Karsten, learn Python so we can leverage each other's codebases!" but then I hightail it once the project delves into areas new to me. New arm users are showing up almost daily on irc and I'm anxious to give them a new release... but then this is exactly the issue, isn't it? Deliverables you'd like to focus on crowding out time to learn new things.
So TL;DR I'm gonna eat my own words and suggest we focus on our separate domains for now. I really would like to work on some small metrics projects with you. Each month I eyeball your status reports asking myself "Is there anything here I can work with Karsten on to draw our spaces closer together?" so please let me know if you run across anything in Metrics we can collaborate on.
(Replying below, first replying to the DynamoDB part.)
Your hypocritical friend, ~Damian
PS. When we next meet I'd like to discuss ExoneraTor's design a bit. First thought I had when looking at the code was 'huh... I wonder if this would be a good use case for DynamoDB'.
I'm wary about moving to another database, especially NoSQL ones and/or cloud-based ones. They don't magically make things faster, and Postgres is something I understand quite well by now. And again, I think that we keep the Postgres part entirely unchanged when moving to Python. Not saying that DymanoDB can't be the better choice, but switching the database is not a priority for me.
So, regarding the rewrite: rather than canceling the project before it starts, how about we find a role for you that you're more comfortable with?
For example, I'd want to try rewriting it step by step based on your suggestion of frameworks/libraries and with some code review of yours.
If you're interested, which framework would I use for the new Python ExoneraTor? It's supposed to do the following tasks:
- Provide a simple web site with a web form, backed by the PostgreSQL
database.
- Maybe offer a simple RESTful API for lookups that the web form could
use to compose responses, but that could also be used by other applications directly.
- Return documents from the database by identifier, so without
providing a search functionality.
- Run a scheduled task once per hour that fetches data from CollecTor
and puts it in a database.
Bonus points if the result is as easy to deploy on Debian Wheezy as possible. Like, install these few Debian packages, run the setup script, done.
Of course, if you'd prefer to focus on other things and not discuss ExoneraTor stuff, that's perfectly fine, too. :)
All the best, Karsten
.