On 09/06/14 00:03, Damian Johnson wrote:
Hi Karsten. This is diving into enough detail that we might as well move this over to tor-dev@. For the list's benefit, Karsten and I are discussing a Python rewrite of ExoneraTor...
https://exonerator.torproject.org/ https://gitweb.torproject.org/exonerator.git
Hi Damian. Sure, moving this to the list is a fine idea.
First I think I need to take a step back to figure out exactly what we're after. From a quick peek at ExoneraTor it looks like it behaves as follows...
a. User enters an address (IPv4 or IPv6) and a date (either for a day or an hour).
b. ExoneraTor lists router status entries for all relays that match the criteria. These entries link to the consensus they came from and server descriptors they reference.
c. The user can then enter a destination address and port to search exit policies in TorDNSEL entres.
Step 'a' and 'b' make sense to me. Step 'c' however I'm having a little difficulty groking. Ignoring TorDNSEL entries for a moment, we already have all the ingredients to provide the user with three fields to start with...
- Source Address (required)
- Timestamp (required)
- Destination Address and/or Port (optional)
The source address and timestamp come from the consensus, and an optional 'can it exit to destination X' consults the server descriptor's exit policy.
So what is TorDNSEL providing us and why is it a separate search on the page? As I understand it the value of TorDNSEL is that we can't trust the address in the router status entries. If that's the case then our present search fields don't make sense to me...
Our initial search consults consensus information for the address and timestamp but not the exit policy. This is weird both because the address this has is faulty, and we have the exit policy so we could trivially include that in our search criteria.
Our second search gives the impression that we're using the earlier consensus results to query exit criteria from TorDNSEL. As I understand it though that's not what it's doing. TorDNSEL is completely independent from the consensus information.
I could understand a search that just consults consensus information (ignoring address accuracy, it has everything we need). I could also understand a search that just consults TorDNSEL information (ignoring its inconsistent poll rate, it has everything we need).
However, this hybrid approach and how it's presented really confuses me. Unless I'm mistaken with something above what I'd expect from ExoneraTor is...
The three search fields mentioned above.
It shows results based on the consensus information like we presently do.
If we have TorDNSEL entries that either indicate that a relay we're presenting had a different external address or another relay had the address we're searching for then note that.
That is to say, the base search is based on consensus information (using server descriptor exit policies if we want to filter by that), and the TorDNSEL results are just appended notes since we can't rely on its poll rate.
Thoughts?
Your description above is not entirely correct. Here's how ExoneraTor works with regard to consensuses and TorDNSEL exit lists:
- In step 1 it tells you whether an IP address was contained in a consensus _or_ TorDNSEL exit list on a given date or at a given time on that date.
- In step 2 it can tell you whether the relay with that IP address could have exited to another IP address and port.
That being said, I think that step 1 is what people care a lot more about than step 2. Most people probably don't know exactly what step 2 does. And with IPv6, relays don't even include full exit policies in their descriptors that we could use to say whether a relay could have exited to a given IP address and port.
So, I'm open to discuss whether we should leave out step 2 in the rewrite and only focus on step 1.
Cheers! -Damian
PS. Congratulations on getting me invested. I just spent the last three hours in front of a whiteboard trying to puzzle out why ExoneraTor works the way it presently does. ;)
Hooray! :)
PPS. Stem's ExitPolicy class has a can_eixt_to() method that would be really handy for this...
https://stem.torproject.org/api/exit_policy.html#stem.exit_policy.ExitPolicy...
See above. Maybe we don't need this at all.
PPPS. I'm still hesitant about actually tackling this project. Arm is midway through being rewritten, and considering its sudden uptick in usage probably the most important project on my plate right now.
That said, I'm happy to discuss this. Even if we don't implement it right now this thread will be useful so we know where we're going with ticket #8260.
Concerning the earlier discussion of 'work with Karsten on a python project' I have a personal bias toward collaborating when the project has few unknowns for me, but working alone when *I'm* learning something. That is to say, I'd love to work with you on a straightforward Stem project and I'd also like to discuss ExoneraTor's design. But when it comes to coding, this has enough unknowns that if I take it on I'd prefer to experiment alone for a while - at least until I know enough about the APIs involved that I can avoid embarrassing myself. :)
How about we work on an ExoneraTor design document, to start with? The step-2 thing above is not the only open design question. Another open question is how accurately users can provide the date or datetime. Are they aware that all data are in UTC? Should we err on the safe side and include network statuses published _after_ the timestamp the user tells us (#3232)?
But, first replying to your other email, because a searchable descriptor archive is indeed useful, too.
All the best, Karsten
On Sun, Jun 8, 2014 at 2:56 AM, Karsten Loesing karsten@torproject.org wrote:
On 08/06/14 06:27, Damian Johnson wrote:
Here's a quick overview of the codebase to facilitate reading through it:
Ahhh, very useful - thanks.
Hmmm. Just took a quick peek at the ExoneraTor codebase and, unless I'm mistaken, it doesn't actually use metrics-lib, does it?
You're right, looks like it doesn't.
Honestly looking over the code is making me a little hesitant to take this on after all. I was anticipating a small, quick project of DocTor's scope but I've never touched SQLAlchemy or Posgress before.
I don't think we'll even have to touch the Postgres for moving from Java to Python. The Python code would simply do SQL calls via its SQL library just like Java does.
I just copied all SQL statements that the Python part would have to prepare and execute:
CALL insert_descriptor(?, ?); CALL insert_statusentry(?, ?, ?, ?, ?, ?, ?); CALL insert_consensus(?, ?); CALL insert_exitlistentry(?, ?, ?, ?, ?); SELECT MIN(validafter) AS first, MAX(validafter) AS last FROM consensus; SELECT validafter FROM consensus WHERE validafter >= ? AND validafter <= ?; CALL search_statusentries_by_address_date(?, ?); CALL search_addresses_in_same_24 (?, ?); CALL search_addresses_in_same_48 (?, ?); SELECT rawdescriptor FROM descriptor WHERE descriptor = ?; SELECT descriptor, rawdescriptor FROM descriptor WHERE descriptor LIKE ?; SELECT rawconsensus FROM consensus WHERE validafter = ?;
That's it. No further knowledge about Postgres required.
Once I wrote this I realized I'm being a damn hypocrite. Here I was saying "Karsten, learn Python so we can leverage each other's codebases!" but then I hightail it once the project delves into areas new to me. New arm users are showing up almost daily on irc and I'm anxious to give them a new release... but then this is exactly the issue, isn't it? Deliverables you'd like to focus on crowding out time to learn new things.
So TL;DR I'm gonna eat my own words and suggest we focus on our separate domains for now. I really would like to work on some small metrics projects with you. Each month I eyeball your status reports asking myself "Is there anything here I can work with Karsten on to draw our spaces closer together?" so please let me know if you run across anything in Metrics we can collaborate on.
(Replying below, first replying to the DynamoDB part.)
Your hypocritical friend, ~Damian
PS. When we next meet I'd like to discuss ExoneraTor's design a bit. First thought I had when looking at the code was 'huh... I wonder if this would be a good use case for DynamoDB'.
I'm wary about moving to another database, especially NoSQL ones and/or cloud-based ones. They don't magically make things faster, and Postgres is something I understand quite well by now. And again, I think that we keep the Postgres part entirely unchanged when moving to Python. Not saying that DymanoDB can't be the better choice, but switching the database is not a priority for me.
So, regarding the rewrite: rather than canceling the project before it starts, how about we find a role for you that you're more comfortable with?
For example, I'd want to try rewriting it step by step based on your suggestion of frameworks/libraries and with some code review of yours.
If you're interested, which framework would I use for the new Python ExoneraTor? It's supposed to do the following tasks:
- Provide a simple web site with a web form, backed by the PostgreSQL
database.
- Maybe offer a simple RESTful API for lookups that the web form could
use to compose responses, but that could also be used by other applications directly.
- Return documents from the database by identifier, so without
providing a search functionality.
- Run a scheduled task once per hour that fetches data from CollecTor
and puts it in a database.
Bonus points if the result is as easy to deploy on Debian Wheezy as possible. Like, install these few Debian packages, run the setup script, done.
Of course, if you'd prefer to focus on other things and not discuss ExoneraTor stuff, that's perfectly fine, too. :)
All the best, Karsten
.