Christian, Lukas, everyone,
I learned today that we should have something working in a week or two. That's why I started hacking on this today and produced some code:
https://github.com/kloesing/challenger
Here are a few things I could use help with:
- Anybody want to help turning this script into a web app, possibly using Flask? See the first next step in README.md.
- Lukas, you announced OnionPy on tor-dev@ the other day. Want to look into the "Add local cache for ..." bullet points under "Next steps"? Is this something OnionPy could support? Want to write the glue code?
- Christian, want to help write the graphing code that visualizes the `combined-*.json` files produced by that tool? The README.md suggests a few possible graphs.
Thanks in advance! You're all helping grow the Tor network!
Also replying to Christian's mail inline.
On 28/03/14 09:07, Christian wrote:
On 27.03.2014 16:25, Karsten Loesing wrote:
On 27/03/14 11:57, Roger Dingledine wrote:
Hi Christian, other tor relay fans,
I'm looking for some volunteers, hopefully including Christian, to work on metrics and visualization of impact from new relays.
We're working with EFF to do another "Tor relay challenge" [*], to both help raise awareness of the value of Tor, and encourage many people to run relays -- probably non-exit relays for the most part, since that's the easiest for normal volunteers to step up and do.
You can read about the first round from several years ago here: https://www.eff.org/torchallenge
To make it succeed, the challenge for us here is to figure out what to measure to track progress, and then measure it and graph it for everybody.
I'm figuring that like last time, EFF will collect a list of fingerprints of relays that signed up "because of the challenge".
One of the main pushes we're aiming for this year is longevity: it's easy to sign up a relay for two weeks and then stop. We want to emphasize consistency and encourage having the relays up for many months.
Do you want the challenge application to simply provide some graphs or give some sort of interactive dashboard (clientside JavaScript)?
You asked Roger, and I'm not Roger, but I'd say let's start with some graphs. We can always make it more interactive later. Though I doubt it will be necessary.
Before going through your list of things we'd want to track below, let's first talk about our options to turn a list of fingerprints into fancy graphs:
- Write a new metrics-web module and put graphs on the metrics
website. This means parsing relay descriptors and storing certain per-relay statistics for all relays. That gives us maximum flexibility in the kinds of statistics, but is also most expensive in terms of developer hours. I don't want to do this.
- Extend Globe to show details pages for multiple relays. This
requires us to move to the server-based Globe-node, because the poor browser shouldn't download graph data for all relays, but the server should return a single graph for all relays. It's also unclear if the new graphs will be of general interest for Globe users, and if the rest of the Globe details will be confusing to people interested in the relay challenge. Probably not a great idea, but I'm not sure.
I agree that Globe isn't the best place to display the challenge graphs. Currently the only focus for Globe is to provide data for single relays and bridges. Imo it would be better if the challenge participants list adds links to atlas, blutmagie and globe.
Agreed!
- Extend Onionoo to return aggregate graph data for a given set of
fingerprints. Seems useful. But has the big disadvantage that Onionoo would suddenly have to create responses dynamically. I'm worried about creating a new performance bottleneck there, and this is certainly not possible with poor overloaded yatei.
- Write a new little tool that fetches Onionoo documents once (or
twice) per day for all relays participating in the relay challenge and that produces graph data. That new tool could probably re-use some Compass code for the backend and some Globe code for the frontend. Graphs could be integrated directly into EFF's website. This is currently my favorite approach.
I like this idea.
Glad to hear! I slightly moved away from the "fetches once or twice per day" idea to a more elaborate approach. But the general idea is still the same.
Note for 2--4: Onionoo currently only gives out data for relays that have been running in the past 7 days. I'd have to extend it to give out all data for a list of fingerprints, regardless of when relays were running the last time. That's 2--3 days of coding and testing for me. It's also potentially creating a bottleneck, so we should first have a replacement for yatei.
So what are the things we'd want to track?
- Number of relays signed up that are Running, over time.
We can do something here with Onionoo's new uptime documents.
- Total bandwidth history of these running relays, over time.
We can sum up data from bandwidth documents for this.
- Maybe a graph showing the total number of bytes ever contributed by these relays? That would impress people perhaps.
Sure, same data as above.
- Total consensus weight of these running relays, over time.
We only have total consensus weight *fraction*, but yes.
- Something emphasizing duration -- e.g. the total consensus weight of the subset of the relays that have been in the consensus for 90% of the past month, 2 months, 6 months, etc. Are there better ideas here I hope? We'll want to be cognizant that if we're in the first week of the challenge, the 2 month graph will be empty and thus look sad.
Not sure what the 90% part is for, but yes, graphs with total consensus weight fraction are doable.
Regarding the sad-looking 2 month graph, we can easily define the data when the challenge starts and not show graphs until they make sense. Note that the current intervals for most data are 1 week, 1 month, 3 months, 1 year, and 5 years.
- Something comparing the above numbers to the total numbers. Given how huge some of the relays are lately, it would be easily to visualize the new contribution as a tiny irrelevant fraction, which could be disheartening to new relay operators even if their relays will actually become a big deal with some patience. What are some strategies for making this work right? E.g. a layer graph showing y layered on top of x where y is the new contribution, rather than a percentage-of-total graph that shows approximately 0%.
Absolute contributions to consensus weight are not available, just relative fractions.
We could also imagine more niche categories. For example, if we're hoping to get people to sign up relays at universities, we could imagine that the folks running the challenge give us a list of fingerprints of relays that self-identify as being at universities, and then we do up the same set of graphs with that subset of relays.
Sure, that's doable.
So, Christian, others, how much of this is possible as-is or with some limited tweaking, with Globe and related scripts? is most of it. :) I also cc Karsten because a lot of this overlaps with the metrics scripts, but I am expecting Karsten to push back against the idea of integrating these measurements more with the metrics project.
Right, adding this to the metrics website is not a good idea, because then we'd have to parse raw relay descriptors.
Somebody else to include here is Sreenatha who has done a pretty good job processing Onionoo data for the t-shirt yes/no ticket #9889.
Any other ideas for what to measure to help people know whether their contribution is being worthwhile?
Not yet, but new ideas may arise when we start working on the code.
[*] Please don't take this mail as any official announcement, or timeline, or any of that. At this point we need to collect people to help make this happen, not collect news stories.
What's the timeline for this? This requires some non-trivial coding time, and I'm not sure how to prioritize this over existing things on my todo list.
All the best, Karsten
(Found nothing else to comment on.)
Thanks!
All the best, Karsten
Hello everyone (reply all ftw),
On 04/04/2014 07:13 PM, Karsten Loesing wrote:
Christian, Lukas, everyone,
I learned today that we should have something working in a week or two. That's why I started hacking on this today and produced some code:
https://github.com/kloesing/challenger
Here are a few things I could use help with:
- Anybody want to help turning this script into a web app, possibly
using Flask? See the first next step in README.md.
I might be able to do that, but currently I don't have enough free time to make a commitment.
- Lukas, you announced OnionPy on tor-dev@ the other day. Want to look
into the "Add local cache for ..." bullet points under "Next steps"? Is this something OnionPy could support? Want to write the glue code?
onion-py already supports transparent caching using memcached. I use a (hopefully) unique serialisation of the query as the key (see serializer functions here: https://github.com/duk3luk3/onion-py/blob/master/onion_py/manager.py#L7) and have a bit of spaghetti code to check for available cached data and the 304 response status from onionoo (https://github.com/duk3luk3/onion-py/blob/master/onion_py/manager.py#L97).
I don't really understand what the code does. What is meant by "combining" documents? What exactly are we trying to measure? Once I know that and have thought of a sensible way to integrate it into onion-py I'm confident I can infact write that glue code :)
Cutting off the rest of the quote tree here (is that a polite thing to do on mailing lists? Sorry if not.), I just have two more comments towards Roger's thoughts:
1. Groups of relays taking the challenge together could just form relay families and we could count relay families in aggregate. (I'm already thinking about relay families a lot because gamambel wants me to overhaul the torservers exit-funding scripts to use relay families.) 2. If you want to do something with consensus weight, why not compare against all other new relays based on the first_seen property? ("new" can be adjusted until sufficiently pretty graphs emerge; and we'd need to periodically (every 4 or 12 or 24 hours?) fetch the consensus_weights from onionoo)
Cheers, Luke
PS: If you'd like me to support different backends for the caching in onion-py, I'm open to integrating anything that has a python 3 library.
On 04/04/14 21:24, Lukas Erlacher wrote:
Hello everyone (reply all ftw),
Hi Lukas,
On 04/04/2014 07:13 PM, Karsten Loesing wrote:
Christian, Lukas, everyone,
I learned today that we should have something working in a week or two. That's why I started hacking on this today and produced some code:
https://github.com/kloesing/challenger
Here are a few things I could use help with:
- Anybody want to help turning this script into a web app,
possibly using Flask? See the first next step in README.md.
I might be able to do that, but currently I don't have enough free time to make a commitment.
Okay. Maybe I'll give it a try by stealing heavily from Sathya's Compass code. Unless somebody else wants to give this a try?
- Lukas, you announced OnionPy on tor-dev@ the other day. Want to
look into the "Add local cache for ..." bullet points under "Next steps"? Is this something OnionPy could support? Want to write the glue code?
onion-py already supports transparent caching using memcached. I use a (hopefully) unique serialisation of the query as the key (see serializer functions here: https://github.com/duk3luk3/onion-py/blob/master/onion_py/manager.py#L7) and have a bit of spaghetti code to check for available cached data and the 304 response status from onionoo (https://github.com/duk3luk3/onion-py/blob/master/onion_py/manager.py#L97).
On second thought, and after sleeping over this, I'm less convinced that we should use an external library for the caching. We should rather start with a simple dict in memory and flush it based on some simple rules. That would allow us to tweak the caching specifically for our use case. And it would mean avoiding a dependency.
We can think about moving to onion-py at a later point. That gives you the opportunity to unspaghettize your code, and once that is done we'll have a better idea what caching needs we have for the challenger tool to decide whether to move to onion-py or not.
Would you still want to help write the simple caching code for challenger?
I don't really understand what the code does. What is meant by "combining" documents? What exactly are we trying to measure? Once I know that and have thought of a sensible way to integrate it into onion-py I'm confident I can infact write that glue code :)
Right now, the script sums up all graphs contained in Onionoo's bandwidth, clients, uptime, and weights documents. It also limits the range of the new graphs to max(first) to max(last) of given input graphs.
For example, assume we want to know the total bandwidth provided by the following 2 relays participating in the relay challenge:
datetime: 0, 1, 2, 3, 4, 5, ...
relay 1: [5, 4, 5, 6] relay 2: [4, 3, 5, 4]
combined: [8, 9, 9, 6]
This is not perfect for various reasons, but it's the best I came up with yesterday. Also, as we all know, perfect is the enemy of good.
(If you're curious, reason #1: the graph goes down at the end, and we can't say whether it's because relay 2 disappeared or did not report data yet; reason #2: we're weighting both relays' B/s equally, though relay 1 might have been online 24/7 and relay 2 only long enough that Onionoo doesn't put in null; there may be more reasons.)
Cutting off the rest of the quote tree here (is that a polite thing to do on mailing lists? Sorry if not.), I just have two more comments towards Roger's thoughts:
- Groups of relays taking the challenge together could just form
relay families and we could count relay families in aggregate. (I'm already thinking about relay families a lot because gamambel wants me to overhaul the torservers exit-funding scripts to use relay families.)
Relay families are a difficult topic. I remember spending a day or two figuring out how to group by family in Compass a while back. There must be some notes or thoughts on Trac if you're curious.
Regarding these graphs, I'm not sure what we would gain from grouping new relays by family. My current plan is to provide only graphs that have a single graph line for all relays and bridges participating in the challenge. So, "total bytes read", "total bytes written", "total number of new relays and bridges", "total consensus weight fraction added", "total advertised bandwidth added", etc. I don't think we should add categories by family or any other criteria. KISS.
- If you want to do something with consensus weight, why
not compare against all other new relays based on the first_seen property? ("new" can be adjusted until sufficiently pretty graphs emerge; and we'd need to periodically (every 4 or 12 or 24 hours?) fetch the consensus_weights from onionoo)
I'm not sure what you mean. We do have consensus weight fractions in (combined) weights documents. I'm also planning to add absolute consensus weights to those documents in the future.
By "fetching something periodically from Onionoo", do you mean keeping a local state other than the latest cached Onionoo documents? I'm explicitly trying to avoid that. Keeping a state means you need to back it up and restore it, and most importantly, fix it whenever there's a bug. I'm already feeling that pain with Onionoo, so I'd want to keep all state in Onionoo and not make the new tool any more complex than required.
PS: If you'd like me to support different backends for the caching in onion-py, I'm open to integrating anything that has a python 3 library.
See above. Happy to discuss caching more when we know what caching needs we have.
I'm not also sure about Python 3. Whatever we write needs to run on Debian Wheezy with whatever libraries are present there. If they're all Python 3, great. If not, can't do.
Thanks for your feedback!
All the best, Karsten
Hi Karsten,
On 04/05/2014 09:58 AM, Karsten Loesing wrote:
On second thought, and after sleeping over this, I'm less convinced that we should use an external library for the caching. We should rather start with a simple dict in memory and flush it based on some simple rules. That would allow us to tweak the caching specifically for our use case. And it would mean avoiding a dependency. We can think about moving to onion-py at a later point. That gives you the opportunity to unspaghettize your code, and once that is done we'll have a better idea what caching needs we have for the challenger tool to decide whether to move to onion-py or not. Would you still want to help write the simple caching code for challenger?
I cleaned up the caching code and added a simple in-memory dict caching provider that has no further dependencies to onion-py. (it also has no provisions for eviction/flushing at all, but I will add that next. Right now everything is cached forever, but of course a new response from OnionOO replaces an old one.)
I can write the OnionOO API code and caching code for challenger, if I can use Python 3 and the requests library. (See below) Of course I'd really like to actually have a user for onion-py, since it would help getting the necessary feedback and polish to push the library to version 1.0, but I understand if that isn't appropriate for this project.
I don't really understand what the code does. What is meant by "combining" documents? What exactly are we trying to measure? Once I know that and have thought of a sensible way to integrate it into onion-py I'm confident I can infact write that glue code :)
Right now, the script sums up all graphs contained in Onionoo's bandwidth, clients, uptime, and weights documents. It also limits the range of the new graphs to max(first) to max(last) of given input graphs.
For example, assume we want to know the total bandwidth provided by the following 2 relays participating in the relay challenge:
datetime: 0, 1, 2, 3, 4, 5, ...
relay 1: [5, 4, 5, 6] relay 2: [4, 3, 5, 4]
combined: [8, 9, 9, 6]
This is not perfect for various reasons, but it's the best I came up with yesterday. Also, as we all know, perfect is the enemy of good.
(If you're curious, reason #1: the graph goes down at the end, and we can't say whether it's because relay 2 disappeared or did not report data yet; reason #2: we're weighting both relays' B/s equally, though relay 1 might have been online 24/7 and relay 2 only long enough that Onionoo doesn't put in null; there may be more reasons.)
Ah, I see! :) So for scalar attributes of relays (such as consensus_weight_fraction) it's just a sum, and for histories it's the graphs combined as you just outlined. That makes sense, thank you!
I'm not also sure about Python 3. Whatever we write needs to run on Debian Wheezy with whatever libraries are present there. If they're all Python 3, great. If not, can't do.
I would strongly prefer to use Python 3. I understand wanting to use debian stable (I use it myself), but Python 3 is 6 years old and Python 2 is completely dead and its use for new projects is not recommended. The only mandatory dependency for onion-py, and for me, is requests (I really dislike using urllib* directly - if you want to know why, check https://gist.github.com/kennethreitz/973705), and the python3-requests package in Wheezy is from 2012, and there is no python3-flask. :-(
Is there anything standing against using pip (python3-pip package) to install requests and flask from pypi?
Thanks for your feedback!
All the best, Karsten
Cheers, Luke
On 05/04/14 12:19, Lukas Erlacher wrote:
Hi Karsten,
On 04/05/2014 09:58 AM, Karsten Loesing wrote:
On second thought, and after sleeping over this, I'm less convinced that we should use an external library for the caching. We should rather start with a simple dict in memory and flush it based on some simple rules. That would allow us to tweak the caching specifically for our use case. And it would mean avoiding a dependency. We can think about moving to onion-py at a later point. That gives you the opportunity to unspaghettize your code, and once that is done we'll have a better idea what caching needs we have for the challenger tool to decide whether to move to onion-py or not. Would you still want to help write the simple caching code for challenger?
I cleaned up the caching code and added a simple in-memory dict caching provider that has no further dependencies to onion-py. (it also has no provisions for eviction/flushing at all, but I will add that next. Right now everything is cached forever, but of course a new response from OnionOO replaces an old one.)
Yeah, I think we'll want to define a maximum lifetime of cache entries, or the poor cache will explode pretty soon.
I can write the OnionOO API code and caching code for challenger, if I can use Python 3 and the requests library. (See below)
Great, your help would be much appreciated! Want to send me a pull request whenever you have something to merge?
See my response regarding Python 3 below.
Of course I'd really like to actually have a user for onion-py, since it would help getting the necessary feedback and polish to push the library to version 1.0, but I understand if that isn't appropriate for this project.
My hope with challenger is that it's written quickly, working quietly for a year, and then disappearing without anybody noticing. I'd rather not want to maintain yet another thing. So, maybe Weather is a better candidate for using onion-py than challenger.
I don't really understand what the code does. What is meant by "combining" documents? What exactly are we trying to measure? Once I know that and have thought of a sensible way to integrate it into onion-py I'm confident I can infact write that glue code :)
Right now, the script sums up all graphs contained in Onionoo's bandwidth, clients, uptime, and weights documents. It also limits the range of the new graphs to max(first) to max(last) of given input graphs.
For example, assume we want to know the total bandwidth provided by the following 2 relays participating in the relay challenge:
datetime: 0, 1, 2, 3, 4, 5, ...
relay 1: [5, 4, 5, 6] relay 2: [4, 3, 5, 4]
combined: [8, 9, 9, 6]
This is not perfect for various reasons, but it's the best I came up with yesterday. Also, as we all know, perfect is the enemy of good.
(If you're curious, reason #1: the graph goes down at the end, and we can't say whether it's because relay 2 disappeared or did not report data yet; reason #2: we're weighting both relays' B/s equally, though relay 1 might have been online 24/7 and relay 2 only long enough that Onionoo doesn't put in null; there may be more reasons.)
Ah, I see! :) So for scalar attributes of relays (such as consensus_weight_fraction) it's just a sum, and for histories it's the graphs combined as you just outlined. That makes sense, thank you!
Right. Though details documents are not included, so just graphs, no scalar attributes.
I'm not also sure about Python 3. Whatever we write needs to run on Debian Wheezy with whatever libraries are present there. If they're all Python 3, great. If not, can't do.
I would strongly prefer to use Python 3. I understand wanting to use debian stable (I use it myself), but Python 3 is 6 years old and Python 2 is completely dead and its use for new projects is not recommended. The only mandatory dependency for onion-py, and for me, is requests (I really dislike using urllib* directly - if you want to know why, check https://gist.github.com/kennethreitz/973705), and the python3-requests package in Wheezy is from 2012, and there is no python3-flask. :-(
Is there anything standing against using pip (python3-pip package) to install requests and flask from pypi?
If there's a way to build it only with packages coming out of Wheezy's apt-get, our sysadmins will like us more, and that's a good thing.
Installing packages using Python-specific package managers is going to make our sysadmins sad, so we should have a very good reason for wanting such a package. In general, we don't need the latest and greatest package. Unless we do.
All the best, Karsten
On Sat, Apr 5, 2014 at 3:58 PM, Karsten Loesing karsten@torproject.org wrote:
Installing packages using Python-specific package managers is going to make our sysadmins sad, so we should have a very good reason for wanting such a package. In general, we don't need the latest and greatest package. Unless we do.
What about virtualenv? Part of the premise behind it is that you can configure appropriate packages as a developer / operator without having to bother sysadmins and making them worried about system-wide effects.
- Nikita
Hello Nikita, Karsten,
On 04/05/2014 05:03 PM, Nikita Borisov wrote:
On Sat, Apr 5, 2014 at 3:58 PM, Karsten Loesing karsten@torproject.org wrote:
Installing packages using Python-specific package managers is going to make our sysadmins sad, so we should have a very good reason for wanting such a package. In general, we don't need the latest and greatest package. Unless we do.
What about virtualenv? Part of the premise behind it is that you can configure appropriate packages as a developer / operator without having to bother sysadmins and making them worried about system-wide effects.
- Nikita
I was going to mention virtualenv as well, but I have to admit that I find it weird and scary, especially since I haven't found good documentation for it. If there is somebody who is familiar with virtualenv that would probably be the best solution.
On 04/05/2014 04:58 PM, Karsten Loesing wrote:
My hope with challenger is that it's written quickly, working quietly for a year, and then disappearing without anybody noticing. I'd rather not want to maintain yet another thing. So, maybe Weather is a better candidate for using onion-py than challenger.
Yes, I understand.
Yeah, I think we'll want to define a maximum lifetime of cache entries, or the poor cache will explode pretty soon.
What usage patterns do we have to expect? Do we want to hit onionoo to check if the cache is still valid for every request, or should we do "hard caching" for several minutes? The best UX solution would be to have a background task that keeps the cache current so user requests can be delivered without hitting onionoo at all. In other words, unless we do something intelligent with the cache, the cache is not actually going to be very useful.
Great, your help would be much appreciated! Want to send me a pull request whenever you have something to merge?
Will do.
Cheers, Luke
On 05/04/14 17:46, Lukas Erlacher wrote:
Hello Nikita, Karsten,
On 04/05/2014 05:03 PM, Nikita Borisov wrote:
On Sat, Apr 5, 2014 at 3:58 PM, Karsten Loesing karsten@torproject.org wrote:
Installing packages using Python-specific package managers is going to make our sysadmins sad, so we should have a very good reason for wanting such a package. In general, we don't need the latest and greatest package. Unless we do.
What about virtualenv? Part of the premise behind it is that you can configure appropriate packages as a developer / operator without having to bother sysadmins and making them worried about system-wide effects.
- Nikita
I was going to mention virtualenv as well, but I have to admit that I find it weird and scary, especially since I haven't found good documentation for it. If there is somebody who is familiar with virtualenv that would probably be the best solution.
I'm afraid I don't know enough about Python or virtualenv. So far, it was almost zero effort for our sysadmins to install a package from the repositories and keep that up-to-date. I'd like to stick with the apt-get approach and save the virtualenv approach for situations when we really need a package that is not contained in the repositories.
Thanks for the suggestion, though!
On 04/05/2014 04:58 PM, Karsten Loesing wrote:
My hope with challenger is that it's written quickly, working quietly for a year, and then disappearing without anybody noticing. I'd rather not want to maintain yet another thing. So, maybe Weather is a better candidate for using onion-py than challenger.
Yes, I understand.
Yeah, I think we'll want to define a maximum lifetime of cache entries, or the poor cache will explode pretty soon.
What usage patterns do we have to expect? Do we want to hit onionoo to check if the cache is still valid for every request, or should we do "hard caching" for several minutes? The best UX solution would be to have a background task that keeps the cache current so user requests can be delivered without hitting onionoo at all.
That's a fine question. I can see various caching approaches here. But I just realize that this is premature optimization. Let's first build the thing and download whatever we need and whenever we need it. And once we know what caching needs we have, let's build the cache.
In other words, unless we do something intelligent with the cache, the cache is not actually going to be very useful.
Valid point. :)
Great, your help would be much appreciated! Want to send me a pull request whenever you have something to merge?
Will do.
Great. Thanks!
All the best, Karsten
On Tue, Apr 8, 2014 at 12:59 PM, Karsten Loesing karsten@torproject.orgwrote:
On 05/04/14 17:46, Lukas Erlacher wrote:
Hello Nikita, Karsten,
On 04/05/2014 05:03 PM, Nikita Borisov wrote:
On Sat, Apr 5, 2014 at 3:58 PM, Karsten Loesing karsten@torproject.org wrote:
Installing packages using Python-specific package managers is going to make our sysadmins sad, so we should have a very good reason for wanting such a package. In general, we don't need the latest and greatest package. Unless we do.
What about virtualenv? Part of the premise behind it is that you can configure appropriate packages as a developer / operator without having to bother sysadmins and making them worried about system-wide effects.
- Nikita
I was going to mention virtualenv as well, but I have to admit that I find it weird and scary, especially since I haven't found good documentation for it. If there is somebody who is familiar with virtualenv that would probably be the best solution.
I'm afraid I don't know enough about Python or virtualenv. So far, it was almost zero effort for our sysadmins to install a package from the repositories and keep that up-to-date. I'd like to stick with the apt-get approach and save the virtualenv approach for situations when we really need a package that is not contained in the repositories.
Thanks for the suggestion, though!
On 04/05/2014 04:58 PM, Karsten Loesing wrote:
My hope with challenger is that it's written quickly, working quietly for a year, and then disappearing without anybody noticing. I'd rather not want to maintain yet another thing. So, maybe Weather is a better candidate for using onion-py than challenger.
Yes, I understand.
Yeah, I think we'll want to define a maximum lifetime of cache entries, or the poor cache will explode pretty soon.
What usage patterns do we have to expect? Do we want to hit onionoo to check if the cache is still valid for every request, or should we do "hard caching" for several minutes? The best UX solution would be to have a background task that keeps the cache current so user requests can be delivered without hitting onionoo at all.
That's a fine question. I can see various caching approaches here. But I just realize that this is premature optimization. Let's first build the thing and download whatever we need and whenever we need it. And once we know what caching needs we have, let's build the cache.
In other words, unless we do something intelligent with the cache, the cache is not actually going to be very useful.
Valid point. :)
Great, your help would be much appreciated! Want to send me a pull request whenever you have something to merge?
Will do.
Great. Thanks!
Hi Karsten and others,
I got to run the challenger script by chance[1], and spotted a small mistake that was preventing Lukas' onion.py downloader code from working. Ended up forking and creating a separate branch:
https://github.com/wfn/challenger/commits/wfn_fix_luk3s_download
Relevant commits:
- 38d88bcb1136f97881f81152d3d883c4e9480188[2] (enables downloader) - 39c800643c040474402fc62d2a2db75c25889dfc[3] (this is the one with the small thingie-fix)
(It was a very small thing with the way the 'requests' module handles/provides json documents.)
I was doing this to be able to give Roger the 'combined-*.json' files for currently vulnerable (re: openssl) relays (he wanted to see which part of the combined weight fraction they comprise, etc.)
Fingerprints for those relays are here, fwiw: http://ravinesmp.com/volatile/challenger-stuff/vuln_fingerprints.txt (the original link that Roger gave me was http://fpaste.org/92688/ ) (count: 1024.)
If you download these fingerprints, you can just run `python challenge.py -f vuln_fingerprints.txt`
(for anyone using virtualenv, you might need to `pip install requests`, and then things should work. For anyone who's just cloned the thing, everything should probably work after simply installing the 'requests' python module, if it's not there. I see that 'python-requests' is available in the repos.)
I guess the code hasn't been tested for those amounts of fingerprints before. Good news: it works (where 'works' means 'i opened the resulting files and they contained all those fingerprints, and/or they contained lots of numbers.') Kinda-bad news: Onionoo doesn't seem to share the enthusiasm, and hiccups, and spits 502 Proxy Error some time after the lookups for the first document (combined bandwidth) are made.
My cheap quick hack was to insert time.sleep() here and there:
- 7425ef6fc00dedf3b2b7f2649e832fb4c93909ae[4]
(cheap hack is cheap, but it worked. Note: takes time to download everything. Didn't time it yet - sorry.)
For anyone interested, these are the resulting 'combined-*.json' files from all those fingerprints:
- http://ravinesmp.com/volatile/challenger-stuff/vuln1024-combined-bandwidth.j... - http://ravinesmp.com/volatile/challenger-stuff/vuln1024-combined-weights.jso... - http://ravinesmp.com/volatile/challenger-stuff/vuln1024-combined-clients.jso... oh, this one's empty. Why is it empty? Didn't look into it.] - http://ravinesmp.com/volatile/challenger-stuff/vuln1024-combined-uptime.json
I haven't much looked into them, at least not yet.
Roger wants to get some information about those vulnerable relays, and he thinks this challenger stuff can help with that. Those combined-* documents seem useful. I made a separate ML thread for this:
https://lists.torproject.org/pipermail/tor-relays/2014-April/004262.html%5Bw..., i should switch to plaintext email probably..]
[1]: where 'by chance' means 'fell under arma's irc-Jedi spells somehow' / didn't plan to / i might be wrong about things or the things i did to the script, so beware [2]: https://github.com/wfn/challenger/commit/38d88bcb1136f97881f81152d3d883c4e94... [3]: https://github.com/wfn/challenger/commit/39c800643c040474402fc62d2a2db75c258... [4]: https://github.com/wfn/challenger/commit/7425ef6fc00dedf3b2b7f2649e832fb4c93...
take it easy
--
kostas / wfn
0x0e5dce45 @ pgp.mit.edu
Hi Kostas,
right now, we're coding challenger against what exists in debian wheezy, which means version 0.1.2 of the requests lib using the python-requests package you mentioned, where response.json is correct, and not response.json() to get json content from the response.
I'd recommend that if you want to make your own "grab stuff from onionoo" script suite, to work with onion-py[1] . It's very new, very spiffy and uses python 3 and the newest requests lib. (full disclosure: It's my baby and I'm desperately looking for testers/users, but that should be obvious to anyone who read this thread.) Alternatively, convince the right people (presumably Karsten and arma) that challenger should switch to a more sustainable runtime than "what we can get from wheezy's repositories". ;-)
Cheers, Luke
On Wed, Apr 9, 2014 at 4:06 AM, Lukas Erlacher tor@lerlacher.de wrote:
Hi Kostas,
right now, we're coding challenger against what exists in debian wheezy, which means version 0.1.2 of the requests lib using the python-requests package you mentioned, where response.json is correct, and not response.json() to get json content from the response.
I'd recommend that if you want to make your own "grab stuff from onionoo" script suite, to work with onion-py[1] . It's very new, very spiffy and uses python 3 and the newest requests lib. (full disclosure: It's my baby and I'm desperately looking for testers/users, but that should be obvious to anyone who read this thread.) Alternatively, convince the right people (presumably Karsten and arma) that challenger should switch to a more sustainable runtime than "what we can get from wheezy's repositories". ;-)
A-ha! :) That makes sense. (fwiw, i used pip under virtualenv in wheezy; requests lib version ancient indeed; such is life. fwiw, convincing wheezy cavepeople to use what you suggest makes sense. It's a false dichotomy between 'ensures dependences vs. breaks dependencies.')
So
- the timeout stuff might be useful to everyone involved; it's rough - the 'fix' might be useful for people using old 'requests' - your onion-py sounds nice
g'day
Cheers, Luke
On Wed, Apr 9, 2014 at 4:18 AM, Kostas Jakeliunas kostas@jakeliunas.comwrote:
On Wed, Apr 9, 2014 at 4:06 AM, Lukas Erlacher tor@lerlacher.de wrote:
Hi Kostas,
right now, we're coding challenger against what exists in debian wheezy, which means version 0.1.2 of the requests lib using the python-requests package you mentioned, where response.json is correct, and not response.json() to get json content from the response.
I'd recommend that if you want to make your own "grab stuff from onionoo" script suite, to work with onion-py[1] . It's very new, very spiffy and uses python 3 and the newest requests lib. (full disclosure: It's my baby and I'm desperately looking for testers/users, but that should be obvious to anyone who read this thread.) Alternatively, convince the right people (presumably Karsten and arma) that challenger should switch to a more sustainable runtime than "what we can get from wheezy's repositories". ;-)
A-ha! :) That makes sense. (fwiw, i used pip under virtualenv in wheezy; requests lib version ancient indeed; such is life. fwiw, convincing wheezy cavepeople to use what you suggest makes sense. It's a false dichotomy between 'ensures dependences vs. breaks dependencies.')
So
- the timeout stuff might be useful to everyone involved; it's rough
- the 'fix' might be useful for people using old 'requests'
Actually, I might have that one kind of backwards. So timeout stuff for everyone (who wants to use things from the 'luk3duk3-onionoo-integration'[2] branch), the 'fix' for *certain* people (for example, for those using pip.)
- your onion-py sounds nice
g'day
Cheers, Luke
[2]: https://github.com/kloesing/challenger/commits/luk3duk3-onionoo-integration
On 04/05/2014 04:58 PM, Karsten Loesing wrote:
Great, your help would be much appreciated! Want to send me a pull request whenever you have something to merge?
Alright, so I wrote a few lines and sent you a pull request. Could you please check if that downloads the data you expect? And when we know what exactly we want to cache and how, I'll add the logic for that.
Cheers, Luke
On Sat, Apr 5, 2014 at 8:58 AM, Karsten Loesing karsten@torproject.org wrote:
Right now, the script sums up all graphs contained in Onionoo's bandwidth, clients, uptime, and weights documents. It also limits the range of the new graphs to max(first) to max(last) of given input graphs.
For example, assume we want to know the total bandwidth provided by the following 2 relays participating in the relay challenge:
datetime: 0, 1, 2, 3, 4, 5, ...
relay 1: [5, 4, 5, 6] relay 2: [4, 3, 5, 4]
combined: [8, 9, 9, 6]
This is not perfect for various reasons, but it's the best I came up with yesterday. Also, as we all know, perfect is the enemy of good.
(If you're curious, reason #1: the graph goes down at the end, and we can't say whether it's because relay 2 disappeared or did not report data yet; reason #2: we're weighting both relays' B/s equally, though relay 1 might have been online 24/7 and relay 2 only long enough that Onionoo doesn't put in null; there may be more reasons.)
For the relay challenge, wouldn't you want to include the entire period that data is available for (i.e., min(first) to max(last))? Otherwise, if you are looking at a month's worth of data and a new relay arrives on the last day, your graph would only contain that day.
Also, I think you would want to do datetime.strptime(max(first), ...) here: https://github.com/kloesing/challenger/blob/master/challenge.py#L177-L178 Otherwise you're just taking the last relay's first and last values as the new_first and new_last.
Cheers, - Nikita
On 05/04/14 16:42, Nikita Borisov wrote:
On Sat, Apr 5, 2014 at 8:58 AM, Karsten Loesing karsten@torproject.org wrote:
Right now, the script sums up all graphs contained in Onionoo's bandwidth, clients, uptime, and weights documents. It also limits the range of the new graphs to max(first) to max(last) of given input graphs.
For example, assume we want to know the total bandwidth provided by the following 2 relays participating in the relay challenge:
datetime: 0, 1, 2, 3, 4, 5, ...
relay 1: [5, 4, 5, 6] relay 2: [4, 3, 5, 4]
combined: [8, 9, 9, 6]
This is not perfect for various reasons, but it's the best I came up with yesterday. Also, as we all know, perfect is the enemy of good.
(If you're curious, reason #1: the graph goes down at the end, and we can't say whether it's because relay 2 disappeared or did not report data yet; reason #2: we're weighting both relays' B/s equally, though relay 1 might have been online 24/7 and relay 2 only long enough that Onionoo doesn't put in null; there may be more reasons.)
For the relay challenge, wouldn't you want to include the entire period that data is available for (i.e., min(first) to max(last))? Otherwise, if you are looking at a month's worth of data and a new relay arrives on the last day, your graph would only contain that day.
Very good point!
The reason why I didn't include everything from min(first) to max(last) is that any graph covers the last $time_period of the relay or bridge being online and reporting data. So, the "3_days" graph of a specific relay could show a 3-day period weeks ago, and we wouldn't want to merge that with other 3-day periods which are more recent. Of corse, you're right that a new relay covering only a few hours in their "3_days" graph would reduce our combined graph to just that. Oops.
So, I guess what we want to do is include everything from $(now - 3 days) to $now in the combined graph. Will fix.
Also, I think you would want to do datetime.strptime(max(first), ...) here: https://github.com/kloesing/challenger/blob/master/challenge.py#L177-L178 Otherwise you're just taking the last relay's first and last values as the new_first and new_last.
Another very good point. Will fix.
Thanks for the review!
All the best, Karsten
On 04.04.2014 19:13, Karsten Loesing wrote:
Christian, Lukas, everyone,
I learned today that we should have something working in a week or two. That's why I started hacking on this today and produced some code:
https://github.com/kloesing/challenger
Here are a few things I could use help with:
- Anybody want to help turning this script into a web app, possibly
using Flask? See the first next step in README.md.
- Lukas, you announced OnionPy on tor-dev@ the other day. Want to look
into the "Add local cache for ..." bullet points under "Next steps"? Is this something OnionPy could support? Want to write the glue code?
- Christian, want to help write the graphing code that visualizes the
`combined-*.json` files produced by that tool? The README.md suggests a few possible graphs.
Sure, should I create a new repo for the website with graphing code or work directly in the kloesing/challenger repository?
On 06/04/14 21:29, Christian wrote:
On 04.04.2014 19:13, Karsten Loesing wrote:
Christian, Lukas, everyone,
I learned today that we should have something working in a week or two. That's why I started hacking on this today and produced some code:
https://github.com/kloesing/challenger
Here are a few things I could use help with:
- Anybody want to help turning this script into a web app, possibly
using Flask? See the first next step in README.md.
- Lukas, you announced OnionPy on tor-dev@ the other day. Want to look
into the "Add local cache for ..." bullet points under "Next steps"? Is this something OnionPy could support? Want to write the glue code?
- Christian, want to help write the graphing code that visualizes the
`combined-*.json` files produced by that tool? The README.md suggests a few possible graphs.
Sure, should I create a new repo for the website with graphing code or work directly in the kloesing/challenger repository?
My hope is that we can turn my script into a Flask web app which serves JSON data which is then graphed by your JavaScript that is embedded into the HTML. So it probably makes sense to have everything in a single repository. I'd say feel free to clone kloesing/challenger and send me pull requests. And feel free to create new directories as needed, we can still move around things later.
All the best, Karsten
On 07.04.2014 10:43, Karsten Loesing wrote:
On 06/04/14 21:29, Christian wrote:
On 04.04.2014 19:13, Karsten Loesing wrote:
Christian, Lukas, everyone,
I learned today that we should have something working in a week or two. That's why I started hacking on this today and produced some code:
https://github.com/kloesing/challenger
Here are a few things I could use help with:
- Anybody want to help turning this script into a web app, possibly
using Flask? See the first next step in README.md.
- Lukas, you announced OnionPy on tor-dev@ the other day. Want to look
into the "Add local cache for ..." bullet points under "Next steps"? Is this something OnionPy could support? Want to write the glue code?
- Christian, want to help write the graphing code that visualizes the
`combined-*.json` files produced by that tool? The README.md suggests a few possible graphs.
Sure, should I create a new repo for the website with graphing code or work directly in the kloesing/challenger repository?
My hope is that we can turn my script into a Flask web app which serves JSON data which is then graphed by your JavaScript that is embedded into the HTML. So it probably makes sense to have everything in a single repository. I'd say feel free to clone kloesing/challenger and send me pull requests. And feel free to create new directories as needed, we can still move around things later.
I send you a pull request with the first working version: https://github.com/kloesing/challenger/pull/2 . The ui is temporary but it works so far.
All the best, Karsten
tor-relays@lists.torproject.org