I'm trying to understand how the bandwidth authorities work, and reading the spec [0] got me only partway to an understanding, so I'm trying to use Stem to see what the bwauth measurements look like in practice. I'm working off the "Votes by Bandwidth Authorities" example on the Stem webpage [1]. The query there returns a RouterStatusEntryV3, so I've been looking at the "bandwidth" and "measured" fields to try to understand what's going on [2]. "Bandwidth" maxes out at 10,000 (kb/s??) but "measured" doesn't seem to have the same ceiling, which made me realize that they aren't in the same units. Damian mentioned on IRC that "measured" might be returning the bwauth weight rather than a bandwidth, but what is the meaning of that weight? Does a higher "measured" value mean a higher bandwidth, or a higher bandwidth relative to what the relay advertises? In other words, if I sorted the descriptors by "measured" value, what would that order mean?
Separately, is there a way (using Stem or some other tool) to see the raw bwauth measurements rather than the weights? Is that a calculation I can reverse? I haven't looked into the historical data on CollecTor yet, but ideally, I would like to use the historical data to figure out how effective the bwauth measurements seem to be in different situations (for example, the misconfigured to very high bandwidth relay this past February seems to have produced confusing bwauth measurements [3]). If I'm looking for interesting events in the historical bwauth data, would I be looking for high "measured" values, rapid changes, or ...?
Thanks! Anna PhD Student, University of Washington
[0] https://gitweb.torproject.org/torflow.git/blob/HEAD:/NetworkScanners/BwAutho... [1] https://stem.torproject.org/tutorials/examples/votes_by_bandwidth_authoritie... [2] https://stem.torproject.org/api/descriptor/router_status_entry.html [3] https://lists.torproject.org/pipermail/tor-talk/2014-February/032094.html
Hi Anna. Glad you're interested in digging in to the directory authorities! This is certainly a space that could use some love.
I'm working off the "Votes by Bandwidth Authorities" example on the Stem webpage (https://stem.torproject.org/tutorials/examples/votes_by_bandwidth_authoritie...).
Oh dear! Just noticed that example sucks for figuring out who is and isn't a bandwidth authority. Made a little tweak so the world sucks a little less...
https://gitweb.torproject.org/stem.git/commitdiff/e130863
You mentioned on IRC that "measured" might be returning the bwauth weight rather than a bandwidth, but what is the meaning of that weight? Is a higher "measured" value mean a higher bandwidth, or a higher bandwidth relative to what it advertises?
My understanding is that a higher 'measured' simply means 'the bandwidth authorities think you should use this relay more', which in turn is based on how much traffic the bandwidth authorities thinks it can/should handle.
In other words, if I sorted the descriptors by "measured" value, what would that order mean?
I *think* that would be the ordering of 'relays who receive the most tor client traffic due to having a more highly weighted heuristic for relay selection'.
That said, this is an area I'm honestly not that familiar. I'm looping in Sebastian, Karsten, and Roger. As mentioned on irc Sebastian has touched the Bandwidth Authorities most recently, so he's likely the most knowledgeable at present about this space.
Karsten is the maintainer of our metrics space (http://metrics.torproject.org) and a descriptor guru, while Roger... well, knows all the things. But that said, he has a special afinity for research so as a PhD student he'll probably be especially interested to hear your plans.
Separately, is there a way (using Stem or some other tool) to see the raw bwauth measurements rather than the weights?
I don't believe this is exposed anywhere, so only the bandwidth authority operators have this. And by 'have' I mean 'maybe in their logs, or possibly not even surfaced at all'.
Is that a calculation I can reverse?
Maybe run a bandwidth authority of your own? This could be a terrible idea. Sebastian would know.
I haven't looked into the historical data on CollecTor yet, but ideally, I would like to use the historical data to figure out how effective the bwauth measurements seem to be in different situations (for example, the misconfigured to very high bandwidth relay this past February seems to have confused the bwauths https://lists.torproject.org/pipermail/tor-talk/2014-February/032094.html).
Agreed, it would be nice for CollecTor to have bandwidth authority information. However, with a few small exceptions (like rdns and geoip lookups) CollecTor is simply a distilled version of what's in the consensus. That is to say, by directly collecting descriptor information like you are you're already have a superset of what CollecTor provides.
Cheers! -Damian
Hi there,
On 21 Nov 2014, at 23:44, Damian Johnson atagar@torproject.org wrote:
In other words, if I sorted the descriptors by "measured" value, what would that order mean?
I *think* that would be the ordering of 'relays who receive the most tor client traffic due to having a more highly weighted heuristic for relay selection'.
that would be accurate, is my understanding
That said, this is an area I'm honestly not that familiar. I'm looping in Sebastian, Karsten, and Roger. As mentioned on irc Sebastian has touched the Bandwidth Authorities most recently, so he's likely the most knowledgeable at present about this space.
I've tried fixing stuff and have mostly given up. I'm not too familiar with it, and probably can't help too much. I'll try to answer questions if there are any, tho.
Separately, is there a way (using Stem or some other tool) to see the raw bwauth measurements rather than the weights?
I don't believe this is exposed anywhere, so only the bandwidth authority operators have this. And by 'have' I mean 'maybe in their logs, or possibly not even surfaced at all'.
We could publish those. Let's ask karsten if he thinks that'd be worthwhile?
Is that a calculation I can reverse?
Maybe run a bandwidth authority of your own? This could be a terrible idea. Sebastian would know.
You can look at the votes before a consensus was formed, they'll continue the values for each measuring bwauth. Running your own bwauth might be interesting, but it's probably not very useful if you want to learn values for the deployed network.
Cheers Sebastian
On 22/11/14 01:53, Sebastian Hahn wrote:
On 21 Nov 2014, at 23:44, Damian Johnson atagar@torproject.org wrote:
Separately, is there a way (using Stem or some other tool) to see the raw bwauth measurements rather than the weights?
I don't believe this is exposed anywhere, so only the bandwidth authority operators have this. And by 'have' I mean 'maybe in their logs, or possibly not even surfaced at all'.
We could publish those. Let's ask karsten if he thinks that'd be worthwhile?
Mike and I discussed this a few years ago in the context of a deliverable where we were supposed to "get bwauth and torperf data up on metrics.tp.o (#2394, #2534)":
https://trac.torproject.org/projects/tor/wiki/org/sponsors/SponsorF/Year1
I'm afraid I can't find Mike's exact statement anymore, but it was something along: "the bandwidth authority measurement setup is so artificial, I can't imagine how anyone would use the raw measurement data for anything useful." I cc'ed Mike to correct me if my memory is wrong, or to say if he changed his opinion.
That being said, I don't see any harm in publishing raw bandwidth authority data if it's for research purposes. Let's just not create the whole infrastructure for making the data available on CollecTor, at least until we're more certain that that's worthwhile.
I'm also cc'ing Aaron who has worked quite a bit on the bandwidth authority code.
All the best, Karsten
Thanks all for the responses!
On Fri, Nov 21, 2014 at 4:53 PM, Sebastian Hahn sebastian@torproject.org wrote:
Hi there,
On 21 Nov 2014, at 23:44, Damian Johnson atagar@torproject.org wrote:
In other words, if I sorted the descriptors by "measured" value, what
would
that order mean?
I *think* that would be the ordering of 'relays who receive the most tor client traffic due to having a more highly weighted heuristic for relay selection'.
that would be accurate, is my understanding
Is there documentation of why this "heuristic for relay selection" does not correlate that well with "bandwidth" in the descriptor? I've attached a couple of scatter plots pulled from moria1's "measured" and "bandwidth" values for each descriptor a couple hours ago (and the plots look similar from the other bwauths). One shows all values, the other shows the bottom 75% of values (sorted by measurements), and neither shows as much of a correlation as I would expect. Are there factors other than bandwidth that contribute to this "heuristic for relay selection"?
Thanks, Anna
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 06/12/14 00:26, Anna Kornfeld Simpson wrote:
Thanks all for the responses!
On Fri, Nov 21, 2014 at 4:53 PM, Sebastian Hahn sebastian@torproject.org wrote:
Hi there,
On 21 Nov 2014, at 23:44, Damian Johnson atagar@torproject.org wrote:
In other words, if I sorted the descriptors by "measured" value, what
would
that order mean?
I *think* that would be the ordering of 'relays who receive the most tor client traffic due to having a more highly weighted heuristic for relay selection'.
that would be accurate, is my understanding
Is there documentation of why this "heuristic for relay selection" does not correlate that well with "bandwidth" in the descriptor? I've attached a couple of scatter plots pulled from moria1's "measured" and "bandwidth" values for each descriptor a couple hours ago (and the plots look similar from the other bwauths). One shows all values, the other shows the bottom 75% of values (sorted by measurements), and neither shows as much of a correlation as I would expect. Are there factors other than bandwidth that contribute to this "heuristic for relay selection"?
Hi Anna,
I don't have answers, but maybe ideas for further investigations:
- Not sure if this was mentioned before, but did you take a look at the spec? https://gitweb.torproject.org/torflow.git/tree/NetworkScanners/BwAuthority/R...
- Maybe try removing bandwidth values close to 10000, or just values exactly at 10000. IIRC, values are capped at that value. (Removing just those values may be more accurate than removing the top 25%.)
- Very small bandwidth values might be the result from newly started or restarted relays. (Advertised) bandwidth values are "the volume of traffic, both incoming and outgoing, that a relay is willing to sustain, as configured by the operator and claimed to be observed from recent data transfers." If a relay didn't observe larger data transfers, the reported bandwidth value will be small, but still the (past) measurements might be large. Maybe compare this for single relays over time.
- There's an interesting pattern at 1024 (?) kB/s. Maybe there are more at 512 kB/s and others. Can you reduce the amount of overplotting in the graph? In R/ggplot2, you'd set the "alpha" value to something smaller than 1, so that dots become somewhat transparent. Could be that these patterns are normal, because operators tend to pick certain bandwidth rates more often than others.
All the best, Karsten
On 21/11/14 23:44, Damian Johnson wrote:
Agreed, it would be nice for CollecTor to have bandwidth authority information. However, with a few small exceptions (like rdns and geoip lookups) CollecTor is simply a distilled version of what's in the consensus. That is to say, by directly collecting descriptor information like you are you're already have a superset of what CollecTor provides.
Minor clarification: I think you're confusing Onionoo with CollecTor.
Onionoo indeed distills data obtained from CollecTor and adds things like rDNS and GeoIP lookups. But Onionoo is probably not the right tool for researching bandwidth authorities.
CollecTor simply fetches descriptors from the directory authorities and makes them available. There's no difference between using Stem to fetch recent votes or using votes from CollecTor's descriptor archives.
All the best, Karsten