Filename: 276-lower-bw-granularity.txt Title: Report bandwidth with lower granularity in consensus documents Author: Nick Mathewson Created: 20-Feb-2017 Status: Open Target: 0.3.1.x-alpha
1. Overview
This document proposes that, in order to limit the bandwidth needed for networkstatus diffs, we lower the granularity with which bandwidth is reported in consensus documents.
Making this change will reduce the total compressed ed diff download volume by around 10%.
2. Motivation
Consensus documents currently report bandwidth values as the median of the measured bandwidth values in the votes. (Or as the median of all votes' values if there are not enough measurements.) And when voting, in turn, authorities simply report whatever measured value they most recently encountered, clipped to 3 significant base-10 figures.
This means that, from one consensus to the next, these weights very often and with little significance: A large fraction of bandwidth transitions are under 2% in magnitude.
As we begin to use consensus diffs, each change will take space to transmit. So lowering the amount of changes will lower client bandwidth requirements significantly.
3. Proposal
I propose that we round the bandwidth values as they are placed in the votes to two no more than significant digits. In addition, for values beginning with decimal "2" through "4", we should round the first two digits the nearest multiple of 2. For values beginning with decimal "5" though "9", we should round to the nearest multiple of 5.
This change does not require a consensus method; it will take effect once enough authorities have upgraded.
4. Analysis
The rounding proposed above will not round any value by more than 5%, so the overall impact on bandwidth balancing should be small.
In order to assess the bandwidth savings of this approach, I smoothed the January 2017 consensus documents' Bandwidth fields, using scripts from [1]. I found that if clients download consensus diffs once an hour, they can expect 11-13% mean savings after xz or gz compression. For two-hour intervals, the savings is 8-10%; for three-hour or four-hour intervals, the savings only is 6-8%. After that point, we start seeing diminishing returns, with only 1-2% savings on a 72-hour interval's diff.
[1] https://github.com/nmathewson/consensus-diff-analysis
5. Open questions:
Is there a greedier smoothing algorithm that would produce better results?
Is there any reason to think this amount of smoothing would not be save?
Would a time-aware smoothing mechanism work better?
On 25 Feb 2017, at 03:25, Nick Mathewson nickm@torproject.org wrote:
Filename: 276-lower-bw-granularity.txt Title: Report bandwidth with lower granularity in consensus documents Author: Nick Mathewson Created: 20-Feb-2017 Status: Open Target: 0.3.1.x-alpha
- Overview
This document proposes that, in order to limit the bandwidth needed for networkstatus diffs, we lower the granularity with which bandwidth is reported in consensus documents. ... 3. Proposal
I propose that we round the bandwidth values as they are placed in the votes to two no more than significant digits.
Misordered sentence this is
In addition, for values beginning with decimal "2" through "4", we should round the first two digits the nearest multiple of 2. For values beginning with decimal "5" though "9", we should round to the nearest multiple of 5.
Value Rounding Percentage Distinct Values 0-9 0 0% 10 10-20 0.5/10-0.5/20 5%-2.5% 10 20-50 1/20-1/50 5%-2% 15 50-100 2.5/50-2.5/100 5%-2.5% 10 (repeat the pattern for 100-200 etc.) (lower bounds are inclusive, upper bounds are exclusive)
This reduces each set of 1000 3 significant figure bandwidths to 35 distinct values.
It's worth noting that the existing bandwidth authority scripts round to 3 significant figures, so the overall rounding is up to 5.5%.
This change does not require a consensus method; it will take effect once enough authorities have upgraded.
The change will take effect progressively as authorities upgrade: since the median value is used, when one authority upgrades, 1/5 of the bandwidths will be rounded (on average).
Once all authorities upgrade, all bandwidths will be rounded like this.
- Analysis
The rounding proposed above will not round any value by more than 5%, so the overall impact on bandwidth balancing should be small.
5% *more* than the existing rounding.
...
- Open questions:
Is there a greedier smoothing algorithm that would produce better results?
Rounding to one significant digit changes 10 by up to 50%, so that's clearly unacceptable.
If we wanted 12.5% to be the maximum change, we could round approximately double (with some adjustments for scale).
For values beginning with decimal "1", we should round the first two digits the nearest multiple of 2. For values beginning with decimal "2" through "4", we should round the first two digits the nearest multiple of 5. For values beginning with decimal "5" though "9", we should round to one significant digit.
Value Rounding Percentage Distinct Values 0-9 0 0% 10 10-20 1/10-1/20 10%-5% 5 20-50 2.5/20-2.5/50 12.5%-4% 6 50-100 5/50-5/100 10%-5% 5 (repeat the pattern for 100-200 etc.) (lower bounds are inclusive, upper bounds are exclusive)
This reduces each set of 1000 3 significant figure bandwidths to 16 distinct values.
Would a scheme like this reduce the diffs by enough to make it worthwhile?
Is there any reason to think this amount of smoothing would not be save?
s/save/safe/
Given the existing variance is around 2%, and the existing rounding of 0.5%, rounding by 5% seems quite sensible.
(Typical variance *between* bandwidth authorities is much more than this anyway.)
It seems hard to game this scheme, and it would only result in a 5% boost anyway.
The same arguments apply to rounding by up to 12.5%. Anything more seems risky.
Would a time-aware smoothing mechanism work better?
Perhaps, but the complexity might outweigh the benefits.
T
-- Tim Wilson-Brown (teor)
teor2345 at gmail dot com PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B ricochet:ekmygaiu4rzgsk6n xmpp: teor at torproject dot org ------------------------------------------------------------------------
On Sun, Feb 26, 2017 at 7:15 AM, teor teor2345@gmail.com wrote:
On 25 Feb 2017, at 03:25, Nick Mathewson nickm@torproject.org wrote:
Filename: 276-lower-bw-granularity.txt Title: Report bandwidth with lower granularity in consensus documents Author: Nick Mathewson Created: 20-Feb-2017 Status: Open Target: 0.3.1.x-alpha
- Overview
This document proposes that, in order to limit the bandwidth needed for networkstatus diffs, we lower the granularity with which bandwidth is reported in consensus documents. ... 3. Proposal
I propose that we round the bandwidth values as they are placed in the votes to two no more than significant digits.
Misordered sentence this is
Fixed; thanks! ;)
In addition, for values beginning with decimal "2" through "4", we should round the first two digits the nearest multiple of 2. For values beginning with decimal "5" though "9", we should round to the nearest multiple of 5.
Value Rounding Percentage Distinct Values 0-9 0 0% 10 10-20 0.5/10-0.5/20 5%-2.5% 10 20-50 1/20-1/50 5%-2% 15 50-100 2.5/50-2.5/100 5%-2.5% 10 (repeat the pattern for 100-200 etc.) (lower bounds are inclusive, upper bounds are exclusive)
This reduces each set of 1000 3 significant figure bandwidths to 35 distinct values.
It's worth noting that the existing bandwidth authority scripts round to 3 significant figures, so the overall rounding is up to 5.5%.
This change does not require a consensus method; it will take effect once enough authorities have upgraded.
The change will take effect progressively as authorities upgrade: since the median value is used, when one authority upgrades, 1/5 of the bandwidths will be rounded (on average).
Once all authorities upgrade, all bandwidths will be rounded like this.
I've used your wording here.
Is there a greedier smoothing algorithm that would produce better results?
Rounding to one significant digit changes 10 by up to 50%, so that's clearly unacceptable.
If we wanted 12.5% to be the maximum change, we could round approximately double (with some adjustments for scale).
For values beginning with decimal "1", we should round the first two digits the nearest multiple of 2. For values beginning with decimal "2" through "4", we should round the first two digits the nearest multiple of 5. For values beginning with decimal "5" though "9", we should round to one significant digit.
Value Rounding Percentage Distinct Values 0-9 0 0% 10 10-20 1/10-1/20 10%-5% 5 20-50 2.5/20-2.5/50 12.5%-4% 6 50-100 5/50-5/100 10%-5% 5 (repeat the pattern for 100-200 etc.) (lower bounds are inclusive, upper bounds are exclusive)
This reduces each set of 1000 3 significant figure bandwidths to 16 distinct values.
Would a scheme like this reduce the diffs by enough to make it worthwhile?
I simulated this scheme, and it didn't pan out: the average reduction was less than 2% over the scheme currently in 276. (That is, if we were already doing the 276 scheme, this change would save no more than 2%.)
Is there any reason to think this amount of smoothing would not be save?
s/save/safe/
Given the existing variance is around 2%, and the existing rounding of 0.5%, rounding by 5% seems quite sensible.
(Typical variance *between* bandwidth authorities is much more than this anyway.)
It seems hard to game this scheme, and it would only result in a 5% boost anyway.
The same arguments apply to rounding by up to 12.5%. Anything more seems risky.
Would a time-aware smoothing mechanism work better?
Perhaps, but the complexity might outweigh the benefits.
T
-- Tim Wilson-Brown (teor)
teor2345 at gmail dot com PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B ricochet:ekmygaiu4rzgsk6n xmpp: teor at torproject dot org
tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
On 27 Feb 2017, at 11:35, Nick Mathewson nickm@alum.mit.edu wrote:
Is there any reason to think this amount of smoothing would not be save?
... Given the existing variance is around 2%, and the existing rounding of 0.5%, rounding by 5% seems quite sensible.
(Typical variance *between* bandwidth authorities is much more than this anyway.)
I was wrong: mikeperry raised some serious issues with this scheme.
Issue 1: Increased Inaccuracy
This scheme increases the inaccuracy of the bandwidth authority votes, compared to the bandwidth measurements. This effect is particularly pronounced for high-bandwidth relays, because the bandwidths are distributed according to a power law.
(Mike, feel free to correct me if I've got this wrong.)
For example, the top relay is approximately 1% of the network, or 750 Mbps. The existing variance due to rounding is 0.5% of that, or 4 Mbps.
The variance due to this scheme would be 5.5%, or 41 Mbps.
If the top 10 relays were rounded down, which would happen in approximately 1 in 2**10 consensuses (once every 6 weeks), this could distribute ~500 Mbps of client bandwidth to the rest of the relays arbitrarily and without warning.
This effect gets worse as relay performance increases due to multithreading or optimisation or processor improvements.
Issue 2: Different Levels of Inaccuracy
As a separate but related issue, rounding to 3 significant digits means that the error ratio is much higher when those digits are 100..., compared with when those digits are 999...
The maximum error ratios vary by 10x: 100- 101 0.00499 ... 999-1000 0.00050
This is an existing bug, and contributes to network instability. We should fix it anyway, and this is a good opportunity to do so.
Is there a greedier smoothing algorithm that would produce better results?
...
To address issue 1, I suggest that we report higher bandwidths more accurately, and lower bandwidths less accurately.
Here's a sketch of how that could work:
Top 1%: 3+1 significant digits (900 geometric values per decile) Next 9%: 2+1 significant digits (90 geometric values per decile) Next 90%: 1+1 significant digits (9 geometric values per decile) 0-100: set values: 0, 1, 5, 22 (4 geometric values for 4 deciles) (0-100 is special, because the most common values are 0 and 20.)
To address issue 2, I suggest that we round bandwidths geometrically, rather than linearly.
Here's a sketch of how that could work when rounding 100-1000 to 2 significant digits:
Normal linear rounding yields 90 values: 100, 110, 120, ... , 980, 990, 1000 Using the linear midpoints (a+b)/2: 105, 115, ... , 985, 995
Geometric rounding yields 90 values (using an extra significant digit): 100, 103, 105, ... , 950, 975, 1000 Using the geometric midpoints sqrt(a*b): 101, 104, ... , 962, 987
The maximum geometric error ratios vary by up to 2x, and that's due to the rounding to 3 significant digits (otherwise, all the ratios would be equal): 100- 103 0.01489 ... 975-1000 0.01274
But this is much better than linear rounding, where the ratios vary by up to 10x.
I've attached a script that calculates these geometric series: it's configured to calculate the 900 values for Top 1% rounding, their geometric error ratios, and their linear error differences.
(Changing precision to 2+1 and 1+1 gets you the 90 and 9 series, changing precision to 1, values to 3, and high to 100.0 gets you the tail-end series.)
This would be relatively easy to implement in C, or it could be precalculated and stored using ~2 kilobytes (1003 * uint16_t).
And as we're also doing work on compression:
There are also opportunities for table-driven compression: each value can be specified using 10 bits for the table index, plus a few bits for the power of 10 of the multiplier:
Top: 750 Mbps * 1024 / 8 = 96000 KBps = 9599 * 10**1 (1 bit) 1%: 100 Mbps * 1024 / 8 = 12800 KBps = 129 * 10**2 (2 bits) 10%: 25 Mbps * 1024 / 8 = 3200 KBps = 36 * 10**2 (2 bits) 99%: 4 Mbps * 1024 / 8 = 512 KBps = 46 * 10**1 (1 bit) End: 20 KBps = 20 KBps = 22 * 10**0 (1 bit)
6 extra bits (2 bytes total) would be sufficient to cover a 10,000-fold increase in maximum individual relay bandwidth (up to 8 Tbps).
T
-- Tim Wilson-Brown (teor)
teor2345 at gmail dot com PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B ricochet:ekmygaiu4rzgsk6n xmpp: teor at torproject dot org ------------------------------------------------------------------------
On Tue, Feb 28, 2017 at 6:42 AM, teor teor2345@gmail.com wrote:
Hi, Tim!
This looks pretty plausible to me. Do you have time to write up a quick&dirty python function that actually performs the smoothing? If so, I can test it out on the January consensuses and measure the impact on compressed diff sizes.
On 1 Mar 2017, at 04:11, Nick Mathewson nickm@alum.mit.edu wrote:
On Tue, Feb 28, 2017 at 6:42 AM, teor teor2345@gmail.com wrote:
Hi, Tim!
This looks pretty plausible to me. Do you have time to write up a quick&dirty python function that actually performs the smoothing?
Well, it's a function, but it's not pretty, because it needs to know all the consensus bandwidth values to find the percentile to do the smoothing.
If so, I can test it out on the January consensuses and measure the impact on compressed diff sizes.
To generate a list of bandwidths, run: cat cached-microdesc-consensus | grep ^w | cut -d= -f2 | cut -d" " -f1 > bandwidth_list
To generate a list of pairs: bandwidth rounded_bandwidth
Run: ./bwround.py bandwidth_list
There's an alternative scheme available in the script, to use it, comment out the line: rounding_configs = excess_rounding
(The alternative scheme rounds the 99th percentile to 2 significant figures rather than 3.)
I'd be interested in how each scheme performs.
T
-- Tim Wilson-Brown (teor)
teor2345 at gmail dot com PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B ricochet:ekmygaiu4rzgsk6n xmpp: teor at torproject dot org ------------------------------------------------------------------------