Greetings relay operators!
Tor has now embarked in a 2 year long scalability project aimed, in part, at improving the network performance.
The first steps will be to measure performance on the public network in order to come up with a baseline. We'll likely be adjusting circuit window size, cell scheduling (KIST) and circuit build timeout (CBT) parameters over the next months in both contained experiments and on the live network.
This announcement is about KIST parameters, our cell scheduler.
Roughly a year ago, we've discovered that all tor clients are capped to ~3MB/sec in maximum outbound throughput due to how the scheduler is operating. I won't get into the details but if you are curious, it is here:
https://gitlab.torproject.org/tpo/core/tor/-/issues/29427
It turns out that we now believe that the entire network, not only clients, are actually capped at 3MB/sec per channel (a channel is a connection between client -> relay or relay -> relay also called an OR connection).
We've recently conducted experiments with chutney [1], which operates on the loopback interface, and we indeed hit those limits.
KIST has a parameter named KISTSchedRunInterval which is currently set at 10 msec and that is our culprit. By lowering it to 2 msec, our experiment showed that the cap goes from 3MB/sec to ~5MB/sec with burst a bit higher.
Now, the question is why was it set to 10 msec in the first place? Again, without getting into the technical details of the KIST paper[2], our cell scheduler requires a "grace period" in order to be able to accumulate cells and then prioritize over many circuits using an EWMA algorithm which tor has been using for a long time now. Without this, one can clog the pipes (at the TCP level) with a very loud transfer by always being scheduled and filling the TCP buffers leaving nothing for the quieter circuit.
Important to note that the goal of EWMA in tor is to prioritize quiet circuit for example, an SSH session will be prioritized over a bulk HTTP transfer. This is so "likely" interactive connections are not delayed and are snappy.
But, lowering this to 2 msec means less time to accumulate and in theory worst cell prioritization.
However, we think this will not be a problem because we believe the network is underloaded. And, because of this 3MB/sec cap per channel, it means that tor is sending burst of cells instead of a constant stream of cells and thus it is under processing what it possibly could at the relay side. Again, all this in theory.
All in all, going to 2 msec should improve speed at the very least and not make the network worst.
We want to test that, measure that for a couple of weeks and then transition to a higher value and doing that until we get to 10 msec so we can clearly well compare the effect on EWMA priority and performance.
One possibility will be 2 msec, 5 msec, 10 msec transition period.
Yesterday, a request to our 9 directory authorities have been made to set this consensus parameter:
KISTSchedRunInterval=2
We are still missing 1 authority to enable this param for it to take effect network wide. Hopefully, it should be today in the coming hours/day.
This is where we need your help. We would like you to notify us on this thread about any noticeable changes in CPU, RAM, or BW usage. In other words, anything that changes from the "average" you've been seeing is worth informing us.
We do NOT expect big changes for your relay(s) but there could reasonably be a change in bandwidth throughput and thus some of you could see a traffic increase, unclear at the moment.
Huge thanks to everyone here! We will carefully monitor this change and if things go bad, we'll revert it as fast as we can! Thus, your help becomes extremely important!
Cheers! David
[1] https://git.torproject.org/chutney.git/ [2] https://arxiv.org/pdf/1709.01044.pdf
KISTSchedRunInterval=2
We are still missing 1 authority to enable this param for it to take effect network wide. Hopefully, it should be today in the coming hours/day.
since it is in effect by now https://consensus-health.torproject.org/#consensusparams could you publish the exact timestamp when it came into effect?
I noticed some unusual things today (exits having a non-zero guard probability), did you change more parameters than this one or was this the only one?
On 15 Oct (23:40:34), nusenu wrote:
KISTSchedRunInterval=2
We are still missing 1 authority to enable this param for it to take effect network wide. Hopefully, it should be today in the coming hours/day.
since it is in effect by now https://consensus-health.torproject.org/#consensusparams could you publish the exact timestamp when it came into effect?
The consensus on October 15th at 16:00 UTC was the first one that was voted with the change.
I noticed some unusual things today (exits having a non-zero guard probability), did you change more parameters than this one or was this the only one?
We did not.
That's worth looking into!?
David
On Thu, Oct 15, 2020 at 11:40:34PM +0200, nusenu wrote:
since it is in effect by now https://consensus-health.torproject.org/#consensusparams could you publish the exact timestamp when it came into effect?
One can learn this from the recent consensus documents, e.g. at https://collector.torproject.org/recent/relay-descriptors/consensuses/
And I agree that we should have a central experiment page (e.g. on gitlab) that lists the experiments, when we ran them, when the network changes occurred, what we expected to find, and what we *did* find.
David or Mike, can you make sure that page happens?
I noticed some unusual things today (exits having a non-zero guard probability), did you change more parameters than this one or was this the only one?
No, that was the only change.
We had a good discussion with Florentin et al on #tor-dev just now, where we concluded that yes, we're still in "case 3be (E scarce)", but the math still allows a little bit of use of exits for other roles: check out the networkstatus_compute_bw_weights_v10() function in src/feature/dirauth/dirvote.c.
So as far as we can tell so far, we are still in the "exit scarce" case of Mike's weight voodoo, but his math allows exits to be used a little bit in non-exit roles even in this case.
Wed = (weight_scale*(D - 2*E + G + M))/(3*D);
Wgd = (weight_scale - Wed)/2;
And Wed in this case is 9849 rather than 10000.
So, to say it much more plainly, we are just barely on the other side of the line from "exit capacity is so scarce that exits will only ever be used for exiting."
Mike was expecting some rebalancing to be done by the bwauths, once we shifted the Kist interval, but I don't know whether we're seeing that rebalancing or if it this is a coincidence.
--Roger
lets see when this graph stops growing https://cryptpad.fr/code/#/2/code/view/1uaA141Mzk91n1EL5w0AGM7zucwFGsLWzt-Es...
why is this relevant? It puts more entities into an end-to-end correlation position than there used to be https://nusenu.github.io/OrNetStats/#tor-relay-operators-in-end-to-end-corre...
and it might also decrease exit traffic on exits when a tor client chooses an exit as guard
On 16 Oct (10:49:43), nusenu wrote:
lets see when this graph stops growing https://cryptpad.fr/code/#/2/code/view/1uaA141Mzk91n1EL5w0AGM7zucwFGsLWzt-Es...
To help you out here for this line:
"2020-10-15 ?? first Tor dir auths change KISTSchedRunInterval from 10 to 2"
These are the 3 authorities that notified us with the change along with the most accurate timestamp I have timestamp:
longclaw -> Oct 14 at 16:05:08 UTC moria1 -> Oct 14 before 16:00 UTC (exact consensus time is unknown, would need to dig in the votes but Roger said it was changed on moria "earlier today" that is before this time.) bastet -> Oct 15 at 15:26:47 UTC
Three are needed consensus on parameters so the Oct 15th 16:00 UTC is the first consensus to see the change.
Keep in mind that it would take at maximum ~2h for ALL relays to get that change.
why is this relevant? It puts more entities into an end-to-end correlation position than there used to be https://nusenu.github.io/OrNetStats/#tor-relay-operators-in-end-to-end-corre...
and it might also decrease exit traffic on exits when a tor client chooses an exit as guard
It was pointed out by Jaym on IRC, notice here a bump in Exit capacity around mid September:
http://rougmnvswfsmd4dq.onion/bandwidth-flags.html?start=2020-08-18&end=...
That could likely be a reason for this sudden change in probabilities.
Now, _maybe_ the KIST change, which in theory increases bw throughput, allowed those Exit to push more traffic and thus might help with increasing that Guard+Exit probability we are seeing in your graph.
Lets keep a close eye on your graph!
Thanks! David
On 10/16/20 3:49 AM, nusenu wrote:
lets see when this graph stops growing https://cryptpad.fr/code/#/2/code/view/1uaA141Mzk91n1EL5w0AGM7zucwFGsLWzt-Es...
Yes let's keep an eye on this. I doubt it is directly related, but it could be a side effect.
However, I suspect that the KIST change will most affect Guards, especially those used by loud clients. It will allow them to handle much more traffic from loud clients, and probably get higher consensus values as a result.
why is this relevant? It puts more entities into an end-to-end correlation position than there used to be https://nusenu.github.io/OrNetStats/#tor-relay-operators-in-end-to-end-corre...
I share this concern. It seems plausible and even likely to me that Exits are more likely to be surveilled than non-Exits, which makes them more dangerous to use in both entry and exit positions. Additionally, the use of an Exit in the Guard position leaks information, since you will never use that Exit to connect anywhere, and this is visible over a long period of time, leading to Guard discovery.
I want to remove the ability for Exits to become Guards entirely. In addition to the correlation and Guard discovery issues, it has historically caused much excess complexity for load balancing.
If Exits can't also become Guards, the load balancing equations become way more legible and no longer have the "poorly defined constraint" problem. This means the complicated scarcity cases from the solution go away: https://gitlab.torproject.org/tpo/core/torspec/-/blob/master/proposals/265-l...
and it might also decrease exit traffic on exits when a tor client chooses an exit as guard
Hrm... this will be a function of how many clients choose that Exit. This process will take months, because of the long guard rotation period. If we keep flapping in and out of Exits-as-Guards, they are unlikely to accumulate many clients.
The guard rotation period is another source of load balancing pain.
For outstanding issues with our attempt at solving it, see: https://gitlab.torproject.org/tpo/core/tor/-/issues/16255
On Fri, Oct 16, 2020 at 10:49:43AM +0200, nusenu wrote:
lets see when this graph stops growing https://cryptpad.fr/code/#/2/code/view/1uaA141Mzk91n1EL5w0AGM7zucwFGsLWzt-Es...
Sounds good. I think, based on your graph, that it is a coincidence that we launched the "KISTSchedRunInterval=2" experiment right around this time. That is, your graph shows that the consensus weights for whether to use exit relays solely for exiting went from "100%, always do it" to "a tiny bit less than 100%" *before* we changed the KISTSchedRunInterval value.
I guess we will get another data point when we turn KISTSchedRunInterval back to its default of 10ms, and (we hypothesize) it has no impact on the weights.
why is this relevant? It puts more entities into an end-to-end correlation position than there used to be https://nusenu.github.io/OrNetStats/#tor-relay-operators-in-end-to-end-corre...
Yep. I'm not worried in the short term here, since one of the features of our guard design is that just because a relay suddenly has a chance of being used as a Guard, that doesn't mean it becomes *your* guard. You (your Tor client) already have your guard, and you'll be sticking with it for the next few weeks probably.
So this question matters over the coming weeks or months, because clients will be rotating to new guards and then they will have a tiny chance of picking one of these exits as their guard, each time they rotate to a new guard, which is typically an infrequent event.
*That* said: since the chance of picking any of these exits as your guard is really tiny, I think that means they will have very few users using them as guards, which puts them in a weird position. For example, it means that if you're one of the very rare clients who picked an exit as your guard, then your choice acts as a better fingerprint for you, if you move around, and if there is an attacker that's in a position to watch your local network in each new location.
I don't think we ever intended that edge case to happen, and it's kind of an uncomfortable situation to be in.
I wonder if the right fix is: "if you have the Exit flag, you don't get the Guard flag"?
I think it would mean that the tiny amount of excess capacity on exit relays would get pushed into rarely-but-still-sometimes being used as somebody's middle hop, which seems to me like a much better oucome.
and it might also decrease exit traffic on exits when a tor client chooses an exit as guard
I think load on these exits would increase, not decrease. They're already going to be used for exiting. That is, if you need an exit for your circuit, you're going to use one. The impact of these weights will actually *increase* traffic on exits, because they will be used for exiting like normal (there's no choice) but *also* they will be (very rarely at present, but still more than the 'never' from last week) used in other circuit positions too.
--Roger
On 15 Oct (19:26:09), Roger Dingledine wrote:
On Thu, Oct 15, 2020 at 11:40:34PM +0200, nusenu wrote:
since it is in effect by now https://consensus-health.torproject.org/#consensusparams could you publish the exact timestamp when it came into effect?
One can learn this from the recent consensus documents, e.g. at https://collector.torproject.org/recent/relay-descriptors/consensuses/
And I agree that we should have a central experiment page (e.g. on gitlab) that lists the experiments, when we ran them, when the network changes occurred, what we expected to find, and what we *did* find.
David or Mike, can you make sure that page happens?
This is the page of what we planned to work on.
https://gitlab.torproject.org/tpo/core/team/-/wikis/NetworkTeam/Sponsor61/Pe...
We are still very early on the KIST experiment here so will update the page with the latest today or very soon once we are starting to see/measure the effects.
David
Okay. Is there a page for critters with wee brains? An ELI5 or even ELI3 would be great.
nifty
This is the page of what we planned to work on.
https://gitlab.torproject.org/tpo/core/team/-/wikis/NetworkTeam/Sponsor61/Pe...
We are still very early on the KIST experiment here so will update the page with the latest today or very soon once we are starting to see/measure the effects.
David
-- 7h1/NAPdaaGpI8WG6X4FtryAZZ4EhnznUVVLqIf/04A=
On 10/15/20 3:14 PM, David Goulet wrote:
This is where we need your help. We would like you to notify us on this thread about any noticeable changes in CPU, RAM, or BW usage. In other words, anything that changes from the "average" you've been seeing is worth informing us.
Maybe completely unrelated, but at a first glance it seems, that a
iftop -B -P -N -n -b -i enp4s0 -F 5.9.158.75/32 -F [2a01:4f8:190:514a::2]/64
at a host running 2 Tor relays at 0.4.5.0-alpha-dev now shows more IPv6 connection among the top ones (wrt throughput)
tor-relays@lists.torproject.org