Greetings relay operators!
Tor has now embarked in a 2 year long scalability project aimed, in part, at improving the network performance.
The first steps will be to measure performance on the public network in order to come up with a baseline. We'll likely be adjusting circuit window size, cell scheduling (KIST) and circuit build timeout (CBT) parameters over the next months in both contained experiments and on the live network.
This announcement is about KIST parameters, our cell scheduler.
Roughly a year ago, we've discovered that all tor clients are capped to ~3MB/sec in maximum outbound throughput due to how the scheduler is operating. I won't get into the details but if you are curious, it is here:
https://gitlab.torproject.org/tpo/core/tor/-/issues/29427
It turns out that we now believe that the entire network, not only clients, are actually capped at 3MB/sec per channel (a channel is a connection between client -> relay or relay -> relay also called an OR connection).
We've recently conducted experiments with chutney [1], which operates on the loopback interface, and we indeed hit those limits.
KIST has a parameter named KISTSchedRunInterval which is currently set at 10 msec and that is our culprit. By lowering it to 2 msec, our experiment showed that the cap goes from 3MB/sec to ~5MB/sec with burst a bit higher.
Now, the question is why was it set to 10 msec in the first place? Again, without getting into the technical details of the KIST paper[2], our cell scheduler requires a "grace period" in order to be able to accumulate cells and then prioritize over many circuits using an EWMA algorithm which tor has been using for a long time now. Without this, one can clog the pipes (at the TCP level) with a very loud transfer by always being scheduled and filling the TCP buffers leaving nothing for the quieter circuit.
Important to note that the goal of EWMA in tor is to prioritize quiet circuit for example, an SSH session will be prioritized over a bulk HTTP transfer. This is so "likely" interactive connections are not delayed and are snappy.
But, lowering this to 2 msec means less time to accumulate and in theory worst cell prioritization.
However, we think this will not be a problem because we believe the network is underloaded. And, because of this 3MB/sec cap per channel, it means that tor is sending burst of cells instead of a constant stream of cells and thus it is under processing what it possibly could at the relay side. Again, all this in theory.
All in all, going to 2 msec should improve speed at the very least and not make the network worst.
We want to test that, measure that for a couple of weeks and then transition to a higher value and doing that until we get to 10 msec so we can clearly well compare the effect on EWMA priority and performance.
One possibility will be 2 msec, 5 msec, 10 msec transition period.
Yesterday, a request to our 9 directory authorities have been made to set this consensus parameter:
KISTSchedRunInterval=2
We are still missing 1 authority to enable this param for it to take effect network wide. Hopefully, it should be today in the coming hours/day.
This is where we need your help. We would like you to notify us on this thread about any noticeable changes in CPU, RAM, or BW usage. In other words, anything that changes from the "average" you've been seeing is worth informing us.
We do NOT expect big changes for your relay(s) but there could reasonably be a change in bandwidth throughput and thus some of you could see a traffic increase, unclear at the moment.
Huge thanks to everyone here! We will carefully monitor this change and if things go bad, we'll revert it as fast as we can! Thus, your help becomes extremely important!
Cheers! David
[1] https://git.torproject.org/chutney.git/ [2] https://arxiv.org/pdf/1709.01044.pdf