New subject: growing guard probability on exits (2020-10-15)

15 Oct 2020


      Greetings relay operators!
Tor has now embarked in a 2 year long scalability project aimed, in part, at
improving the network performance.
The first steps will be to measure performance on the public network in order
to come up with a baseline. We'll likely be adjusting circuit window size,
cell scheduling (KIST) and circuit build timeout (CBT) parameters over the
next months in both contained experiments and on the live network.
This announcement is about KIST parameters, our cell scheduler.
Roughly a year ago, we've discovered that all tor clients are capped to
~3MB/sec in maximum outbound throughput due to how the scheduler is operating.
I won't get into the details but if you are curious, it is here:
https://gitlab.torproject.org/tpo/core/tor/-/issues/29427
It turns out that we now believe that the entire network, not only clients,
are actually capped at 3MB/sec per channel (a channel is a connection between
client -> relay or relay -> relay also called an OR connection).
We've recently conducted experiments with chutney [1], which operates on the
loopback interface, and we indeed hit those limits.
KIST has a parameter named KISTSchedRunInterval which is currently set at 10
msec and that is our culprit. By lowering it to 2 msec, our experiment showed
that the cap goes from 3MB/sec to ~5MB/sec with burst a bit higher.
Now, the question is why was it set to 10 msec in the first place? Again,
without getting into the technical details of the KIST paper[2], our cell
scheduler requires a "grace period" in order to be able to accumulate cells
and then prioritize over many circuits using an EWMA algorithm which tor has
been using for a long time now. Without this, one can clog the pipes (at the
TCP level) with a very loud transfer by always being scheduled and filling the
TCP buffers leaving nothing for the quieter circuit.
Important to note that the goal of EWMA in tor is to prioritize quiet circuit
for example, an SSH session will be prioritized over a bulk HTTP transfer.
This is so "likely" interactive connections are not delayed and are snappy.
But, lowering this to 2 msec means less time to accumulate and in theory worst
cell prioritization.
However, we think this will not be a problem because we believe the network is
underloaded. And, because of this 3MB/sec cap per channel, it means that tor is
sending burst of cells instead of a constant stream of cells and thus it is
under processing what it possibly could at the relay side. Again, all this in
theory.
All in all, going to 2 msec should improve speed at the very least and not make
the network worst.
We want to test that, measure that for a couple of weeks and then transition to
a higher value and doing that until we get to 10 msec so we can clearly well
compare the effect on EWMA priority and performance.
One possibility will be 2 msec, 5 msec, 10 msec transition period.
Yesterday, a request to our 9 directory authorities have been made to set this
consensus parameter:
KISTSchedRunInterval=2
We are still missing 1 authority to enable this param for it to take effect
network wide. Hopefully, it should be today in the coming hours/day.
This is where we need your help. We would like you to notify us on this thread
about any noticeable changes in CPU, RAM, or BW usage. In other words,
anything that changes from the "average" you've been seeing is worth informing
us.
We do NOT expect big changes for your relay(s) but there could reasonably be a
change in bandwidth throughput and thus some of you could see a traffic
increase, unclear at the moment.
Huge thanks to everyone here! We will carefully monitor this change and if
things go bad, we'll revert it as fast as we can! Thus, your help becomes
extremely important!
Cheers!
David
[1] https://git.torproject.org/chutney.git/
[2] https://arxiv.org/pdf/1709.01044.pdf
-- 
7h1/NAPdaaGpI8WG6X4FtryAZZ4EhnznUVVLqIf/04A=

Network Performance Experiment - KISTSchedRunInterval - October 2020