On Fri, Oct 22, 2021 at 05:38:27PM -0400, Nick Mathewson wrote:
In this circumstance, we _could_ say that we only build circuits to G1, wait for them to succeed or fail, and only try G2 if we see that the circuits to G1 have failed completely. But that delays in the case that G1 is down.
Instead, the first time we get a circuit request, we try to build one circuit to G1. On the next circuit request, if the circuit to G1 isn't done yet, we launch a circuit to G2 instead. The next request (if the G1 and G2 circuits are still pending) goes to G3, and so on. But (here's the critical part!) we don't actually _use_ the circuit to G2 unless the circuit to G1 fails, and we don't actually _use_ the circuit to G3 unless the circuits to G1 and G2 both fail.
This approach causes Tor clients to check the status of multiple possible guards in parallel, while not actually _using_ any guard until we're sure that all the guards we'd rather use are down.
On reflection, this design (both our current behavior, and also that same behavior in your proposed new design) is kind of bizarre.
I've written my thoughts as a gitlab ticket for torspec: https://gitlab.torproject.org/tpo/core/torspec/-/issues/68 but I'll paste them here too.
There are two suboptimal things about this approach:
(1) We're potentially touching a whole lot more guards than we need to. For example, imagine we've gone offline and managed to mark our primary guards down, but then we come back online and we're running ricochet, and we have 100 contacts. We then launch 100 new circuits, which causes us to start connections to the next 100 guards in our list. That's a lot of surface area, impacting both security (many new guards that learn that I'm a Tor user) and network load.
(2) Why should the number of new guards that we try in parallel be a function of the number of circuits we're hoping to build? If it's a good idea to try several in parallel in case the first one is slow to fail, then shouldn't we do that even if there's only one circuit waiting? And from the other side, if we have ten circuits waiting, why should that map to testing ten new guards, when it is super unlikely that we're going to end up using that tenth guard?
Here is a concrete alternative design: if our primary guards are down, and we don't yet have a guard that we know we want to use, and there is at least one circuit pending hoping to have a guard, then try to always have three new guard attempts in-flight. This way we are getting the parallel attempt feature, and we get it even if we don't have multiple circuits waiting; and also we are limiting our surface area, and focusing our guard attempts on the ones most likely to actually be used.
--Roger