Greetings everyone,
After some discussions between arma, mikeperry, asn and I, it is time for a famous tor-dev@ email thread to try to get to a consensus if need be.
In the past weeks, a set of HSv3 reachability issues have been found regarding *service* intro points (IP):
- hs-v3: Service can keep unused intro points in its list https://bugs.torproject.org/31561 - hs-v3: Service can pick more than HiddenServiceNumIntroductionPoints intro points https://bugs.torproject.org/31548 - hs-v3: Stop using ip->circuit_established flag https://bugs.torproject.org/32094 - hs-v3: Client can re-pick bad intro points https://bugs.torproject.org/31541 - hs-v3: Service circuit retry limit should not close a valid circuit https://bugs.torproject.org/31652
Long story short, couple weeks ago we've almost merged a new behavior on the service side with #31561 that would have ditch an intro point if its circuit would time out instead of retrying it. (Today, a service always retry their intro point up to 3 times on any type of circuit failure.)
And here comes the core of the discussion of this thread: Retrying intro point on failure or simply ditch it on failure and pick a new one?
Some 7 years ago, this ticket was created and thus we implemented roughly 4 years ago a mechanism that makes a service retry to establish the intro point circuit up to 3 times when it collapses (except for very very specific cases for which we wouldn't):
https://bugs.torproject.org/8239
HSv3 tried to be on feature parity there with v2 up until now that the above bugs have been mostly fixed.
That being all said, regarding the retry feature, there are pros and cons. I'll try to organize them below based on many adhoc discussions in the past and what I can get from all the tickets up to this day (there could be more! this is just what I could recall and find in the tickets):
== Pros ==
The primary original argument for retrying is based on the mobile use case. If a .onion is running on a cellphone and the network happens to be bad all the sudden, the service is better off to re-establish the intro circuits which would make the retry attempts of the client to finally succeed after a bit instead of having to re-fetch a descriptor and go to the new intro points.
Thus, in theory, it is mostly a reachability argument.
One question that can arise from this is: Will the client be able to reconnect using the old intro points by the time the service re-established?
In other words, is the retry behavior of the *client* allows enough time for the service to stabilize for the mobile use case? I'm curious to learn from people with experience with this!
== Cons ==
Recently, mikeperry raised concerns about the retry behavior all together and proposed to simply ditch each time the intro point instead of retrying.
(@Mike, I do invite you to comment here as you mentionned many times rationales for this but I don't have enough IRC backlog :S).
== Pros _and_ Cons at the same time ==
There is a possible Guard discovery attack argument against retrying. But it is nuanced on what exactly constitute a failure and when should it retry vs ditching.
Quote from https://trac.torproject.org/projects/tor/ticket/8239#comment:6
FWIW, it's also worth mentioning that making HSes more stubborn towards old IPs might also allow guard discovery attacks from the IP. That is the IP kills incoming circuits, till a compromised middle node is selected, and since the HS is stubborn it will keep on establishing new circuits.
This was mentioned by waldo here: https://lists.torproject.org/pipermail/tor-dev/2014-May/006843.html
... which is where the "what is the failure" is important as arma's mentions in the same ticket:
That's why you should only stick to your intro point when it's your network that failed (that is, the connection between you and your guard), not the intro circuit. (This is what I meant in the body of the bug in the 'main tricky point' sentence.)
We had this discussion before in Tor many times on "how to detect network failures" vs "circuit failures". In other words, if the link to your Guard fails, that would be enough to consider a network failure and thus retry the intro point.
But if the circuit collapses due to let say a DESTROY or TRUNCATED cell, then it could be the IP closing it for the purpose of an attack and thus you would select a new intro point. But, it could also be that the middle node died... That one has many false positive.
Soooooo, to repeat what I first said at the beginning, today an HSv3 will _always_ retry up to 3 times regardless of the reason why the circuit collapsed.
Should that behavior get more refined with the network failed vs circuit close argument? Should we stop at once retrying? Should we change the retry behavior client side to better match the latency of the mobile use case?
Whatever we decide, most importantly, we need to document the *why* of this retry/ditch behavior and thus this email thread is I hope a good start to keep a record of the discussions/arguments.
Cheers! David
Hi David,
On 29/10/2019 14:52, David Goulet wrote:
Long story short, couple weeks ago we've almost merged a new behavior on the service side with #31561 that would have ditch an intro point if its circuit would time out instead of retrying it. (Today, a service always retry their intro point up to 3 times on any type of circuit failure.)
Thanks for not merging this yet. :-)
The primary original argument for retrying is based on the mobile use case. If a .onion is running on a cellphone and the network happens to be bad all the sudden, the service is better off to re-establish the intro circuits which would make the retry attempts of the client to finally succeed after a bit instead of having to re-fetch a descriptor and go to the new intro points.
Thus, in theory, it is mostly a reachability argument.
One question that can arise from this is: Will the client be able to reconnect using the old intro points by the time the service re-established?
In other words, is the retry behavior of the *client* allows enough time for the service to stabilize for the mobile use case? I'm curious to learn from people with experience with this!
For what it's worth, we used to run into the following problem with Briar:
* Device X tries to connect to device Y's hidden service * X has a cached descriptor for Y's HS * Since the time when X cached the descriptor, Y has lost its guard connection, so it's built new intro circuits to new intro points * After multiple connection attempts, X gives up on the intro points in the cached descriptor and fetches a new descriptor * This causes a delay in X connecting to Y
A typical mobile device loses its guard connection frequently - not necessarily because it loses internet access, but because it switches between wifi and mobile data. So the scenario above was very common.
Before the HS behaviour was changed to reuse the old intro points, we had to maintain a patch against Tor to add a controller command for flushing a cached HS descriptor before trying to connect. This essentially made the client's descriptor cache redundant, so it was a slight loss of efficiency, but better than trying a bunch of stale intro points and then fetching a new descriptor anyway.
If you're considering switching back to the old behaviour, I'd like to discuss whether we could make one of the following changes to continue supporting the mobile HS use case:
1. Add a controller command for flushing an HS descriptor 2. Add a controller command for notifying Tor that we lost/gained internet access, or switched between wifi and mobile data, so Tor knows that (a) its guard connection may be dead, and (b) its intro circuits may be dead, but not due to an attack by the intro points, so it can safely reuse the intro points 3. If intro circuits are closed due to DisableNetwork changing from 0 to 1, remember this and reuse the intro points when the network is re-enabled
Android notifies apps of connectivity changes, so Briar could easily pass this information on to Tor via a new controller command or by setting DisableNetwork. (The general problem of detecting whether our internet connectivity is broken for some definition of broken remains hard, but fortunately we don't need to solve that to handle the common cases of switching between wifi and mobile data, and losing mobile signal, which the OS can tell us about.)
My one-sided two cents. ;-)
Cheers, Michael