Greetings everyone,
After some discussions between arma, mikeperry, asn and I, it is time for a famous tor-dev@ email thread to try to get to a consensus if need be.
In the past weeks, a set of HSv3 reachability issues have been found regarding *service* intro points (IP):
- hs-v3: Service can keep unused intro points in its list https://bugs.torproject.org/31561 - hs-v3: Service can pick more than HiddenServiceNumIntroductionPoints intro points https://bugs.torproject.org/31548 - hs-v3: Stop using ip->circuit_established flag https://bugs.torproject.org/32094 - hs-v3: Client can re-pick bad intro points https://bugs.torproject.org/31541 - hs-v3: Service circuit retry limit should not close a valid circuit https://bugs.torproject.org/31652
Long story short, couple weeks ago we've almost merged a new behavior on the service side with #31561 that would have ditch an intro point if its circuit would time out instead of retrying it. (Today, a service always retry their intro point up to 3 times on any type of circuit failure.)
And here comes the core of the discussion of this thread: Retrying intro point on failure or simply ditch it on failure and pick a new one?
Some 7 years ago, this ticket was created and thus we implemented roughly 4 years ago a mechanism that makes a service retry to establish the intro point circuit up to 3 times when it collapses (except for very very specific cases for which we wouldn't):
https://bugs.torproject.org/8239
HSv3 tried to be on feature parity there with v2 up until now that the above bugs have been mostly fixed.
That being all said, regarding the retry feature, there are pros and cons. I'll try to organize them below based on many adhoc discussions in the past and what I can get from all the tickets up to this day (there could be more! this is just what I could recall and find in the tickets):
== Pros ==
The primary original argument for retrying is based on the mobile use case. If a .onion is running on a cellphone and the network happens to be bad all the sudden, the service is better off to re-establish the intro circuits which would make the retry attempts of the client to finally succeed after a bit instead of having to re-fetch a descriptor and go to the new intro points.
Thus, in theory, it is mostly a reachability argument.
One question that can arise from this is: Will the client be able to reconnect using the old intro points by the time the service re-established?
In other words, is the retry behavior of the *client* allows enough time for the service to stabilize for the mobile use case? I'm curious to learn from people with experience with this!
== Cons ==
Recently, mikeperry raised concerns about the retry behavior all together and proposed to simply ditch each time the intro point instead of retrying.
(@Mike, I do invite you to comment here as you mentionned many times rationales for this but I don't have enough IRC backlog :S).
== Pros _and_ Cons at the same time ==
There is a possible Guard discovery attack argument against retrying. But it is nuanced on what exactly constitute a failure and when should it retry vs ditching.
Quote from https://trac.torproject.org/projects/tor/ticket/8239#comment:6
FWIW, it's also worth mentioning that making HSes more stubborn towards old IPs might also allow guard discovery attacks from the IP. That is the IP kills incoming circuits, till a compromised middle node is selected, and since the HS is stubborn it will keep on establishing new circuits.
This was mentioned by waldo here: https://lists.torproject.org/pipermail/tor-dev/2014-May/006843.html
... which is where the "what is the failure" is important as arma's mentions in the same ticket:
That's why you should only stick to your intro point when it's your network that failed (that is, the connection between you and your guard), not the intro circuit. (This is what I meant in the body of the bug in the 'main tricky point' sentence.)
We had this discussion before in Tor many times on "how to detect network failures" vs "circuit failures". In other words, if the link to your Guard fails, that would be enough to consider a network failure and thus retry the intro point.
But if the circuit collapses due to let say a DESTROY or TRUNCATED cell, then it could be the IP closing it for the purpose of an attack and thus you would select a new intro point. But, it could also be that the middle node died... That one has many false positive.
Soooooo, to repeat what I first said at the beginning, today an HSv3 will _always_ retry up to 3 times regardless of the reason why the circuit collapsed.
Should that behavior get more refined with the network failed vs circuit close argument? Should we stop at once retrying? Should we change the retry behavior client side to better match the latency of the mobile use case?
Whatever we decide, most importantly, we need to document the *why* of this retry/ditch behavior and thus this email thread is I hope a good start to keep a record of the discussions/arguments.
Cheers! David