Onion Service Intro Point Retry Behavior - tor-dev

29 Oct 2019


      Greetings everyone,
After some discussions between arma, mikeperry, asn and I, it is time for a
famous tor-dev@ email thread to try to get to a consensus if need be.
In the past weeks, a set of HSv3 reachability issues have been found regarding
*service* intro points (IP):
- hs-v3: Service can keep unused intro points in its list
      https://bugs.torproject.org/31561
    - hs-v3: Service can pick more than HiddenServiceNumIntroductionPoints
      intro points https://bugs.torproject.org/31548
    - hs-v3: Stop using ip->circuit_established flag
      https://bugs.torproject.org/32094
    - hs-v3: Client can re-pick bad intro points
      https://bugs.torproject.org/31541
    - hs-v3: Service circuit retry limit should not close a valid circuit
      https://bugs.torproject.org/31652
Long story short, couple weeks ago we've almost merged a new behavior on the
service side with #31561 that would have ditch an intro point if its circuit
would time out instead of retrying it. (Today, a service always retry their
intro point up to 3 times on any type of circuit failure.)
And here comes the core of the discussion of this thread: Retrying intro point
on failure or simply ditch it on failure and pick a new one?
Some 7 years ago, this ticket was created and thus we implemented roughly 4
years ago a mechanism that makes a service retry to establish the intro point
circuit up to 3 times when it collapses (except for very very specific cases
for which we wouldn't):
https://bugs.torproject.org/8239
HSv3 tried to be on feature parity there with v2 up until now that the above
bugs have been mostly fixed.
That being all said, regarding the retry feature, there are pros and cons.
I'll try to organize them below based on many adhoc discussions in the past
and what I can get from all the tickets up to this day (there could be more!
this is just what I could recall and find in the tickets):
== Pros ==
The primary original argument for retrying is based on the mobile use case. If
a .onion is running on a cellphone and the network happens to be bad all the
sudden, the service is better off to re-establish the intro circuits which
would make the retry attempts of the client to finally succeed after a bit
instead of having to re-fetch a descriptor and go to the new intro points.
Thus, in theory, it is mostly a reachability argument.
One question that can arise from this is: Will the client be able to reconnect
using the old intro points by the time the service re-established?
In other words, is the retry behavior of the *client* allows enough time for
the service to stabilize for the mobile use case? I'm curious to learn from
people with experience with this!
== Cons ==
Recently, mikeperry raised concerns about the retry behavior all together and
proposed to simply ditch each time the intro point instead of retrying.
(@Mike, I do invite you to comment here as you mentionned many times
rationales for this but I don't have enough IRC backlog :S).
== Pros _and_ Cons at the same time ==
There is a possible Guard discovery attack argument against retrying. But it
is nuanced on what exactly constitute a failure and when should it retry vs
ditching.
Quote from https://trac.torproject.org/projects/tor/ticket/8239#comment:6
FWIW, it's also worth mentioning that making HSes more stubborn towards
    old IPs might also allow guard discovery attacks from the IP. That is the
    IP kills incoming circuits, till a compromised middle node is selected,
    and since the HS is stubborn it will keep on establishing new circuits.
This was mentioned by waldo here:
    https://lists.torproject.org/pipermail/tor-dev/2014-May/006843.html
... which is where the "what is the failure" is important as arma's mentions
in the same ticket:
That's why you should only stick to your intro point when it's your
    network that failed (that is, the connection between you and your guard),
    not the intro circuit. (This is what I meant in the body of the bug in the
    'main tricky point' sentence.)
We had this discussion before in Tor many times on "how to detect network
failures" vs "circuit failures". In other words, if the link to your Guard
fails, that would be enough to consider a network failure and thus retry the
intro point.
But if the circuit collapses due to let say a DESTROY or TRUNCATED cell, then
it could be the IP closing it for the purpose of an attack and thus you would
select a new intro point. But, it could also be that the middle node died...
That one has many false positive.
Soooooo, to repeat what I first said at the beginning, today an HSv3 will
_always_ retry up to 3 times regardless of the reason why the circuit
collapsed.
Should that behavior get more refined with the network failed vs circuit close
argument? Should we stop at once retrying? Should we change the retry behavior
client side to better match the latency of the mobile use case?
Whatever we decide, most importantly, we need to document the *why* of this
retry/ditch behavior and thus this email thread is I hope a good start to keep
a record of the discussions/arguments.
Cheers!
David
-- 
5qZaRu0+AqSNqiaTmTpzcIEztqeYQIq7AAfzKdg/2cs=