Re: [tor-dev] Proposal: The move to two guard nodes

18 Apr 2018


      Roger Dingledine:
...
On Wed, Apr 11, 2018 at 11:15:44AM +0000, Mike Perry wrote:
...
...
To be clear, the design I've been considering here is simply allowing
reuse between the guard hop and the final hop, when it can't be avoided. I
don't mean to allow the guard (or its family) to show up as all four
hops in the path. Is that the same as what you meant, or did you mean
something more thorough?
By all path restrictions I mean for the last hop of the circuit and the
first (though vanguards would be simpler if we got rid of them for other
hops, too).
Can you lay out for us the things to think about in the Vanguard design?
Last I checked there were quite a few Vanguard design variants, ranging
from "two vanguards per guard, tree style" to some sort of mesh.
In particular, it would be convenient if there is a frontrunner design
that really would benefit from relaxing many path restrictions, and a
frontrunner design that is not so tied together to the path restriction
question.
There are two frontrunner forms. One has no path restrictions, the other
would try to perform restriction checks on each layer to ensure that it
is valid and doesn't leak info about other layers or prevent circuit
creation.
They are otherwise the same. Both are mesh; both are tunable in the
number of guards and rotation times in each layer.
I am leaning towards "no restrictions" for vanguards for 0.3.4 because
it is simpler, and it did not strike me that the arguments in their
favor justified trying to implement them quickly in a way that might
cause reachability or path influence risks.
...
...
But I do mean all restrictions, not just guard node choice.
The adversary also gets to force you to use a second network path
whenever they want via the /16 and node family restrictions.
Can you give us a specific example here, for this phrase "network
path"? When you say "second network path" are you thinking in the
Vanguard world?
Second path to entry into the Tor network (and a second guard),
regardless of vanguards.
...
...
...
I'd like to hear more about the "cleverly crafted exit policy" attack
another way to do this type of exit rotation attack is to cause
a client to look up a DNS name where you control the resolver, and keep
timing out on the DNS response. The client will then retry the stream
request with a new exit. The same thing can also be done by timing out
the TCP handshake to a server you control. Both of these attacks can be
done with only the ability to inject an img tag into a page.
You repeat this until an exit is chosen that is in the same /16 or
family as the guard, and then the client uses a second network path for
an unmultiplexed request at a time you control.
The three fixes that come to mind are
(A) "Have two guards": so you can pick any exit you like, and then just
use the guard that doesn't conflict with the exit you picked.
(B) "Add a bonus hop when needed": First relax the /16 and family
restrictions, so the remaining issue is reuse of your guard. Then if
you find that you just chose your guard as your exit, insert an extra
hop in the middle of that circuit.
(C) "Exits can't be Guards": First relax the /16 and family restrictions,
so the remaining issue is reuse of your guard. Then notice that due
to exit scarcity, guards aren't actually used in the exit position
anyway. Then enforce that rule (so they can't be in the future either).
All three of these choices have downsides. But all three of them look
like improvements over the current situation -- because of how crappy
the current situation is.
(Rejected option (D): "Just start allowing it": Relax the /16 and
family restrictions, and also relax the rule where relays refuse a
circuit that goes right back where it came from. Giving the middle node
that much information about the circuit just wigs me out.)
Also, notice that I think Mike's proposed design will turn out to be some
combination of "A" and also something like "B" or "C", because even if
you start with two guards, if you don't add a new guard right when your
first guard goes down, you might find yourself in the situation where
you have one working guard, and you pick it as your exit, and now you
need to do *something*.
The one-guard-down case does impact things. But even when this does
happen (which should be rare), it should only be true for a small window
of time before the consensus updates.
The "down" guard should either be temporarily overloaded, or fully down
and kicked off the consensus. I think we should only add a new guard
when one falls out of the consensus, or both are unreachable/unusable.
This is why I think it is OK to take an incremental approach and
start with A, and roll out things like B and C and other restriction
relaxations.
During these edge cases, the most important property that we should
strive to preserve is overall reachability. I don't like situations
where the adversary gains information by certain nodes being overloaded
or down. In my view, trying to make smart decisions to minimize exposure
to more nodes is secondary to overall reachability. (Overall
reachability allows a *non-network* adversary to gain information about
how clients are using our network. That strikes me as a lower resource,
more dangerous attack than the unknown risk of possible partial network
observers. In other words, I believe we made the right short-term call
in #14917 in terms of preserving reachability.)
...
...
Our path restrictions also cause normal exiting clients to use a second
guard for unmultiplexed activity, at adversary controlled times, or just
at periodically at random.
Just to make sure I understand: at least on the current network,
that's because of the /16 rule and the family rule, and not because of
the "if the exit you picked turns out to be your guard too, move to a
different guard" rule, because exits aren't normally used for guards on
our current network?
On more examination though, that's not something to rely on with our
current design, since I bet there are weird edge cases like a relay
loses its Guard flag, but it's still your Guard so you keep using it
(depending on the advice from #17773), but now the weightings
let you pick it for your Exit, and oops.
Another problematic example would be a relay that you picked as your
Guard, and later it opened up its exit policy and became an Exit.
I am in favor of preventing guards from being exits. Intuitively, it
means less "one stop shop" surveillance points to see both entry and
exit traffic. It also makes flag-based load balancing equations much
simpler, and makes it easier to account for padding overhead.
...
So if I wanted to try to flesh out my "Then enforce that rule" approach
above, we would need to (1) Have dir auths take away the Guard flag from
relays that can be used as Exits, and (2) Make sure that clients know
that if their guards lose the Guard flag, they should treat them as being
no longer guardworthy. I think we're doing that second one right now,
based on my latest reading of #17773, so this would actually be a pretty
easy change. But still, it's not exactly elegant.
In the world where we keep path restrictions, these would be my rules:
1. Two equal guards, chosen from not the same /16 or family
2. Choose each vanguard layer members such that each layer has at least
   one node from a unique /16 and family.
3. Build paths in a strict order, from last hop towards guard. If you
   can't build a path with this ordering, start over with a sampled guard.
   (With rule #1 and #2, this should be very rare and should mean that
   a guard is marked down locally but still marked up in the consensus.)
4. No guards as exits (Not needed but do it anyway for other reasons).
Then under these rules, you decide to use a new primary guard, if:
0. When a guard leaves the consensus, replace it with a new primary
   guard.
1. Temporarily pick a new guard when your two primaries are locally down
   or unusable (ie step #3 above fails).
...
...
...
...
However, while removing path restrictions will solve the immediate
  problem, it will not address other instances where Tor temporarily opts
  use a second guard due to congestion, OOM, or failure of its primary
  guard, and we're still running into bugs where this can be adversarially
  controlled or just happen randomly[5].
I continue to think we need to fix these. I'm glad to see that George
has been putting some energy into looking more at them. The bugs that
we don't understand are especially worrying, since it's hard to know
how bad they are. Moving to two guards might put a bit of a bandaid on
the issues, but it can't be our long-term plan for fixing them.
We're choosing fixes for these bugs that enable an adversary to deny
service to clients at a particular guard, *without* letting those
clients move to a second guard. This enables confirmation attacks, and
these confirmation attacks can be extended to guard discovery attacks by
DoSing guards one at a time until an onion service fails.
I would find non-onion-service examples more compelling here, since I
want to avoid falling back into the "well, onion services need special
treatment to be safe, so we have to choose between hurting normal clients
and hurting onion services" trap.
How is this for an alternative scenario to be considering: the attacking
website gives the Tor Browser user some page content that causes the
browser to initiate periodic events. Then it starts congesting guards
one at a time until the events stop arriving.
Are those two scenarios basically equivalent in terms of the confirmation
attacks you are worrying about? I hope yes, and now I can stop getting
distracted by wondering if going to this effort is worth it only to
protect onion services? :)
Yes.
...
...
...
But I wonder if we're looking at this backwards, and the primary
question we should be asking is "How can we protect the transition between
guards?" Then one of the potential answers to consider is "Maybe we should
start out with two guards rather than just one." Framing it that way,
are there more options that we should consider too? For example, removing
the ability of the non-local attacker to trigger a transition? Then
there would still be visibility of a transition, but the (non-local)
attacker can't impact the timing of the transition. How much does that
solve? Need to think more.
One guard is inherently more fragile than two, and no matter what we do,
it means that there will be a risk of attacks that can confirm guard
choice, because the downtime during this transition can never be hidden
without at least some redundancy.
How's this for another option: clients have two guards, but they have
a first guard and a backup guard. They do the traffic padding to both
of them, to ensure continuous netflow sessions in their local ISP's
logs. But they try to send most of their traffic over the first guard,
thus avoiding most of the "increased surface area" concerns about using
two guards at once. And we try to reduce the frequency of situations where
they can't use their first guard. But in the "transition" situations
that we decide we need to keep, they use their backup guard, and it's
already available and ready and that netflow session is already active
in the eyes of their ISP.
This approach isn't conflux (yet), but it's not incompatible with later
changing things so we do conflux.
It also doesn't get us the lower variance of performance that having
two equally used guards would get us. But I am ok with that for now,
at least until somebody has done some performance analysis to show that
we're really suffering now and we would stop suffering then.
FYI, we actually do have one form of this info in figure 10 of
https://www.freehaven.net/anonbib/cache/wpes12-cogs.pdf
We get the largest performance gains from going from one guard to two,
in terms of reducing the variance (flatness) of that CDF.
Qualitatively, this means way fewer users who try Tor and experience a
very slow Tor, telling their friends that it is too slow and should not
be used. This is a real thing. Web UX folks have found that it happens
with perf variances in the sub-second range with websites.
...
It adds load onto the relays, by almost doubling the number of sockets
used by guards for clients, and also by adding more bandwidth load from
the padding cells to/from the backup guard. (How much bandwidth load is
this, per client?)
And it doesn't actually provide as much "real" cover traffic onto the
backup guard in most situations, so somebody who can look more thoroughly
at the traffic flows will still be able to distinguish a transition
event from the first to the backup. Maybe that's a problem? Or maybe
the netflow level adversary that we declared in the threat model can't
do that, and a real attacker would be able to see the traffic details
anyway, so we're fine^W^Wno worse off than before?
There are a couple things here that make me think we may still be worse
off.
1. The netflow padding is not designed to simulate client traffic. It is
designed to aggregate client traffic together over time in the
adversary's logs. Instead of seeing a discrete "520KB xfer in this 15
second period, 80KB in that one, and 2300KB in that one, and then
silence for 25 minutes", the adversary records "2900KB traffic total in
this half hour". For this aggregation to help, there really needs to be
other traffic during that half hour. This is why I keep saying that more
concurrent activity is better than only using the second guard
sometimes. (WTF-PAD could do things like you describe above, but we need
to program histograms+state machines for that).
2. Detection of when to switch to this second guard seems complicated
and error prone, and if it results in unavailability, it is strictly
worse. If it switches to the second guard at the first sign of
RESOURCELIMIT and path selection issues, well, then you're adding a lot
of complexity for how much benefit (and also complexity that could be
manipulated by the adversary).
...
Whereas that first risk does seem plausible to me -- worth trying to
reduce. I think we should start by enumerating as many scary scenarios
as we can (where scary means "currently we would shift away from our
first guard"), and then fix as many of them as we can. Then we should
look at the remaining scenarios where we would switch over to using our
backup guard (like, when our first guard isn't able to build new circuits
for us), and decide if the cost of the additional load on the network is
worth hiding that transition timing from a netflow-level client-side-ISP
adversary. I can see the answer being "yes, it's worth it", but I think it
will be useful to have a good handle on which transition scenarios remain.
Well, "fixing" the largest, most frequent, and adversary controlled
classes of these requires:
1. Removing path restrictions.
2. Recognizing DoS attacks and differentiating them from bad network
   conditions.
#2 is what worries me. Any solution to #2 that is agile enough to avoid
downtime strikes me as no better than "switch to guard #2 with
probability 1/2 after a RESOURCELIMIT or any other circuit failure"
(which is what the code would do today with two equal guards), and a
hell of a lot more complex (with risk of a downtime signal or adversary
path influence if we get it wrong).
-- 
Mike Perry

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [tor-dev] Proposal: The move to two guard nodes