Roger Dingledine:
On Wed, Apr 11, 2018 at 11:15:44AM +0000, Mike Perry wrote:
To be clear, the design I've been considering here is simply allowing reuse between the guard hop and the final hop, when it can't be avoided. I don't mean to allow the guard (or its family) to show up as all four hops in the path. Is that the same as what you meant, or did you mean something more thorough?
By all path restrictions I mean for the last hop of the circuit and the first (though vanguards would be simpler if we got rid of them for other hops, too).
Can you lay out for us the things to think about in the Vanguard design? Last I checked there were quite a few Vanguard design variants, ranging from "two vanguards per guard, tree style" to some sort of mesh.
In particular, it would be convenient if there is a frontrunner design that really would benefit from relaxing many path restrictions, and a frontrunner design that is not so tied together to the path restriction question.
There are two frontrunner forms. One has no path restrictions, the other would try to perform restriction checks on each layer to ensure that it is valid and doesn't leak info about other layers or prevent circuit creation.
They are otherwise the same. Both are mesh; both are tunable in the number of guards and rotation times in each layer.
I am leaning towards "no restrictions" for vanguards for 0.3.4 because it is simpler, and it did not strike me that the arguments in their favor justified trying to implement them quickly in a way that might cause reachability or path influence risks.
But I do mean all restrictions, not just guard node choice. The adversary also gets to force you to use a second network path whenever they want via the /16 and node family restrictions.
Can you give us a specific example here, for this phrase "network path"? When you say "second network path" are you thinking in the Vanguard world?
Second path to entry into the Tor network (and a second guard), regardless of vanguards.
I'd like to hear more about the "cleverly crafted exit policy" attack
another way to do this type of exit rotation attack is to cause a client to look up a DNS name where you control the resolver, and keep timing out on the DNS response. The client will then retry the stream request with a new exit. The same thing can also be done by timing out the TCP handshake to a server you control. Both of these attacks can be done with only the ability to inject an img tag into a page.
You repeat this until an exit is chosen that is in the same /16 or family as the guard, and then the client uses a second network path for an unmultiplexed request at a time you control.
The three fixes that come to mind are
(A) "Have two guards": so you can pick any exit you like, and then just use the guard that doesn't conflict with the exit you picked.
(B) "Add a bonus hop when needed": First relax the /16 and family restrictions, so the remaining issue is reuse of your guard. Then if you find that you just chose your guard as your exit, insert an extra hop in the middle of that circuit.
(C) "Exits can't be Guards": First relax the /16 and family restrictions, so the remaining issue is reuse of your guard. Then notice that due to exit scarcity, guards aren't actually used in the exit position anyway. Then enforce that rule (so they can't be in the future either).
All three of these choices have downsides. But all three of them look like improvements over the current situation -- because of how crappy the current situation is.
(Rejected option (D): "Just start allowing it": Relax the /16 and family restrictions, and also relax the rule where relays refuse a circuit that goes right back where it came from. Giving the middle node that much information about the circuit just wigs me out.)
Also, notice that I think Mike's proposed design will turn out to be some combination of "A" and also something like "B" or "C", because even if you start with two guards, if you don't add a new guard right when your first guard goes down, you might find yourself in the situation where you have one working guard, and you pick it as your exit, and now you need to do *something*.
The one-guard-down case does impact things. But even when this does happen (which should be rare), it should only be true for a small window of time before the consensus updates.
The "down" guard should either be temporarily overloaded, or fully down and kicked off the consensus. I think we should only add a new guard when one falls out of the consensus, or both are unreachable/unusable.
This is why I think it is OK to take an incremental approach and start with A, and roll out things like B and C and other restriction relaxations.
During these edge cases, the most important property that we should strive to preserve is overall reachability. I don't like situations where the adversary gains information by certain nodes being overloaded or down. In my view, trying to make smart decisions to minimize exposure to more nodes is secondary to overall reachability. (Overall reachability allows a *non-network* adversary to gain information about how clients are using our network. That strikes me as a lower resource, more dangerous attack than the unknown risk of possible partial network observers. In other words, I believe we made the right short-term call in #14917 in terms of preserving reachability.)
Our path restrictions also cause normal exiting clients to use a second guard for unmultiplexed activity, at adversary controlled times, or just at periodically at random.
Just to make sure I understand: at least on the current network, that's because of the /16 rule and the family rule, and not because of the "if the exit you picked turns out to be your guard too, move to a different guard" rule, because exits aren't normally used for guards on our current network?
On more examination though, that's not something to rely on with our current design, since I bet there are weird edge cases like a relay loses its Guard flag, but it's still your Guard so you keep using it (depending on the advice from #17773), but now the weightings let you pick it for your Exit, and oops.
Another problematic example would be a relay that you picked as your Guard, and later it opened up its exit policy and became an Exit.
I am in favor of preventing guards from being exits. Intuitively, it means less "one stop shop" surveillance points to see both entry and exit traffic. It also makes flag-based load balancing equations much simpler, and makes it easier to account for padding overhead.
So if I wanted to try to flesh out my "Then enforce that rule" approach above, we would need to (1) Have dir auths take away the Guard flag from relays that can be used as Exits, and (2) Make sure that clients know that if their guards lose the Guard flag, they should treat them as being no longer guardworthy. I think we're doing that second one right now, based on my latest reading of #17773, so this would actually be a pretty easy change. But still, it's not exactly elegant.
In the world where we keep path restrictions, these would be my rules: 1. Two equal guards, chosen from not the same /16 or family 2. Choose each vanguard layer members such that each layer has at least one node from a unique /16 and family. 3. Build paths in a strict order, from last hop towards guard. If you can't build a path with this ordering, start over with a sampled guard. (With rule #1 and #2, this should be very rare and should mean that a guard is marked down locally but still marked up in the consensus.) 4. No guards as exits (Not needed but do it anyway for other reasons).
Then under these rules, you decide to use a new primary guard, if: 0. When a guard leaves the consensus, replace it with a new primary guard. 1. Temporarily pick a new guard when your two primaries are locally down or unusable (ie step #3 above fails).
However, while removing path restrictions will solve the immediate problem, it will not address other instances where Tor temporarily opts use a second guard due to congestion, OOM, or failure of its primary guard, and we're still running into bugs where this can be adversarially controlled or just happen randomly[5].
I continue to think we need to fix these. I'm glad to see that George has been putting some energy into looking more at them. The bugs that we don't understand are especially worrying, since it's hard to know how bad they are. Moving to two guards might put a bit of a bandaid on the issues, but it can't be our long-term plan for fixing them.
We're choosing fixes for these bugs that enable an adversary to deny service to clients at a particular guard, *without* letting those clients move to a second guard. This enables confirmation attacks, and these confirmation attacks can be extended to guard discovery attacks by DoSing guards one at a time until an onion service fails.
I would find non-onion-service examples more compelling here, since I want to avoid falling back into the "well, onion services need special treatment to be safe, so we have to choose between hurting normal clients and hurting onion services" trap.
How is this for an alternative scenario to be considering: the attacking website gives the Tor Browser user some page content that causes the browser to initiate periodic events. Then it starts congesting guards one at a time until the events stop arriving.
Are those two scenarios basically equivalent in terms of the confirmation attacks you are worrying about? I hope yes, and now I can stop getting distracted by wondering if going to this effort is worth it only to protect onion services? :)
Yes.
But I wonder if we're looking at this backwards, and the primary question we should be asking is "How can we protect the transition between guards?" Then one of the potential answers to consider is "Maybe we should start out with two guards rather than just one." Framing it that way, are there more options that we should consider too? For example, removing the ability of the non-local attacker to trigger a transition? Then there would still be visibility of a transition, but the (non-local) attacker can't impact the timing of the transition. How much does that solve? Need to think more.
One guard is inherently more fragile than two, and no matter what we do, it means that there will be a risk of attacks that can confirm guard choice, because the downtime during this transition can never be hidden without at least some redundancy.
How's this for another option: clients have two guards, but they have a first guard and a backup guard. They do the traffic padding to both of them, to ensure continuous netflow sessions in their local ISP's logs. But they try to send most of their traffic over the first guard, thus avoiding most of the "increased surface area" concerns about using two guards at once. And we try to reduce the frequency of situations where they can't use their first guard. But in the "transition" situations that we decide we need to keep, they use their backup guard, and it's already available and ready and that netflow session is already active in the eyes of their ISP.
This approach isn't conflux (yet), but it's not incompatible with later changing things so we do conflux.
It also doesn't get us the lower variance of performance that having two equally used guards would get us. But I am ok with that for now, at least until somebody has done some performance analysis to show that we're really suffering now and we would stop suffering then.
FYI, we actually do have one form of this info in figure 10 of https://www.freehaven.net/anonbib/cache/wpes12-cogs.pdf
We get the largest performance gains from going from one guard to two, in terms of reducing the variance (flatness) of that CDF.
Qualitatively, this means way fewer users who try Tor and experience a very slow Tor, telling their friends that it is too slow and should not be used. This is a real thing. Web UX folks have found that it happens with perf variances in the sub-second range with websites.
It adds load onto the relays, by almost doubling the number of sockets used by guards for clients, and also by adding more bandwidth load from the padding cells to/from the backup guard. (How much bandwidth load is this, per client?)
And it doesn't actually provide as much "real" cover traffic onto the backup guard in most situations, so somebody who can look more thoroughly at the traffic flows will still be able to distinguish a transition event from the first to the backup. Maybe that's a problem? Or maybe the netflow level adversary that we declared in the threat model can't do that, and a real attacker would be able to see the traffic details anyway, so we're fine^W^Wno worse off than before?
There are a couple things here that make me think we may still be worse off.
1. The netflow padding is not designed to simulate client traffic. It is designed to aggregate client traffic together over time in the adversary's logs. Instead of seeing a discrete "520KB xfer in this 15 second period, 80KB in that one, and 2300KB in that one, and then silence for 25 minutes", the adversary records "2900KB traffic total in this half hour". For this aggregation to help, there really needs to be other traffic during that half hour. This is why I keep saying that more concurrent activity is better than only using the second guard sometimes. (WTF-PAD could do things like you describe above, but we need to program histograms+state machines for that).
2. Detection of when to switch to this second guard seems complicated and error prone, and if it results in unavailability, it is strictly worse. If it switches to the second guard at the first sign of RESOURCELIMIT and path selection issues, well, then you're adding a lot of complexity for how much benefit (and also complexity that could be manipulated by the adversary).
Whereas that first risk does seem plausible to me -- worth trying to reduce. I think we should start by enumerating as many scary scenarios as we can (where scary means "currently we would shift away from our first guard"), and then fix as many of them as we can. Then we should look at the remaining scenarios where we would switch over to using our backup guard (like, when our first guard isn't able to build new circuits for us), and decide if the cost of the additional load on the network is worth hiding that transition timing from a netflow-level client-side-ISP adversary. I can see the answer being "yes, it's worth it", but I think it will be useful to have a good handle on which transition scenarios remain.
Well, "fixing" the largest, most frequent, and adversary controlled classes of these requires:
1. Removing path restrictions. 2. Recognizing DoS attacks and differentiating them from bad network conditions.
#2 is what worries me. Any solution to #2 that is agile enough to avoid downtime strikes me as no better than "switch to guard #2 with probability 1/2 after a RESOURCELIMIT or any other circuit failure" (which is what the code would do today with two equal guards), and a hell of a lot more complex (with risk of a downtime signal or adversary path influence if we get it wrong).