On Wed, Apr 11, 2018 at 11:15:44AM +0000, Mike Perry wrote:
To be clear, the design I've been considering here is simply allowing reuse between the guard hop and the final hop, when it can't be avoided. I don't mean to allow the guard (or its family) to show up as all four hops in the path. Is that the same as what you meant, or did you mean something more thorough?
By all path restrictions I mean for the last hop of the circuit and the first (though vanguards would be simpler if we got rid of them for other hops, too).
Can you lay out for us the things to think about in the Vanguard design? Last I checked there were quite a few Vanguard design variants, ranging from "two vanguards per guard, tree style" to some sort of mesh.
In particular, it would be convenient if there is a frontrunner design that really would benefit from relaxing many path restrictions, and a frontrunner design that is not so tied together to the path restriction question.
But I do mean all restrictions, not just guard node choice. The adversary also gets to force you to use a second network path whenever they want via the /16 and node family restrictions.
Can you give us a specific example here, for this phrase "network path"? When you say "second network path" are you thinking in the Vanguard world?
We're not using one guard in the current Tor. We're using two, and the second one is only used for unmultiplexed activity. That is one property I don't like about our "let's pretend to use one guard" status quo.
Right, I agree.
I'd like to hear more about the "cleverly crafted exit policy" attack
another way to do this type of exit rotation attack is to cause a client to look up a DNS name where you control the resolver, and keep timing out on the DNS response. The client will then retry the stream request with a new exit. The same thing can also be done by timing out the TCP handshake to a server you control. Both of these attacks can be done with only the ability to inject an img tag into a page.
You repeat this until an exit is chosen that is in the same /16 or family as the guard, and then the client uses a second network path for an unmultiplexed request at a time you control.
Hm! Yes, this is a yucky one. (I don't think just an img tag would be enough, because Tor will try a few circuits and then give up. You'd need some sort of javascript or refresh chain or the like that generates new addresses and tries them in succession. But that's totally feasible.)
This one is also yucky because we could also imagine a different way to pick your path, where when you're selecting your exit, you avoid choosing exits which would conflict with your guard, and thus you'll never be pushed off of your guard. But then the destination website can do this same attack over time and notice which exit you never try to use. So this is a case where to blend in best, we *need* to be willing to use all of the potential exits.
But since normal exit circuits are three hops, if we simply relax the path restrictions, we could be making a circuit of the form "A - B - A", which would not only stand out as weird to B, but actually right now a relay in B's position will refuse such a circuit. Bad news all around.
The three fixes that come to mind are
(A) "Have two guards": so you can pick any exit you like, and then just use the guard that doesn't conflict with the exit you picked.
(B) "Add a bonus hop when needed": First relax the /16 and family restrictions, so the remaining issue is reuse of your guard. Then if you find that you just chose your guard as your exit, insert an extra hop in the middle of that circuit.
(C) "Exits can't be Guards": First relax the /16 and family restrictions, so the remaining issue is reuse of your guard. Then notice that due to exit scarcity, guards aren't actually used in the exit position anyway. Then enforce that rule (so they can't be in the future either).
All three of these choices have downsides. But all three of them look like improvements over the current situation -- because of how crappy the current situation is.
(Rejected option (D): "Just start allowing it": Relax the /16 and family restrictions, and also relax the rule where relays refuse a circuit that goes right back where it came from. Giving the middle node that much information about the circuit just wigs me out.)
Also, notice that I think Mike's proposed design will turn out to be some combination of "A" and also something like "B" or "C", because even if you start with two guards, if you don't add a new guard right when your first guard goes down, you might find yourself in the situation where you have one working guard, and you pick it as your exit, and now you need to do *something*.
Our path restrictions also cause normal exiting clients to use a second guard for unmultiplexed activity, at adversary controlled times, or just at periodically at random.
Just to make sure I understand: at least on the current network, that's because of the /16 rule and the family rule, and not because of the "if the exit you picked turns out to be your guard too, move to a different guard" rule, because exits aren't normally used for guards on our current network?
On more examination though, that's not something to rely on with our current design, since I bet there are weird edge cases like a relay loses its Guard flag, but it's still your Guard so you keep using it (depending on the advice del año from #17773), but now the weightings let you pick it for your Exit, and oops.
Another problematic example would be a relay that you picked as your Guard, and later it opened up its exit policy and became an Exit.
So if I wanted to try to flesh out my "Then enforce that rule" approach above, we would need to (1) Have dir auths take away the Guard flag from relays that can be used as Exits, and (2) Make sure that clients know that if their guards lose the Guard flag, they should treat them as being no longer guardworthy. I think we're doing that second one right now, based on my latest reading of #17773, so this would actually be a pretty easy change. But still, it's not exactly elegant.
However, while removing path restrictions will solve the immediate problem, it will not address other instances where Tor temporarily opts use a second guard due to congestion, OOM, or failure of its primary guard, and we're still running into bugs where this can be adversarially controlled or just happen randomly[5].
I continue to think we need to fix these. I'm glad to see that George has been putting some energy into looking more at them. The bugs that we don't understand are especially worrying, since it's hard to know how bad they are. Moving to two guards might put a bit of a bandaid on the issues, but it can't be our long-term plan for fixing them.
We're choosing fixes for these bugs that enable an adversary to deny service to clients at a particular guard, *without* letting those clients move to a second guard. This enables confirmation attacks, and these confirmation attacks can be extended to guard discovery attacks by DoSing guards one at a time until an onion service fails.
I would find non-onion-service examples more compelling here, since I want to avoid falling back into the "well, onion services need special treatment to be safe, so we have to choose between hurting normal clients and hurting onion services" trap.
How is this for an alternative scenario to be considering: the attacking website gives the Tor Browser user some page content that causes the browser to initiate periodic events. Then it starts congesting guards one at a time until the events stop arriving.
Are those two scenarios basically equivalent in terms of the confirmation attacks you are worrying about? I hope yes, and now I can stop getting distracted by wondering if going to this effort is worth it only to protect onion services? :)
You keep focusing on the performance aspects of conflux, but that is not the argument I am making. My arguments for conflux in Section 4 are about resilience to congestion, downtime, circuit killing, and DoS, as well as traffic analysis resistance. I see the performance benefits as secondary.
I like conflux in theory, but somebody needs to do the other 90% of the work to make it a concrete thing that we can consider.
I continue to think "Tor should switch to two guards, because one day we should design and deploy conflux" is a terrible reason to switch to two guards now.
So I didn't mean to mix the conflux discussion and the performance discussion. I meant to mostly ignore the conflux discussion (because it is a future proposal, not this one), while also making sure that we don't forget the potential performance benefits of having two guards in general.
But I wonder if we're looking at this backwards, and the primary question we should be asking is "How can we protect the transition between guards?" Then one of the potential answers to consider is "Maybe we should start out with two guards rather than just one." Framing it that way, are there more options that we should consider too? For example, removing the ability of the non-local attacker to trigger a transition? Then there would still be visibility of a transition, but the (non-local) attacker can't impact the timing of the transition. How much does that solve? Need to think more.
One guard is inherently more fragile than two, and no matter what we do, it means that there will be a risk of attacks that can confirm guard choice, because the downtime during this transition can never be hidden without at least some redundancy.
How's this for another option: clients have two guards, but they have a first guard and a backup guard. They do the traffic padding to both of them, to ensure continuous netflow sessions in their local ISP's logs. But they try to send most of their traffic over the first guard, thus avoiding most of the "increased surface area" concerns about using two guards at once. And we try to reduce the frequency of situations where they can't use their first guard. But in the "transition" situations that we decide we need to keep, they use their backup guard, and it's already available and ready and that netflow session is already active in the eyes of their ISP.
This approach isn't conflux (yet), but it's not incompatible with later changing things so we do conflux.
It also doesn't get us the lower variance of performance that having two equally used guards would get us. But I am ok with that for now, at least until somebody has done some performance analysis to show that we're really suffering now and we would stop suffering then.
It adds load onto the relays, by almost doubling the number of sockets used by guards for clients, and also by adding more bandwidth load from the padding cells to/from the backup guard. (How much bandwidth load is this, per client?)
And it doesn't actually provide as much "real" cover traffic onto the backup guard in most situations, so somebody who can look more thoroughly at the traffic flows will still be able to distinguish a transition event from the first to the backup. Maybe that's a problem? Or maybe the netflow level adversary that we declared in the threat model can't do that, and a real attacker would be able to see the traffic details anyway, so we're fine^W^Wno worse off than before?
Assuming this design meets all of our goals, let's examine two variants of it to make sure we understand what we're actually trading off. In particular, consider a design where we maintain (and pad) these two connections, vs a design where we maintain a connection to our first guard and then launch a connection to the backup guard on demand. The downside of keeping the backup connection open is the extra network-wide socket and bandwidth load on relays, while the downsides of launching a connection on demand are the risk that a local netflow-level ISP can see when we transition to using the backup guard, plus the risk that a remote attacker who can cripple guards will be able to notice the delay in the "launch on demand case" but could not distinguish the delay in the "two connections" case.
That second risk doesn't seem so scary to me, since local handshakes should be a small fraction of the overall time it takes to build and use a new circuit. But above you say "the downtime during this transition can never be hidden without at least some redundancy", so if you think this risk is scary, I'd like to hear more details about why. (Maybe the design you were concerned about was one where we just freeze in place and fail when we don't want to use our first guard? I agree, that's a bad design, and we can do better, for example by "be willing to use the second guard".)
Whereas that first risk does seem plausible to me -- worth trying to reduce. I think we should start by enumerating as many scary scenarios as we can (where scary means "currently we would shift away from our first guard"), and then fix as many of them as we can. Then we should look at the remaining scenarios where we would switch over to using our backup guard (like, when our first guard isn't able to build new circuits for us), and decide if the cost of the additional load on the network is worth hiding that transition timing from a netflow-level client-side-ISP adversary. I can see the answer being "yes, it's worth it", but I think it will be useful to have a good handle on which transition scenarios remain.
--Roger