In-line below for ease of comment. Also available at: https://gitweb.torproject.org/user/mikeperry/torspec.git/tree/proposals/xxx-...
===========================
Filename: xxx-two-guard-nodes.txt Title: The move to two guard nodes Author: Mike Perry Created: 2018-03-22 Supersedes: Proposal 236
0. Background
Back in 2014, Tor moved from three guard nodes to one guard node[1,2,3].
We made this change primarily to limit points of observability of entry into the Tor network for clients and onion services, as well as to reduce the ability of an adversary to track clients as they move from one internet connection to another by their choice of guards.
1. Proposed changes
1.1. Switch to two guards per client
When this proposal becomes effective, clients will switch to using two guard nodes. The guard node selection algorithms of Proposal 271 will remain unchanged. Instead of having one primary guard "in use", Tor clients will always use two.
This will be accomplished by setting the guard-n-primary-guards-to-use consensus parameter to 2, as well as guard-n-primary-guards to 2. (Section 3.1 covers the reason for both parameters). This is equivalent to using the torrc option NumEntryGuards=2, which can be used for testing behavior prior to the consensus update.
1.2. Enforce Tor's path restrictions across this guard layer
In order to ensure that Tor can always build circuits using two guards without resorting to a third, they must be chosen such that Tor's path restrictions could still build a path with at least one of them, regardless of the other nodes in the path.
In other words, we must ensure that both guards are not chosen from the same /16 or the same node family. In this way, Tor will always be able to build a path using these guards, preventing the use of a third guard.
2. Discussion
2.1. Why two guards?
The main argument for switching to two guards is that because of Tor's path restrictions, we're already using two guards, but we're using them in a suboptimal and potentially dangerous way.
Tor's path restrictions enforce the condition that the same node cannot appear twice in the same circuit, nor can nodes from the same /16 subnet or node family be used in the same circuit.
Tor's paths are also built such that the exit node is chosen first and held fixed during guard node choice, as are the IP, HSDIR, and RPs for onion services. This means that whenever one of these nodes happens to be the guard[4], or be in the same /16 or node family as the guard, Tor will build that circuit using a second "primary" guard, as per proposal 271[7].
Worse still, the choice of RP, IP, and exit can all be controlled by an adversary (to varying degrees), enabling them to force the use of a second guard at will.
Because this happens somewhat infrequently in normal operation, a fresh TLS connection will typically be created to the second "primary" guard, and that TLS connection will be used only for the circuit for that particular request. This property makes all sorts of traffic analysis attacks easier, because this TLS connection will not benefit from any multiplexing.
This is more serious than traffic injection via an already in-use guard because the lack of multiplexing means that the data retention level required to gain information from this activity is very low, and may exist for other reasons. To gain information from this behavior, an adversary needs only connection 5-tuples + timestamps, as opposed to detailed timeseries data that is polluted by other concurrent activity and padding.
In the most severe form of this attack, the adversary can take a suspect list of Tor client IP addresses (or the list of all Guard node IP addresses) and observe when secondary Tor connections are made to them at the time when they cycle through all guards as RPs for connections to an onion service. This adversary does not require collusion on the part of observers beyond the ability to provide 5-tuple connection logs (which ISPs may retain for reasons such as netflow accounting, IDS, or DoS protection systems).
A fully passive adversary can also make use of this behavior. Clients unlucky enough to pick guard nodes in heavily used /16s or in large node families will tend to make use of a second guard more frequently even without effort from the adversary. In these cases, the lack of multiplexing also means that observers along the path to this secondary guard gain more information per observation.
2.2. Why not MOAR guards?
We do not want to increase the number of observation points for client activity into the Tor network[1]. We merely want better multiplexing for the cases where this already happens.
2.3. Can you put some numbers on that?
The Changing of the Guards[13] paper studies this from a few different angles, but one of the crucially missing graphs is how long a client can expect to run with N guards before it chooses a malicious guard.
However, we do have tables in section 3.2.1 of proposal 247 that cover this[14]. There are three tables there: one for a 1% adversary, one for a 5% adversary, and one for a 10% adversary. You can see the probability of adversary success for one and two guards in terms of the number of rotations needed before the adversary's node is chosen. Not surprisingly, the two guard adversary gets to compromise clients roughly twice as quickly, but the timescales are still rather large even for the 10% adversary: they only have 50% chance of success after 4 rotations, which will take about 14 months with Tor's 3.5 month guard rotation.
2.4. What about guard fingerprinting?
More guards also means more fingerprinting[8]. However, even one guard may be enough to fingerprint a user who moves around in the same area, if that guard is low bandwidth or there are not many Tor users in that area.
Furthermore, our use of separate directory guards (and three of them) means that we're not really changing the situation much with the addition of another regular guard. Right now, directory guard use alone is enough to track all Tor users across the entire world.
While the directory guard problem could be fixed[12] (and should be fixed), it is still the case that another mechanism should be used for the general problem of guard-vs-location management[9].
3. Alternatives
There are two other solutions that also avoid the use of secondary guard in the path restriction case.
3.1. Eliminate path restrictions entirely
If Tor decided to stop enforcing /16, node family, and also allowed the guard node to be chosen twice in the path, then under normal conditions, it should retain the use of its primary guard.
This approach is not as extreme as it seems on face. In fact, it is hard to come up with arguments against removing these restrictions. Tor's /16 restriction is of questionable utility against monitoring, and it can be argued that since only good actors use node family, it gives influence over path selection to bad actors in ways that are worse than the benefit it provides to paths through good actors[10,11].
However, while removing path restrictions will solve the immediate problem, it will not address other instances where Tor temporarily opts use a second guard due to congestion, OOM, or failure of its primary guard, and we're still running into bugs where this can be adversarially controlled or just happen randomly[5].
While using two guards means twice the surface area for these types of bugs, it also means that instances where they happen simultaneously on both guards (thus forcing a third guard) are much less likely than with just one guard. (In the passive adversary model, consider that one guard fails at any point with probability P1. If we assume that such passive failures are independent events, both guards would fail concurrently with probability P1*P2. Even if the events are correlated, the maximum chance of concurrent failure is still MIN(P1,P2)).
Note that for this analysis to hold, we have to ensure that nodes that are at RESOURCELIMIT or otherwise temporarily unresponsive do not cause us to consider other primary guards beyond than the two we have chosen. This is accomplished by setting guard-n-primary-guards to 2 (in addition to setting guard-n-primary-guards-to-use to 2). With this parameter set, the proposal 271 algorithm will avoid considering more than our two guards, unless *both* are down at once.
3.2. No Guard-flagged nodes as exit, RP, IP, or HSDIRs
Similar to 3.1, we could instead forbid the use of Guard-flagged nodes for the exit, IP, RP, and HSDIR positions.
This solution has two problems: First, like 3.1, it also does not handle the case where resource exhaustion could force the use of a second guard. Second, it requires clients to upgrade to the new behavior and stop using Guard flagged nodes before it can be deployed.
4. The future is confluxed
An additional benefit of using a second guard is that it enables us to eventually use conflux[6].
Conflux works by giving circuits a 256bit cookie that is sent to the exit/RP, and circuits that are then built to the same exit/RP with the same cookie can then be fused together. Throughput estimates are used to balance traffic between these circuits, depending on their performance.
We have unfortunately signaled to the research community that conflux is not worth pursuing, because of our insistence on a single guard. While not relevant to this proposal (indeed, conflux requires its own proposal and also concurrent research), it is worth noting that whichever way we go here, the door remains open to conflux because of its utility against similar issues.
If our conflux implementation includes packet acking, then circuits can still survive the loss of one guard node due to DoS, OOM, or other failures because the second half of the path will remain open and usable (see the probability of concurrent failure arguments in Section 3.1).
If exits remember this cookie for a short period of time after the last circuit is closed, the technique can be used to protect against DoS/OOM/guard downtime conditions that take down both guard nodes or destroy many circuits to confirm both guard node choices. In these cases, circuits could be rebuilt along an alternate path and resumed without end-to-end circuit connectivity loss. This same technique will also make things like ephemeral bridges (ie Snowflake/Flashproxy) more usable, because bridge uptime will no longer be so crucial to usability. It will also improve mobile usability by allowing us to resume connections after mobile Tor apps are briefly suspended, or if the user switches between cell and wifi networks.
Furthermore, it is likely that conflux will also be useful against traffic analysis and congestion attacks. Since the load balancing is dynamic and hard to predict by an external observer and also increases overall traffic multiplexing, traffic correlation and website traffic fingerprinting attacks will become harder, because the adversary can no longer be sure what percentage of the traffic they have seen (depending on their position and other potential concurrent activity). Similarly, it should also help dampen congestion attacks, since traffic will automatically shift away from a congested guard.
References:
1. https://blog.torproject.org/improving-tors-anonymity-changing-guard-paramete... 2. https://trac.torproject.org/projects/tor/ticket/12206 3. https://gitweb.torproject.org/torspec.git/tree/proposals/236-single-guard-no... 4. https://trac.torproject.org/projects/tor/ticket/14917 5. https://trac.torproject.org/projects/tor/ticket/25347#comment:14 6. https://www.cypherpunks.ca/~iang/pubs/conflux-pets.pdf 7. https://gitweb.torproject.org/torspec.git/tree/proposals/271-another-guard-s... 8. https://trac.torproject.org/projects/tor/ticket/9273#comment:3 9. https://tails.boum.org/blueprint/persistent_Tor_state/ 10. https://trac.torproject.org/projects/tor/ticket/6676#comment:3 11. https://bugs.torproject.org/15060 12. https://trac.torproject.org/projects/tor/ticket/10969 13. https://www.freehaven.net/anonbib/cache/wpes12-cogs.pdf 14. https://gitweb.torproject.org/torspec.git/tree/proposals/247-hs-guard-discov...
On Sat, Mar 31, 2018 at 2:52 AM, Mike Perry mikeperry@torproject.org wrote:
In-line below for ease of comment. Also available at: https://gitweb.torproject.org/user/mikeperry/torspec.git/tree/proposals/xxx-...
===========================
Filename: xxx-two-guard-nodes.txt Title: The move to two guard nodes Author: Mike Perry Created: 2018-03-22 Supersedes: Proposal 236
Added as proposal 291!
Mike Perry mikeperry@torproject.org writes:
In-line below for ease of comment. Also available at: https://gitweb.torproject.org/user/mikeperry/torspec.git/tree/proposals/xxx-...
===========================
Filename: xxx-two-guard-nodes.txt Title: The move to two guard nodes Author: Mike Perry Created: 2018-03-22 Supersedes: Proposal 236
<snip>
3.1. Eliminate path restrictions entirely
If Tor decided to stop enforcing /16, node family, and also allowed the guard node to be chosen twice in the path, then under normal conditions, it should retain the use of its primary guard.
This approach is not as extreme as it seems on face. In fact, it is hard to come up with arguments against removing these restrictions. Tor's /16 restriction is of questionable utility against monitoring, and it can be argued that since only good actors use node family, it gives influence over path selection to bad actors in ways that are worse than the benefit it provides to paths through good actors[10,11].
However, while removing path restrictions will solve the immediate problem, it will not address other instances where Tor temporarily opts use a second guard due to congestion, OOM, or failure of its primary guard, and we're still running into bugs where this can be adversarially controlled or just happen randomly[5].
Hello Mike,
IMO we should not portray removing the above path restrictions as something extreme, until we have good evidence that those path restrictions offer something positive in the cases we are examining. Personally, I see the result of this proposal of making Sybil attacks two times more quick (section 2.3), as an equally radical result.
That said, I feel that this proposal is valuable and I'm not trying to say that I don't like this proposal, or that I don't buy the arguments. I'm trying to say that I don't know how to weight the tradeoffs here so that I gain confidence, because I'm not sure how people are trying to attack Tor clients right now.
The way I see it is that if we adopt this proposal: + We are better defended against active attacks like congestion attacks and OOM/DoS attacks. + We improve network health by reducing congestion to certain guards. - Sybil attacks can be performed two times more quickly.
IMO, we should not rush this decision for 034, given that it's a concensus parameter change that can happen instantaneously. However, we should do the following soon:
1) Accept that there is no single best guard topology, and fix our codebase to work well with either one guard or two guards, so that we are ready for when we flip the switch. Perhaps we can fix #25753/#25705/etc. in a way that works well both now and in the 2-guard future?
2) Investigate our current prop#271 codebase and make sure that the paragraph below will work as intended if we do this proposal.
3) Involve more peple into this (Roger, NRL, etc.) and have them think about this, to gain more confidence.
Do you think this approach is too slow or backwards?
Just to speed it up, I just did (2) below:
Note that for this analysis to hold, we have to ensure that nodes that are at RESOURCELIMIT or otherwise temporarily unresponsive do not cause us to consider other primary guards beyond than the two we have chosen. This is accomplished by setting guard-n-primary-guards to 2 (in addition to setting guard-n-primary-guards-to-use to 2). With this parameter set, the proposal 271 algorithm will avoid considering more than our two guards, unless *both* are down at once.
OK, the above paragraph is basically the juice of this proposal! I spent all day today to investigate how this would work! The results are very positive, but also not 100% straightforward because of the various intricancies of prop#271.
[First of all, there is no way to simulate the above topology using the config file because if you set NumEntryGuards=2 in your torrc, Tor will setup 4 primary guards because of the way get_n_primary_guards() works. So I hacked my Tor client to *have* 2 primary guards (guard-n-primary-guards), and *use* 2 primary guards (guard-n-primary-guards-to-use).]
The good part: This topology works exactly how the proposal wants it to work. Because of the way primary guards work, you will have 2 primary guards, and if one of them goes down you will always use the other primary, instead of falling back to a third guard. That's excellent, but it's also abusing the primary guard feature in a good way but not in the way we were intending it to be used.
Here are the side-effects from this abuse:
- By reducing the amount of primaries from three to two, it's more likely that all primaries can be down at a given time. Prop#271 was written with an inherent assumption that one of the primaries will always be reachable, because when all of them are down the code goes into an "oh shit! bad reachability!" mode which was mainly designed for network-down scenarios (like no-internet-land, or tunnels).
I'm refering to the UPDATE_WAITING section of prop#271 and entry_guards_upgrade_waiting_circuits() in our codebase which takes care of this situation. This behavior will basically delay circuits on non-primary guards until a primary guard goes online. You can test this behavior by blocking connections to all your primaries using iptables. I did this today, and while Tor worked fine after some time, there were delays and broken circuits. It's very likely we can optimize this behavior if we want, so this is not really a blocker for this proposal, but something we should think about and experiment with...
We might also want to consider writing code to block clients from skipping to lower-priority primary guards if higher-priority primary guards are still reachable and guard-n-primary-guards-to-use > 1, so that we can have more primary guards than we need without skipping them when one of them goes down. That would allow us to get both the effect of prop#291 while maintaining the original use of primary guards.
- If we set the number of primary guards to 2 and we leave NumDirectoryGuards to 3, then NumDirectoryGuards will not work as intended, and we will actually always use our two primary guards for dirinfo as long as one of them is reachable. This is not a huge problem, and might be a feature, but not the way we were intending to use NumDirectoryGuards (see #13908 and https://lists.torproject.org/pipermail/tor-dev/2014-May/006820.html).
Other than the above side-effects, Tor worked fine all day and only connected to the primary guards, even when I blocked connections to one of them. It was actually quite nice to see!
---
Hope this was useful and let me know if you have questions!
Mike Perry mikeperry@torproject.org writes:
In-line below for ease of comment. Also available at: https://gitweb.torproject.org/user/mikeperry/torspec.git/tree/proposals/xxx-...
===========================
Filename: xxx-two-guard-nodes.txt Title: The move to two guard nodes Author: Mike Perry Created: 2018-03-22 Supersedes: Proposal 236
<snip>
3.1. Eliminate path restrictions entirely
If Tor decided to stop enforcing /16, node family, and also allowed the guard node to be chosen twice in the path, then under normal conditions, it should retain the use of its primary guard.
This approach is not as extreme as it seems on face. In fact, it is hard to come up with arguments against removing these restrictions. Tor's /16 restriction is of questionable utility against monitoring, and it can be argued that since only good actors use node family, it gives influence over path selection to bad actors in ways that are worse than the benefit it provides to paths through good actors[10,11].
However, while removing path restrictions will solve the immediate problem, it will not address other instances where Tor temporarily opts use a second guard due to congestion, OOM, or failure of its primary guard, and we're still running into bugs where this can be adversarially controlled or just happen randomly[5].
Seems like the above paragraph is our main argument against removing path restrictions.
Might be worth pointing out that if congestion/OOM attacks are in our threat model against the current single guard design, then the same adversary can force prop#291 to open a connection to the *third* guard by first doing an OOM/congestion attack against one of your first two guards, and then pushing you to your third guard using a path restriction attack (#14917).
Thought that I should mention that because it might be an argument for both moving to two guards and also lifting some path restrictions...
On Sat, Mar 31, 2018 at 06:52:51AM +0000, Mike Perry wrote:
The main argument for switching to two guards is that because of Tor's path restrictions, we're already using two guards, but we're using them in a suboptimal and potentially dangerous way.
Tor's path restrictions enforce the condition that the same node cannot appear twice in the same circuit, nor can nodes from the same /16 subnet or node family be used in the same circuit.
Tor's paths are also built such that the exit node is chosen first and held fixed during guard node choice, as are the IP, HSDIR, and RPs for onion services. This means that whenever one of these nodes happens to be the guard[4], or be in the same /16 or node family as the guard, Tor will build that circuit using a second "primary" guard, as per proposal 271[7].
Worse still, the choice of RP, IP, and exit can all be controlled by an adversary (to varying degrees), enabling them to force the use of a second guard at will.
I agree with you that we should do something about this bug, where Tor clients will switch to a rarely used guard in some situations. Our fix from ticket #14917 was not a good fix. More on that below in Section 3.1.
Not surprisingly, the two guard adversary gets to compromise clients roughly twice as quickly, but the timescales are still rather large even for the 10% adversary: they only have 50% chance of success after 4 rotations, which will take about 14 months with Tor's 3.5 month guard rotation.
Three thoughts here:
(A) You're right, 14 months doesn't sound bad here.
(B) This calculation was ignoring churn, right? That is, guards going away before you wanted to rotate from them. So another way to phrase that would be "once eight of your guards have gone away, you're in bad shape"? Looking at it that way, it seems like two guards is more than twice as scary as one, since *either* of them going away moves you one step closer on the path. Not the end of the world, but worth noticing. And maybe partially solvable by your "when one of your two goes away, stick to the remaining one" design; more on that below.
(C) Similarly, we should be sure to remember the network adversary here too. I don't know a simple way to reason about it well. Using more guards over time could be *less* than twice as scary, because sometimes the network paths overlap so you don't expose as much new surface area as you might have. And using more guards over time could be *more* than twice as scary, if the question is whether your traffic ever goes over that one bad place, since you have an exponentially low chance to *never* pick a guard where your traffic to/from that guard travels over the bad place. It really depends on your location, the guard locations, the Internet topology, and a bunch of other confusing factors.
Furthermore, our use of separate directory guards (and three of them) means that we're not really changing the situation much with the addition of another regular guard. Right now, directory guard use alone is enough to track all Tor users across the entire world.
Shit, you're right. The guard set fingerprint issue remains right now, because we never solved the directory guard side of it. :(
While the directory guard problem could be fixed[12] (and should be fixed), it is still the case that another mechanism should be used for the general problem of guard-vs-location management[9].
The part that freaks me out about all the designs I've seen here is the attack where the local adversary advertises a series of local wireless addresses, first to make you keep generating new guard contexts (similar to forcing quick guard rotation), or second to guess-and-check whether you've already got a guard context for some wireless address in the next city over. Maybe it can be solved by proper UI ("we'll just delegate the decision to the user"), but hoo boy. But that's a separate proposal fortunately. :)
3.1. Eliminate path restrictions entirely
I'm increasingly a fan of this option, the more I read these threads.
Let's examine the two attacker assumptions behind two of the attacks we're worried about.
Attack one: the client's local ISP collects coarse netflow logs, and these logs aren't detailed enough to allow a traffic volume detection attack on an existing long-lived TLS flow, so the connection to that first guard is safe; but a connection to that second guard will be unusual and not multiplexed and at exactly the time of the adversary-controlled circuit that triggered it, so that second guard, because it is used so rarely, is dangerous to use.
Attack two: if the client uses its guard as the first hop of its circuit and also the adversary-requested fourth hop, then the guard can do pairwise traffic correlation attacks on all of its circuits and realize that these two circuits it has are really two pieces of the same circuit.
This second attack seems weird to me. One reason is because in attack one we're brushing aside the traffic analysis as hard, whereas in attack two we're assuming it's trivial and perfect. But the simpler reason is: if your guard is going to participate in a traffic correlation attack against you, then it could just as easily team up with some other relay that the adversary picked. That is, avoiding reusing your guard on the other end of the circuit isn't going to save you if your guard is out to get you.
Part of why it's hard to compare these two attacks directly is because one is a client-side-observer adversary and the other is a relay-level adversary.
Let's look at "attack one" from a relay-level-adversary perspective: if your first guard is bad, you're screwed already. But if that second guard might be bad, you really want to do anything you can do to not reach out to it even once.
And "attack two" from the client-side-observer-level-adversary perspective: well, if the attacker is watching the *client*, there's no visible hint that it's reusing its guard later in the path -- and that's the whole point. But if the attacker is watching the *relay*, then suddenly we don't have as much diversity of traffic location as we thought we had. That is, even if your relay is nice, somebody watching the relay's network could do the pairwise correlation attacks we described earlier.
Another part of what bothers me about attack two -- the one where the adversary gives you your fourth hop -- is that the adversary has *other* hops in their side of the circuit, and you don't even know about them. What if they chose your guard for their middle hop? Or for *their* guard? There's nothing you can do about those cases, because you can't know that they're happening. My conclusion is that if we can't solve significant instances of this attack, we should be wary of paying a large price to solve only a piece of it.
If Tor decided to stop enforcing /16, node family, and also allowed the guard node to be chosen twice in the path, then under normal conditions, it should retain the use of its primary guard.
To be clear, the design I've been considering here is simply allowing reuse between the guard hop and the final hop, when it can't be avoided. I don't mean to allow the guard (or its family) to show up as all four hops in the path. Is that the same as what you meant, or did you mean something more thorough?
I think "can't be avoided" means HSDir, IP, RP -- which I note are all onion service related circuits.
I'd like to hear more about the "cleverly crafted exit policy" attack, and I wonder if we can't solve that differently. For example, if it's about making you do a request to a port that only one exit relay allows, and ha ha whoops your guard was on the same /16 as that exit relay... maybe it's time for the dir auths to not advertise super rare ports? This was one of the topics in the users-get-routed paper too.
One non-starter idea would be to move onion-service-related Tors to two guards, and leave other Tors at one guard. It's a non-starter because of course advertising which you are to your local network is no good. But that idea gave me a different perspective on this discussion: I wonder how much this design decision comes down to making all Tors use two guards in order to protect the onion-service-related Tors, which are the only ones who actually need it?
This approach is not as extreme as it seems on face. In fact, it is hard to come up with arguments against removing these restrictions. Tor's /16 restriction is of questionable utility against monitoring, and it can be argued that since only good actors use node family, it gives influence over path selection to bad actors in ways that are worse than the benefit it provides to paths through good actors[10,11].
Yep.
One remaining feature for MyFamily though is that relay operators can say "No, even though I run these eight relays, I'm not in a position to do traffic correlation attacks on users, because I told the users to not put me in that position." This angle of the feature is about protecting relays, not about protecting clients.
However, while removing path restrictions will solve the immediate problem, it will not address other instances where Tor temporarily opts use a second guard due to congestion, OOM, or failure of its primary guard, and we're still running into bugs where this can be adversarially controlled or just happen randomly[5].
I continue to think we need to fix these. I'm glad to see that George has been putting some energy into looking more at them. The bugs that we don't understand are especially worrying, since it's hard to know how bad they are. Moving to two guards might put a bit of a bandaid on the issues, but it can't be our long-term plan for fixing them.
Note that for this analysis to hold, we have to ensure that nodes that are at RESOURCELIMIT or otherwise temporarily unresponsive do not cause us to consider other primary guards beyond than the two we have chosen. This is accomplished by setting guard-n-primary-guards to 2 (in addition to setting guard-n-primary-guards-to-use to 2). With this parameter set, the proposal 271 algorithm will avoid considering more than our two guards, unless *both* are down at once.
I like this general idea of not immediately replacing guards so long as you have a working one. In fact, we used to do something similar back in the day: https://blog.torproject.org/improving-tors-anonymity-changing-guard-paramete... says (emphasis mine) """ Tor 0.2.3's entry guard behavior is "choose three guards, ***adding another one if two of those three go down*** but going back to the original ones if they come back up, and also throw out (aka rotate) a guard 4-8 weeks after you chose it." """
There are still some fiddly decisions to make here. For example, as you say we probably shouldn't replacement a guard just because we failed to connect to one of our guards once. We might decide that it's time to add a new second guard if the consensus tells us that one of them is down (so we have confirmation that it isn't down for just us, it's down for everybody). Or we might decide to wait on adding a new one even if it really is down, because maybe it'll come back soon. But how long do we wait? And if, while we're down to one, we encounter one of these situations where the requested fourth hop overlaps with our remaining guard, what do we do?
In fact, here's a hopefully useful insight that I've just realized: you're not concerned about one guard vs two guards, you're concerned about *transitioning* between guards. It's that moment when you're starting to use a new guard, if the attacker can observe that you're doing it, and especially if the attacker can make you do it, that is vulnerable. And starting with two guards can help, in that it postpones the time until you're forced to transition, and maybe also because if we do it right it can make the transition less visible.
But I wonder if we're looking at this backwards, and the primary question we should be asking is "How can we protect the transition between guards?" Then one of the potential answers to consider is "Maybe we should start out with two guards rather than just one." Framing it that way, are there more options that we should consider too? For example, removing the ability of the non-local attacker to trigger a transition? Then there would still be visibility of a transition, but the (non-local) attacker can't impact the timing of the transition. How much does that solve? Need to think more.
3.2. No Guard-flagged nodes as exit, RP, IP, or HSDIRs
Similar to 3.1, we could instead forbid the use of Guard-flagged nodes for the exit, IP, RP, and HSDIR positions.
This solution has two problems: First, like 3.1, it also does not handle the case where resource exhaustion could force the use of a second guard. Second, it requires clients to upgrade to the new behavior and stop using Guard flagged nodes before it can be deployed.
I'm not much of a fan of this approach (it seems so inelegant!), but I find the two problems that you identified to be unsatisfying for ruling it out. I wonder if we can find some stronger arguments against this approach?
Otherwise I might find myself starting to like it. :)
One stronger argument might be: "the attacker can always use Guard-flagged nodes for other hops on its half of the circuit, and you wouldn't even be able to know that it's doing it, so if the goal is to never have a circuit with your guard both at your end and also reused elsewhere in the circuit, sorry you can't achieve that goal, so stop messing stuff up while trying to achieve what can only ever be a partial solution."
- The future is confluxed
An additional benefit of using a second guard is that it enables us to eventually use conflux[6].
I think the performance benefits are the main arguments in favor of doing two guards. In fact, I still think that it's mainly a performance-vs-safety tradeoff.
I agree with George that moving to two guards now so that we can maybe do Conflux later is doing it the wrong way round. Since it's so easy to switch to two guards, that should be one of the very easy steps in moving to Conflux when we do, and taking the safety hit now in exchange for the potential performance benefit later doesn't seem best.
But there's another performance argument we shouldn't forget: if you have two guards, you're much more likely to have at least one guard that's adequately fast. Right now some of the guards are fast (relative to others), and some are slow (relative to others). If you get one of the lower-end guards, your Tor performance is sad -- for months! We tried to mitigate that issue when we switched to one guard, by raising the required bandwidth to get the Guard flag, so there would be no truly terrible guards. But still, some guards are more equal than others.
This issue came up especially in the context of the December/January CPU overload attacks, where some guards were overwhelmed by circuit creation requests, and if you had a happy guard, lucky you, but if you had a sad guard, you might as well delete your Tor Browser and try again.
Now, in an ideal world we should come up with fixes for all of those other issues, for example by taking the Guard flag away from relays that can't be great guards. But in the world we live in right now, we can relieve some of that pressure-to-be-perfect by giving people two guards.
But if we're only going on a performance vs safety basis, I don't see a huge rush to trade off safety until we have a better handle on what sort of performance benefits we'd actually get, and until we've compared to other low-hanging performance fruit.
In summary:
(1) I think we should fix the bug from #14917 where the attacker can push us off our guard just by naming our guard as the HSDir/IP/RP, and I think we should fix it by being willing to reuse our guard when it can't be avoided. That step will resolve some, but not all, of the pressure about moving to two guards. Then
(2) Hopefully the above discussion has helped us move forward on the remaining reasons for switching to two guards. To me the two biggest questions left to resolve are (a) how best to protect the vulnerable transition to a new guard, and if two guards is the best idea we've got for that, and (b) how big an issue is it really that having only one guard can sometimes give you a low-performance guard, and if two guards is the best idea we've got for that one too.
--Roger
Roger Dingledine:
On Sat, Mar 31, 2018 at 06:52:51AM +0000, Mike Perry wrote:
3.1. Eliminate path restrictions entirely
I'm increasingly a fan of this option, the more I read these threads.
Let's examine the two attacker assumptions behind two of the attacks we're worried about.
Attack one: the client's local ISP collects coarse netflow logs, and these logs aren't detailed enough to allow a traffic volume detection attack on an existing long-lived TLS flow, so the connection to that first guard is safe; but a connection to that second guard will be unusual and not multiplexed and at exactly the time of the adversary-controlled circuit that triggered it, so that second guard, because it is used so rarely, is dangerous to use.
Attack two: if the client uses its guard as the first hop of its circuit and also the adversary-requested fourth hop, then the guard can do pairwise traffic correlation attacks on all of its circuits and realize that these two circuits it has are really two pieces of the same circuit.
This second attack seems weird to me. One reason is because in attack one we're brushing aside the traffic analysis as hard, whereas in attack two we're assuming it's trivial and perfect. But the simpler reason is: if your guard is going to participate in a traffic correlation attack against you, then it could just as easily team up with some other relay that the adversary picked. That is, avoiding reusing your guard on the other end of the circuit isn't going to save you if your guard is out to get you.
I agree. I am not concerned about attack two. But we're not choosing between just these two attacks.
To be clear, the design I've been considering here is simply allowing reuse between the guard hop and the final hop, when it can't be avoided. I don't mean to allow the guard (or its family) to show up as all four hops in the path. Is that the same as what you meant, or did you mean something more thorough?
By all path restrictions I mean for the last hop of the circuit and the first (though vanguards would be simpler if we got rid of them for other hops, too). But I do mean all restrictions, not just guard node choice. The adversary also gets to force you to use a second network path whenever they want via the /16 and node family restrictions. And it happens naturally all the time.
We're not using one guard in the current Tor. We're using two, and the second one is only used for unmultiplexed activity. That is one property I don't like about our "let's pretend to use one guard" status quo.
The second thing I don't like is that one guard is fragile, which enables confirmation attacks when it can be made to go down.
I think "can't be avoided" means HSDir, IP, RP -- which I note are all onion service related circuits.
I'd like to hear more about the "cleverly crafted exit policy" attack, and I wonder if we can't solve that differently. For example, if it's about making you do a request to a port that only one exit relay allows, and ha ha whoops your guard was on the same /16 as that exit relay... maybe it's time for the dir auths to not advertise super rare ports? This was one of the topics in the users-get-routed paper too.
Yes that is the one I was talking about.
However, another way to do this type of exit rotation attack is to cause a client to look up a DNS name where you control the resolver, and keep timing out on the DNS response. The client will then retry the stream request with a new exit. The same thing can also be done by timing out the TCP handshake to a server you control. Both of these attacks can be done with only the ability to inject an img tag into a page.
You repeat this until an exit is chosen that is in the same /16 or family as the guard, and then the client uses a second network path for an unmultiplexed request at a time you control.
One non-starter idea would be to move onion-service-related Tors to two guards, and leave other Tors at one guard. It's a non-starter because of course advertising which you are to your local network is no good. But that idea gave me a different perspective on this discussion: I wonder how much this design decision comes down to making all Tors use two guards in order to protect the onion-service-related Tors, which are the only ones who actually need it?
Our path restrictions also cause normal exiting clients to use a second guard for unmultiplexed activity, at adversary controlled times, or just at periodically at random.
However, while removing path restrictions will solve the immediate problem, it will not address other instances where Tor temporarily opts use a second guard due to congestion, OOM, or failure of its primary guard, and we're still running into bugs where this can be adversarially controlled or just happen randomly[5].
I continue to think we need to fix these. I'm glad to see that George has been putting some energy into looking more at them. The bugs that we don't understand are especially worrying, since it's hard to know how bad they are. Moving to two guards might put a bit of a bandaid on the issues, but it can't be our long-term plan for fixing them.
We're choosing fixes for these bugs that enable an adversary to deny service to clients at a particular guard, *without* letting those clients move to a second guard. This enables confirmation attacks, and these confirmation attacks can be extended to guard discovery attacks by DoSing guards one at a time until an onion service fails.
Bringing back CREATE_FAST could help with this piece, I suppose, but it doesn't solve OOM attacks...
Note that for this analysis to hold, we have to ensure that nodes that are at RESOURCELIMIT or otherwise temporarily unresponsive do not cause us to consider other primary guards beyond than the two we have chosen. This is accomplished by setting guard-n-primary-guards to 2 (in addition to setting guard-n-primary-guards-to-use to 2). With this parameter set, the proposal 271 algorithm will avoid considering more than our two guards, unless *both* are down at once.
I like this general idea of not immediately replacing guards so long as you have a working one. In fact, we used to do something similar back in the day: https://blog.torproject.org/improving-tors-anonymity-changing-guard-paramete... says (emphasis mine) """ Tor 0.2.3's entry guard behavior is "choose three guards, ***adding another one if two of those three go down*** but going back to the original ones if they come back up, and also throw out (aka rotate) a guard 4-8 weeks after you chose it." """
There are still some fiddly decisions to make here. For example, as you say we probably shouldn't replacement a guard just because we failed to connect to one of our guards once. We might decide that it's time to add a new second guard if the consensus tells us that one of them is down (so we have confirmation that it isn't down for just us, it's down for everybody). Or we might decide to wait on adding a new one even if it really is down, because maybe it'll come back soon. But how long do we wait? And if, while we're down to one, we encounter one of these situations where the requested fourth hop overlaps with our remaining guard, what do we do?
If I were to drop everything to build the Tor I think should exist, I would do the following:
1. Use two guards, replacing them only when both are unreachable, or when one leaves the consensus. 2. Make path restrictions not as strict (for cases like the one above). 3. Use conflux (which also needs less strict/no path restrictions) 4. Build it on QUIC.
I would do them in that order because I think we get the most benefit from #1, and we get some benefit from #2 still (as you point out above).
You keep focusing on the performance aspects of conflux, but that is not the argument I am making. My arguments for conflux in Section 4 are about resilience to congestion, downtime, circuit killing, and DoS, as well as traffic analysis resistance. I see the performance benefits as secondary.
(I also think the best arguments for QUIC are also in the reliability direction, because fixed queues means no adversary provoked OOMing.)
In fact, here's a hopefully useful insight that I've just realized: you're not concerned about one guard vs two guards, you're concerned about *transitioning* between guards. It's that moment when you're starting to use a new guard, if the attacker can observe that you're doing it, and especially if the attacker can make you do it, that is vulnerable. And starting with two guards can help, in that it postpones the time until you're forced to transition, and maybe also because if we do it right it can make the transition less visible.
The transition aspect is a big piece of it, but I think we're also running into a fragility problem, which makes the transition signal very loud in many cases.
But I wonder if we're looking at this backwards, and the primary question we should be asking is "How can we protect the transition between guards?" Then one of the potential answers to consider is "Maybe we should start out with two guards rather than just one." Framing it that way, are there more options that we should consider too? For example, removing the ability of the non-local attacker to trigger a transition? Then there would still be visibility of a transition, but the (non-local) attacker can't impact the timing of the transition. How much does that solve? Need to think more.
One guard is inherently more fragile than two, and no matter what we do, it means that there will be a risk of attacks that can confirm guard choice, because the downtime during this transition can never be hidden without at least some redundancy.
In summary:
(1) I think we should fix the bug from #14917 where the attacker can push us off our guard just by naming our guard as the HSDir/IP/RP, and I think we should fix it by being willing to reuse our guard when it can't be avoided. That step will resolve some, but not all, of the pressure about moving to two guards. Then
Without removing all path restrictions that apply to first and last hop, we're still actually using two guards, and using them at times that the adversary gets to control if they want, or just randomly otherwise.
(2) Hopefully the above discussion has helped us move forward on the remaining reasons for switching to two guards. To me the two biggest questions left to resolve are (a) how best to protect the vulnerable transition to a new guard, and if two guards is the best idea we've got for that, and (b) how big an issue is it really that having only one guard can sometimes give you a low-performance guard, and if two guards is the best idea we've got for that one too.
Transitions will always be noisy with one guard, because it is fragile to DoS, congestion, OOM, circuit failure, onionskin overload, etc etc etc. How can you provide resiliency under arbitrary and partial failure without any redundancy?
On Wed, Apr 11, 2018 at 11:15:44AM +0000, Mike Perry wrote:
To be clear, the design I've been considering here is simply allowing reuse between the guard hop and the final hop, when it can't be avoided. I don't mean to allow the guard (or its family) to show up as all four hops in the path. Is that the same as what you meant, or did you mean something more thorough?
By all path restrictions I mean for the last hop of the circuit and the first (though vanguards would be simpler if we got rid of them for other hops, too).
Can you lay out for us the things to think about in the Vanguard design? Last I checked there were quite a few Vanguard design variants, ranging from "two vanguards per guard, tree style" to some sort of mesh.
In particular, it would be convenient if there is a frontrunner design that really would benefit from relaxing many path restrictions, and a frontrunner design that is not so tied together to the path restriction question.
But I do mean all restrictions, not just guard node choice. The adversary also gets to force you to use a second network path whenever they want via the /16 and node family restrictions.
Can you give us a specific example here, for this phrase "network path"? When you say "second network path" are you thinking in the Vanguard world?
We're not using one guard in the current Tor. We're using two, and the second one is only used for unmultiplexed activity. That is one property I don't like about our "let's pretend to use one guard" status quo.
Right, I agree.
I'd like to hear more about the "cleverly crafted exit policy" attack
another way to do this type of exit rotation attack is to cause a client to look up a DNS name where you control the resolver, and keep timing out on the DNS response. The client will then retry the stream request with a new exit. The same thing can also be done by timing out the TCP handshake to a server you control. Both of these attacks can be done with only the ability to inject an img tag into a page.
You repeat this until an exit is chosen that is in the same /16 or family as the guard, and then the client uses a second network path for an unmultiplexed request at a time you control.
Hm! Yes, this is a yucky one. (I don't think just an img tag would be enough, because Tor will try a few circuits and then give up. You'd need some sort of javascript or refresh chain or the like that generates new addresses and tries them in succession. But that's totally feasible.)
This one is also yucky because we could also imagine a different way to pick your path, where when you're selecting your exit, you avoid choosing exits which would conflict with your guard, and thus you'll never be pushed off of your guard. But then the destination website can do this same attack over time and notice which exit you never try to use. So this is a case where to blend in best, we *need* to be willing to use all of the potential exits.
But since normal exit circuits are three hops, if we simply relax the path restrictions, we could be making a circuit of the form "A - B - A", which would not only stand out as weird to B, but actually right now a relay in B's position will refuse such a circuit. Bad news all around.
The three fixes that come to mind are
(A) "Have two guards": so you can pick any exit you like, and then just use the guard that doesn't conflict with the exit you picked.
(B) "Add a bonus hop when needed": First relax the /16 and family restrictions, so the remaining issue is reuse of your guard. Then if you find that you just chose your guard as your exit, insert an extra hop in the middle of that circuit.
(C) "Exits can't be Guards": First relax the /16 and family restrictions, so the remaining issue is reuse of your guard. Then notice that due to exit scarcity, guards aren't actually used in the exit position anyway. Then enforce that rule (so they can't be in the future either).
All three of these choices have downsides. But all three of them look like improvements over the current situation -- because of how crappy the current situation is.
(Rejected option (D): "Just start allowing it": Relax the /16 and family restrictions, and also relax the rule where relays refuse a circuit that goes right back where it came from. Giving the middle node that much information about the circuit just wigs me out.)
Also, notice that I think Mike's proposed design will turn out to be some combination of "A" and also something like "B" or "C", because even if you start with two guards, if you don't add a new guard right when your first guard goes down, you might find yourself in the situation where you have one working guard, and you pick it as your exit, and now you need to do *something*.
Our path restrictions also cause normal exiting clients to use a second guard for unmultiplexed activity, at adversary controlled times, or just at periodically at random.
Just to make sure I understand: at least on the current network, that's because of the /16 rule and the family rule, and not because of the "if the exit you picked turns out to be your guard too, move to a different guard" rule, because exits aren't normally used for guards on our current network?
On more examination though, that's not something to rely on with our current design, since I bet there are weird edge cases like a relay loses its Guard flag, but it's still your Guard so you keep using it (depending on the advice del año from #17773), but now the weightings let you pick it for your Exit, and oops.
Another problematic example would be a relay that you picked as your Guard, and later it opened up its exit policy and became an Exit.
So if I wanted to try to flesh out my "Then enforce that rule" approach above, we would need to (1) Have dir auths take away the Guard flag from relays that can be used as Exits, and (2) Make sure that clients know that if their guards lose the Guard flag, they should treat them as being no longer guardworthy. I think we're doing that second one right now, based on my latest reading of #17773, so this would actually be a pretty easy change. But still, it's not exactly elegant.
However, while removing path restrictions will solve the immediate problem, it will not address other instances where Tor temporarily opts use a second guard due to congestion, OOM, or failure of its primary guard, and we're still running into bugs where this can be adversarially controlled or just happen randomly[5].
I continue to think we need to fix these. I'm glad to see that George has been putting some energy into looking more at them. The bugs that we don't understand are especially worrying, since it's hard to know how bad they are. Moving to two guards might put a bit of a bandaid on the issues, but it can't be our long-term plan for fixing them.
We're choosing fixes for these bugs that enable an adversary to deny service to clients at a particular guard, *without* letting those clients move to a second guard. This enables confirmation attacks, and these confirmation attacks can be extended to guard discovery attacks by DoSing guards one at a time until an onion service fails.
I would find non-onion-service examples more compelling here, since I want to avoid falling back into the "well, onion services need special treatment to be safe, so we have to choose between hurting normal clients and hurting onion services" trap.
How is this for an alternative scenario to be considering: the attacking website gives the Tor Browser user some page content that causes the browser to initiate periodic events. Then it starts congesting guards one at a time until the events stop arriving.
Are those two scenarios basically equivalent in terms of the confirmation attacks you are worrying about? I hope yes, and now I can stop getting distracted by wondering if going to this effort is worth it only to protect onion services? :)
You keep focusing on the performance aspects of conflux, but that is not the argument I am making. My arguments for conflux in Section 4 are about resilience to congestion, downtime, circuit killing, and DoS, as well as traffic analysis resistance. I see the performance benefits as secondary.
I like conflux in theory, but somebody needs to do the other 90% of the work to make it a concrete thing that we can consider.
I continue to think "Tor should switch to two guards, because one day we should design and deploy conflux" is a terrible reason to switch to two guards now.
So I didn't mean to mix the conflux discussion and the performance discussion. I meant to mostly ignore the conflux discussion (because it is a future proposal, not this one), while also making sure that we don't forget the potential performance benefits of having two guards in general.
But I wonder if we're looking at this backwards, and the primary question we should be asking is "How can we protect the transition between guards?" Then one of the potential answers to consider is "Maybe we should start out with two guards rather than just one." Framing it that way, are there more options that we should consider too? For example, removing the ability of the non-local attacker to trigger a transition? Then there would still be visibility of a transition, but the (non-local) attacker can't impact the timing of the transition. How much does that solve? Need to think more.
One guard is inherently more fragile than two, and no matter what we do, it means that there will be a risk of attacks that can confirm guard choice, because the downtime during this transition can never be hidden without at least some redundancy.
How's this for another option: clients have two guards, but they have a first guard and a backup guard. They do the traffic padding to both of them, to ensure continuous netflow sessions in their local ISP's logs. But they try to send most of their traffic over the first guard, thus avoiding most of the "increased surface area" concerns about using two guards at once. And we try to reduce the frequency of situations where they can't use their first guard. But in the "transition" situations that we decide we need to keep, they use their backup guard, and it's already available and ready and that netflow session is already active in the eyes of their ISP.
This approach isn't conflux (yet), but it's not incompatible with later changing things so we do conflux.
It also doesn't get us the lower variance of performance that having two equally used guards would get us. But I am ok with that for now, at least until somebody has done some performance analysis to show that we're really suffering now and we would stop suffering then.
It adds load onto the relays, by almost doubling the number of sockets used by guards for clients, and also by adding more bandwidth load from the padding cells to/from the backup guard. (How much bandwidth load is this, per client?)
And it doesn't actually provide as much "real" cover traffic onto the backup guard in most situations, so somebody who can look more thoroughly at the traffic flows will still be able to distinguish a transition event from the first to the backup. Maybe that's a problem? Or maybe the netflow level adversary that we declared in the threat model can't do that, and a real attacker would be able to see the traffic details anyway, so we're fine^W^Wno worse off than before?
Assuming this design meets all of our goals, let's examine two variants of it to make sure we understand what we're actually trading off. In particular, consider a design where we maintain (and pad) these two connections, vs a design where we maintain a connection to our first guard and then launch a connection to the backup guard on demand. The downside of keeping the backup connection open is the extra network-wide socket and bandwidth load on relays, while the downsides of launching a connection on demand are the risk that a local netflow-level ISP can see when we transition to using the backup guard, plus the risk that a remote attacker who can cripple guards will be able to notice the delay in the "launch on demand case" but could not distinguish the delay in the "two connections" case.
That second risk doesn't seem so scary to me, since local handshakes should be a small fraction of the overall time it takes to build and use a new circuit. But above you say "the downtime during this transition can never be hidden without at least some redundancy", so if you think this risk is scary, I'd like to hear more details about why. (Maybe the design you were concerned about was one where we just freeze in place and fail when we don't want to use our first guard? I agree, that's a bad design, and we can do better, for example by "be willing to use the second guard".)
Whereas that first risk does seem plausible to me -- worth trying to reduce. I think we should start by enumerating as many scary scenarios as we can (where scary means "currently we would shift away from our first guard"), and then fix as many of them as we can. Then we should look at the remaining scenarios where we would switch over to using our backup guard (like, when our first guard isn't able to build new circuits for us), and decide if the cost of the additional load on the network is worth hiding that transition timing from a netflow-level client-side-ISP adversary. I can see the answer being "yes, it's worth it", but I think it will be useful to have a good handle on which transition scenarios remain.
--Roger
Roger Dingledine:
On Wed, Apr 11, 2018 at 11:15:44AM +0000, Mike Perry wrote:
To be clear, the design I've been considering here is simply allowing reuse between the guard hop and the final hop, when it can't be avoided. I don't mean to allow the guard (or its family) to show up as all four hops in the path. Is that the same as what you meant, or did you mean something more thorough?
By all path restrictions I mean for the last hop of the circuit and the first (though vanguards would be simpler if we got rid of them for other hops, too).
Can you lay out for us the things to think about in the Vanguard design? Last I checked there were quite a few Vanguard design variants, ranging from "two vanguards per guard, tree style" to some sort of mesh.
In particular, it would be convenient if there is a frontrunner design that really would benefit from relaxing many path restrictions, and a frontrunner design that is not so tied together to the path restriction question.
There are two frontrunner forms. One has no path restrictions, the other would try to perform restriction checks on each layer to ensure that it is valid and doesn't leak info about other layers or prevent circuit creation.
They are otherwise the same. Both are mesh; both are tunable in the number of guards and rotation times in each layer.
I am leaning towards "no restrictions" for vanguards for 0.3.4 because it is simpler, and it did not strike me that the arguments in their favor justified trying to implement them quickly in a way that might cause reachability or path influence risks.
But I do mean all restrictions, not just guard node choice. The adversary also gets to force you to use a second network path whenever they want via the /16 and node family restrictions.
Can you give us a specific example here, for this phrase "network path"? When you say "second network path" are you thinking in the Vanguard world?
Second path to entry into the Tor network (and a second guard), regardless of vanguards.
I'd like to hear more about the "cleverly crafted exit policy" attack
another way to do this type of exit rotation attack is to cause a client to look up a DNS name where you control the resolver, and keep timing out on the DNS response. The client will then retry the stream request with a new exit. The same thing can also be done by timing out the TCP handshake to a server you control. Both of these attacks can be done with only the ability to inject an img tag into a page.
You repeat this until an exit is chosen that is in the same /16 or family as the guard, and then the client uses a second network path for an unmultiplexed request at a time you control.
The three fixes that come to mind are
(A) "Have two guards": so you can pick any exit you like, and then just use the guard that doesn't conflict with the exit you picked.
(B) "Add a bonus hop when needed": First relax the /16 and family restrictions, so the remaining issue is reuse of your guard. Then if you find that you just chose your guard as your exit, insert an extra hop in the middle of that circuit.
(C) "Exits can't be Guards": First relax the /16 and family restrictions, so the remaining issue is reuse of your guard. Then notice that due to exit scarcity, guards aren't actually used in the exit position anyway. Then enforce that rule (so they can't be in the future either).
All three of these choices have downsides. But all three of them look like improvements over the current situation -- because of how crappy the current situation is.
(Rejected option (D): "Just start allowing it": Relax the /16 and family restrictions, and also relax the rule where relays refuse a circuit that goes right back where it came from. Giving the middle node that much information about the circuit just wigs me out.)
Also, notice that I think Mike's proposed design will turn out to be some combination of "A" and also something like "B" or "C", because even if you start with two guards, if you don't add a new guard right when your first guard goes down, you might find yourself in the situation where you have one working guard, and you pick it as your exit, and now you need to do *something*.
The one-guard-down case does impact things. But even when this does happen (which should be rare), it should only be true for a small window of time before the consensus updates.
The "down" guard should either be temporarily overloaded, or fully down and kicked off the consensus. I think we should only add a new guard when one falls out of the consensus, or both are unreachable/unusable.
This is why I think it is OK to take an incremental approach and start with A, and roll out things like B and C and other restriction relaxations.
During these edge cases, the most important property that we should strive to preserve is overall reachability. I don't like situations where the adversary gains information by certain nodes being overloaded or down. In my view, trying to make smart decisions to minimize exposure to more nodes is secondary to overall reachability. (Overall reachability allows a *non-network* adversary to gain information about how clients are using our network. That strikes me as a lower resource, more dangerous attack than the unknown risk of possible partial network observers. In other words, I believe we made the right short-term call in #14917 in terms of preserving reachability.)
Our path restrictions also cause normal exiting clients to use a second guard for unmultiplexed activity, at adversary controlled times, or just at periodically at random.
Just to make sure I understand: at least on the current network, that's because of the /16 rule and the family rule, and not because of the "if the exit you picked turns out to be your guard too, move to a different guard" rule, because exits aren't normally used for guards on our current network?
On more examination though, that's not something to rely on with our current design, since I bet there are weird edge cases like a relay loses its Guard flag, but it's still your Guard so you keep using it (depending on the advice from #17773), but now the weightings let you pick it for your Exit, and oops.
Another problematic example would be a relay that you picked as your Guard, and later it opened up its exit policy and became an Exit.
I am in favor of preventing guards from being exits. Intuitively, it means less "one stop shop" surveillance points to see both entry and exit traffic. It also makes flag-based load balancing equations much simpler, and makes it easier to account for padding overhead.
So if I wanted to try to flesh out my "Then enforce that rule" approach above, we would need to (1) Have dir auths take away the Guard flag from relays that can be used as Exits, and (2) Make sure that clients know that if their guards lose the Guard flag, they should treat them as being no longer guardworthy. I think we're doing that second one right now, based on my latest reading of #17773, so this would actually be a pretty easy change. But still, it's not exactly elegant.
In the world where we keep path restrictions, these would be my rules: 1. Two equal guards, chosen from not the same /16 or family 2. Choose each vanguard layer members such that each layer has at least one node from a unique /16 and family. 3. Build paths in a strict order, from last hop towards guard. If you can't build a path with this ordering, start over with a sampled guard. (With rule #1 and #2, this should be very rare and should mean that a guard is marked down locally but still marked up in the consensus.) 4. No guards as exits (Not needed but do it anyway for other reasons).
Then under these rules, you decide to use a new primary guard, if: 0. When a guard leaves the consensus, replace it with a new primary guard. 1. Temporarily pick a new guard when your two primaries are locally down or unusable (ie step #3 above fails).
However, while removing path restrictions will solve the immediate problem, it will not address other instances where Tor temporarily opts use a second guard due to congestion, OOM, or failure of its primary guard, and we're still running into bugs where this can be adversarially controlled or just happen randomly[5].
I continue to think we need to fix these. I'm glad to see that George has been putting some energy into looking more at them. The bugs that we don't understand are especially worrying, since it's hard to know how bad they are. Moving to two guards might put a bit of a bandaid on the issues, but it can't be our long-term plan for fixing them.
We're choosing fixes for these bugs that enable an adversary to deny service to clients at a particular guard, *without* letting those clients move to a second guard. This enables confirmation attacks, and these confirmation attacks can be extended to guard discovery attacks by DoSing guards one at a time until an onion service fails.
I would find non-onion-service examples more compelling here, since I want to avoid falling back into the "well, onion services need special treatment to be safe, so we have to choose between hurting normal clients and hurting onion services" trap.
How is this for an alternative scenario to be considering: the attacking website gives the Tor Browser user some page content that causes the browser to initiate periodic events. Then it starts congesting guards one at a time until the events stop arriving.
Are those two scenarios basically equivalent in terms of the confirmation attacks you are worrying about? I hope yes, and now I can stop getting distracted by wondering if going to this effort is worth it only to protect onion services? :)
Yes.
But I wonder if we're looking at this backwards, and the primary question we should be asking is "How can we protect the transition between guards?" Then one of the potential answers to consider is "Maybe we should start out with two guards rather than just one." Framing it that way, are there more options that we should consider too? For example, removing the ability of the non-local attacker to trigger a transition? Then there would still be visibility of a transition, but the (non-local) attacker can't impact the timing of the transition. How much does that solve? Need to think more.
One guard is inherently more fragile than two, and no matter what we do, it means that there will be a risk of attacks that can confirm guard choice, because the downtime during this transition can never be hidden without at least some redundancy.
How's this for another option: clients have two guards, but they have a first guard and a backup guard. They do the traffic padding to both of them, to ensure continuous netflow sessions in their local ISP's logs. But they try to send most of their traffic over the first guard, thus avoiding most of the "increased surface area" concerns about using two guards at once. And we try to reduce the frequency of situations where they can't use their first guard. But in the "transition" situations that we decide we need to keep, they use their backup guard, and it's already available and ready and that netflow session is already active in the eyes of their ISP.
This approach isn't conflux (yet), but it's not incompatible with later changing things so we do conflux.
It also doesn't get us the lower variance of performance that having two equally used guards would get us. But I am ok with that for now, at least until somebody has done some performance analysis to show that we're really suffering now and we would stop suffering then.
FYI, we actually do have one form of this info in figure 10 of https://www.freehaven.net/anonbib/cache/wpes12-cogs.pdf
We get the largest performance gains from going from one guard to two, in terms of reducing the variance (flatness) of that CDF.
Qualitatively, this means way fewer users who try Tor and experience a very slow Tor, telling their friends that it is too slow and should not be used. This is a real thing. Web UX folks have found that it happens with perf variances in the sub-second range with websites.
It adds load onto the relays, by almost doubling the number of sockets used by guards for clients, and also by adding more bandwidth load from the padding cells to/from the backup guard. (How much bandwidth load is this, per client?)
And it doesn't actually provide as much "real" cover traffic onto the backup guard in most situations, so somebody who can look more thoroughly at the traffic flows will still be able to distinguish a transition event from the first to the backup. Maybe that's a problem? Or maybe the netflow level adversary that we declared in the threat model can't do that, and a real attacker would be able to see the traffic details anyway, so we're fine^W^Wno worse off than before?
There are a couple things here that make me think we may still be worse off.
1. The netflow padding is not designed to simulate client traffic. It is designed to aggregate client traffic together over time in the adversary's logs. Instead of seeing a discrete "520KB xfer in this 15 second period, 80KB in that one, and 2300KB in that one, and then silence for 25 minutes", the adversary records "2900KB traffic total in this half hour". For this aggregation to help, there really needs to be other traffic during that half hour. This is why I keep saying that more concurrent activity is better than only using the second guard sometimes. (WTF-PAD could do things like you describe above, but we need to program histograms+state machines for that).
2. Detection of when to switch to this second guard seems complicated and error prone, and if it results in unavailability, it is strictly worse. If it switches to the second guard at the first sign of RESOURCELIMIT and path selection issues, well, then you're adding a lot of complexity for how much benefit (and also complexity that could be manipulated by the adversary).
Whereas that first risk does seem plausible to me -- worth trying to reduce. I think we should start by enumerating as many scary scenarios as we can (where scary means "currently we would shift away from our first guard"), and then fix as many of them as we can. Then we should look at the remaining scenarios where we would switch over to using our backup guard (like, when our first guard isn't able to build new circuits for us), and decide if the cost of the additional load on the network is worth hiding that transition timing from a netflow-level client-side-ISP adversary. I can see the answer being "yes, it's worth it", but I think it will be useful to have a good handle on which transition scenarios remain.
Well, "fixing" the largest, most frequent, and adversary controlled classes of these requires:
1. Removing path restrictions. 2. Recognizing DoS attacks and differentiating them from bad network conditions.
#2 is what worries me. Any solution to #2 that is agile enough to avoid downtime strikes me as no better than "switch to guard #2 with probability 1/2 after a RESOURCELIMIT or any other circuit failure" (which is what the code would do today with two equal guards), and a hell of a lot more complex (with risk of a downtime signal or adversary path influence if we get it wrong).