Hello,
I'm doing research on improving Tor security via path selection. This work is with Ryan Wails, Aaron Johnson, Prateek Mittal, and Olivier Pereira. In our work, we ran into load-balancing problems that affected our experiments. We identified the problem to originate in the specification and implementation of Proposal 271, "Another algorithm for guard selection", which was implemented in tor-0.3.0.1-alpha ~2 years ago and is thus now in tor stable. In short, Prop 271 causes guards to be selected with probabilities different than their weights due to the way it samples many guards and then chooses primary guards from that sample. We are suggesting a straightforward fix to the problem, which is, roughly speaking, to choose primary guards in the order in which they were sampled. We have created a patch implementing this fix for the case affecting our experiments, which would improve the current situation. We are further suggesting that Tor apply the technique throughout the guard-selection logic.
In more detail, Prop 271 chooses guards via a multi-step process: 1. It chooses 20 distinct guards (and sometimes more) by sampling without replacement with probability proportional to consensus weight. 2. It produces two subsets of the sample: (1) "filtered" guards, which are guards that satisfy various torrc constraints, and (2) "confirmed" guards, which are guards through which a circuit has been constructed. 3. The "primary" guards (i.e. the actual guards used for circuits) are chosen from the confirmed and filtered subsets. I'm ignoring the additional "usable" subsets for clarity. This description is based on Section 4.6 of the specification (https://gitweb.torproject.org/torspec.git/tree/guard-spec.txt).
The primary guards are selected uniformly at random from the filtered guards when no confirmed guards exist. No confirmed guards appear to exist until some primary guards have been selected, and so when Tor is started the first time the primary guards always come only from the filtered set. The uniformly-random selection causes a bias in primary-guard selection away from consensus weights and towards a more uniform selection of guards. As just an example of the problem, if there were only 20 guards in the network, the sampled set would be all guards and primary guard selection would be entirely uniformly random, ignoring weights entirely. This bias is worse the larger the sampled set is relative to the entire set of guards, and it has a significant effect on Tor simulations, which are typically on smaller networks. We believe that this issue has only a limited effect on Tor currently due to the relatively large number of guards.
However, the issue is potentially worse when there exist confirmed guards. In this case, primary guards are chosen from the intersection of confirmed and filtered guards in the order they were added to the confirmed set. That order seems to be the order in which circuits were successfully created through the guards, which would thus be both somewhat non-deterministic (because it depends on when a circuit successfully finishes) and random (because circuits are only attempted through primary guards, which are initially uniformly-random filtered guards). The non-determinism especially could yield guard selection probabilities that deviate quite a bit from the consensus weights.
This design has both performance and security implications. It potentially reduces the performance of the Tor network by making the load unbalanced. It also affects the correctness of performance analyses done with Shadow simulations, and the error is likely much larger due to the smaller size of simulation networks, which commonly include 100 or fewer guards. The design also reduces Tor's security by increasing the number of clients that an adversary running small relays can observe. In addition, an adversary has to wait less time than it should after it starts a malicious guard to be chosen by a client. This weakness occurs because the malicious guard only needs to enter the sampled list to have a chance to be chosen as primary, rather than having to wait until all previously-sampled guards have already expired.
We propose a solution that fits well within the existing guard-selection algorithm. Our solution is to select primary guards in the order they were sampled. This ordering should be applied after the filtering and/or confirmed guard sets are constructed as normal. That is, primary guards should be selected from the filtered guards (if no guards are both confirmed and filtered) or from the set of confirmed and filtered guards (if such guards exist) in the order they were initially sampled. This solution guarantees that each primary guard is selected (without replacement) from all guards with a probability that is proportional to its consensus weight.
We include a patch that applies this ordering only when primary guards are selected from filtered guards. This fixes the problem in Shadow simulations because confirmed guards never exist when primary guards are being selected. However, to solve all issues in the real Tor network, we recommend that you apply similar logic to the case that primary guards are selected from confirmed guards.
Note that the issue of sampling skewing guard selection seems to have been raised in proposal discussions, which is documented in the proposal (although the proposed solution is less practical):
[Paul Syverson in a conversation at the Wilmington Meeting 2017 says that we should look into how we're doing this sampling. Essentially, his concern is that, since we are sampling by bandwidth at first (when we choose the `sampled` set), then later there is another bias—when
trying to
build circuits (and hence marking guards as confirmed) we select those which completed a usable circuit first (and hence have the lowest latency)—that this sort of "doubly skewed" selection may "snub" some low-consensus-weight guards and leave them unused completely. Thus the issue is primarily that we're not allocating network resources efficiently. Mine and Nick's guard algorithm simulation code never checked what percentage of possible guards the algorithm reasonably allowed clients to use; this would be an interesting thing to check in simulation at some point. If it does turn out to be a problem, Paul's intuition for a fix is to select uniformly at random to obtain the `sampled` set, then weight by bandwidth when trying to build circuits and marking guards as confirmed. —isis]
Best,
Florentin
On Mon, Oct 14, 2019 at 07:56:29AM +0000, Florentin Rochet wrote:
In short, Prop 271 causes guards to be selected with probabilities different than their weights due to the way it samples many guards and then chooses primary guards from that sample.
Agreed. As you say, Paul identified this one a few years ago too. As far as I understand the prop#271 design, I agree that it's an issue.
We are suggesting a straightforward fix to the problem, which is, roughly speaking, to choose primary guards in the order in which they were sampled.
This looks like a good solution to the issue -- the ordering of the guards as we select them is proportional to their weight, so let's just use them in the order that we selected them.
One of the tricky features of the prop#271 guard selection design is that it won't just keep on choosing guards if many are unreachable, but rather it will stop after a while, so a bad ISP can't totally control what guard you pick. I think that feature is left untouched by your design change, since we're choosing from among only the same set as before, just in a different order. But please think about whether that is true.
We have created a patch implementing this fix for the case affecting our experiments, which would improve the current situation. We are further suggesting that Tor apply the technique throughout the guard-selection logic.
Can you help us make sure we think of all the places you've already thought of? :)
We believe that this issue has only a limited effect on Tor currently due to the relatively large number of guards.
That makes sense to me too -- most of the guards that are chosen in the 20 will be fast, so choosing uniformly from them will often give you a fast one.
The design also reduces Tor's security by increasing the number of clients that an adversary running small relays can observe. In addition, an adversary has to wait less time than it should after it starts a malicious guard to be chosen by a client. This weakness occurs because the malicious guard only needs to enter the sampled list to have a chance to be chosen as primary, rather than having to wait until all previously-sampled guards have already expired.
This part makes me wonder about another angle to this problem: proper load balancing when we choose our guards on one date but then make decisions about them on a different date.
For example, if we sample all these guards on day 0, and then use the first guard for a week, and then move to the second guard... but the weights have changed in that time... what will that do to our load balancing? One extreme case would be a relay that has a really high weight for a while, and then later turns out to have much lower bandwidth. It gets into a bunch of guard lists at first (but mostly not #1 since that's how the probabilities work), and then slowly clients shift load to it as their #1 guard goes away.
In an ideal world we would want to take into account current guard weights, when we're shifting from one guard to the next, rather than making that decision way earlier before we actually turn out to need the guards. Maybe that argues for delaying more of the decisions?
Note that this question is about yet another improvement that could be made to the guard part of path selection, and I think it's orthogonal to the improvement you are proposing.
Thanks, --Roger
Hello,
On 14/10/2019 13:29, Roger Dingledine wrote:
On Mon, Oct 14, 2019 at 07:56:29AM +0000, Florentin Rochet wrote:
We are suggesting a straightforward fix to the problem, which is, roughly speaking, to choose primary guards in the order in which they were sampled.
This looks like a good solution to the issue -- the ordering of the guards as we select them is proportional to their weight, so let's just use them in the order that we selected them.
One of the tricky features of the prop#271 guard selection design is that it won't just keep on choosing guards if many are unreachable, but rather it will stop after a while, so a bad ISP can't totally control what guard you pick. I think that feature is left untouched by your design change, since we're choosing from among only the same set as before, just in a different order. But please think about whether that is true.
Yes, so I think this patch is actually improving this goal since the bad ISP has first to exhaust the sampled list to have a chance to get in.
We have created a patch implementing this fix for the case affecting our experiments, which would improve the current situation. We are further suggesting that Tor apply the technique throughout the guard-selection logic.
Can you help us make sure we think of all the places you've already thought of? :)
Sure! This is mostly code cleanup (as the confirmed ordering is not meaningful anymore), tests unit cleanup and a bit of spec rewriting (the reverse order of steps might be preferable, though :)
I would be happy reviewing any final patch. I could also do the patch myself after my current work is submitted.
<skip> > The design also reduces Tor's security by increasing the > number of clients that an adversary running small relays can observe. In > addition, an adversary has to wait less time than it should after it > starts a malicious guard to be chosen by a client. This weakness occurs > because the malicious guard only needs to enter the sampled list to have > a chance to be chosen as primary, rather than having to wait until all > previously-sampled guards have already expired. This part makes me wonder about another angle to this problem: proper load balancing when we choose our guards on one date but then make decisions about them on a different date.
For example, if we sample all these guards on day 0, and then use the first guard for a week, and then move to the second guard... but the weights have changed in that time... what will that do to our load balancing? One extreme case would be a relay that has a really high weight for a while, and then later turns out to have much lower bandwidth. It gets into a bunch of guard lists at first (but mostly not #1 since that's how the probabilities work), and then slowly clients shift load to it as their #1 guard goes away.
In an ideal world we would want to take into account current guard weights, when we're shifting from one guard to the next, rather than making that decision way earlier before we actually turn out to need the guards. Maybe that argues for delaying more of the decisions?
Note that this question is about yet another improvement that could be made to the guard part of path selection, and I think it's orthogonal to the improvement you are proposing.
I guess the amount of work is also dependent of how far we want to go. As you mention, there are still load-balancing problems when "we choose our guards on one date but then make decisions about them on a different date". It is possible to get this fixed by removing the sampled list and, instead, keeping an history a previously sampled guards. When moving to the next guard, we could consider *current* weights and make the decision. The history should resist attacks that try to force clients onto compromised guards, using relays that are part of the history if they're still available (in sample order), and by tracking its size.
That second improvement seems to be a deeper refactoring of the proposal and the code, but it could be interesting, especially when hundreds of relays get an EoL network-reject. There might be some domino effect at play here, because some % of clients got a rotation of their EoL guard.
Is there any argument not to do both?
1) Applying this patch and extending the technique throughout the guard selection logic should be straightforward and solve most of the problem (should be fast to get). 2) Deeper refactoring of the proposal as you mention (note, this shouldn't make the step 1 useless, as the logic should be reused here).
I've opened a trac ticket for 1) https://trac.torproject.org/projects/tor/ticket/32088
Best,
Florentin
On Tue, Oct 15, 2019 at 5:43 AM Florentin Rochet florentin.rochet@uclouvain.be wrote:
I've opened a trac ticket for 1) https://trac.torproject.org/projects/tor/ticket/32088
(This is now proposal 310.)