David Goulet dgoulet@torproject.org writes:
On 30 May (09:49:26), David Goulet wrote:
Greetings!
[snip]
Hi everyone,
I'm writing here to update on where we are about the introduction rate limiting at the intro point feature.
The branch of #15516 (https://trac.torproject.org/15516) is ready to be merged upstream which implements a simple rate/burst combo for controlling the amount of INTRODUCE2 cells that are relayed to the service.
Great stuff! Thanks for the update!
<snip>
The bad news is that availability is _not_ improved. One of the big reasons for that is because the rate limit defenses, once engaged at the intro point, will send back a NACK to the client. A vanilla tor client will stop using that introduction point away for 120 seconds if it gets 3 NACKs from it. This leads to tor quickly giving up on trying to connect and thus telling the client that connection is impossible to the .onion.
We've hacked a tor client to play along and stop ignoring the NACKs to see how much time it would take to reach it. On average, a client would roughly need around 70 seconds with more than 40 NACKs on average.
However, it varied a _lot_ during our experiments with many outliers from 8 seconds with 1 NACK up to 160 seconds with 88 NACKs. (For this, the SocksTimeout had to be bumped quite a bit).
That makes sense.
So it seems like this change will change the UX of clients visiting DoSed onion services to a sideways direction (not better/worse), right? Clients will immediately see a "Cant connect" page on their browser since the SOCKS conn will abort after after getting 3 NACKs. Is that the case?
This change also impacts the performance impact of these legitimate clients, since now they will immediately try all three introduction points by extending the introduction circuit two times. This means that legitimate clients will be slightly more damaging to the network, but the DoS attacker will be much less damaging to the network, and since the DoS attacker causes all the damage here this seems like a net positive change.
There is an avenue of improvement here to make the intro point sends a specific NACK reason (like "Under heavy load" or ...) which would make the client consider it like "I should retry soon-ish" and thus making the client possibly able to connect after many seconds (or until the SocksTimeout).
Another bad news there! We can't do that anytime soon because of this bug that basically crash clients if an unknown status code is sent back (that is a new NACK value): https://trac.torproject.org/30454. So yeah... quite unfortunate there but also a superb reason for everyone out there to upgrade :).
Do we have any view on what's the ideal client behavior here? Is "retrying soon-ish" actually something we want to do? Does it have security implications?
<snip>
Overall, this rate limit feature does two things:
Reduce the overall network load.
Soaking the introduction requests at the intro point helps avoid the service creating pointless rendezvous circuits which makes it "less" of an amplification attack.
I think it would be really useful to get a baseline of how much we "Reduce the overall network load" here, given that this is the reason we are doing this.
That is, it would be great to get a graph with how many rendezvous circuits and/or bandwidth attackers can induce to the network right now by attacking a service, and what's the same number if we do this feature with different parameters.
Keep the service usable.
The tor daemon doesn't go in massive CPU load and thus can be actually used properly during the attack.
The problem with (2) is the availability part where for a legit client to reach the service, it is close to impossible for a vanilla tor without lots of luck. However, if let say the tor daemon would be configured with 2 .onion where one is public and the other one is private with client authorization, then the second .onion would be totally usable due to the tor daemon not being CPU overloaded.
That's more like a "Keep the service CPU usable, but not the service itself" ;)
<snip>
At this point in time, we don't have a good grasp on what happens in terms of CPU if the rate or the burst is bumped up or even how availability is affected. During our experimentation, we did observed a "sort of" linear progression between CPU usage and rate. But we barely touched the surface since it was changed from 25 to 50 to 75 and that is it.
I wonder how we can get a better grasp at this given that we are about to deploy it on the real net. Perhaps some graphs with the effect of these parameters on (1) and (2) above would be useful.
In particular, I think it would be smart and not a huge delay to wait until Stockholm before we merge this so that we can discuss it in person with more people and come up with exact parameters, client behaviors, etc.
Thanks again! :)