Greetings!
As some of you know, a bunch of onion services were or are still under heavy DDoS on the network. More specifically, they are bombarded with introduction requests (INTRODUCE2 cells) which forces them to rendezvous for each of them by creating a ton of circuits.
This basically leads to a resource exhaustion attack on the service side with its CPU massively used for path selection, opening new circuits and continously handling INTRODUCE2 cells.
Unfortunately, our circuit-level flow control does not apply to the service introduction circuit which means that the intro point is allowed, by the Tor protocol, to send an arbitrary large amount of cells down the circuit. This means for the service that even after the DoS has stopped, it would still receive massive amounts of cells because some are either inflight on the circuit or queued at the intro point ready to be sent (towards the service).
That being all said, our short-term goal here is to add INTRODUCE2 rate-limiting (similar to the Guard DoS subsystem deployed early last year) *at* the intro point but much simpler. The goal is to soak up the introduction load directly at the intro points which would help reduce the load on the network overall and thus preserve its health.
Please have a look at https://trac.torproject.org/15516 for some discussions and ongoing code work. We are at the point where we have a branch that rate limits INTRODUCE2 cells at the intro point but we need to figure out proper values for the rate per second and the burst allowed.
One naive approach is to see how much cells an attack can send towards a service. George and I have conducted experiment where with 10 *modified* tor clients bombarding a service at a much faster rate than 1 per-second (what vanilla tor does if asked to connect a lot), we see in 1 minute ~15000 INTRODUCE2 cells at the service. This varies in the thousands depending on different factors but overall that is a good average of our experiment.
This means that 15000/60 = 250 cells per second.
Considering that this is an absurd amount of INTRODUCE2 cells (maybe?), we can put a rate per second of let say a fifth meaning 50 and a burst of 200.
Over the normal 3 intro points a service has, it means 150 introduction per-second are allowed with a burst of 600 in total. Or in other words, 150 clients can reach the service every second up to a burst of 600 at once. This probably will ring alarms bell for very popular services that probably gets 1000+ users a second so please check next section.
I'm not that excited about hardcoded network wide values so this is why the next section is more exciting but much more work for us!
One step further: we have not really decided yet if this is something we want nor have time to tackle but an idea here would be for a service to inform the intro point, using the ESTABLISH_INTRO cell payload, on the parameters it wants for its DoS defenses. So let say a very popular .onion with OnionBalance and 10 intro points, could tell to its intro points that it wants much higher values for the DoS defenses (or even put it off).
However, it doesn't change the initial building block of being able to rate limit at the introduction point. As a second step, we can add this new type of ESTABLISH_INTRO cell. It is always dicy to introduce a new cell since for instance this would leak information to the intro point that the service is ">= version". Thus, this needs to be done with carefully.
Time for your thoughts and help! :)
Thanks everyone! David
Hi,
On 30 May 2019, at 23:49, David Goulet dgoulet@torproject.org wrote:
Over the normal 3 intro points a service has, it means 150 introduction per-second are allowed with a burst of 600 in total. Or in other words, 150 clients can reach the service every second up to a burst of 600 at once. This probably will ring alarms bell for very popular services that probably gets 1000+ users a second so please check next section.
Do we know how many introduce cells are sent to popular services?
How can the operators of these services find out their current introduce rate?
T
On 31 May (00:46:56), teor wrote:
Hi,
On 30 May 2019, at 23:49, David Goulet dgoulet@torproject.org wrote:
Over the normal 3 intro points a service has, it means 150 introduction per-second are allowed with a burst of 600 in total. Or in other words, 150 clients can reach the service every second up to a burst of 600 at once. This probably will ring alarms bell for very popular services that probably gets 1000+ users a second so please check next section.
Do we know how many introduce cells are sent to popular services?
How can the operators of these services find out their current introduce rate?
Yes good point.
The only thing we have available is the heartbeat that should read like so:
log_notice(LD_HEARTBEAT, "Our onion service%s received %u v2 and %u v3 INTRODUCE2 cells " "and attempted to launch %d rendezvous circuits.", num_services == 1 ? "" : "s", hs_stats_get_n_introduce2_v2_cells(), hs_stats_get_n_introduce2_v3_cells(), hs_stats_get_n_rendezvous_launches());
Those counters don't get reset so to get the rate one need to compare between two heartbeats (default is every 6h).
Thus, if any big popular service out there (no need to give the .onion) can tell us the rate they see, it would be grand!
Thanks! David
Hello, can someone answer some questions I have about how this attacks work?
As far as I understand INTRODUCE2 cells are sent by Introduction Points directly to the Hidden Service. But this only happens after a Client sends the INTRODUCE1 cell to the Introduction Point.
Now the question is, do we allow more than 1 INTRODUCE1 per client circuit? If this is right, why? Or the attack is working because the client makes a new circuit/connection to the I.P. each time for sending a INTRODUCE1?
On 31/5/19 14:21, David Goulet wrote:
On 31 May (00:46:56), teor wrote:
Hi,
On 30 May 2019, at 23:49, David Goulet dgoulet@torproject.org wrote:
Over the normal 3 intro points a service has, it means 150 introduction per-second are allowed with a burst of 600 in total. Or in other words, 150 clients can reach the service every second up to a burst of 600 at once. This probably will ring alarms bell for very popular services that probably gets 1000+ users a second so please check next section.
Do we know how many introduce cells are sent to popular services?
How can the operators of these services find out their current introduce rate?
Yes good point.
The only thing we have available is the heartbeat that should read like so:
log_notice(LD_HEARTBEAT, "Our onion service%s received %u v2 and %u v3 INTRODUCE2 cells " "and attempted to launch %d rendezvous circuits.", num_services == 1 ? "" : "s", hs_stats_get_n_introduce2_v2_cells(), hs_stats_get_n_introduce2_v3_cells(), hs_stats_get_n_rendezvous_launches());
Those counters don't get reset so to get the rate one need to compare between two heartbeats (default is every 6h).
Thus, if any big popular service out there (no need to give the .onion) can tell us the rate they see, it would be grand!
Thanks! David
tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
On Fri, May 31, 2019 at 08:15:16PM +0200, juanjo wrote:
As far as I understand INTRODUCE2 cells are sent by Introduction Points directly to the Hidden Service. But this only happens after a Client sends the INTRODUCE1 cell to the Introduction Point.
Now the question is, do we allow more than 1 INTRODUCE1 per client circuit? If this is right, why? Or the attack is working because the client makes a new circuit/connection to the I.P. each time for sending a INTRODUCE1?
It's that second one. Some jerk is pretending to be many Tor users, and since it's an anonymity system, it's hard to tell which ones are the jerk and which ones are other users.
For the "oops you can send more than one intro1 cell per intro circuit" bug, see https://bugs.torproject.org/15515 (fixed in Tor 0.2.4.27)
--Roger
Nice to try to stop this DoS vulnerability at network design level.
Can we have an estimation of when will be released this antiDoS features? 0.4.1.x or 0.4.2.x ?
And just came to my mind reading this, that to stop these attacks we could implement some authentication based on Proof of Work or something like that. This means that to launch such an attack the attacker (client level) should compute the PoW and must have many computing power, while normal clients/users don't need almost any change. Actually this is what PoW is very useful.
On 30/5/19 15:49, David Goulet wrote:
Greetings!
As some of you know, a bunch of onion services were or are still under heavy DDoS on the network. More specifically, they are bombarded with introduction requests (INTRODUCE2 cells) which forces them to rendezvous for each of them by creating a ton of circuits.
This basically leads to a resource exhaustion attack on the service side with its CPU massively used for path selection, opening new circuits and continously handling INTRODUCE2 cells.
Unfortunately, our circuit-level flow control does not apply to the service introduction circuit which means that the intro point is allowed, by the Tor protocol, to send an arbitrary large amount of cells down the circuit. This means for the service that even after the DoS has stopped, it would still receive massive amounts of cells because some are either inflight on the circuit or queued at the intro point ready to be sent (towards the service).
That being all said, our short-term goal here is to add INTRODUCE2 rate-limiting (similar to the Guard DoS subsystem deployed early last year) *at* the intro point but much simpler. The goal is to soak up the introduction load directly at the intro points which would help reduce the load on the network overall and thus preserve its health.
Please have a look at https://trac.torproject.org/15516 for some discussions and ongoing code work. We are at the point where we have a branch that rate limits INTRODUCE2 cells at the intro point but we need to figure out proper values for the rate per second and the burst allowed.
One naive approach is to see how much cells an attack can send towards a service. George and I have conducted experiment where with 10 *modified* tor clients bombarding a service at a much faster rate than 1 per-second (what vanilla tor does if asked to connect a lot), we see in 1 minute ~15000 INTRODUCE2 cells at the service. This varies in the thousands depending on different factors but overall that is a good average of our experiment.
This means that 15000/60 = 250 cells per second.
Considering that this is an absurd amount of INTRODUCE2 cells (maybe?), we can put a rate per second of let say a fifth meaning 50 and a burst of 200.
Over the normal 3 intro points a service has, it means 150 introduction per-second are allowed with a burst of 600 in total. Or in other words, 150 clients can reach the service every second up to a burst of 600 at once. This probably will ring alarms bell for very popular services that probably gets 1000+ users a second so please check next section.
I'm not that excited about hardcoded network wide values so this is why the next section is more exciting but much more work for us!
One step further: we have not really decided yet if this is something we want nor have time to tackle but an idea here would be for a service to inform the intro point, using the ESTABLISH_INTRO cell payload, on the parameters it wants for its DoS defenses. So let say a very popular .onion with OnionBalance and 10 intro points, could tell to its intro points that it wants much higher values for the DoS defenses (or even put it off).
However, it doesn't change the initial building block of being able to rate limit at the introduction point. As a second step, we can add this new type of ESTABLISH_INTRO cell. It is always dicy to introduce a new cell since for instance this would leak information to the intro point that the service is ">= version". Thus, this needs to be done with carefully.
Time for your thoughts and help! :)
Thanks everyone! David
tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
On Thu, May 30, 2019 at 09:03:40PM +0200, juanjo wrote:
And just came to my mind reading this, that to stop these attacks we could implement some authentication based on Proof of Work or something like that. This means that to launch such an attack the attacker (client level) should compute the PoW and must have many computing power, while normal clients/users don't need almost any change. Actually this is what PoW is very useful.
Check out https://bugs.torproject.org/25066 for more details on this idea.
There are still some interesting design questions to be resolved before it's really a proposed idea.
--Roger
Ok, thanks, I was actually thinking about PoW on the Introduction Point itself, but it would need to add a round trip, like some sort of "authentication based PoW" before allowing to send the INTRODUCE1 cell. At least it would make the overhead of clients higher than I.P. as the clients would need to compute the PoW function and the I.P. only to verify it. So if right now the cost of the attack is "low" we can add an overhead of +10 to the client and only +2 to the I.P. (for example) and the hidden service doesn't need to do anything.
I will write down my idea and send it here.
On 31/5/19 20:26, Roger Dingledine wrote:
On Thu, May 30, 2019 at 09:03:40PM +0200, juanjo wrote:
And just came to my mind reading this, that to stop these attacks we could implement some authentication based on Proof of Work or something like that. This means that to launch such an attack the attacker (client level) should compute the PoW and must have many computing power, while normal clients/users don't need almost any change. Actually this is what PoW is very useful.
Check out https://bugs.torproject.org/25066 for more details on this idea.
There are still some interesting design questions to be resolved before it's really a proposed idea.
--Roger
tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
juanjo juanjo@avanix.es writes:
Ok, thanks, I was actually thinking about PoW on the Introduction Point itself, but it would need to add a round trip, like some sort of "authentication based PoW" before allowing to send the INTRODUCE1 cell. At least it would make the overhead of clients higher than I.P. as the clients would need to compute the PoW function and the I.P. only to verify it. So if right now the cost of the attack is "low" we can add an overhead of +10 to the client and only +2 to the I.P. (for example) and the hidden service doesn't need to do anything.
Also see the idea in (b) (1) here: https://lists.torproject.org/pipermail/tor-dev/2019-April/013790.html and how it couples with the "rendezvous approver" from ticket #16059. Given a generic system there, adding proof-of-work is a possibility.
Another option would be to add the proof-of-work in the public parts of INTRO1 and have the introduction point verify it which is not covered in our email above.
Proof-of-work systems could be something to consider, altho tweaking a proof-of-work system that would deny attackers and still allow normal clients to visit it (without e.g. burning the battery of mobile clients) is an open problem AFAIK.
George Kadianakis desnacked@riseup.net writes:
juanjo juanjo@avanix.es writes:
Ok, thanks, I was actually thinking about PoW on the Introduction Point itself, but it would need to add a round trip, like some sort of "authentication based PoW" before allowing to send the INTRODUCE1 cell. At least it would make the overhead of clients higher than I.P. as the clients would need to compute the PoW function and the I.P. only to verify it. So if right now the cost of the attack is "low" we can add an overhead of +10 to the client and only +2 to the I.P. (for example) and the hidden service doesn't need to do anything.
Also see the idea in (b) (1) here: https://lists.torproject.org/pipermail/tor-dev/2019-April/013790.html and how it couples with the "rendezvous approver" from ticket #16059. Given a generic system there, adding proof-of-work is a possibility.
Another option would be to add the proof-of-work in the public parts of INTRO1 and have the introduction point verify it which is not covered in our email above.
Proof-of-work systems could be something to consider, altho tweaking a proof-of-work system that would deny attackers and still allow normal clients to visit it (without e.g. burning the battery of mobile clients) is an open problem AFAIK.
Here is how this could work after a discussion with dgoulet and arma on IRC:
1) Service enables DoS protection in its torrc.
2) Service uploads descriptor with PoW parameters.
3) Service sends special flag in its ESTABLISH_INTRO to its intro points that says "Enable PoW defences".
4) Clients fetch descriptor, parse the PoW parameters and now need to complete PoW before they send a valid INTRO1 cell, otherwise it gets dropped by the intro point.
All the above seems like they could work for some use cases.
As said above, I doubt there are parameters that would help against DoS and still allow people to pleasantly visit such onion services through an uncharged mobile phone, but this choice is up to the onion service. The onion service can turn this feature on when they want, and disable it when they want. And mobile clients could also disallow visits to such sites to avoid battery/CPU burns.
All the above seems likely, but it's significant work. We first need a proposal to discuss, and then there is lots of code to be written...
George Kadianakis desnacked@riseup.net writes:
George Kadianakis desnacked@riseup.net writes:
juanjo juanjo@avanix.es writes:
Ok, thanks, I was actually thinking about PoW on the Introduction Point itself, but it would need to add a round trip, like some sort of "authentication based PoW" before allowing to send the INTRODUCE1 cell. At least it would make the overhead of clients higher than I.P. as the clients would need to compute the PoW function and the I.P. only to verify it. So if right now the cost of the attack is "low" we can add an overhead of +10 to the client and only +2 to the I.P. (for example) and the hidden service doesn't need to do anything.
Also see the idea in (b) (1) here: https://lists.torproject.org/pipermail/tor-dev/2019-April/013790.html and how it couples with the "rendezvous approver" from ticket #16059. Given a generic system there, adding proof-of-work is a possibility.
Another option would be to add the proof-of-work in the public parts of INTRO1 and have the introduction point verify it which is not covered in our email above.
Proof-of-work systems could be something to consider, altho tweaking a proof-of-work system that would deny attackers and still allow normal clients to visit it (without e.g. burning the battery of mobile clients) is an open problem AFAIK.
Here is how this could work after a discussion with dgoulet and arma on IRC:
Service enables DoS protection in its torrc.
Service uploads descriptor with PoW parameters.
Service sends special flag in its ESTABLISH_INTRO to its intro points that says "Enable PoW defences".
Clients fetch descriptor, parse the PoW parameters and now need to complete PoW before they send a valid INTRO1 cell, otherwise it gets dropped by the intro point.
All the above seems like they could work for some use cases.
As said above, I doubt there are parameters that would help against DoS and still allow people to pleasantly visit such onion services through an uncharged mobile phone, but this choice is up to the onion service. The onion service can turn this feature on when they want, and disable it when they want. And mobile clients could also disallow visits to such sites to avoid battery/CPU burns.
All the above seems likely, but it's significant work. We first need a proposal to discuss, and then there is lots of code to be written...
FWIW, thinking about this more, I think it's quite unlikely that we will find a non-interactive PoW system here (like hashcash) whose parameters would allow a legit client to compute a PoW in a reasonable time frame, and still disallow a motivated attacker with GPUs to compute hundreds/thousands of them in a single second (which can be enough to DoS a service).
We should look into parameters and tuning non-interactive PoW systems, or we could look into interactive proof-of-work systems like CAPTCHAs (or something else), which would be additional work but might suit as more.
David Goulet dgoulet@torproject.org writes:
Greetings!
<snip>
Hello, I'm here to brainstorm about this suggested feature. I don't have a precise plan forward here, so I'm just talking.
Unfortunately, our circuit-level flow control does not apply to the service introduction circuit which means that the intro point is allowed, by the Tor protocol, to send an arbitrary large amount of cells down the circuit. This means for the service that even after the DoS has stopped, it would still receive massive amounts of cells because some are either inflight on the circuit or queued at the intro point ready to be sent (towards the service).
== SENDME VS Token bucket
So it seems like we are going with a token bucket approach (#15516) to rate-limit introduce cells, even tho the rest of the Tor protocol is using SENDME cells. Are we reinventing the wheel here?
That being all said, our short-term goal here is to add INTRODUCE2
rate-limiting (similar to the Guard DoS subsystem deployed early last year) *at* the intro point but much simpler. The goal is to soak up the introduction load directly at the intro points which would help reduce the load on the network overall and thus preserve its health.
== We need to understand the effects of this feature:
First of all, the main thing to note here is that this is a feature that primarily intends to improve network health against DoS adversaries. It achieves this by greatly reducing the amount of useless rendezvous circuits opened by the victim service, which then improves the health of guard nodes (when guard nodes breaks, circuit start retrying endlessly, and hell begins).
We don't know how this feature will impact the availability of an attacked service. Right now, my hypothesis is that even with this feature enabled, an attacked service will remain unusable. That's because an attacker who spams INTRO1 cells will always saturate the intro point and innocent clients with a browser will be very unlikely to get service (kinda like sitting under a waterfall and trying to fill a glass with your spit). That said, with this defense, the service won't be 100% CPU, so perhaps innocent clients who manage to sneak in will get service, whereas now they don't anyhow.
IMO, it's very important to understand exactly how this feature will impact the availability of the service: If this feature does not help the availability of the service, then victim operators will be incentivized to disable the feature (or crank up the limits) which means that we will not improve the health of the network, which is our primary goal here.
---
== Why are we doing all this?
Another thing I wanted to mention here is the second order effect we are facing. The only reason we are doing all this is because attackers are incentived into attacking onion services. Perhaps the best thing we could do here is to create tools to make denial of service attacks less effective against onion services, which would make attackers stop performing them, and hence we won't need to implement rate-limits to protect the network in case they do. Right now the best things we have towards that direction is the incomplete-but-plausible design of [0] and the inelegant 1b from [1].
This is especially true since to get this rate-limiting feature deployed to the whole network we need all relays (intro points) to upgrade to the new version so we are looking at years in the future anyway.
[0]: https://lists.torproject.org/pipermail/tor-dev/2019-May/013849.html https://lists.torproject.org/pipermail/tor-dev/2019-June/013862.html [1]: https://lists.torproject.org/pipermail/tor-dev/2019-April/013790.html
One naive approach is to see how much cells an attack can send towards a service. George and I have conducted experiment where with 10 *modified* tor clients bombarding a service at a much faster rate than 1 per-second (what vanilla tor does if asked to connect a lot), we see in 1 minute ~15000 INTRODUCE2 cells at the service. This varies in the thousands depending on different factors but overall that is a good average of our experiment.
This means that 15000/60 = 250 cells per second.
Considering that this is an absurd amount of INTRODUCE2 cells (maybe?), we can put a rate per second of let say a fifth meaning 50 and a burst of 200.
Over the normal 3 intro points a service has, it means 150 introduction per-second are allowed with a burst of 600 in total. Or in other words, 150 clients can reach the service every second up to a burst of 600 at once. This probably will ring alarms bell for very popular services that probably gets 1000+ users a second so please check next section.
I'm not that excited about hardcoded network wide values so this is why the next section is more exciting but much more work for us!
Yes, I'm also very afraid of imposing network wide values here. What happens to hypothetical onion services that outperform the hard limits we impose here, even when they are not DoSed? The limits above are extremely low when we are looking at normal busy websites on the clearnet, so by activating them we are basically putting hard limits to the adoption of onion services.
Perhaps that's something we want to do anyway, because not knowing how many clients an onion service can support is also not ideal, but we should really think twice (and then again twice) before doing it and also talk to some people who manage busy sites in the onionspace and outside of it.
== What about false positives?
Also given that the rate limiting happens on the intro point layer here, how does a service learn that it's getting DoSed? Are we looking at a special IP->HS cell that says "we are throttling your clients"? How much to overengineer here?
== What's the ideal client behavior when the limit gets hit?
So given that these hard limits can be hit quite easily by an attacker, what is the client behavior when they get hit? Will normal clients keep on retrying intro points until they get service, and continuously extending their circuits? This behavior is particularly important for the availability of the service under this feature.
---
These are some thoughts I have about this. As you can see I'm also confused and thinking about this topic :)
On 06 Jun (20:03:52), George Kadianakis wrote:
David Goulet dgoulet@torproject.org writes:
Greetings!
<snip>
Hello, I'm here to brainstorm about this suggested feature. I don't have a precise plan forward here, so I'm just talking.
Unfortunately, our circuit-level flow control does not apply to the service introduction circuit which means that the intro point is allowed, by the Tor protocol, to send an arbitrary large amount of cells down the circuit. This means for the service that even after the DoS has stopped, it would still receive massive amounts of cells because some are either inflight on the circuit or queued at the intro point ready to be sent (towards the service).
== SENDME VS Token bucket
So it seems like we are going with a token bucket approach (#15516) to rate-limit introduce cells, even tho the rest of the Tor protocol is using SENDME cells. Are we reinventing the wheel here?
I see these as two different approaches.
Relying on the flow control protocol here is nice in practice because the intro point would not relay anything until the service asks for more data. But this can often be influenced by the circuit latency. It could be that the service could handle 10 times what it received but because the SENDME takes 2 second to reach the intro point, then we loose precious "work time".
I think if we rely on the flow control, it will severely impact very popular hidden service that have a nice OnionBalance setup and all. I have no numbers to back that up but that is my intuition.
The token bucket approach is more flexible _especially_ with the idea of having ESTABLISH_INTRO cell having parameters for the token bucket knobs.
That being all said, our short-term goal here is to add INTRODUCE2
rate-limiting (similar to the Guard DoS subsystem deployed early last year) *at* the intro point but much simpler. The goal is to soak up the introduction load directly at the intro points which would help reduce the load on the network overall and thus preserve its health.
== We need to understand the effects of this feature:
First of all, the main thing to note here is that this is a feature that primarily intends to improve network health against DoS adversaries. It achieves this by greatly reducing the amount of useless rendezvous circuits opened by the victim service, which then improves the health of guard nodes (when guard nodes breaks, circuit start retrying endlessly, and hell begins).
We don't know how this feature will impact the availability of an attacked service. Right now, my hypothesis is that even with this feature enabled, an attacked service will remain unusable. That's because an attacker who spams INTRO1 cells will always saturate the intro point and innocent clients with a browser will be very unlikely to get service (kinda like sitting under a waterfall and trying to fill a glass with your spit). That said, with this defense, the service won't be 100% CPU, so perhaps innocent clients who manage to sneak in will get service, whereas now they don't anyhow.
IMO, it's very important to understand exactly how this feature will impact the availability of the service: If this feature does not help the availability of the service, then victim operators will be incentivized to disable the feature (or crank up the limits) which means that we will not improve the health of the network, which is our primary goal here.
This is an experiment we can easily run. Saturate a service intro points (that we control) and run in a loop a client trying to reconnect. See the success rate. I'm also expecting very very very low reachability but who knows, we could be surprised but at least we'll have data points.
== Why are we doing all this?
Another thing I wanted to mention here is the second order effect we are facing. The only reason we are doing all this is because attackers are incentived into attacking onion services. Perhaps the best thing we could do here is to create tools to make denial of service attacks less effective against onion services, which would make attackers stop performing them, and hence we won't need to implement rate-limits to protect the network in case they do. Right now the best things we have towards that direction is the incomplete-but-plausible design of [0] and the inelegant 1b from [1].
This is especially true since to get this rate-limiting feature deployed to the whole network we need all relays (intro points) to upgrade to the new version so we are looking at years in the future anyway.
https://lists.torproject.org/pipermail/tor-dev/2019-June/013862.html
My two cents here are that all those features could complement each other over time. Having a proof-of-work + rate limit can work well together.
But at this juncture in time, what I want most to be fixed is the fact that service are used for an amplification attack. This was disastrous during the 2018 DDoS, saturating Guard nodes constantly. We fixed this by adding DoS defenses at the Guard level which stopped the client madness, but not the service side of things.
Soaking the huge loads on the intro point is a good easy avenue for us to pursue and have very direct impact on the health of the network. And it is always something we can disable with a consensus parameterse if shit hit the fan with it.
One naive approach is to see how much cells an attack can send towards a service. George and I have conducted experiment where with 10 *modified* tor clients bombarding a service at a much faster rate than 1 per-second (what vanilla tor does if asked to connect a lot), we see in 1 minute ~15000 INTRODUCE2 cells at the service. This varies in the thousands depending on different factors but overall that is a good average of our experiment.
This means that 15000/60 = 250 cells per second.
Considering that this is an absurd amount of INTRODUCE2 cells (maybe?), we can put a rate per second of let say a fifth meaning 50 and a burst of 200.
Over the normal 3 intro points a service has, it means 150 introduction per-second are allowed with a burst of 600 in total. Or in other words, 150 clients can reach the service every second up to a burst of 600 at once. This probably will ring alarms bell for very popular services that probably gets 1000+ users a second so please check next section.
I'm not that excited about hardcoded network wide values so this is why the next section is more exciting but much more work for us!
Yes, I'm also very afraid of imposing network wide values here. What happens to hypothetical onion services that outperform the hard limits we impose here, even when they are not DoSed? The limits above are extremely low when we are looking at normal busy websites on the clearnet, so by activating them we are basically putting hard limits to the adoption of onion services.
Perhaps that's something we want to do anyway, because not knowing how many clients an onion service can support is also not ideal, but we should really think twice (and then again twice) before doing it and also talk to some people who manage busy sites in the onionspace and outside of it.
They need to be at least consensus parameters so the entire network can adapt if the default values ends up being very bad or worst, inneffective.
Second thing is that I'm thinking more and more that this feature is not complete/useful without a way for the service operator to have control over those knobs. Fortunately, we have #30790 in the pipe for this.
== What about false positives?
Also given that the rate limiting happens on the intro point layer here, how does a service learn that it's getting DoSed? Are we looking at a special IP->HS cell that says "we are throttling your clients"? How much to overengineer here?
For now, it would be unnoticed by the operator for which I'm not that worried about. Likely scenario here is that users starts complaining to the service operator that they can't reach it.
== What's the ideal client behavior when the limit gets hit?
So given that these hard limits can be hit quite easily by an attacker, what is the client behavior when they get hit? Will normal clients keep on retrying intro points until they get service, and continuously extending their circuits? This behavior is particularly important for the availability of the service under this feature.
The code right now, in #15516, will send a NACK. The reason for this is because we want legit client to re-extend and not create a new intro circuit. More efficient and less pressure on the network.
After getting NACKed by all introduction points, the client will stop retyring. It will be allowed to retry when the "failure cache" cleans up which is right now 5 minutes time out. Or if new intro point are found in a new descriptor.
I'm in favor of the re-extend option here which is the normal behavior client will encounter in normal circumstances. And also the one that creates less pressure.
Cheers! David
On 30 May (09:49:26), David Goulet wrote:
Greetings!
[snip]
Hi everyone,
I'm writing here to update on where we are about the introduction rate limiting at the intro point feature.
The branch of #15516 (https://trac.torproject.org/15516) is ready to be merged upstream which implements a simple rate/burst combo for controlling the amount of INTRODUCE2 cells that are relayed to the service.
As previously detailed in this thread, the default values are a rate of 25 introduction per second and a burst of 200 per second. These values can be controlled by consensus parameters meaning they can be changed network wide.
We've first asked big service operators, I'm not going to detail the values they provided us in private, but those defaults are quite large enough to sustain heavy traffic from what we can tell from what they gave us.
The second thing we did is do experimental testing to see how CPU usage and availability is affected. We've tested this with 3 _fast_ introduction points and then 3 rate limited introduction points.
The good news is that once the attack stops, the rain of introduction requests to the service stops very quickly.
With the default rate/burst values, on a Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz (8 cores), the tor service CPU doesn't go above ~60% (on one single core). And almost drops to 0 as soon as the attack ends.
The bad news is that availability is _not_ improved. One of the big reasons for that is because the rate limit defenses, once engaged at the intro point, will send back a NACK to the client. A vanilla tor client will stop using that introduction point away for 120 seconds if it gets 3 NACKs from it. This leads to tor quickly giving up on trying to connect and thus telling the client that connection is impossible to the .onion.
We've hacked a tor client to play along and stop ignoring the NACKs to see how much time it would take to reach it. On average, a client would roughly need around 70 seconds with more than 40 NACKs on average.
However, it varied a _lot_ during our experiments with many outliers from 8 seconds with 1 NACK up to 160 seconds with 88 NACKs. (For this, the SocksTimeout had to be bumped quite a bit).
There is an avenue of improvement here to make the intro point sends a specific NACK reason (like "Under heavy load" or ...) which would make the client consider it like "I should retry soon-ish" and thus making the client possibly able to connect after many seconds (or until the SocksTimeout).
Another bad news there! We can't do that anytime soon because of this bug that basically crash clients if an unknown status code is sent back (that is a new NACK value): https://trac.torproject.org/30454. So yeah... quite unfortunate there but also a superb reason for everyone out there to upgrade :).
One good news is that it seems that having fast intro points instead of slow IPs doesn't change much on the overall load on the service so this for now, our experiment, shows it doesn't matter.
Overall, this rate limit feature does two things:
1. Reduce the overall network load.
Soaking the introduction requests at the intro point helps avoid the service creating pointless rendezvous circuits which makes it "less" of an amplification attack.
2. Keep the service usable.
The tor daemon doesn't go in massive CPU load and thus can be actually used properly during the attack.
The problem with (2) is the availability part where for a legit client to reach the service, it is close to impossible for a vanilla tor without lots of luck. However, if let say the tor daemon would be configured with 2 .onion where one is public and the other one is private with client authorization, then the second .onion would be totally usable due to the tor daemon not being CPU overloaded.
As a third thing we did about this. In order to make this feature a bit more "malleable", we are working on https://trac.torproject.org/30924 which is proposal 305.
In short, torrc options are added so an operator can change the rate/burst that the intro points will use. We can do that using the ESTABLISH_INTRO cell that will have an extension to define the DoS defense parameters (proposal 305).
That way, a service operator can disable this feature, or turn the knobs on the rate/burst in order to basically adjust the defenses.
At this point in time, we don't have a good grasp on what happens in terms of CPU if the rate or the burst is bumped up or even how availability is affected. During our experimentation, we did observed a "sort of" linear progression between CPU usage and rate. But we barely touched the surface since it was changed from 25 to 50 to 75 and that is it.
We would require much more experimentation which is something we want to avoid as much as possible on the real network.
Finally, many more changes are cooking up. One in particular is https://trac.torproject.org/projects/tor/ticket/26294 that will make tor to only rotate its intro points when the number of introduction requests is between 150k to 300k (random value) which currently is between 16k and 32k. See the ticket for the benefits here which mostly helps with (1).
There has been much talk about a client PoW (see the proposal 305 thread on this list) which in theory would help out with service availability.
We will also soon merge upstream this ticket https://trac.torproject.org/24962 which goes one step further at denying single-hop connections to the HSDir/Intro in order to try as much as possible to shutdown the Tor2web connections (or any attacker that speeds things up on their side by single hoping).
We are making progress here... This is really a non trivial problem and solution for service availability are not that simple. Our priority is to protect the network as much as possible and then move to possible solutions for availability.
I'll stop for now. Huge thanks to everyone who provided service logs, ideas, code review and future testers :).
Cheers! David
David Goulet dgoulet@torproject.org writes:
On 30 May (09:49:26), David Goulet wrote:
Greetings!
[snip]
Hi everyone,
I'm writing here to update on where we are about the introduction rate limiting at the intro point feature.
The branch of #15516 (https://trac.torproject.org/15516) is ready to be merged upstream which implements a simple rate/burst combo for controlling the amount of INTRODUCE2 cells that are relayed to the service.
Great stuff! Thanks for the update!
<snip>
The bad news is that availability is _not_ improved. One of the big reasons for that is because the rate limit defenses, once engaged at the intro point, will send back a NACK to the client. A vanilla tor client will stop using that introduction point away for 120 seconds if it gets 3 NACKs from it. This leads to tor quickly giving up on trying to connect and thus telling the client that connection is impossible to the .onion.
We've hacked a tor client to play along and stop ignoring the NACKs to see how much time it would take to reach it. On average, a client would roughly need around 70 seconds with more than 40 NACKs on average.
However, it varied a _lot_ during our experiments with many outliers from 8 seconds with 1 NACK up to 160 seconds with 88 NACKs. (For this, the SocksTimeout had to be bumped quite a bit).
That makes sense.
So it seems like this change will change the UX of clients visiting DoSed onion services to a sideways direction (not better/worse), right? Clients will immediately see a "Cant connect" page on their browser since the SOCKS conn will abort after after getting 3 NACKs. Is that the case?
This change also impacts the performance impact of these legitimate clients, since now they will immediately try all three introduction points by extending the introduction circuit two times. This means that legitimate clients will be slightly more damaging to the network, but the DoS attacker will be much less damaging to the network, and since the DoS attacker causes all the damage here this seems like a net positive change.
There is an avenue of improvement here to make the intro point sends a specific NACK reason (like "Under heavy load" or ...) which would make the client consider it like "I should retry soon-ish" and thus making the client possibly able to connect after many seconds (or until the SocksTimeout).
Another bad news there! We can't do that anytime soon because of this bug that basically crash clients if an unknown status code is sent back (that is a new NACK value): https://trac.torproject.org/30454. So yeah... quite unfortunate there but also a superb reason for everyone out there to upgrade :).
Do we have any view on what's the ideal client behavior here? Is "retrying soon-ish" actually something we want to do? Does it have security implications?
<snip>
Overall, this rate limit feature does two things:
Reduce the overall network load.
Soaking the introduction requests at the intro point helps avoid the service creating pointless rendezvous circuits which makes it "less" of an amplification attack.
I think it would be really useful to get a baseline of how much we "Reduce the overall network load" here, given that this is the reason we are doing this.
That is, it would be great to get a graph with how many rendezvous circuits and/or bandwidth attackers can induce to the network right now by attacking a service, and what's the same number if we do this feature with different parameters.
Keep the service usable.
The tor daemon doesn't go in massive CPU load and thus can be actually used properly during the attack.
The problem with (2) is the availability part where for a legit client to reach the service, it is close to impossible for a vanilla tor without lots of luck. However, if let say the tor daemon would be configured with 2 .onion where one is public and the other one is private with client authorization, then the second .onion would be totally usable due to the tor daemon not being CPU overloaded.
That's more like a "Keep the service CPU usable, but not the service itself" ;)
<snip>
At this point in time, we don't have a good grasp on what happens in terms of CPU if the rate or the burst is bumped up or even how availability is affected. During our experimentation, we did observed a "sort of" linear progression between CPU usage and rate. But we barely touched the surface since it was changed from 25 to 50 to 75 and that is it.
I wonder how we can get a better grasp at this given that we are about to deploy it on the real net. Perhaps some graphs with the effect of these parameters on (1) and (2) above would be useful.
In particular, I think it would be smart and not a huge delay to wait until Stockholm before we merge this so that we can discuss it in person with more people and come up with exact parameters, client behaviors, etc.
Thanks again! :)
On 04/07/2019 12:46, George Kadianakis wrote:
David Goulet dgoulet@torproject.org writes:
Overall, this rate limit feature does two things:
Reduce the overall network load.
Soaking the introduction requests at the intro point helps avoid the service creating pointless rendezvous circuits which makes it "less" of an amplification attack.
I think it would be really useful to get a baseline of how much we "Reduce the overall network load" here, given that this is the reason we are doing this.
That is, it would be great to get a graph with how many rendezvous circuits and/or bandwidth attackers can induce to the network right now by attacking a service, and what's the same number if we do this feature with different parameters.
If you're going to do this comparison, I wonder if it would be worth including a third option in the comparison: dropping excess INTRODUCE2 cells at the service rather than NACKing them at the intro point.
In terms of network load, it seems like this would fall somewhere between the status quo and the intro point rate-limiting mechanism: excess INTRODUCE2 cells would be relayed from the intro point to the service (thus higher network load than intro point rate-limiting), but they wouldn't cause rendezvous circuits to be built (thus lower network load than the status quo).
Unlike intro point rate-limiting, a backlog of INTRODUCE2 cells would build up in the intro circuits if the attacker was sending cells faster than the service could read and discard them, so I'd expect availability to be affected for some time after the attack stopped, until the service had drained the backlog.
Excess INTRODUCE2 cells would be dropped rather than NACKed, so legitimate clients would see a rendezvous timeout rather than an intro point failure; I'm not sure if that's good or bad.
On the other hand there would be a couple of advantages vs intro point rate-limiting: services could deploy the mechanism immediately without waiting for intro points to upgrade, and services could adjust their rate-limiting parameters quickly in response to local conditions (e.g. CPU load), without needing to define consensus parameters or a way for services to send custom parameters to their intro points.
Previously I'd assumed these advantages would be outweighed by the better network load reduction of intro point rate-limiting, but if there's an opportunity to measure how much network load is actually saved by each mechanism then maybe it's worth including this mechanism in the evaluation to make sure that's true?
I may have missed parts of the discussion, so apologies if this has already been discussed and ruled out.
Cheers, Michael