Hello list,
This is a thread summarizing and brainstorming various defences about denial of service defences for onion services after an in-depth discussion with David Goulet.
We've been thinking about denial of service defences for onion services lately. This has been a recurrent topic that has been creeping up every once in a while: Last time we had to tackle this issue it was back in early 2018 when we had to design a DoS mitigation subsystem because the network was crumbling down (https://trac.torproject.org/projects/tor/ticket/24902).
Unfortunately, while the DoS mitigation subsystem improved the health of the network and stopped the DoS attacks back then, it did not address the total space of possible attacks, and onion services and the network is still open to various attacks. The main DoS attack right now is the naive attack of flooding the service with too many introduction requests, and this is the attack that this post is gonna be dealing with.
We don't like DoS attacks because they cause two issues to Tor:
a) They damage the health of the Tor network impacting every user b) They kill availability of legitimate onion services.
In this thread we will handle these two issues independently, as there is no single solution that improves both areas at once. We have some pretty good ideas on (a), but we would appreciate ideas on (b), so feel free to give us your input.
== a) Minimizing the damage to the network caused by DoS attacks:
Most of the damage caused during DoS attacks is from the circuits created by the attacker to introduce/rendezvous to the victim onion service, and also by the circuits created by the victim onion service as it tries to rendezvous with all those clients. An attacker can literally create tens of thousands of introduction circuits in less than a minute, which get amplified by the service launching that many rendezvous circuits. Not good.
Here are a few ways to reduce the damage to the network:
== 1) Rate limiting introduction circuits
There should be a way to rate-limit introductions so that services do not get overwhelmed. There are various places where we can rate-limit: we could rate-limit on the guard-layer, or on the intro-point layer or on the service-layer.
We have already attempted at rate-limiting on the guard-layer with #24902, but it's hard to go deeper there because the guard does not know if the circuit is a DoS attacker, or a busy onion service, or 150 Tor users in an airport. We also think that rate-limiting on the service-layer won't do much good since that's too far down the circuit, and we are trying to reduce the operations it has to do so that it doesn't get overwhelmed (see #15463 for various queue-management approaches for rate-limiting on the service side).
So we've been thinking of rate-limiting on the introduction point layer, since it's a nice soaking point that does not do much right now. See #15516 (comment 28) for a concrete proposal by arma which results in far less damage to the network (since evil traffic does not get carried through to the service-side introduction circuit, and no extra rendezvous circuits get launched), and also a swifter way for legit clients to know that an onion-service circuit won't work.
== 2) Stop needless circuit rotation on service-side
Right now, services will rotate their introduction circuits after a certain number of introductions (#26294). This means that during an attack, the service not only needs to handle thousands of fake introduction circuits, but also continuously tear down and recreate introduction circuits and publish new descriptors. See comment 8 on that ticket for a short-term proposal on how to improve the situation here, by not continuously rotating introduction points.
== 3) Optimize CPU performance on the service-side
Right now, onion services during an attack are actually CPU bound. See #30221 for various improvements we can do to improve the performance of services. However, improving CPU performance might have the opposite effect, since processing cells quicker means that the service will make even more rendezvous circuits.
== 4) Make sure attackers don't take shortcuts around the protocol
We should make sure that attackers don't take shortcuts around the Tor protocol to launch their attacks. Examples here involve requiring a proof-of-rendezvous from clients (#25066), and not allowing single-hop proxies to do introductions (#22689).
The above suggestions (maybe in priority order) are ways we can improve the damage dealt to the network by DoS attackers. But that still does not make DoS attacks less effective. So here follows the section about improving service availability:
== b) Improve service availability during DoS attacks
Unfortunately, it's really hard to accurately stop DoS attacks in the Tor protocol. There is just no good way to distinguish between innocent clients trying to access content, and a bad actor trying to disable an onion service. Here is the main way we've thought of addressing this issue:
== 1) Binding the application-layer with the Tor introduction-layer
We think that the Tor protocol layer might not be the right place for handling DoS attacks. There are literally million-dollar companies trying hard to tackle this issue on the application-layer, where it's easier since you can do machine learning, give out captchas, zone out users, etc. And that's why we think that the solution to this issue lies on the application-layer and not on the Tor protocol layer.
In particular, a plausible solution here might involve for the client to embed application-layer information (e.g. a username/password) in its INTRODUCE1 cell, which then gets passed to the service. The service, can then check whether the given username/password should be allowed to connect (see "rendezvous approver" concept at #16059), and allow or reject the connection as it wishes. This way onion service operators can have complicated application-layer software that analyzes the activity of users and decide whether users should be allowed in or not (based on the number of introductions, or their application-layer (web) activity).
+===========================================+ | Tor network | +===========================================+ ^ ^ | +-----+ | +-------->| Tor |-------------------+ INTRO2 | HS | rendezvous circuit with +-----+ only if approved user/pass ^ | | v +----------+ +-------+ |Rendezvous|<------->|sqlite?| |approver | +-------+ +----------+
We think that this is a solution that could allow onion services to continue existing under high-load scenarios, since no rendezvous circuits would be established during DoS scenarios (and we know that rendezvous circuits is what causes the most CPU/network/availability damage).
However, this is a very complicated solution from an engineering perspective given that it requires changes on the client-side (to enhance INTRO1 cells with application-layer data), and also involves various enhancements on the service-side (various control port commands to interact with the (nonexistent) "rendezvous approver" software, which in turn needs to interact with other application-layer software (e.g. sql databases to manage membership).
There is also serious UX concerns with how this would look like on the client-side? Also, how does this interact with client auth? And how does this interact with intro-point-level rate limiting proposed above (onions should be given the option to disable intro-layer rate limiting)? How is this related to #17254?
All in all, we feel like we have pretty good options for reducing the damage that DoS attacks cause on our network, but we are still lacking easy and practical solutions for ensuring availability of onion services that are under DoS. For the next months, we plan to focus on reducing the damage on the network, since the damage on the network has a cummulative effect as circuits fail and get endlessly retried, where nothing ends up working right. At the same time, we will be thinking of good solutions for keeping a high availability on services that receive DoS attacks.
We would love your feedback and suggestions.
Thanks!