I realize we're in the middle of the Christmas / New Year dead week, but it would be great one of the developers could says something (anything) about the ongoing denial-of-service attacks.
My node crashed a second time a fews days back, and while the second iteration of hardening appears have held, I see many others crack under the stress. If one opens a control channel and issues
setevents extended orconn
volumes of
650 ORCONN ... LAUNCHED ID=nnnn 650 ORCONN ... FAILED REASON=CONNECTREFUSED NCIRCS=x ID=nnnn
messages appear. First I though these were operators with some new mitigation, but finally realized the relays are crashed and have not yet dropped off the consensus.
On Sat, Dec 30, 2017 at 03:33:23PM -0500, starlight.2017q4@binnacle.cx wrote:
I realize we're in the middle of the Christmas / New Year dead week, but it would be great one of the developers could says something (anything) about the ongoing denial-of-service attacks.
Still alive! Some of us just finished 34c3, and some others of us are trying to be offline for the holidays.
Here are some hopefully useful things to say:
(0) Thanks everybody for your work keeping the network going in the meantime! I see that the total number of relays has dropped off a tiny bit: https://metrics.torproject.org/networksize.html but the overall capacity and load on the network has stayed about the same: https://metrics.torproject.org/bandwidth.html So I wouldn't say the sky is falling at this point.
(1) I don't currently have any reason to think this is an intentional denial-of-service attack. Rather, the overloading is a consequence of the 1million+ new Tor clients that appeared a few weeks ago: https://metrics.torproject.org/userstats-relay-country.html This point shouldn't make you feel completely better, since even an incidental denial-of-service can cause real problems. But the response when modeling it as a malicious attack should be quite different from the response when modeling it as emergent pain from a series of bugs or poor design choices. Whatever these new clients are doing, the Tor design + the current Tor network aren't prepared to handle doing it at that scale.
(2) Now, it is very reasonable to wonder what 1M new Tor clients are doing, especially if they are all running on a handful of IP addresses at hosting facilities like Hetzner and OVH. These are not all new humans running Tor Browser.
(2b) If anybody has great contacts at Hetzner or OVH and can help us get a message to whoever is running these clients, that would be grand. ("Hi, did you know that you're hurting the Tor network? The Tor people would love to talk to you to help you do whatever it is you're trying to do, in a less harmful way.")
(3) I took some steps on Dec 22 to reduce the load that these clients (well, all clients) are putting on the network in terms of circuit creates. It seems like maybe it helped a bit, or maybe it didn't, but I'm the only one who has posted any stats for comparison. You can read more here: https://trac.torproject.org/24716
(4) David identified some bugs and missing features that could be causing extra pain in this situation:
https://trac.torproject.org/24665 https://trac.torproject.org/24666 https://trac.torproject.org/24667 https://trac.torproject.org/24668 https://trac.torproject.org/24671 https://trac.torproject.org/24694
A few of these were fixed in the latest 0.3.2.8-rc release, but some of them involve design work, not just bug fixes, and some of those changes are probably not a wise idea to stick into a late-stage release candidate (or backport to an earlier stable). So, think of this as a great opportunity for us to fix some scalability issues that weren't issues until this new situation appeared. :)
setevents extended orconn
volumes of
650 ORCONN ... LAUNCHED ID=nnnn 650 ORCONN ... FAILED REASON=CONNECTREFUSED NCIRCS=x ID=nnnn
messages appear. First I though these were operators with some new mitigation, but finally realized the relays are crashed and have not yet dropped off the consensus.
Hey, nice one. I've opened a ticket to try to improve that situation too: https://trac.torproject.org/24767
--Roger
At 18:25 12/30/2017 -0500, Roger Dingledine wrote:
Thank you Roger for your detailed reply.
I have some observations:
1) An additional contingent of dubious clients exists aside from the newly arrived big-name-hoster instances each generating _hundreds_of_thousands_ of connection requests _per_guard_per_day_: hundreds of scattered client IPs behave in a distinctive bot-like manner, and seem a likely source of excess circuit-extend activity. These IPs have been active since late August this year.
2) Intervals of extreme circuit-extend activity come and go in patterns that resemble attacks to my eyes. In one my guard relay was so overloaded before crashing that no normal user circuits could be created whatsoever. Has never come close to happening before.
3) I run an exit on a much more powerful machine. Normally the exit does not complain "assign_to_cpuworker failed," but recently the exit was attacked two different ways in rapid succession: first it was hit with a DDOS packet-saturation blast calibrated to overload the network interface but not so strong as to trigger the ISP's anti-DDOS system (which works well); the first attack had little effect. Then within two hours the exit was hit with a singular and massive circuit-extend attack that pegged the crypto-worker thread, generating thousands of "assign_to_cpuworker failed" messages. Both attacks degraded traffic flow noticeably but did not severely impact the exit. The attacker gave up (or accomplished their goal), presumably moving on to other targets.
4) Aside from "assign_to_cpuworker failed" overloading, the recent aggravation commenced with a "sniper attack" against my guard relay that resulted in Linux OOM kill of the daemon. Brought it back up with a more appropriate MaxMemInQueues setting and they tried again exactly two times, then ceased. I am certain it was a sniper attack due to the subsequent attempts, and it appears the perpetrator was actively and consciously engaged in attacking a selected target.
https://trac.torproject.org/projects/tor/ticket/24737
Here are my two cents: The current stress activity is either or both of 1) a long-running guerilla campaign to harass Tor relay operators and the Tor network, calibrated to avoid attracting an all-hands mitigation and associated bad press, 2) an effort to deanonymize hidden services with various guard-discovery guard-substitution techniques.
In light of the above I suggest adding support for circuit-extend rate-limiting of some kind or another. I run Tor relays to help regular flesh-and-blood users, not to facilitate volume traffic/abuse initiated by dubious actors. I wish to favor the former and have no qualms hobbling and blocking the latter.
On 12/31/2017 01:36 PM, starlight.2017q4@binnacle.cx wrote:
first it was hit with a DDOS packet-saturation blast calibrated to overload the network interface but not so strong as to trigger the ISP's anti-DDOS system (which works well); the first attack had little effect. Then within two hours the exit was hit with a singular and massive circuit-extend attack
IIRC I observed this type of attack then and when at my exit in the last years.
At 07:36 12/31/2017 -0500, I wrote:
. . . suggest adding support for circuit-extend rate-limiting of some kind or another. . .
Further in support of the request, for _12_hours_ preceding the most recent crash, the daemon reported:
Your computer is too slow to handle this many circuit creation requests. . . [450043 similar message(s) suppressed in last 60 seconds]
and for the attack on the fast exit machine:
[1091489 similar message(s) suppressed in last 60 seconds]
I see no reason _any_ router should _ever_ have to handle this volume of circuit requests. DOS attacks no doubt whatsoever. Rate limit is needed to mitigate the problem.
At 07:36 12/31/2017 -0500, I wrote:
. . . suggest adding support for circuit-extend rate-limiting of some kind or another. . .
also:
Heartbeat: Tor's uptime is 10 days 0:00 hours, with 115583 circuits open. I've sent 5190.11 GB and received 5048.62 GB. Circuit handshake stats since last time: 538253/637284 TAP, 5878399/5922888 NTor.
Heartbeat: Tor's uptime is 10 days 6:00 hours, with 179761 circuits open. I've sent 5314.34 GB and received 5193.56 GB. Circuit handshake stats since last time: 34639/34885 TAP, *** 18741983/144697651 NTor. ***
Hi everybody,
Thanks for your patience. Here is quick update -- hopefully we'll have another update in the upcoming days too.
On Sat, Dec 30, 2017 at 06:25:28PM -0500, Roger Dingledine wrote:
(0) Thanks everybody for your work keeping the network going in the meantime! I see that the total number of relays has dropped off a tiny bit: https://metrics.torproject.org/networksize.html but the overall capacity and load on the network has stayed about the same: https://metrics.torproject.org/bandwidth.html So I wouldn't say the sky is falling at this point.
This part is still true! :)
(1) I don't currently have any reason to think this is an intentional denial-of-service attack.
Actually, I now think there is an intentional component to it. But it's not as straightforward as we might have thought.
I think the pain started because somebody is trying to overload a set of onion services with rendezvous requests. But the real pain for the network as a whole comes when those onion services try to keep up with responding to the rendezvous requests.
Counterintuitively, by generating so many response circuits on the network, they're actually loading down the network enough that many of their response attempts will fail.
For one concrete example, when a v2 (that is, non-nextgen) onion service is building its response rendezvous circuit, the last hop in that circuit (the one to the rendezvous point) uses the old "TAP" circuit handshake, which takes a lot more cpu and is given much lower priority by that relay. So if people are flooding the relay with a bunch of circuit create requests, it will take an extra long time to get around to processing the TAP cell, which is part of why their rendezvous circuits are failing. That explanation also matches how people here observed a spike in TAP cells on their relays.
(2b) If anybody has great contacts at Hetzner or OVH and can help us get a message to whoever is running these clients, that would be grand. ("Hi, did you know that you're hurting the Tor network? The Tor people would love to talk to you to help you do whatever it is you're trying to do, in a less harmful way.")
We talked to some OVH abuse people who are Tor fans, who requested that we file a formal abuse ticket asking for contact. I did, and they passed it on to "the customer", but then the OVH Tor clients mysteriously vanished a few days later, with as far as I can tell no attempts at contact. https://metrics.torproject.org/userstats-relay-country.html?start=2017-10-15...
The Hetzner clients still remain so far: https://metrics.torproject.org/userstats-relay-country.html?start=2017-10-15... and we've actually heard from some of them, who are onion service operators trying to keep up with the load.
But the number of people we have heard from only explains a tiny fraction of the "million plus" new users in Germany, so there are still some good mysteries left.
But again, it seems that (some of) these connections from OVH and Hetzner aren't really the origin of the problem. So defenses that focus only on stopping these "attacks" are leaving out a big piece of the puzzle.
(3) I took some steps on Dec 22 to reduce the load that these clients (well, all clients) are putting on the network in terms of circuit creates. It seems like maybe it helped a bit, or maybe it didn't, but I'm the only one who has posted any stats for comparison. You can read more here: https://trac.torproject.org/24716
Alas, I think these consensus param changes didn't make a huge difference. We still have the main change in place, but I plan to try backing it out sometime soon, to see if we see any difference.
The other directions we're working on fall into four categories:
A) Bugfixes and design changes to help onion services not overload the network when they're trying to respond to so many requests. That is, ways to make them more efficient at responding to the most actual users with the fewest wasted circuits.
B) Ways to block or throttle jerks who are trying to overload the onion services. I actually think I have good way to do it for this particular attack, but I'd like to work harder to be a few steps ahead in the arms race first -- that is, move from "bump out these jerks" to "make it harder to use the Tor internal protocols for amplification attacks".
C) Mitigations that relays can use to be more fair with their available resources. This one is actually quite tough from a design perspective, because if one relay is really fast, meaning it could handle all of the create cells it receives, maybe it should nonetheless opt to fail some of them, for the good of later circuits in those circuits.
D) Talking to the humans involved to try to get them to stop and/or make things less bad, in the mean time.
Hope that helps explain. More soon as we learn more and/or as we merge in defenses and/or as we get permission to share things from the people who have told us things.
Thanks, --Roger
tor-relays@lists.torproject.org