Hi community,
Unfortunately my otherwise stable tor guard relay has recently lost it's guard flag, once again, due to what I think is a new type of (D)DoS attack, either directly targeted towards my tor relay, or against some other relays inside the network, facilitated through my relay.
It all started to go downhill one month ago, on January 10th of this year - the linux kernel OOM killer decided to reap the tor process, multiple times in a row - it was violating the MaxMemInQueues setting of 732MB, the optimal value according to tor - on a virtual machine with a dedicated CPU core (and sadly, a lack of hardware AES acceleration, but that's off topic, by completely sandboxing and isolating the tor process, and then disabling all mitigations offered by the linux kernel, I still managed to achieve a peak throughput of 10mb/s while keeping other users and processes safe and sound - this made the relay the fastest tor relay belonging to my AS.. sorry, just bragging ;-)) which has 1024 megabytes of physical ram, and a swap partition with a size of 512 megabytes (vm.swappiness initially was 90, I've changed it to 70).
The first time this happened, it came out of nowhere, so I wasn't closely monitoring the metrics page of my relay - this led to a downtime of 3 days, then leading to the loss of the guard flag.
Since then, traffic on my relay has been limited to traffic coming from other relays, it is now exclusively a middle-only relay - it did not recover from the attack, even though I managed to achieve longer consecutive uptimes by tweaking MaxMemInQueues, first down to 704, then 672, and now 640MB.
The only log entry I ever saw before the tor process got reaped is the following one:
Feb 04 12:47:08 *hostname_redacted* tor[224]: Feb 04 12:47:08.000 [warn] Your computer is too slow to handle this many circuit creation requests! Please consider using the MaxAdvertisedBandwidth config option or choosing a more restricted exit policy. [93409 similar message(s) suppressed in last 60 seconds] Feb 04 12:48:08 *hostname_redacted* tor[224]: Feb 04 12:48:08.000 [warn] Your computer is too slow to handle this many circuit creation requests! Please consider using the MaxAdvertisedBandwidth config option or choosing a more restricted exit policy. [42527 similar message(s) suppressed in last 60 seconds]
As you can tell by the date, this was today - after 3 weeks and 1 day, this noon, the process got OOM-killed again - I instantly noticed it, logged into the machine, installed updates, updated my pacman mirrorlist, the usual stuff - then I rebooted the machine, only to log in 5 minutes later to see tor using 100% of CPU time, with the log file getting spammed by this error - it started only 3 seconds after the relay published it's descriptor.
Clearly, this is some sort of targeted attack against either my relay or someone is abusing it to attack someone or something else inside or outside the tor network.
I did read the recent information on attacks regarding DirAuth's, but apparently a fix has been deployed on all of them, and checking the bandwidth stats of some of them, it seems to be working, somewhat.
I wonder if this is the culprit here, this machine has been running tor relays since 2014, and I never had these problems with it before - even without lowering MaxMemInQueues.
To me, that's just further proof that this is a targeted attack using my relay and before you ask, I know some KVM hypervisors are oversold, but my hosting company stopped selling KVM machines with mechanical host HDD's a few years ago, so new customers can't be the reason for all of this.
Is there anything I can do to make my relay as stable as possible until the attacker(s) stop(s) hammering my relay? This doesn't seem to get caught by the built-in DoS prevention, so I didn't try tweaking the associated config options.
Maybe someone from the tor team has a patch I can try applying? I'm not completely up to date on the newest tor shenanigans and pitfalls, so maybe it has already been posted on this mailing list, but I don't feel like reading a bazillion messages.
For the time being, MaxMemInQueues will stay at 640 megabytes, to (hopefully) at least keep the relay up and running, even though performance, for anyone unlucky enough to build a circuit through it, will be affected, severely (despite all this, I'm still pushing around 11TB/s of traffic a month, I just don't know how much of it is legitimate..) - it's my duty as a relay operator to guarantee for the safety and usability of my tor relay, so I'm very eager to find a solution for this.
Unfortunately, this is a live relay, otherwise I would probably try developing my own solution for the problem (including publishing it / making a pull request), but in this case repeated downtimes and restarts, which would be necessary when working on the source code, is absolutely not an option as it would possibly disrupt hundreds, if not thousands of clients.
If anyone could point me in a direction, I'd really appreciate it.
Thank you,
William
P.S: I know it's not an error but a warning, bad wording from my side there.
Right now the relay appears to be semi-stable, still consuming much more memory than I remember from pre-2021 times, but that's fine, nothing dangerous yet.
At one point, traffic is at it's peak with 80mbit/s, another time, it dips down to 16mbit/s for many minutes - not sure if this is the attacker or simply tor compressing consensus documents.. log still spamming the warning mentioned above.
Best Regards, William
2021-02-04 17:51 GMT, William Kane ttallink@googlemail.com:
Hi community,
Unfortunately my otherwise stable tor guard relay has recently lost it's guard flag, once again, due to what I think is a new type of (D)DoS attack, either directly targeted towards my tor relay, or against some other relays inside the network, facilitated through my relay.
It all started to go downhill one month ago, on January 10th of this year - the linux kernel OOM killer decided to reap the tor process, multiple times in a row - it was violating the MaxMemInQueues setting of 732MB, the optimal value according to tor - on a virtual machine with a dedicated CPU core (and sadly, a lack of hardware AES acceleration, but that's off topic, by completely sandboxing and isolating the tor process, and then disabling all mitigations offered by the linux kernel, I still managed to achieve a peak throughput of 10mb/s while keeping other users and processes safe and sound - this made the relay the fastest tor relay belonging to my AS.. sorry, just bragging ;-)) which has 1024 megabytes of physical ram, and a swap partition with a size of 512 megabytes (vm.swappiness initially was 90, I've changed it to 70).
The first time this happened, it came out of nowhere, so I wasn't closely monitoring the metrics page of my relay - this led to a downtime of 3 days, then leading to the loss of the guard flag.
Since then, traffic on my relay has been limited to traffic coming from other relays, it is now exclusively a middle-only relay - it did not recover from the attack, even though I managed to achieve longer consecutive uptimes by tweaking MaxMemInQueues, first down to 704, then 672, and now 640MB.
The only log entry I ever saw before the tor process got reaped is the following one:
Feb 04 12:47:08 *hostname_redacted* tor[224]: Feb 04 12:47:08.000 [warn] Your computer is too slow to handle this many circuit creation requests! Please consider using the MaxAdvertisedBandwidth config option or choosing a more restricted exit policy. [93409 similar message(s) suppressed in last 60 seconds] Feb 04 12:48:08 *hostname_redacted* tor[224]: Feb 04 12:48:08.000 [warn] Your computer is too slow to handle this many circuit creation requests! Please consider using the MaxAdvertisedBandwidth config option or choosing a more restricted exit policy. [42527 similar message(s) suppressed in last 60 seconds]
As you can tell by the date, this was today - after 3 weeks and 1 day, this noon, the process got OOM-killed again - I instantly noticed it, logged into the machine, installed updates, updated my pacman mirrorlist, the usual stuff - then I rebooted the machine, only to log in 5 minutes later to see tor using 100% of CPU time, with the log file getting spammed by this error - it started only 3 seconds after the relay published it's descriptor.
Clearly, this is some sort of targeted attack against either my relay or someone is abusing it to attack someone or something else inside or outside the tor network.
I did read the recent information on attacks regarding DirAuth's, but apparently a fix has been deployed on all of them, and checking the bandwidth stats of some of them, it seems to be working, somewhat.
I wonder if this is the culprit here, this machine has been running tor relays since 2014, and I never had these problems with it before - even without lowering MaxMemInQueues.
To me, that's just further proof that this is a targeted attack using my relay and before you ask, I know some KVM hypervisors are oversold, but my hosting company stopped selling KVM machines with mechanical host HDD's a few years ago, so new customers can't be the reason for all of this.
Is there anything I can do to make my relay as stable as possible until the attacker(s) stop(s) hammering my relay? This doesn't seem to get caught by the built-in DoS prevention, so I didn't try tweaking the associated config options.
Maybe someone from the tor team has a patch I can try applying? I'm not completely up to date on the newest tor shenanigans and pitfalls, so maybe it has already been posted on this mailing list, but I don't feel like reading a bazillion messages.
For the time being, MaxMemInQueues will stay at 640 megabytes, to (hopefully) at least keep the relay up and running, even though performance, for anyone unlucky enough to build a circuit through it, will be affected, severely (despite all this, I'm still pushing around 11TB/s of traffic a month, I just don't know how much of it is legitimate..) - it's my duty as a relay operator to guarantee for the safety and usability of my tor relay, so I'm very eager to find a solution for this.
Unfortunately, this is a live relay, otherwise I would probably try developing my own solution for the problem (including publishing it / making a pull request), but in this case repeated downtimes and restarts, which would be necessary when working on the source code, is absolutely not an option as it would possibly disrupt hundreds, if not thousands of clients.
If anyone could point me in a direction, I'd really appreciate it.
Thank you,
William
tor-relays@lists.torproject.org