Hi community,
Unfortunately my otherwise stable tor guard relay has recently lost
it's guard flag, once again, due to what I think is a new type of
(D)DoS attack, either directly targeted towards my tor relay, or
against some other relays inside the network, facilitated through my
relay.
It all started to go downhill one month ago, on January 10th of this
year - the linux kernel OOM killer decided to reap the tor process,
multiple times in a row - it was violating the MaxMemInQueues setting
of 732MB, the optimal value according to tor - on a virtual machine
with a dedicated CPU core (and sadly, a lack of hardware AES
acceleration, but that's off topic, by completely sandboxing and
isolating the tor process, and then disabling all mitigations offered
by the linux kernel, I still managed to achieve a peak throughput of
10mb/s while keeping other users and processes safe and sound - this
made the relay the fastest tor relay belonging to my AS.. sorry, just
bragging ;-)) which has 1024 megabytes of physical ram, and a swap
partition with a size of 512 megabytes (vm.swappiness initially was
90, I've changed it to 70).
The first time this happened, it came out of nowhere, so I wasn't
closely monitoring the metrics page of my relay - this led to a
downtime of 3 days, then leading to the loss of the guard flag.
Since then, traffic on my relay has been limited to traffic coming
from other relays, it is now exclusively a middle-only relay - it did
not recover from the attack, even though I managed to achieve longer
consecutive uptimes by tweaking MaxMemInQueues, first down to 704,
then 672, and now 640MB.
The only log entry I ever saw before the tor process got reaped is the
following one:
Feb 04 12:47:08 *hostname_redacted* tor[224]: Feb 04 12:47:08.000
[warn] Your computer is too slow to handle this many circuit creation
requests! Please consider using the MaxAdvertisedBandwidth config
option or choosing a more restricted exit policy. [93409 similar
message(s) suppressed in last 60 seconds]
Feb 04 12:48:08 *hostname_redacted* tor[224]: Feb 04 12:48:08.000
[warn] Your computer is too slow to handle this many circuit creation
requests! Please consider using the MaxAdvertisedBandwidth config
option or choosing a more restricted exit policy. [42527 similar
message(s) suppressed in last 60 seconds]
As you can tell by the date, this was today - after 3 weeks and 1 day,
this noon, the process got OOM-killed again - I instantly noticed it,
logged into the machine, installed updates, updated my pacman
mirrorlist, the usual stuff - then I rebooted the machine, only to log
in 5 minutes later to see tor using 100% of CPU time, with the log
file getting spammed by this error - it started only 3 seconds after
the relay published it's descriptor.
Clearly, this is some sort of targeted attack against either my relay
or someone is abusing it to attack someone or something else inside or
outside the tor network.
I did read the recent information on attacks regarding DirAuth's, but
apparently a fix has been deployed on all of them, and checking the
bandwidth stats of some of them, it seems to be working, somewhat.
I wonder if this is the culprit here, this machine has been running
tor relays since 2014, and I never had these problems with it before -
even without lowering MaxMemInQueues.
To me, that's just further proof that this is a targeted attack using
my relay and before you ask, I know some KVM hypervisors are
oversold, but my hosting company stopped selling KVM machines with
mechanical host HDD's a few years ago, so new customers can't be the
reason for all of this.
Is there anything I can do to make my relay as stable as possible
until the attacker(s) stop(s) hammering my relay? This doesn't seem to
get caught by the built-in DoS prevention, so I didn't try tweaking
the associated config options.
Maybe someone from the tor team has a patch I can try applying? I'm
not completely up to date on the newest tor shenanigans and pitfalls,
so maybe it has already been posted on this mailing list, but I don't
feel like reading a bazillion messages.
For the time being, MaxMemInQueues will stay at 640 megabytes, to
(hopefully) at least keep the relay up and running, even though
performance, for anyone unlucky enough to build a circuit through it,
will be affected, severely (despite all this, I'm still pushing around
11TB/s of traffic a month, I just don't know how much of it is
legitimate..) - it's my duty as a relay operator to guarantee for the
safety and usability of my tor relay, so I'm very eager to find a
solution for this.
Unfortunately, this is a live relay, otherwise I would probably try
developing my own solution for the problem (including publishing it /
making a pull request), but in this case repeated downtimes and
restarts, which would be necessary when working on the source code, is
absolutely not an option as it would possibly disrupt hundreds, if not
thousands of clients.
If anyone could point me in a direction, I'd really appreciate it.
Thank you,
William