Gordon Morehouse wrote:
Jobiwan Kenobi:
I've been running a relay for about months now. It runs on an 1.6
That should have been: about 3 months.
Ghz single core Atom with hyperthreading, 1GB of RAM. It's on my home connection; I advertise only 176 KB/sec.
Under normal conditions, it pumps around 100KB/sec, it has around 600 connections, it uses around 20% CPU. Everything looks very healthy.
Lately I have seen a couple of incidents where the number of connections suddenly goes up to over 3000, traffic increases heavily, CPU usage goes well over 150% (out of possible 200). Traffic can go up to between 500 and 1000 KB/sec for long periods of time. Sometimes it seems that my relay just can't take it anymore. In the log, the ratio of TAP handshakes goes wild, and I get clock jump warnings. My clock does not jump. This is Tor hanging while allocating memory.
Welcome to the world of the Raspberry Pi / BeagleBone / CubieBoard operator, except normally we'd have crashed (without some defenseive measures) before the clock jump thing - the Pi in particular has a known dodgy "clock."
Today I had a really bad episode, where the box started thrashing. When it became responsive again, Tor was left in a state where it was constantly downloading about 400KB/sec more than it was uploading. Normally I have a little bit more up than down, because I'm a directory server as well. I can not explain having a lot more down than up. I can fantasize that those hangs/clock jumps ("assuming established circuits no longer work") could leave 'half' circuits.
As best I can tell, probably that's a flood of incoming TAP requests or TLS handshakes.
Finally I restarted my relay, (which I really don't like to do) and after a while it stabilized. At this point, my router shows a peak of almost 8000 NAT sessions.
Is this normal behavior of the network (esp. the sudden increase in connections) or is this another kind of attack/probe like what we've seen in early September? Is this because this machine is just too underpowered? Should I collect/provide any diagnostics? Have others seen similar events?
Have a look at the Raspberry Pi threads and search for "circuit creation storms." I'm slowly developing a set of defensive iptables rules for low-power relays which you might want to have a look at, but as your machine is far more capable than a Pi, you'll need to adjust accordingly (and then, I hope, contribute back!)
https://github.com/gordon-morehouse/cipollini/tree/master/contrib/90_slowboa...
(Ignore the fail2ban stuff for now, I found a more efficient way to handle the problem with the help of a list reader.)
Thanks Gordon,
I'm not sure I can get iptables up and working on this box. It is more of an appliance. (Tho I did get a build environment on it to build Tor.)
Throttling incoming connections is probably not the answer in this case, as it can still build up to a large number. Throttling handshakes might, but that can't be done on network level.
Anyway, this 'attack' (if that's what it is .. millions of TAP handshakes per hour) doesn't kill my relay, but after those clock jump messages, it is left in a state where it downloads waaay more data that it uploads. As if the circuits it assumed to be no longer working are still sending data that doesn't get relayed.
If this is the case, would it be possible to detect this and either block those circuits, close those connections, or make no assumptions in the first place?
Another one of these is going right now as I write.
When I set my bandwidth rate to 256 KB, download fills that up while upload is at only 150 KB or so. When I set it to 1 MB, download fills that up while upload stays at roughly 150 KB. CPU is well over 100%.
Normally I have the rate at 10 Mbit, a bit less than my actual bandwidth, but advertise much less. When I don't have 3000+ connections, sometimes I see it do high volume for long times with relatively low CPU load.
This time I'm not going to relaunch it but let it recover on its own. With a lowered bandwidth rate since most of it's going into a sink hole anyway.
-Job