Last night for 4h30 (until VPS provider shut it down) one of my middle relays seems to have got in a bad state where it was using 100% CPU continuously
It was not using much bandwidth, about 4MB/s, but reading the logs it seems it was getting a lot of circuit requests (70K/minute initially, 100K at the end). Is there anything I can do about it?
Jul 20 19:41:20.000 [notice] Heartbeat: Tor's uptime is 157 days 18:00 hours, with 4642 circuits open. I've sent 13916.64 GB and received 13502.43 GB. Jul 20 19:41:20.000 [notice] Average packaged cell fullness: 98.154% Jul 20 19:41:20.000 [notice] TLS write overhead: 2% Jul 20 19:41:20.000 [notice] Circuit handshake stats since last time: 8823/8823 TAP, 112411/112806 NTor. Jul 21 00:52:06.000 [warn] Your computer is too slow to handle this many circuit creation requests! Please consider using the MaxAdvertisedBandwidth config option or choosing a more restricted exit policy. [398 similar message(s) suppressed in last 60 seconds] Jul 21 00:53:06.000 [warn] Your computer is too slow to handle this many circuit creation requests! Please consider using the MaxAdvertisedBandwidth config option or choosing a more restricted exit policy. [46893 similar message(s) suppressed in last 60 seconds] Jul 21 00:54:06.000 [warn] Your computer is too slow to handle this many circuit creation requests! Please consider using the MaxAdvertisedBandwidth config option or choosing a more restricted exit policy. [66339 similar message(s) suppressed in last 60 seconds] Jul 21 00:55:06.000 [warn] Your computer is too slow to handle this many circuit creation requests! Please consider using the MaxAdvertisedBandwidth config option or choosing a more restrict ed exit policy. [69939 similar message(s) suppressed in last 60 seconds] Jul 21 00:56:06.000 [warn] Your computer is too slow to handle this many circuit creation requests! Please consider using the MaxAdvertisedBandwidth config option or choosing a more restrict ed exit policy. [73055 similar message(s) suppressed in last 60 seconds]
[...]
Jul 21 04:19:08.000 [warn] Your computer is too slow to handle this many circuit creation requests! Please consider using the MaxAdvertisedBandwidth config option or choosing a more restricted exit policy. [101856 similar message(s) suppressed in last 60 seconds] Jul 21 04:20:08.000 [warn] Your computer is too slow to handle this many circuit creation requests! Please consider using the MaxAdvertisedBandwidth config option or choosing a more restricted exit policy. [97494 similar message(s) suppressed in last 60 seconds]
On Fri, 21 Jul 2017 10:24:39 +0100 Pascal Terjan pterjan@gmail.com wrote:
Last night for 4h30 (until VPS provider shut it down) one of my middle relays seems to have got in a bad state where it was using 100% CPU continuously
It was not using much bandwidth, about 4MB/s, but reading the logs it seems it was getting a lot of circuit requests (70K/minute initially, 100K at the end). Is there anything I can do about it?
Your message prompted me to check logs, and on one relay I see the following:
Jul 19 03:26:03.000 [notice] Circuit handshake stats since last time: 4242/4242 TAP, 65583/65583 NTor. Jul 19 09:26:03.000 [notice] Circuit handshake stats since last time: 4897/4897 TAP, 60996/60996 NTor. Jul 19 15:26:03.000 [notice] Circuit handshake stats since last time: 3682/3682 TAP, 1759242/1759519 NTor. Jul 19 21:26:04.000 [notice] Circuit handshake stats since last time: 3762/3762 TAP, 5946906/5947506 NTor. Jul 20 03:26:04.000 [notice] Circuit handshake stats since last time: 1676/1676 TAP, 6794460/6795903 NTor. Jul 20 09:26:04.000 [notice] Circuit handshake stats since last time: 4432/4433 TAP, 6991546/6993275 NTor. Jul 20 15:26:04.000 [notice] Circuit handshake stats since last time: 7649/7650 TAP, 1437786/1437748 NTor. Jul 20 21:26:04.000 [notice] Circuit handshake stats since last time: 7484/7484 TAP, 68388/68388 NTor. Jul 21 03:26:04.000 [notice] Circuit handshake stats since last time: 6105/6105 TAP, 52027/52027 NTor. Jul 21 09:26:04.000 [notice] Circuit handshake stats since last time: 5210/5210 TAP, 99365/99365 NTor.
That one has not been shut down (the VPS provider does not seem to have any issue with allowing high CPU usage) and works normally after that.
Your message prompted me to check logs, and on one relay I see the following:
Similar thing for me:
Jul 19 00:08:27.000 [notice] Circuit handshake stats since last time: 3571/3571 TAP, 41180/41180 NTor. Jul 19 06:08:27.000 [notice] Circuit handshake stats since last time: 2054/2054 TAP, 29181/29181 NTor. Jul 19 12:08:28.000 [notice] Circuit handshake stats since last time: 2773/2773 TAP, 26497/26497 NTor. Jul 19 18:08:28.000 [notice] Circuit handshake stats since last time: 3970/3970 TAP, 31344/31344 NTor. Jul 20 00:08:28.000 [notice] Circuit handshake stats since last time: 4096/4096 TAP, 41730/41730 NTor. Jul 20 06:08:28.000 [notice] Circuit handshake stats since last time: 18285/18285 TAP, 54102/54102 NTor. Jul 20 12:08:28.000 [notice] Circuit handshake stats since last time: 61136/61386 TAP, 378196/378339 NTor. Jul 20 18:08:29.000 [notice] Circuit handshake stats since last time: 73297/73688 TAP, 566708/566892 NTor. Jul 21 00:08:29.000 [notice] Circuit handshake stats since last time: 67165/67830 TAP, 572685/572851 NTor. Jul 21 06:08:29.000 [notice] Circuit handshake stats since last time: 31988/32138 TAP, 521455/521536 NTor. Jul 21 12:08:29.000 [notice] Circuit handshake stats since last time: 5523/5523 TAP, 222378/222432 NTor.
Also there are too much "[warn] assign_to_cpuworker failed. Ignoring." lines in the logs.
-- Vort
Vort vvort@yandex.ru wrote:
Your message prompted me to check logs, and on one relay I see the following:
Similar thing for me:
Jul 19 00:08:27.000 [notice] Circuit handshake stats since last time: 3571/3571 TAP, 41180/41180 NTor. Jul 19 06:08:27.000 [notice] Circuit handshake stats since last time: 2054/2054 TAP, 29181/29181 NTor. Jul 19 12:08:28.000 [notice] Circuit handshake stats since last time: 2773/2773 TAP, 26497/26497 NTor. Jul 19 18:08:28.000 [notice] Circuit handshake stats since last time: 3970/3970 TAP, 31344/31344 NTor. Jul 20 00:08:28.000 [notice] Circuit handshake stats since last time: 4096/4096 TAP, 41730/41730 NTor. Jul 20 06:08:28.000 [notice] Circuit handshake stats since last time: 18285/18285 TAP, 54102/54102 NTor. Jul 20 12:08:28.000 [notice] Circuit handshake stats since last time: 61136/61386 TAP, 378196/378339 NTor. Jul 20 18:08:29.000 [notice] Circuit handshake stats since last time: 73297/73688 TAP, 566708/566892 NTor. Jul 21 00:08:29.000 [notice] Circuit handshake stats since last time: 67165/67830 TAP, 572685/572851 NTor. Jul 21 06:08:29.000 [notice] Circuit handshake stats since last time: 31988/32138 TAP, 521455/521536 NTor. Jul 21 12:08:29.000 [notice] Circuit handshake stats since last time: 5523/5523 TAP, 222378/222432 NTor.
Also there are too much "[warn] assign_to_cpuworker failed. Ignoring." lines in the logs.
This sort of thing has been going on for many years. I used to refer to it as "mobbing". As nearly as I was ever able to determine, the behavior is an unintended consequence of hidden services. I found that I could greatly reduce the frequency of occurrence, but *not* to zero, by setting
HidServDirectoryV2 0
in my torrc file. My tentative conclusion was that the majority of these events are cases in which a relay has been selected as an HSDir to which a hidden service descriptor has been posted for a very popular hidden service, so by refusing to be a hidden service directory mirror, those cases can be eliminated. I never had a very satisfying hypothesis to explain the remaining minority of cases.
Scott Bennett, Comm. ASMELG, CFIAG ********************************************************************** * Internet: bennett at sdf.org *xor* bennett at freeshell.org * *--------------------------------------------------------------------* * "A well regulated and disciplined militia, is at all times a good * * objection to the introduction of that bane of all free governments * * -- a standing army." * * -- Gov. John Hancock, New York Journal, 28 January 1790 * **********************************************************************
I was under the impression that HidServDirectoryV2 was an obsolete config option. I run 0.2.9.11
On 7/21/2017 3:42 AM, Scott Bennett wrote:
Vort vvort@yandex.ru wrote:
Your message prompted me to check logs, and on one relay I see the following:
Similar thing for me:
Jul 19 00:08:27.000 [notice] Circuit handshake stats since last time: 3571/3571 TAP, 41180/41180 NTor. Jul 19 06:08:27.000 [notice] Circuit handshake stats since last time: 2054/2054 TAP, 29181/29181 NTor. Jul 19 12:08:28.000 [notice] Circuit handshake stats since last time: 2773/2773 TAP, 26497/26497 NTor. Jul 19 18:08:28.000 [notice] Circuit handshake stats since last time: 3970/3970 TAP, 31344/31344 NTor. Jul 20 00:08:28.000 [notice] Circuit handshake stats since last time: 4096/4096 TAP, 41730/41730 NTor. Jul 20 06:08:28.000 [notice] Circuit handshake stats since last time: 18285/18285 TAP, 54102/54102 NTor. Jul 20 12:08:28.000 [notice] Circuit handshake stats since last time: 61136/61386 TAP, 378196/378339 NTor. Jul 20 18:08:29.000 [notice] Circuit handshake stats since last time: 73297/73688 TAP, 566708/566892 NTor. Jul 21 00:08:29.000 [notice] Circuit handshake stats since last time: 67165/67830 TAP, 572685/572851 NTor. Jul 21 06:08:29.000 [notice] Circuit handshake stats since last time: 31988/32138 TAP, 521455/521536 NTor. Jul 21 12:08:29.000 [notice] Circuit handshake stats since last time: 5523/5523 TAP, 222378/222432 NTor.
Also there are too much "[warn] assign_to_cpuworker failed. Ignoring." lines in the logs.
This sort of thing has been going on for many years. I used to refer
to it as "mobbing". As nearly as I was ever able to determine, the behavior is an unintended consequence of hidden services. I found that I could greatly reduce the frequency of occurrence, but *not* to zero, by setting
HidServDirectoryV2 0
in my torrc file. My tentative conclusion was that the majority of these events are cases in which a relay has been selected as an HSDir to which a hidden service descriptor has been posted for a very popular hidden service, so by refusing to be a hidden service directory mirror, those cases can be eliminated. I never had a very satisfying hypothesis to explain the remaining minority of cases.
Scott Bennett, Comm. ASMELG, CFIAG
- Internet: bennett at sdf.org *xor* bennett at freeshell.org *
*--------------------------------------------------------------------*
- "A well regulated and disciplined militia, is at all times a good *
- objection to the introduction of that bane of all free governments *
- -- a standing army." *
- -- Gov. John Hancock, New York Journal, 28 January 1790 *
tor-relays mailing list tor-relays@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-relays
This sort of thing has been going on for many years. I used to refer
to it as "mobbing". As nearly as I was ever able to determine, the behavior is an unintended consequence of hidden services.
Same thing started to happen today and I have noticed that 100% CPU usage spikes happens every hour and lasts for several minutes. During this spikes, all cores of CPU are used and stack trace points somewhere at worker_thread_main() function. Also today relay have more connections than usually (5500 vs 2000-3000). Is this pattern matches the characteristics of hidden services work?
Jul 27 16:09:12.000 [warn] assign_to_cpuworker failed. Ignoring. ... Jul 27 17:09:13.000 [warn] assign_to_cpuworker failed. Ignoring. ... Jul 27 18:08:31.000 [notice] Circuit handshake stats since last time: 5198/5200 TAP, 3994625/3995090 NTor. ... Jul 27 18:09:11.000 [warn] assign_to_cpuworker failed. Ignoring. ... Jul 27 19:09:11.000 [warn] assign_to_cpuworker failed. Ignoring. ... Jul 27 20:10:11.000 [warn] assign_to_cpuworker failed. Ignoring.
-- Vort
Am 27.07.2017 um 19:48 schrieb Vort:
This sort of thing has been going on for many years. I used to refer
to it as "mobbing". As nearly as I was ever able to determine, the behavior is an unintended consequence of hidden services.
Same thing started to happen today and I have noticed that 100% CPU usage spikes happens every hour and lasts for several minutes. During this spikes, all cores of CPU are used and stack trace points somewhere at worker_thread_main() function. Also today relay have more connections than usually (5500 vs 2000-3000). Is this pattern matches the characteristics of hidden services work?
Happened to my relay as well, performance was degraded quite a lot, i lost guard state ....
Strange ...
Sebastian
On 28 Jul 2017, at 03:48, Vort vvort@yandex.ru wrote:
This sort of thing has been going on for many years. I used to refer
to it as "mobbing". As nearly as I was ever able to determine, the behavior is an unintended consequence of hidden services.
Same thing started to happen today and I have noticed that 100% CPU usage spikes happens every hour and lasts for several minutes. During this spikes, all cores of CPU are used and stack trace points somewhere at worker_thread_main() function. Also today relay have more connections than usually (5500 vs 2000-3000). Is this pattern matches the characteristics of hidden services work?
...
Jul 27 18:08:31.000 [notice] Circuit handshake stats since last time: 5198/5200 TAP, 3994625/3995090 NTor.
TAP is used for hidden services to connect to intro and rendezvous points, and you're not seeing many extra TAP connections.
So *if* this is related to hidden services, it is not connecting to the hidden service directly. Instead, it is sending (exit?) traffic through the relays in the hidden service circuit.
The upcoming link padding may partially defend against this, depending on whether guard nodes are being targeted. Otherwise, we would need to use circuit padding, which is an area of active research.
T
-- Tim Wilson-Brown (teor)
teor2345 at gmail dot com PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B ricochet:ekmygaiu4rzgsk6n xmpp: teor at torproject dot org ------------------------------------------------------------------------
Jul 27 18:08:31.000 [notice] Circuit handshake stats since last time: 5198/5200 TAP, 3994625/3995090 NTor.
TAP is used for hidden services to connect to intro and rendezvous points, and you're not seeing many extra TAP connections.
So *if* this is related to hidden services, it is not connecting to the hidden service directly. Instead, it is sending (exit?) traffic through the relays in the hidden service circuit.
I have found that there are two patterns, which are associated with "assign_to_cpuworker failed" errors. First one: heavy overload, millions of NTor handshakes, weight is decreased several times, relay can lose Guard state. Second one: moderate overload, TAP handshakes slightly increased, weight is not affected.
Normal stats: Jul 24 18:08:29.000 [notice] Circuit handshake stats since last time: 4892/4892 TAP, 61208/61208 NTor. Jul 25 00:08:29.000 [notice] Circuit handshake stats since last time: 3753/3753 TAP, 61775/61775 NTor. Jul 25 06:08:29.000 [notice] Circuit handshake stats since last time: 3218/3218 TAP, 57756/57756 NTor. Jul 25 12:08:29.000 [notice] Circuit handshake stats since last time: 3538/3538 TAP, 56631/56631 NTor. Jul 25 18:08:29.000 [notice] Circuit handshake stats since last time: 4188/4188 TAP, 60672/60672 NTor.
Overload #1 stats: Jul 27 12:08:31.000 [notice] Circuit handshake stats since last time: 4715/4715 TAP, 100785/100785 NTor. Jul 27 18:08:31.000 [notice] Circuit handshake stats since last time: 5198/5200 TAP, 3994625/3995090 NTor. Jul 28 00:08:31.000 [notice] Circuit handshake stats since last time: 2771/2773 TAP, 4172331/4174404 NTor. Jul 28 06:08:31.000 [notice] Circuit handshake stats since last time: 1304/1305 TAP, 3899551/3899941 NTor. Jul 28 12:08:32.000 [notice] Circuit handshake stats since last time: 1415/1416 TAP, 3802487/3803824 NTor. Jul 28 18:08:32.000 [notice] Circuit handshake stats since last time: 1895/1895 TAP, 843496/843724 NTor. Jul 29 00:08:32.000 [notice] Circuit handshake stats since last time: 1948/1948 TAP, 34055/34055 NTor.
Overload #2 stats: Jul 30 06:08:33.000 [notice] Circuit handshake stats since last time: 9288/9288 TAP, 60425/60425 NTor. Jul 30 12:08:33.000 [notice] Circuit handshake stats since last time: 31739/32038 TAP, 37301/37307 NTor. Jul 30 18:08:33.000 [notice] Circuit handshake stats since last time: 40316/40993 TAP, 36967/36972 NTor. Jul 31 00:08:34.000 [notice] Circuit handshake stats since last time: 36414/36830 TAP, 36726/36730 NTor. Jul 31 06:08:31.000 [notice] Circuit handshake stats since last time: 21715/21801 TAP, 40564/40564 NTor.
I'm not sure what this differences mean, but, maybe, this stats can help to distinguish the sources of overload (or prove that they are the same).
-- Vort
On Thu, Jul 27, 2017 at 08:48:35PM +0300, Vort wrote:
This sort of thing has been going on for many years. I used to refer
to it as "mobbing". As nearly as I was ever able to determine, the behavior is an unintended consequence of hidden services.
Same thing started to happen today and I have noticed that 100% CPU usage spikes happens every hour and lasts for several minutes. During this spikes, all cores of CPU are used and stack trace points somewhere at worker_thread_main() function. Also today relay have more connections than usually (5500 vs 2000-3000). Is this pattern matches the characteristics of hidden services work?
Yes. Your relay is getting way more circuit create attempts than it usually does, and all your signs point to the "HSDir hotspot" phenomenon.
Each onion service has six relays each day that serve as the place for fetching its onion descriptor, and some onion services are super popular (for example, back in the day, the August 2013 botnet used an onion service for coordinating its millions of bots).
Tor isn't great at parallelizing all of its crypto, but it is good at parallelizing the circuit handshakes, which would be why all of your cores are being used.
And clients building paths to your relay will use a variety of middle hops, meaning more relays than usual will have made a direct connection to your relay lately.
If there's one super popular onion service, then 6/3719 = ~0.16% of the relays will feel the pain on any given day. More than one super popular onion service would make that number go up.
One fix would be to raise the bar for getting the HSDir flag, so relays with the flag are better able to handle the phenomenon. But shrinking the total number of HSDir relays can make other attacks easier. Another fix would be to spread the load for each onion service, or maybe for popular onion services, over more than six relays. But that would open up more surface area for attacks to e.g. measure popularity of an onion service. I think there is not an easy fix.
--Roger
On 28 Jul 2017 6:59 am, "Roger Dingledine" arma@mit.edu wrote:
On Thu, Jul 27, 2017 at 08:48:35PM +0300, Vort wrote:
This sort of thing has been going on for many years. I used to
refer
to it as "mobbing". As nearly as I was ever able to determine, the
behavior
is an unintended consequence of hidden services.
Same thing started to happen today and I have noticed that 100% CPU usage spikes happens every hour and lasts for several minutes. During this spikes, all cores of CPU are used and stack trace points somewhere at worker_thread_main() function. Also today relay have more connections than usually (5500 vs 2000-3000). Is this pattern matches the characteristics of hidden services work?
Yes. Your relay is getting way more circuit create attempts than it usually does, and all your signs point to the "HSDir hotspot" phenomenon.
Each onion service has six relays each day that serve as the place for fetching its onion descriptor, and some onion services are super popular (for example, back in the day, the August 2013 botnet used an onion service for coordinating its millions of bots).
Tor isn't great at parallelizing all of its crypto, but it is good at parallelizing the circuit handshakes, which would be why all of your cores are being used.
Thanks, that's useful information. Most of my relays have several cores and only use one, the one where I saw the problem was a 1 core VPS. So a way to handle this problem better for me would be to upgrade my single core relays.
And clients building paths to your relay will use a variety of middle hops, meaning more relays than usual will have made a direct connection to your relay lately.
If there's one super popular onion service, then 6/3719 = ~0.16% of the relays will feel the pain on any given day. More than one super popular onion service would make that number go up.
One fix would be to raise the bar for getting the HSDir flag, so relays with the flag are better able to handle the phenomenon. But shrinking the total number of HSDir relays can make other attacks easier. Another fix would be to spread the load for each onion service, or maybe for popular onion services, over more than six relays. But that would open up more surface area for attacks to e.g. measure popularity of an onion service. I think there is not an easy fix.
--Roger
_______________________________________________ tor-relays mailing list tor-relays@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-relays
tor-relays@lists.torproject.org