0.2.4.17-rc on Pi, a couple weeks on

List overview All Threads
Download

newer

older

Re: [tor-relays] 0.2.4.17-rc on...

Tor Relay getting "Failed to...

Gordon Morehouse

18 Sep 2013 18 Sep '13

1:50 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512

Hey folks,

Here are my reports. First the good news: it's WAY more stable. Then the bad news: it still gets OOM-killed once in a while, possibly preventably.

THE GOOD (notes from September 14):

Here's my Pi relay since compilation of 0.2.4.17-rc:

Sep 14 11:50:59.000 [notice] Circuit handshake stats since last time: 13308/13324 TAP, 22/22 NTor. Sep 14 11:51:51.000 [notice] Heartbeat: Tor's uptime is 7 days 18:00 hours, with 434 circuits open. I've sent 42.85 GB and received 34.05 GB. Sep 14 11:51:51.000 [notice] Average packaged cell fullness: 98.369% Sep 14 11:51:51.000 [notice] TLS write overhead: 9%

That low NTor number is depressing. That high TAP number is scary. TAP is the old way and the way the botnet is using. NTor is post-0.2.4.8, I think.

The replay has settled into a fairly steady state (after losing its flags except Named) of sending 5-10KB more per sec than it gets. I have a feeling this is literally due to the TAP replies being bigger than the TAP requests. It's handling tens of thousands every 6 hours. Load average is steadyish at 0.7. Unlike its predecessor, though, it's not yet crashed under the new network conditions.

...By last night (18 Sept) it had settled in to no Stable flag, but forwarding an essentially random amount of traffic that looked similar minute-to-minute. Load was between 0.9 and 1.1 with tor, top, and nload being really the only active processes. There is still room for tuning with the Pi given that this load was attained relaying about 2Mbps when I looked.

THE BAD:

This morning:

Sep 18 04:50:59.000 [notice] Circuit handshake stats since last time: 34818/41100 TAP, 50/52 NTor. Sep 18 05:30:43.000 [warn] Your system clock just jumped 101 seconds forward; assuming established circuits no longer work.

The system clock thing is a Pi thing. I'm pretty sure I need to get much more aggressive with ntpd. (Pis have no battery backup and tend to have pretty bad clock drift.)

But then, 2 minutes later:

Sep 18 05:32:20 tulameen kernel: [2188444.188460] Out of memory: Kill process 7544 (tor) score 148 or sacrifice child Sep 18 05:32:20 tulameen kernel: [2188444.188475] Killed process 7544 (tor) total-vm:153352kB, anon-rss:115156kB, file-rss:36712kB

Coincidence? Bug? Smells like the latter. Shouldn't Tor be shedding memory if it closes all its circuits, not acquiring more?

Best, - -Gordon M.

...PGP SIGNATURE...

Show replies by date

Gordon Morehouse

18 Sep 18 Sep

2:20 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512

Addendum: restarting tor instantly puts my router into a tailspin this morning. This is a WRT54G (old school, 3.0 hardware, 200MHz MIPS). While that's old, there are many, many consumer routers out there with similar specs and worse firmware. In this case it causes major problems with DNS.

I'd like to figure out what is going on with this in order to prevent it from happening as part of the Cipollini project[1] so (when the time comes) we're not distributing images for Raspberry Pi which crash people's routers. :(

Request timeout for icmp_seq 847981 64 bytes from 192.168.1.1: icmp_seq=61550 ttl=64 time=1.136 ms Request timeout for icmp_seq 847983 Request timeout for icmp_seq 847984 Request timeout for icmp_seq 847985 64 bytes from 192.168.1.1: icmp_seq=61554 ttl=64 time=0.917 ms Request timeout for icmp_seq 847987 64 bytes from 192.168.1.1: icmp_seq=61556 ttl=64 time=0.929 ms Request timeout for icmp_seq 847989 Request timeout for icmp_seq 847990 64 bytes from 192.168.1.1: icmp_seq=61559 ttl=64 time=0.929 ms 64 bytes from 192.168.1.1: icmp_seq=61560 ttl=64 time=0.922 ms Request timeout for icmp_seq 847993 Request timeout for icmp_seq 847994

Best, - -Gordon M.

Gordon Morehouse:

- -- Sent from my thing that sends email.

...PGP SIGNATURE...

Gordon Morehouse

3:04 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512

Addendum to addendum: the router fail is definitely caused by Tor connections filling up the router's ip_conntrack table - once it gets near full, it somehow interferes with a couple other services on my router (especially DNSmasq) even if there is free RAM. I will need to figure out some iptables tricks for the Pi, which I've long known, to prevent this, just no time yet.

Note that somehow, due to a brief enough hiccup I guess, my Pi relay retained Named, Stable and Fast this morning, so as soon as I restarted it it was instantly slammed with thousands of connections.

I may need to do the kludge of rate-limiting incoming connections to the Tor ports for now, using iptables.

Also of note: regarding the ntp and time/clock issue: it appears that because I was using a particular stripped image of Raspbian, some spurious .conf and init.d files were left for the Raspbian 'ntp' package, which I purged, and ensured that only 'ntpdate' (for setting the clock at startup, run in /etc/rc.local) and 'openntpd' are installed.

Best, - -Gordon M.

Gordon Morehouse:

...

Addendum: restarting tor instantly puts my router into a tailspin this morning. This is a WRT54G (old school, 3.0 hardware, 200MHz MIPS). While that's old, there are many, many consumer routers out there with similar specs and worse firmware. In this case it causes major problems with DNS.

I'd like to figure out what is going on with this in order to prevent it from happening as part of the Cipollini project[1] so (when the time comes) we're not distributing images for Raspberry Pi which crash people's routers. :(

Request timeout for icmp_seq 847981 64 bytes from 192.168.1.1: icmp_seq=61550 ttl=64 time=1.136 ms Request timeout for icmp_seq 847983 Request timeout for icmp_seq 847984 Request timeout for icmp_seq 847985 64 bytes from 192.168.1.1: icmp_seq=61554 ttl=64 time=0.917 ms Request timeout for icmp_seq 847987 64 bytes from 192.168.1.1: icmp_seq=61556 ttl=64 time=0.929 ms Request timeout for icmp_seq 847989 Request timeout for icmp_seq 847990 64 bytes from 192.168.1.1: icmp_seq=61559 ttl=64 time=0.929 ms 64 bytes from 192.168.1.1: icmp_seq=61560 ttl=64 time=0.922 ms Request timeout for icmp_seq 847993 Request timeout for icmp_seq 847994

Best, -Gordon M.

Gordon Morehouse:

_______________________________________________ tor-relays mailing list tor-relays@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-relays

- -- Sent from my thing that sends email.

...PGP SIGNATURE...

josh

5:16 p.m.

You may be able to increase the ip_conntrack_max on your router. I had a terrible verizon dsl router that would have its connection tracking capacity exhausted by pings to games servers. I was able to partially resolve the problem my telnetting (yea I know) into the router and setting the ip_conntrack_max from 1000 to 65000. You might also want to reduce the amount of time TCP spends in TIME-WAIT.

Ultimately I replaced the router with a pi based solution with much greater resources.

On 2013-09-18 11:04, Gordon Morehouse wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512

Addendum to addendum: the router fail is definitely caused by Tor connections filling up the router's ip_conntrack table - once it gets near full, it somehow interferes with a couple other services on my router (especially DNSmasq) even if there is free RAM. I will need to figure out some iptables tricks for the Pi, which I've long known, to prevent this, just no time yet.

Note that somehow, due to a brief enough hiccup I guess, my Pi relay retained Named, Stable and Fast this morning, so as soon as I restarted it it was instantly slammed with thousands of connections.

I may need to do the kludge of rate-limiting incoming connections to the Tor ports for now, using iptables.

Also of note: regarding the ntp and time/clock issue: it appears that because I was using a particular stripped image of Raspbian, some spurious .conf and init.d files were left for the Raspbian 'ntp' package, which I purged, and ensured that only 'ntpdate' (for setting the clock at startup, run in /etc/rc.local) and 'openntpd' are installed.

Best,

-Gordon M.

Gordon Morehouse:

...
Addendum: restarting tor instantly puts my router into a tailspin this morning. This is a WRT54G (old school, 3.0 hardware, 200MHz MIPS). While that's old, there are many, many consumer routers out there with similar specs and worse firmware. In this case it causes major problems with DNS.

I'd like to figure out what is going on with this in order to prevent it from happening as part of the Cipollini project[1] so (when the time comes) we're not distributing images for Raspberry Pi which crash people's routers. :(

Request timeout for icmp_seq 847981 64 bytes from 192.168.1.1: icmp_seq=61550 ttl=64 time=1.136 ms Request timeout for icmp_seq 847983 Request timeout for icmp_seq 847984 Request timeout for icmp_seq 847985 64 bytes from 192.168.1.1: icmp_seq=61554 ttl=64 time=0.917 ms Request timeout for icmp_seq 847987 64 bytes from 192.168.1.1: icmp_seq=61556 ttl=64 time=0.929 ms Request timeout for icmp_seq 847989 Request timeout for icmp_seq 847990 64 bytes from 192.168.1.1: icmp_seq=61559 ttl=64 time=0.929 ms 64 bytes from 192.168.1.1: icmp_seq=61560 ttl=64 time=0.922 ms Request timeout for icmp_seq 847993 Request timeout for icmp_seq 847994

Best, -Gordon M.

Gordon Morehouse:

_______________________________________________ tor-relays mailing list tor-relays@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-relays

Sent from my thing that sends email. -----BEGIN PGP SIGNATURE-----

iQEcBAEBCgAGBQJSOcEIAAoJED/jpRoe7/ujY6QIAImt9T6uaH6OYIZsLkmNAwTm 3d+QyDVAz/tewS732QOqhnqqB4eMAnWsec7wNQB0ZmD5H1pkqFDlZqNxQqeAF/Zv VNNM2IG8nCJGLuvkKE24ta/qpwpLAZY6LvObzTNh9IxYfIteMY4+zU06XRd5jS1J QN5+RPMOAhL50kaGjVW65r2lDB5/XQdBEoIA3LI4yVCaEUCtBEzC3S3jlzPIxqR7 LVrBACMi0W6A43m3OMvxpejFWMahoATYiZVYmZWc/LysGgmyn70rav47rh9/0psh gRvnHAF+5YHytgSrDxW1+H9fmA0PnAlbv8YGNkvwLCXGo39oChUc9W34Im9kuSc= =x7pi -----END PGP SIGNATURE----- _______________________________________________ tor-relays mailing list tor-relays@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-relays

Gordon Morehouse

19 Sep 19 Sep

1:14 a.m.

On Wed, 18 Sep 2013 13:16:48 -0400, josh josh@allensw.com wrote:

...

You may be able to increase the ip_conntrack_max on your router. I had

I can, and have, but eventually its 16MB of RAM becomes a problem. ;)

The bigger deal, though, is I'm attempting to cobble together a set of scripts and best practices to allow a Raspberry Pi model B (512MB) to be turned into a plug-and-forget relay. Thus it can't be crashing consumer routers - even crappy ones - or messing up DNS or video streaming or or or.

...

a terrible verizon dsl router that would have its connection tracking capacity exhausted by pings to games servers. I was able to partially resolve the problem my telnetting (yea I know) into the router and setting the ip_conntrack_max from 1000 to 65000. You might also want to reduce the amount of time TCP spends in TIME-WAIT.

Definitely shortened the TCP timeouts at the router, with the intent to eventually move that into the Pi itself if feasible and useful.

...

Ultimately I replaced the router with a pi based solution with much greater resources.

My old WRT54G is pretty long in the tooth these days... still amazingly capable though.

Best, -Gordon M.

Roger Dingledine

18 Sep 18 Sep

8:43 p.m.

On Wed, Sep 18, 2013 at 06:50:46AM -0700, Gordon Morehouse wrote:

...

The replay has settled into a fairly steady state (after losing its flags except Named) of sending 5-10KB more per sec than it gets. I have a feeling this is literally due to the TAP replies being bigger than the TAP requests.

TAP replies and requests are both packaged in normal 512-byte Tor cells.

Typically extra write bytes happen because of people fetching directory information from your relay. That trend has increased over the past month, with five million more Tor clients online (and updating their directory info) than before.

You can compare by turning your dirport to 0. Unless you did that already?

...

Sep 18 05:30:43.000 [warn] Your system clock just jumped 101 seconds forward; assuming established circuits no longer work.

Shouldn't Tor be shedding memory if it closes all its circuits, not acquiring more?

Ah -- that log message about established circuits is talking about client circuits. It doesn't close circuits that it's relaying for other people.

(In fact, it doesn't even close all the circuits which it had created. It leaves open the ones that have streams on them, in hope that the streams will still work.)

--Roger

Gordon Morehouse

19 Sep 19 Sep

12:43 a.m.

On Wed, 18 Sep 2013 16:43:25 -0400, Roger Dingledine arma@mit.edu wrote:

...

On Wed, Sep 18, 2013 at 06:50:46AM -0700, Gordon Morehouse wrote:

...
The replay has settled into a fairly steady state (after losing its flags except Named) of sending 5-10KB more per sec than it gets. I have a feeling this is literally due to the TAP replies being bigger than the TAP requests.

TAP replies and requests are both packaged in normal 512-byte Tor cells.

Typically extra write bytes happen because of people fetching directory information from your relay. That trend has increased over the past month, with five million more Tor clients online (and updating their directory info) than before.

You can compare by turning your dirport to 0. Unless you did that already?

Nope, but that sounds about right - I definitely am running the directory service.

...

...
Sep 18 05:30:43.000 [warn] Your system clock just jumped 101 seconds forward; assuming established circuits no longer work.

Shouldn't Tor be shedding memory if it closes all its circuits, not acquiring more?

Ah -- that log message about established circuits is talking about client circuits. It doesn't close circuits that it's relaying for other people.

(In fact, it doesn't even close all the circuits which it had created. It leaves open the ones that have streams on them, in hope that the streams will still work.)

Thanks, Roger. I'm still not sure what finally caused the OOM-killer crash this morning after almost a couple weeks (?) of uptime. I was also seeing additional clock jump messages but didn't have time to diagnose it. The Pi does not have a battery-backed RTC so it requires a clock set at each start, accomplished by 'ntpdate' and kept in time with, in my case, 'openntpd'. But, 'openntpd' wasn't complaining at the time of the clock jumps that were reported by Tor, and AFAIK 'ntpdate' is not scheduled to run periodically, so I don't know what's causing it yet.

I'll continue to work on the Pi relay tuning, watch it closely, and report back. It is working better than ever with the latest 0.2.4.x but starting to hit CPU limits around 2Mbps (with little to no intensive optimization, I might add).

Best, -Gordon M.

Andy Isaacson

1:28 a.m.

On Wed, Sep 18, 2013 at 05:43:13PM -0700, Gordon Morehouse wrote:

...

Thanks, Roger. I'm still not sure what finally caused the OOM-killer crash this morning after almost a couple weeks (?) of uptime. I was also seeing additional clock jump messages but didn't have time to diagnose it. The Pi does not have a battery-backed RTC so it requires a clock set at each start, accomplished by 'ntpdate' and kept in time with, in my case, 'openntpd'. But, 'openntpd' wasn't complaining at the time of the clock jumps that were reported by Tor, and AFAIK 'ntpdate' is not scheduled to run periodically, so I don't know what's causing it yet.

The "Your system clock just jumped 100 seconds forward" messages are unlikely to be due to NTP. Much more likely the Tor daemon was blocked for a significant time period, due to swapping or similar.

What does top show? In particular the "Mem" and "Swap" lines, and the process line for the Tor process. Here's a large Xeon server running 4 Tor daemons:

top - 18:26:40 up 16 days, 18:18, 1 user, load average: 3.79, 3.95, 3.96 Tasks: 99 total, 4 running, 95 sleeping, 0 stopped, 0 zombie Cpu(s): 69.0%us, 5.1%sy, 0.0%ni, 19.4%id, 0.0%wa, 0.0%hi, 6.5%si, 0.0%st Mem: 8176824k total, 6361748k used, 1815076k free, 36336k buffers Swap: 0k total, 0k used, 0k free, 183736k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 14567 debian-t 20 0 2003m 1.8g 21m R 109 22.7 4477:00 tor 30171 debian-t 20 0 2390m 1.8g 20m R 100 23.6 13544:10 tor 18891 debian-t 20 0 1800m 1.5g 23m R 86 19.8 3134:31 tor 8798 debian-t 20 0 324m 149m 31m S 21 1.9 36:48.05 tor

A similar picture from your RPi might shed some light on the situation.

-andy

Gordon Morehouse

10:39 p.m.

On Wed, 18 Sep 2013 18:28:18 -0700, Andy Isaacson adi@hexapodia.org wrote:

...

On Wed, Sep 18, 2013 at 05:43:13PM -0700, Gordon Morehouse wrote:

...
Thanks, Roger. I'm still not sure what finally caused the OOM-killer crash this morning after almost a couple weeks (?) of uptime. I was also seeing additional clock jump messages but didn't have time to diagnose it. The Pi does not have a battery-backed RTC so it requires a clock set at each start, accomplished by 'ntpdate' and kept in time with, in my case, 'openntpd'. But, 'openntpd' wasn't complaining at the time of the clock jumps that were reported by Tor, and AFAIK 'ntpdate' is not scheduled to run periodically, so I don't know what's causing it yet.

The "Your system clock just jumped 100 seconds forward" messages are unlikely to be due to NTP. Much more likely the Tor daemon was blocked for a significant time period, due to swapping or similar.

I figured it might have been a really bad clock, but that seemed unlikely. I wouldn't remotely be surprised if what you are saying is true as the Pi is pretty memory-constrained. I'll try to correlate/reproduce it.

When the Pi goes to swap, it's getting desperate - it's swapping to a Class 10 microSD card. :/

...

What does top show? In particular the "Mem" and "Swap" lines, and the process line for the Tor process. Here's a large Xeon server running 4 Tor daemons:

top - 18:26:40 up 16 days, 18:18, 1 user, load average: 3.79, 3.95, 3.96 Tasks: 99 total, 4 running, 95 sleeping, 0 stopped, 0 zombie Cpu(s): 69.0%us, 5.1%sy, 0.0%ni, 19.4%id, 0.0%wa, 0.0%hi, 6.5%si, 0.0%st Mem: 8176824k total, 6361748k used, 1815076k free, 36336k buffers Swap: 0k total, 0k used, 0k free, 183736k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 14567 debian-t 20 0 2003m 1.8g 21m R 109 22.7 4477:00 tor 30171 debian-t 20 0 2390m 1.8g 20m R 100 23.6 13544:10 tor 18891 debian-t 20 0 1800m 1.5g 23m R 86 19.8 3134:31 tor 8798 debian-t 20 0 324m 149m 31m S 21 1.9 36:48.05 tor

A similar picture from your RPi might shed some light on the situation.

*Typically,* tor consumes about 25-40% of all available RAM to handle 300-800 circuits and 0.5-2Mbps of traffic (it boots with 485MB physical available after all is said and done). It always seems to get OOM-killed when I'm sleeping.

Best, -Gordon M.

4067

Age (days ago)

4068

Last active (days ago)

tor-relays@lists.torproject.org

8 comments

4 participants

tags (0)

participants (4)

Andy Isaacson
Gordon Morehouse
josh
Roger Dingledine