Re: [tor-relays] Undiagnosable Crashes in Relays

List overview All Threads
Download

newer

older

onionoo backend seems to be down.

Tor and Freenode

teor

26 Mar 2015 26 Mar '15

1:38 a.m.

...

Date: Wed, 25 Mar 2015 17:21:50 +0800 From: Vincent Yu v@v-yu.com

On Wed, Mar 25, 2015 at 4:26 PM, skyhighatrist skyhighatrist@tfwno.gf wrote:

...
I am wondering if anyone has had their relay randomly crash in the past week or so. Three of mine (I run 6, nonexits) have fallen over. One of them ~5 days ago, one of them ~4 days ago, and one of them earlier today .

This also happened to my two relays about three weeks ago. They are:

https://globe.torproject.org/#/relay/C309A31AD772FFDD0805C9FECB6D4748A7CBF68... https://globe.torproject.org/#/relay/18BE989663CF3351F73D33C672BB1C985E0EA5D...

They are both middle/guard relays (about 200 Mbps each) on the 0.2.6 branch and are on the same dedicated server. They went down at different times, and as far as I can remember, there was nothing notable in the Tor and system logs. No issues prior to this over the past 12 months. I haven't had time to investigate this.

This also happened to my EC2 relay a few weeks ago, but I never had time to investigate. It's quite possible it was a bad nightly tor version, or some sort of misconfiguration on my end, but the timing matches. I didn't manage to get it to come back up in the limited time I had to try and fix it.

https://globe.torproject.org/#/relay/425E55A8FA145ACFC01FA58CD4E6F46DD7762AA... (No record, it's been down for a while.)

teor

teor2345 at gmail dot com pgp 0xABFED1AC https://gist.github.com/teor2345/d033b8ce0a99adbc89c5

teor at blah dot im OTR C3C57B23 349825DE 929A1DEF C3531C25 A32287ED

Attachments:

signature.asc (application/pgp-signature — 832 bytes)

Show replies by date

skyhighatrist

26 Mar 26 Mar

2:27 p.m.

New subject: Undiagnosable Crashes in Relays

Hmmm. "NationalCrimeAgency" and "SnowCrash" fell over again sometime last night it would appear. I have now enabled debug logging on all hosts, and updated them to the latest available (in the repositories those hosts are using) version of Tor.

I also note that the issue others have mentioned (namely in the "CircuitStorm" issue in the tracker) that the relay which also runs a HS has been coming under inredibly high load - CPU maxed out at 100%. Given my HS is rarely used, such load is ... odd. Other relay and HS operators I have privately spoken with have mentioned their HS nodes have also locked up at 100% recently.

I'll provide debug logs if this happens again post-updates being applied, which might help with getting to the bottom of this mess.

On 26/03/15 01:38, teor wrote:

...

...
Date: Wed, 25 Mar 2015 17:21:50 +0800 From: Vincent Yu v@v-yu.com

On Wed, Mar 25, 2015 at 4:26 PM, skyhighatrist skyhighatrist@tfwno.gf wrote:

...
I am wondering if anyone has had their relay randomly crash in the past week or so. Three of mine (I run 6, nonexits) have fallen over. One of them ~5 days ago, one of them ~4 days ago, and one of them earlier today .

This also happened to my two relays about three weeks ago. They are:

https://globe.torproject.org/#/relay/C309A31AD772FFDD0805C9FECB6D4748A7CBF68... https://globe.torproject.org/#/relay/18BE989663CF3351F73D33C672BB1C985E0EA5D...

They are both middle/guard relays (about 200 Mbps each) on the 0.2.6 branch and are on the same dedicated server. They went down at different times, and as far as I can remember, there was nothing notable in the Tor and system logs. No issues prior to this over the past 12 months. I haven't had time to investigate this.

This also happened to my EC2 relay a few weeks ago, but I never had time to investigate. It's quite possible it was a bad nightly tor version, or some sort of misconfiguration on my end, but the timing matches. I didn't manage to get it to come back up in the limited time I had to try and fix it.

https://globe.torproject.org/#/relay/425E55A8FA145ACFC01FA58CD4E6F46DD7762AA... (No record, it's been down for a while.)

teor

teor2345 at gmail dot com pgp 0xABFED1AC https://gist.github.com/teor2345/d033b8ce0a99adbc89c5

teor at blah dot im OTR C3C57B23 349825DE 929A1DEF C3531C25 A32287ED

tor-relays mailing list tor-relays@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-relays

skyhighatrist

27 Mar 27 Mar

12:01 a.m.

New subject: Undiagnosable Crashes in Relays

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256

Ok, another one went down. Here be the logs, last 30 lines of, pasted into email, unredacted. Given I just got back from the fucking pub, if someone else spots the wierdness in this... I'll buy them a pint the next time I meet them. Now, keep a close eye on the timestamps.

You will note that tor "restarts" or "shuts down" 2 hours before it seemingly crashes. the last line of the debug log was incomplete, didn't even have a newline, just *DEAD*. I have restarted the node in the hopes I can capture more data.

IF anyone recognises this behavior, please let me know.

root@nationalcrime:/var/log/tor# tail -n 30 log Mar 26 11:16:30.000 [notice] Bootstrapped 85%: Finishing handshake with first hop. Mar 26 11:16:30.000 [notice] Self-testing indicates your ORPort is reachable from the outside. Excellent. Publishing server descriptor. Mar 26 11:16:31.000 [notice] Bootstrapped 90%: Establishing a Tor circui t. Mar 26 11:16:32.000 [notice] Tor has successfully opened a circuit. Looks like client functionality is working. Mar 26 11:16:32.000 [notice] Bootstrapped 100%: Done. Mar 26 11:16:33.000 [notice] Self-testing indicates your DirPort is reachable from the outside. Excellent. Mar 26 11:16:34.000 [notice] Performing bandwidth self-test...done. Mar 26 11:20:57.000 [notice] Interrupt: we have stopped accepting new connections, and will shut down in 30 seconds. Interrupt again to exit now. Mar 26 11:21:27.000 [notice] Clean shutdown finished. Exiting. Mar 26 11:21:46.000 [notice] Tor 0.2.4.26 (git-0b9fcb34f1996a74) opening log file. Mar 26 11:21:46.000 [notice] Parsing GEOIP IPv4 file /usr/share/tor/geoi p. Mar 26 11:21:46.000 [notice] Parsing GEOIP IPv6 file /usr/share/tor/geoip6. Mar 26 11:21:46.000 [notice] Configured to measure statistics. Look for the *-stats files that will first be written to the data directory in 24 hours from now. Mar 26 11:21:47.000 [warn] I have no descriptor for the router named "ChickenLiver" in my declared family; I'll use the nickname as is, but this may confuse clients. Mar 26 11:21:47.000 [warn] I have no descriptor for the router named "Neuromancer" in my declared family; I'll use the nickname as is, but this may confuse clients. Mar 26 11:21:47.000 [warn] I have no descriptor for the router named "Cyberia" in my declared family; I'll use the nickname as is, but this may confuse clients. Mar 26 11:21:47.000 [warn] I have no descriptor for the router named "necronomicon" in my declared family; I'll use the nickname as is, but this may confuse clients. Mar 26 11:21:47.000 [warn] I have no descriptor for the router named "SnowCrash" in my declared family; I'll use the nickname as is, but this may confuse clients. Mar 26 11:21:47.000 [notice] Your Tor server's identity key fingerprint is 'NationalCrimeAgency AC9803701F9EE18194D40B38E47CE4C68CF2F567' Mar 26 11:21:53.000 [notice] We now have enough directory information to build circuits. Mar 26 11:21:53.000 [notice] Bootstrapped 80%: Connecting to the Tor network. Mar 26 11:21:53.000 [notice] Self-testing indicates your ORPort is reachable from the outside. Excellent. Publishing server descriptor. Mar 26 11:21:53.000 [notice] Bootstrapped 85%: Finishing handshake with first hop. Mar 26 11:21:54.000 [notice] Bootstrapped 90%: Establishing a Tor circui t. Mar 26 11:21:55.000 [notice] Tor has successfully opened a circuit. Looks like client functionality is working. Mar 26 11:21:55.000 [notice] Bootstrapped 100%: Done. Mar 26 11:22:55.000 [notice] Self-testing indicates your DirPort is reachable from the outside. Excellent. Mar 26 11:22:56.000 [notice] Performing bandwidth self-test...done. Mar 26 11:24:34.000 [notice] Interrupt: we have stopped accepting new connections, and will shut down in 30 seconds. Interrupt again to exit now. Mar 26 11:25:04.000 [notice] Clean shutdown finished. Exiting.

DEBUG LOG BEGINS....

root@nationalcrime:/var/log/tor# tail -n 30 debug.log Mar 26 13:03:01.000 [debug] circuit_receive_relay_cell(): Passing on unrecognized cell. Mar 26 13:03:01.000 [debug] connection_or_process_cells_from_inbuf(): 553: starting, inbuf_datalen 1576 (0 pending in tls object). Mar 26 13:03:01.000 [debug] channel_queue_cell(): Directly handling incoming cell_t 0x7fff332939b0 for channel 0x7f238f3dbc80 (global ID 572 ) Mar 26 13:03:01.000 [debug] circuit_get_by_circid_channel_impl(): circuit_get_by_circid_channel_impl() returning circuit 0x7f238fc15550 for circ_id 2147484893, channel ID 572 (0x7f238f3dbc80) Mar 26 13:03:01.000 [debug] circuit_receive_relay_cell(): Passing on unrecognized cell. Mar 26 13:03:01.000 [debug] connection_or_process_cells_from_inbuf(): 553: starting, inbuf_datalen 1062 (0 pending in tls object). Mar 26 13:03:01.000 [debug] channel_queue_cell(): Directly handling incoming cell_t 0x7fff332939b0 for channel 0x7f238f3dbc80 (global ID 572 ) Mar 26 13:03:01.000 [debug] circuit_get_by_circid_channel_impl(): circuit_get_by_circid_channel_impl() returning circuit 0x7f238fc15550 for circ_id 2147484893, channel ID 572 (0x7f238f3dbc80) Mar 26 13:03:01.000 [debug] circuit_receive_relay_cell(): Passing on unrecognized cell. Mar 26 13:03:01.000 [debug] connection_or_process_cells_from_inbuf(): 553: starting, inbuf_datalen 548 (0 pending in tls object). Mar 26 13:03:01.000 [debug] channel_queue_cell(): Directly handling incoming cell_t 0x7fff332939b0 for channel 0x7f238f3dbc80 (global ID 572 ) Mar 26 13:03:01.000 [debug] circuit_get_by_circid_channel_impl(): circuit_get_by_circid_channel_impl() returning circuit 0x7f238fc15550 for circ_id 2147484893, channel ID 572 (0x7f238f3dbc80) Mar 26 13:03:01.000 [debug] circuit_receive_relay_cell(): Passing on unrecognized cell. Mar 26 13:03:01.000 [debug] connection_or_process_cells_from_inbuf(): 553: starting, inbuf_datalen 34 (0 pending in tls object). Mar 26 13:03:01.000 [debug] conn_write_callback(): socket 102 wants to write. Mar 26 13:03:01.000 [debug] flush_chunk_tls(): flushed 4064 bytes, 48 ready to flush, 48 remain. Mar 26 13:03:01.000 [debug] flush_chunk_tls(): flushed 48 bytes, 0 ready to flush, 0 remain. Mar 26 13:03:01.000 [debug] connection_handle_write_impl(): After TLS write of 4112: 0 read, 4170 written Mar 26 13:03:01.000 [debug] channel_write_packed_cell(): Writing packed_cell_t 0x7f2393112f38 to channel 0x7f239059ef50 with global ID 88 Mar 26 13:03:01.000 [debug] channel_write_packed_cell(): Writing packed_cell_t 0x7f23931314f8 to channel 0x7f239059ef50 with global ID 88 Mar 26 13:03:01.000 [debug] channel_write_packed_cell(): Writing packed_cell_t 0x7f239312baf0 to channel 0x7f239059ef50 with global ID 88 Mar 26 13:03:01.000 [debug] channel_write_packed_cell(): Writing packed_cell_t 0x7f239311d070 to channel 0x7f239059ef50 with global ID 88 Mar 26 13:03:01.000 [debug] channel_write_packed_cell(): Writing packed_cell_t 0x7f23931277f0 to channel 0x7f239059ef50 with global ID 88 Mar 26 13:03:01.000 [debug] channel_write_packed_cell(): Writing packed_cell_t 0x7f23931180e0 to channel 0x7f239059ef50 with global ID 88 Mar 26 13:03:01.000 [debug] channel_write_packed_cell(): Writing packed_cell_t 0x7f23931260e8 to channel 0x7f239059ef50 with global ID 88 Mar 26 13:03:01.000 [debug] channel_write_packed_cell(): Writing packed_cell_t 0x7f239312e2b8 to channel 0x7f239059ef50 with global ID 88 Mar 26 13:03:01.000 [debug] channel_flush_from_first_active_circuit(): Made a circuit inactive. Mar 26 13:03:01.000 [debug] conn_read_callback(): socket 301 wants to read. Mar 26 13:03:01.000 [debug] connection_read_to_buf(): 301: starting, inbuf_datalen 0 (0 pending in tls object). at_most 16448. Mar 26 13:03:01.000 [debug] connection_read_to_buf(): Aft

On 26/03/15 14:27, skyhighatrist wrote:

...

Hmmm. "NationalCrimeAgency" and "SnowCrash" fell over again sometime last night it would appear. I have now enabled debug logging on all hosts, and updated them to the latest available (in the repositories those hosts are using) version of Tor.

I also note that the issue others have mentioned (namely in the "CircuitStorm" issue in the tracker) that the relay which also runs a HS has been coming under inredibly high load - CPU maxed out at 100%. Given my HS is rarely used, such load is ... odd. Other relay and HS operators I have privately spoken with have mentioned their HS nodes have also locked up at 100% recently.

I'll provide debug logs if this happens again post-updates being applied, which might help with getting to the bottom of this mess.

.d

On 26/03/15 01:38, teor wrote:

...
...
Date: Wed, 25 Mar 2015 17:21:50 +0800 From: Vincent Yu v@v-yu.com

On Wed, Mar 25, 2015 at 4:26 PM, skyhighatrist skyhighatrist@tfwno.gf wrote:

...
I am wondering if anyone has had their relay randomly crash in the past week or so. Three of mine (I run 6, nonexits) have fallen over. One of them ~5 days ago, one of them ~4 days ago, and one of them earlier today .

This also happened to my two relays about three weeks ago. They are:

https://globe.torproject.org/#/relay/C309A31AD772FFDD0805C9FECB6D474

8A7CBF684

...

...
...

https://globe.torproject.org/#/relay/18BE989663CF3351F73D33C672BB1C985E0 EA5D0

...

...
...
They are both middle/guard relays (about 200 Mbps each) on the 0.2.6 branch and are on the same dedicated server. They went down at different times, and as far as I can remember, there was nothing notable in the Tor and system logs. No issues prior to this over the past 12 months. I haven't had time to investigate this.

This also happened to my EC2 relay a few weeks ago, but I never had time to investigate. It's quite possible it was a bad nightly tor version, or some sort of misconfiguration on my end, but the timing matches. I didn't manage to get it to come back up in the limited time I had to try and fix it.

https://globe.torproject.org/#/relay/425E55A8FA145ACFC01FA58CD4E6F46D

D7762AAB

...

...

(No record, it's been down for a while.)

...

...
teor

teor2345 at gmail dot com pgp 0xABFED1AC https://gist.github.com/teor2345/d033b8ce0a99adbc89c5

teor at blah dot im OTR C3C57B23 349825DE 929A1DEF C3531C25 A32287ED

_______________________________________________ tor-relays mailing list tor-relays@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-relays

_______________________________________________ tor-relays mailing list tor-relays@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-relays

...PGP SIGNATURE...

teor

21 Apr 21 Apr

11 a.m.

New subject: Undiagnosable Crashes in Relays

...

On 26 Mar 2015, at 12:38 , teor teor2345@gmail.com wrote:

...
Date: Wed, 25 Mar 2015 17:21:50 +0800 From: Vincent Yu v@v-yu.com

On Wed, Mar 25, 2015 at 4:26 PM, skyhighatrist skyhighatrist@tfwno.gf wrote:

...
I am wondering if anyone has had their relay randomly crash in the past week or so. Three of mine (I run 6, nonexits) have fallen over. One of them ~5 days ago, one of them ~4 days ago, and one of them earlier today .

This also happened to my two relays about three weeks ago. They are:

https://globe.torproject.org/#/relay/C309A31AD772FFDD0805C9FECB6D4748A7CBF68... https://globe.torproject.org/#/relay/18BE989663CF3351F73D33C672BB1C985E0EA5D...

They are both middle/guard relays (about 200 Mbps each) on the 0.2.6 branch and are on the same dedicated server. They went down at different times, and as far as I can remember, there was nothing notable in the Tor and system logs. No issues prior to this over the past 12 months. I haven't had time to investigate this.

This also happened to my EC2 relay a few weeks ago, but I never had time to investigate. It's quite possible it was a bad nightly tor version, or some sort of misconfiguration on my end, but the timing matches. I didn't manage to get it to come back up in the limited time I had to try and fix it.

https://globe.torproject.org/#/relay/425E55A8FA145ACFC01FA58CD4E6F46DD7762AA... (No record, it's been down for a while.)

As a follow-up, Amazon has now informed me that the hardware the instance was on has undergone "irreparable failure". So it appears likely that the tor crashes were hardware-related.

teor

teor2345 at gmail dot com pgp 0xABFED1AC https://gist.github.com/teor2345/d033b8ce0a99adbc89c5

teor at blah dot im OTR D5BE4EC2 255D7585 F3874930 DB130265 7C9EBBC7

3488

Age (days ago)

3514

Last active (days ago)

tor-relays@lists.torproject.org

3 comments

2 participants

tags (0)

participants (2)

skyhighatrist
teor