connectivity failure for top 100 relays

List overview All Threads
Download

newer

older

Bandwidth Scanning Meeting Notes 3...

Tor Terminology + torspec

dawuud

13 Mar 2018 13 Mar '18

2:55 a.m.

Out of 9900 possible two hop tor circuits among the top 100 tor relays only 935 circuit builds have succeeded. This is way worse than the last time I sent a report 6 months ago during the Montreal tor dev meeting.

Here's the scanner I use:

https://github.com/david415/tor_partition_scanner

(I was planning on improving this testing methodology in collaboration with Katharina Kohls but was unable to travel to Bochum University because of visa limitations. It was either go to tor-dev meeting or Bochum but not both.)

Here's the gist of my simple testing methodology:

https://gist.github.com/david415/9875821652018431dd6d6c4407bb90c0#file-detec...

Here's exactly how I performed the scan to get those results:

wget https://collector.torproject.org/recent/relay-descriptors/consensuses/2018-0...

./helpers/query_fingerprints_from_consensus_file.py 2018-03-1 3-01-00-00-consensus > top100.relays

detect_partitions.py --tor-control tcp:127.0.0.1:9051 --log-dir ./ --status-log ./status_log \ --relay-list top100.relays --secret secretTorEmpireOfRelays --partitions 1 --this-partition 0 \ --build-duration .25 --circuit-timeout 60 --log-chunk-size 1000 --max-concurrency 100

echo "select first_hop, second_hop from scan_log where status = 'failure';" | sqlite3 scan1.db | wc -l 8942

echo "select first_hop, second_hop from scan_log where status = 'timeout';" | sqlite3 scan1.db | wc -l 23

echo "select first_hop, second_hop from scan_log where status = 'success';" | sqlite3 scan1.db | wc -l 935

Attachments:

signature.asc (application/pgp-signature — 833 bytes)

Show replies by date

teor

13 Mar 13 Mar

7:43 a.m.

...

On 13 Mar 2018, at 03:55, dawuud dawuud@riseup.net wrote:

Out of 9900 possible two hop tor circuits among the top 100 tor relays only 935 circuit builds have succeeded. This is way worse than the last time I sent a report 6 months ago during the Montreal tor dev meeting.

How much worse?

And where did you scan *from*? (It's hard to interpret the results without the latency and quality of your client connection.)

Also, we have just deployed defences to exactly this kind of rapid circuit or connection building by a single client. I wonder if your client triggered those defences. The circuit defences would likely cause timeouts, and the connection defences would likely cause failures.

I also wonder if your client triggered custom defences on some relays.

...

Here's the scanner I use:

https://github.com/david415/tor_partition_scanner

…

Here's the gist of my simple testing methodology:

https://gist.github.com/david415/9875821652018431dd6d6c4407bb90c0#file-detec...

Here's exactly how I performed the scan to get those results:

wget https://collector.torproject.org/recent/relay-descriptors/consensuses/2018-0...

./helpers/query_fingerprints_from_consensus_file.py 2018-03-1 3-01-00-00-consensus > top100.relays

detect_partitions.py --tor-control tcp:127.0.0.1:9051 --log-dir ./ --status-log ./status_log \ --relay-list top100.relays --secret secretTorEmpireOfRelays --partitions 1 --this-partition 0 \ --build-duration .25 --circuit-timeout 60 --log-chunk-size 1000 --max-concurrency 100

You might get better results if you scan more slowly. Try to stay under 1 circuit every 3 seconds to each relay from your IP address. Try to stay under 50 connections to the same relay from your IP address.

I'm going from memory, check the Tor man page, dir-spec, and the consensus for the latest DDoS parameter values.

dawuud

1:24 p.m.

...

How much worse?

During the Montreal tor dev meeting I counted 1947 circuit build failures. https://lists.torproject.org/pipermail/tor-project/2017-October/001492.html

...

And where did you scan *from*?

I scaned from a server in the Netherlands.

...

(It's hard to interpret the results without the latency and quality of your client connection.)

I can record latency. What do you mean by quality? I mean... I'm not using these circuits to actually send and receive stuff.

...

Also, we have just deployed defences to exactly this kind of rapid circuit or connection building by a single client. I wonder if your client triggered those defences. The circuit defences would likely cause timeouts, and the connection defences would likely cause failures.

aha! That might explain the terrible results, hopefully it's not that network health has gotten worse in the last six months.

...

I also wonder if your client triggered custom defences on some relays.

I doubt it. I am not making sequential circuits to the same relays. The relays choosen for each circuit builds are generated from a shuffle.

...

You might get better results if you scan more slowly. Try to stay under 1 circuit every 3 seconds to each relay from

OK. I will try this. The scan will take longer but hopefully produce more accurate and useful results.

...

your IP address. Try to stay under 50 connections to the same relay from your IP address.

hmm OK. I can limit the number of concurrenct circuits that are being built but I do not believe that txtorcon let's me control the number of "connections" that little-t tor makes.

...

I'm going from memory, check the Tor man page, dir-spec, and the consensus for the latest DDoS parameter values.

meejah

3:01 p.m.

dawuud dawuud@riseup.net writes:

...

...
your IP address. Try to stay under 50 connections to the same relay from your IP address.

...

hmm OK. I can limit the number of concurrenct circuits that are being built but I do not believe that txtorcon let's me control the number of "connections" that little-t tor makes.

I *think* they should be equivalent? Controllers can't control everything Tor does, though (for example, Tor can decide to set up circuits to fetch things or do its own measurements).

Related to this might be my own scanner; I keep 20 circuits in-flight at any one time and am using random guards so it's "very unlikely" I'd even have two connections to the same first hop at the same time. However, I don't do anything about timing -- maybe we can take up this discussion in an IRC channel?

-- meejah

dawuud

2:48 p.m.

...

And where did you scan *from*? (It's hard to interpret the results without the latency and quality of your client connection.)

It turns out I am recording circuit build latency. It is unclear to me exactly what you'd like me to do with this information however here's a some silly queries:

echo "select (end_time - start_time) / 1000 as duration from scan_log where duration > 50 AND duration < 60;" | sqlite3 scan1.db 55.2818120117187 51.7696379394531 59.9406301269531

echo "select (end_time - start_time) / 1000 as duration from scan_log where duration > 40 AND duration < 50;" | sqlite3 scan1.db 41.0546398925781 40.1456608886719 48.2474660644531

echo "select (end_time - start_time) / 1000 as duration from scan_log where duration > 30 AND duration < 40;" | sqlite3 scan1.db 31.6949631347656 34.8123491210938 37.0733110351563 36.2936791992188

echo "select (end_time - start_time) / 1000 as duration from scan_log where duration > 20 AND duration < 30;" | sqlite3 scan1.db 29.2628620605469 28.2720109863281

echo "select (end_time - start_time) / 1000 as duration from scan_log where duration > 10 AND duration < 20;" | sqlite3 scan1.db 13.4959392089844 14.6635520019531 19.32987109375 14.2355910644531 13.9277241210937 13.3795317382812 12.9024929199219 12.3480061035156 11.711751953125 10.2423110351563 11.0780610351562 18.3046040039062

echo "select (end_time - start_time) / 1000 as duration from scan_log where duration > 3 AND duration < 10;" | sqlite3 scan1.db 8.98835498046875 3.93438012695312 4.10946020507812 9.21181396484375 8.1195078125 6.78396508789062 5.28444775390625 3.59763989257813

echo "select (end_time - start_time) / 1000 as duration from scan_log where duration > 1 AND duration < 3;" | sqlite3 scan1.db 2.05169384765625 1.69050805664062 1.86933813476563 2.22057397460937 1.82368383789063 2.53436987304688 1.80827685546875

echo "select (end_time - start_time) / 1000 as duration from scan_log where duration < 1;" | sqlite3 scan1.db | wc -l 9837

meejah

3:52 p.m.

teor teor2345@gmail.com writes:

...

And where did you scan *from*? (It's hard to interpret the results without the latency and quality of your client connection.)

If I correctly understand what David's scanner is doing, so long as "a" connection can make it to the first hop properly any other failure is "the Tor network's fault", isn't it? (I mean, unless the first-hop connection is so crappy it sometimes just times out or drops).

To me the important thing here would be to do the scans consistently from the same network-vantage point and then at least subsequent scans can be compared more consistently (right?).

For my scans (which are 3-hop) I re-try failing combinations up to 5 times before completely giving up -- but still fail to scan a bunch of relays. These tests *do* fetch real data, though, so there's a lot more opportunity for "bad things" to happen which aren't a problem of "the Tor network" necessarily.

-- meejah

Roger Dingledine

8:47 a.m.

On Tue, Mar 13, 2018 at 02:55:12AM +0000, dawuud wrote:

...

Out of 9900 possible two hop tor circuits among the top 100 tor relays only 935 circuit builds have succeeded. This is way worse than the last time I sent a report 6 months ago during the Montreal tor dev meeting.

The next step here would be to try to debug your results, to understand if it's actually an issue with the Tor network (in which case, what exactly is the issue), or if it's a bug in your scripts.

Teor asked some good questions.

Other questions I'd want to investigate:

(A) Are the failures consistent, or intermittent? That is, does a failed link always fail, or only sometimes?

(B) Are you really sure that it failed? I would guess that 'failed' is different from 'timeout' because it got an explicit destroy back? If so, don't destroy cells have 'reason' components? Which reasons are happening most commonly?

(C) We should find a link that is failing between two relays that we both control, and look at each one more closely to see if there are any hints. For example, is there anything in the logs? If we turn up the logging, do we get any hints then?

(D) ...which leads to: we should run this same tool on the test network that teor and dgoulet et al run, and look for failures there. Assuming we find some, since there are no users on the test network, we can investigate much more thoroughly.

(E) I wonder if there's a correlation between the failed links and whether a TLS connection is already established on that link. That is, when there is no connection already, there are many more steps that need to be taken to extend the circuit, and those steps could lead to increased failure rates, either due to the extra time that is needed, or because part of tor's link handshake (NETINFO, etc) is going wrong.

And a last point: this tool, and these investigations, are exactly in scope for the "network health" topic that the network team has been discussing as one of the key open areas that need more attention.

--Roger

dawuud

1:45 p.m.

...

Other questions I'd want to investigate:

(A) Are the failures consistent, or intermittent? That is, does a failed link always fail, or only sometimes?

Yes this is what our new testing methodology should support. My current scanner is not sufficient. We want to improve it.

...

(B) Are you really sure that it failed? I would guess that 'failed' is different from 'timeout' because it got an explicit destroy back? If so, don't destroy cells have 'reason' components? Which reasons are happening most commonly?

Yes I am sure it failed. It would be cool if txtorcon can expose the 'reason' but I think that it cannot. I suppose it will show up in the tor log file if I set it to debug logging.

...

(C) We should find a link that is failing between two relays that we both control, and look at each one more closely to see if there are any hints. For example, is there anything in the logs? If we turn up the logging, do we get any hints then?

Sounds good. I would certainly be willing to collaborate with Teor or anyone else who might like to help with this.

...

(D) ...which leads to: we should run this same tool on the test network that teor and dgoulet et al run, and look for failures there. Assuming we find some, since there are no users on the test network, we can investigate much more thoroughly.

Sounds good. Let me know if there is anything I can do to help with this.

...

(E) I wonder if there's a correlation between the failed links and whether a TLS connection is already established on that link. That is, when there is no connection already, there are many more steps that need to be taken to extend the circuit, and those steps could lead to increased failure rates, either due to the extra time that is needed, or because part of tor's link handshake (NETINFO, etc) is going wrong.

Ah yes this is another good question for which I currently do not have an answer.

meejah

3:11 p.m.

dawuud dawuud@riseup.net writes:

...

Yes I am sure it failed. It would be cool if txtorcon can expose the 'reason' but I think that it cannot. I suppose it will show up in the tor log file if I set it to debug logging.

txtorcon does expose both the 'reason' and the 'remote_reason' flags returned by the failure messages. In fact, it returns all flags that Tor sent during stream or circuit failures.

The **kwargs in stream_closed, circuit_closed or circuit_failed notifications should all include "REASON" and many times will also include "REMOTE_REASON" (e.g. if the "other" relay closed the connection). For convenience, txtorcon also includes lower-cased versions of all the flags.

...

...
(C) We should find a link that is failing between two relays that we both control, and look at each one more closely to see if there are any hints. For example, is there anything in the logs? If we turn up the logging, do we get any hints then?

...

Sounds good. I would certainly be willing to collaborate with Teor or anyone else who might like to help with this.

I'm +1 here too. I'd like to better understand the failures I see in my scanner as well.

...

...
(E) I wonder if there's a correlation between the failed links and whether a TLS connection is already established on that link. That is, when there is no connection already, there are many more steps that need to be taken to extend the circuit, and those steps could lead to increased failure rates, either due to the extra time that is needed, or because part of tor's link handshake (NETINFO, etc) is going wrong.

...

Ah yes this is another good question for which I currently do not have an answer.

Would it be better, then, to pick one first hop and scan (sequentially) every second-hop using that first hop? (And maybe have say 5 or 10 such things going on at once?)

-- meejah

dawuud

11:48 p.m.

I did another scan, this time with 3 seconds between each circuit build and set the max connections to 50 with similar results as yesterday:

9354 failure 2 timeout 544 success

most of the circuit build failures happened in under a second:

echo "select (end_time - start_time) / 1000 as duration from scan_log where duration < 1 AND status = 'failure';" | sqlite3 scan1.db | wc -l 9344

...

txtorcon does expose both the 'reason' and the 'remote_reason' flags returned by the failure messages. In fact, it returns all flags that Tor sent during stream or circuit failures.

The **kwargs in stream_closed, circuit_closed or circuit_failed notifications should all include "REASON" and many times will also include "REMOTE_REASON" (e.g. if the "other" relay closed the connection). For convenience, txtorcon also includes lower-cased versions of all the flags.

ah ok! I will take a look at this. I'd like to do another scan while collecting this additional information.

...

Would it be better, then, to pick one first hop and scan (sequentially) every second-hop using that first hop? (And maybe have say 5 or 10 such things going on at once?)

Maybe it's ok to make 7,000+ tor circuits sequentially from the same relay if it's done very slowly?

dawuud

27 Apr 27 Apr

9:12 p.m.

Greetings,

( Meejah and I made txtorcon report the reason for circuit build failures here: https://github.com/meejah/txtorcon/pull/299 My scanner now uses this txtorcon feature: https://github.com/david415/tor_partition_scanner )

I used a collector consensus file: 2018-04-27-19-00-00-consensus

wget https://collector.torproject.org/recent/relay-descriptors/consensuses/2018-0...

and extracted the top 100 relays with the highest consensus weights with stable AND fast flags.

./helpers/query_fingerprints_from_consensus_file.py 2018-04-27-19-00-00-consensus > top100.relays

and then performed the scan, building 9900 2-hop tor circuits:

detect_partitions.py --tor-control unix:/var/run/tor/control --log-dir ./ --status-log ./status_log \ --relay-list top100.relays --secret secretTorEmpireOfRelays --partitions 1 --this-partition 0 \ --build-duration .25 --circuit-timeout 60 --log-chunk-size 1000 --max-concurrency 100

This resulted in only 307 circuit build failures:

echo "select reason from scan_log where status = 'failure'

...

;" | sqlite3 scan1.db | wc -l

307

And out of these failures, 301 of them the circuit build failure REASON was reported by little-t tor as TIMEOUT:

echo "select reason from scan_log where status = 'failure';" | sqlite3 scan1.db | grep -i timeout | wc -l 301

Here's the non-timeout REASONs for these circuit build failures:

echo "select reason from scan_log where status = 'failure';" | sqlite3 scan1.db | grep -vi timeout

DESTROYED, FINISHED DESTROYED, FINISHED DESTROYED, CHANNEL_CLOSED DESTROYED, CHANNEL_CLOSED DESTROYED, CHANNEL_CLOSED DESTROYED, CHANNEL_CLOSED

I'm curious to try this scan at different times of day to see if results vary.

Cheers,

David

On Tue, Mar 13, 2018 at 11:48:30PM +0000, dawuud wrote:

...

I did another scan, this time with 3 seconds between each circuit build and set the max connections to 50 with similar results as yesterday:

9354 failure 2 timeout 544 success

most of the circuit build failures happened in under a second:

echo "select (end_time - start_time) / 1000 as duration from scan_log where duration < 1 AND status = 'failure';" | sqlite3 scan1.db | wc -l 9344

...
txtorcon does expose both the 'reason' and the 'remote_reason' flags returned by the failure messages. In fact, it returns all flags that Tor sent during stream or circuit failures.

The **kwargs in stream_closed, circuit_closed or circuit_failed notifications should all include "REASON" and many times will also include "REMOTE_REASON" (e.g. if the "other" relay closed the connection). For convenience, txtorcon also includes lower-cased versions of all the flags.

ah ok! I will take a look at this. I'd like to do another scan while collecting this additional information.

...
Would it be better, then, to pick one first hop and scan (sequentially) every second-hop using that first hop? (And maybe have say 5 or 10 such things going on at once?)

Maybe it's ok to make 7,000+ tor circuits sequentially from the same relay if it's done very slowly?

...

tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev

dawuud

2 May 2 May

8 p.m.

I think that many of my previous scans were not useful and showed inaccurate results because the IP address i was scanning from might have gotten black listed by dir-auths? or perhaps blocked by many relays by the anti-denial-of-service mechanisms in tor? i got rid of that virtual server and lost use of it's IP address... so we'll never know.

Katharina and I are interested in doing lots more thorough scans of the Tor network rather than this limited methodology i've been using.

What are the guidelines to avoid getting blocked by the tor network? Is it possible to check the consensus to see if a client IP has been blocked?

On Fri, Apr 27, 2018 at 09:12:59PM +0000, dawuud wrote:

...

Greetings,

( Meejah and I made txtorcon report the reason for circuit build failures here: https://github.com/meejah/txtorcon/pull/299 My scanner now uses this txtorcon feature: https://github.com/david415/tor_partition_scanner )

I used a collector consensus file: 2018-04-27-19-00-00-consensus

wget https://collector.torproject.org/recent/relay-descriptors/consensuses/2018-0...

and extracted the top 100 relays with the highest consensus weights with stable AND fast flags.

./helpers/query_fingerprints_from_consensus_file.py 2018-04-27-19-00-00-consensus > top100.relays

and then performed the scan, building 9900 2-hop tor circuits:

detect_partitions.py --tor-control unix:/var/run/tor/control --log-dir ./ --status-log ./status_log \ --relay-list top100.relays --secret secretTorEmpireOfRelays --partitions 1 --this-partition 0 \ --build-duration .25 --circuit-timeout 60 --log-chunk-size 1000 --max-concurrency 100

This resulted in only 307 circuit build failures:

echo "select reason from scan_log where status = 'failure'

...
;" | sqlite3 scan1.db | wc -l

307

And out of these failures, 301 of them the circuit build failure REASON was reported by little-t tor as TIMEOUT:

echo "select reason from scan_log where status = 'failure';" | sqlite3 scan1.db | grep -i timeout | wc -l 301

Here's the non-timeout REASONs for these circuit build failures:

echo "select reason from scan_log where status = 'failure';" | sqlite3 scan1.db | grep -vi timeout

DESTROYED, FINISHED DESTROYED, FINISHED DESTROYED, CHANNEL_CLOSED DESTROYED, CHANNEL_CLOSED DESTROYED, CHANNEL_CLOSED DESTROYED, CHANNEL_CLOSED

I'm curious to try this scan at different times of day to see if results vary.

Cheers,

David

On Tue, Mar 13, 2018 at 11:48:30PM +0000, dawuud wrote:

...
I did another scan, this time with 3 seconds between each circuit build and set the max connections to 50 with similar results as yesterday:

9354 failure 2 timeout 544 success

most of the circuit build failures happened in under a second:

echo "select (end_time - start_time) / 1000 as duration from scan_log where duration < 1 AND status = 'failure';" | sqlite3 scan1.db | wc -l 9344

...
txtorcon does expose both the 'reason' and the 'remote_reason' flags returned by the failure messages. In fact, it returns all flags that Tor sent during stream or circuit failures.

The **kwargs in stream_closed, circuit_closed or circuit_failed notifications should all include "REASON" and many times will also include "REMOTE_REASON" (e.g. if the "other" relay closed the connection). For convenience, txtorcon also includes lower-cased versions of all the flags.

ah ok! I will take a look at this. I'd like to do another scan while collecting this additional information.

...
Would it be better, then, to pick one first hop and scan (sequentially) every second-hop using that first hop? (And maybe have say 5 or 10 such things going on at once?)

Maybe it's ok to make 7,000+ tor circuits sequentially from the same relay if it's done very slowly?

...

tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev

...

tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev

nusenu

8:41 p.m.

...

I think that many of my previous scans were not useful and showed inaccurate

I'm glad that it turned out that these previous results might have been inaccurate (because the results were scary if found to be accurate)

...

results because the IP address i was scanning from might have gotten black listed by dir-auths?

I don't see how dir auths could blacklist specific client IP addresses (tor clients use fallbackdirs)

...

or perhaps blocked by many relays by the anti-denial-of-service mechanisms in tor?

can you let me know the start and end date of the scan (2018-03-12?) so I can check how many of the relays you scanned (the top 100 relays by cw? at the time) had a tor version with anti ddos features at the time?

During your first scans (2017) there have been no anti-dos features.

...

i got rid of that virtual server and lost use of it's IP address... so we'll never know.

Katharina and I are interested in doing lots more thorough scans of the Tor network rather than this limited methodology i've been using.

I'm excited to hear that.

...

What are the guidelines to avoid getting blocked by the tor network?

stay under the public thresholds? https://www.torproject.org/docs/tor-manual-dev.html.en#_denial_of_service_mi...

...

Is it possible to check the consensus to see if a client IP has been blocked?

the consensus holds information about relays not about tor client IP addresses, but I assume you know that and I misunderstood your question?

-- https://mastodon.social/@nusenu twitter: @nusenu_

dawuud

10:35 p.m.

...

can you let me know the start and end date of the scan (2018-03-12?) so I can check how many of the relays you scanned (the top 100 relays by cw? at the time)

that scan only took an hour or so to perform and I posted the e-mail minutes after the scan, so you can refer to the date in the e-mail header ;-)

...

During your first scans (2017) there have been no anti-dos features.

ah yeah that's true and i think we'll see lots of partitions in the tor network if we continue to scan. although my latest results show us that at least the top 100 tor relays are OK. we might find that relays with a lower consensus measurement value might be getting more traffic than they can handle which in turn causes those relays to drop new circuit builds. just a theory. the new scan was done from a server in the US... so i mean... we'll see what happens when we perform scans from different locations repeatedly at different times of day.

...

...
What are the guidelines to avoid getting blocked by the tor network?

stay under the public thresholds? https://www.torproject.org/docs/tor-manual-dev.html.en#_denial_of_service_mi...

ah thanks!

...

...
Is it possible to check the consensus to see if a client IP has been blocked?

the consensus holds information about relays not about tor client IP addresses, but I assume you know that and I misunderstood your question?

hmm i was thinking that there could be a limited blacklist of client IPs but i guess there isn't one. nevermind then.

teor

11:28 p.m.

On 3 May 2018, at 06:41, nusenu nusenu-lists@riseup.net wrote:

...

...
What are the guidelines to avoid getting blocked by the tor network?

stay under the public thresholds? https://www.torproject.org/docs/tor-manual-dev.html.en#_denial_of_service_mi...

Those are the defaults.

You'll need to stay under the current thresholds in the consensus: https://consensus-health.torproject.org/#consensusparams

2381

Age (days ago)

2431

Last active (days ago)

tor-dev@lists.torproject.org

14 comments

5 participants

tags (0)

participants (5)

dawuud
meejah
nusenu
Roger Dingledine
teor