DirAuth usage and 503 try again later

List overview All Threads
Download

newer

older

Relay stalling when creating...

torify/nmap results in SIGABRT

James

11 Jan 2021 11 Jan '21

10:20 p.m.

Good day.

Is there any chance that torpy (https://github.com/torpyorg/torpy) was triggered this issue https://gitlab.torproject.org/tpo/core/tor/-/issues/33018 ?

Some wary facts: - Torpy using old fashion consensus (not mircodesc) - When consensus not present in cache (first time usage) it downloads consensus from random directory authorities only. - Before August 2020 it was using plain HTTP requests to DirAuths. Now it creates "CREATE_FAST" circuits to DirAuths (is that right way by the way?)

From other side: - Torpy store consensus on disk (so whenever client restart it must not download full consensus again) - It will try download consensus after time which sets by valid_time field from consensus which more than 1 hour (so it's not so often) - Torpy try get consensus by "diff" feature (so it's minimize traffic)

Still may be some of this features not working well in some conditions. Which could cause a lot of consensus downloads in Jan 2020... Or may be you know more info about this situation?

Do you have some recommendations for tor client implementation? Can you explain in several paragraphs what behavior of original tor client is? As far as I understand when first time original tor starts it tries download consensus from fallback dirs not from DA? Is this key point?

There is one more issue https://gitlab.torproject.org/tpo/core/tor/-/issues/40239 which I'm not understand correctly. Let's imagine it's first run of tor client and that time coincidentally coincided with DA voting. That means client will not be able to download consensus? That is strange decision. Or do you mean clients must download consensus from fallback dirs which never in "voting" process?

Show replies by date

Sebastian Hahn

12 Jan 12 Jan

12:53 a.m.

...

On 11. Jan 2021, at 23:20, James jbrown299@yandex.com wrote:

Good day.

Is there any chance that torpy (https://github.com/torpyorg/torpy) was triggered this issue https://gitlab.torproject.org/tpo/core/tor/-/issues/33018 ?

Some wary facts:

Torpy using old fashion consensus (not mircodesc)

When consensus not present in cache (first time usage) it downloads consensus from random directory authorities only.

Before August 2020 it was using plain HTTP requests to DirAuths. Now it creates "CREATE_FAST" circuits to DirAuths (is that right way by the way?)

From other side:

Torpy store consensus on disk (so whenever client restart it must not download full consensus again)

It will try download consensus after time which sets by valid_time field from consensus which more than 1 hour (so it's not so often)

Torpy try get consensus by "diff" feature (so it's minimize traffic)

Still may be some of this features not working well in some conditions. Which could cause a lot of consensus downloads in Jan 2020... Or may be you know more info about this situation?

Hi there,

thanks for the message. I think it is very likely that torpy is responsible for a at least a part of the increased load we're seeing on dirauths. I have taken a (very!) quick look at the source, and it appears that there are some problems. Please excuse any inaccuracies, I am not that strong in Python nor have I done too much Tor development recently:

First, I found this string in the code: "Hardcoded into each Tor client is the information about 10 beefy Tor nodes run by trusted volunteers". The word beefy is definitely wrong here. The nodes are not particularly powerful, which is why we have the fallback dir design for bootstrapping.

The code counts Serge as a directory authority which signs the consensus, and checks that over half of the dirauths signed it. But Serge is only the bridge authority and never signs the consensus, so torpy will reject some consensuses that are indeed valid. Once this happens, torpy goes into a deathly loop of "consensus invalid, trying again". There are no timeouts, backoffs, or failures noted.

The code frequently throws exceptions, but when an exception occurs it just continues doing what it was doing before. It has absolutely no regards to constrain its resources when using the Tor network.

The logic that if a network_status document was already downloaded that is used rather than trying to download a new one does not work. I have a network_status document, but the dirauths are contacted anyway. Perhaps descriptors are not cached to disk and downloaded on every new start of the application?

New consensuses never seem to be downloaded from guards, only from dirauths.

If my analsis above is at least mostly correct, if only some few people are running a scraper using torpy and call the binary in a loop, they will quickly overload the dirauths, causing exactly the trouble we're seeing. The effects compound, because torpy is relentless in trying again. Especially a scraper that might call torpy in a loop would just think that a single file failed to download and go to the next, once again creating load on all the dirauths.

There are probably more things suboptimal that I missed here. Generally, I think torpy needs to implement the following quickly if it wants to stop hurting the network. This is in order of priority, but I think _ALL_ (maybe more) are needed before torpy stops being an abuser of the network:

- Stop automatically retrying on failure, without backoff - Cache failures to disk to ensure a newly started torpy_cli does not request the same resources again that the previous instance failed to get. - Fix consensus validation logic to work the same way as tor cli (maybe as easy as removing Serge) - use microdescs/consensus, cache descriptors

I wonder if we can actively defend against network abuse like this in a sensible way. Perhaps you have some ideas, too? I think torpy has the ability to also quickly overwhelm fallback dirs in its current implementation, so simply switching to them from dirauths is not a solution here. Defenses are probably necessary to implement even if torpy can be fixed very quickly, because the older versions of torpy are out there and I assume will continue to be used. Hopefully that point is wrong?

Thanks Sebastian

James

15 Jan 15 Jan

10:56 p.m.

Sebastian, Thank you for comments.

First of all, sorry if torpy hurt in some way Tor Network. It was unintentionally.

In any case, it seems to me that if there was some high-level description of logic for official tor client, it would be very useful.

...

First, I found this string in the code: "Hardcoded into each Tor client is the information about 10 beefy Tor nodes run by trusted volunteers". The word beefy is definitely wrong here. The nodes are not particularly powerful, which is why we have the fallback dir design for bootstrapping.

At first glance, it seemed that the AuthDirs were the most trusted and reliable place for obtaining consensus. Now I'm understand more.

...

The code counts Serge as a directory authority which signs the consensus, and checks that over half of the dirauths signed it. But Serge is only the bridge authority and never signs the consensus, so torpy will reject some consensuses that are indeed valid.

Yep, here you right. Thanks for pointing out.

...

Once this happens, torpy goes into a deathly loop of "consensus invalid, trying again". There are no timeouts, backoffs, or failures noted.

Not really, because torpy has only 3 retries for getting consensus. But probably you are right because user code probably can do retry calling torpy in a loop. So that will always try download network_status... If you have some sort of statistic about increasing traffic we can compare that with time when was consensus signed by 4 signers which enough for tor but not enough for torpy.

...

The code frequently throws exceptions, but when an exception occurs it just continues doing what it was doing before. It has absolutely no regards to constrain its resources when using the Tor network.

What kind of constraints can you advise?

...

The logic that if a network_status document was already downloaded that is used rather than trying to download a new one does not work.

It works. But probably not in optimal way. It caches network_status only.

...

I have a network_status document, but the dirauths are contacted anyway. Perhaps descriptors are not cached to disk and downloaded on every new start of the application?

Exactly. Descriptors and network_status diff every hour was asking always from AuthDirs.

...

New consensuses never seem to be downloaded from guards, only from dirauths.

Thanks for pointing out. I looked more deeply into tor client sources. So basically if we have network_status we can use guard nodes to ask network_status and descriptors from them. Otherwise using fallback dirs to download network_status. I've implemented such logic in last commit.

...

There are probably more things suboptimal that I missed here.

If you find more please let me know. It really helpful.

...

Generally, I think torpy needs to implement the following quickly if it wants to stop hurting the network. This is in order of priority, but I think _ALL_ (maybe more) are needed before torpy stops being an abuser of the network:

...

Stop automatically retrying on failure, without backoff

I've added delays and backoff between retries.

...

Cache failures to disk to ensure a newly started torpy_cli does not

request the same resources again that the previous instance failed to get.

That will be on the list. But probably even if there is a loop level above and without this feature but with backoff it will be delays like: 3 sec, 5, 7, 9; 3, 5, 7, 9. Seems ok?

...

Fix consensus validation logic to work the same way as tor cli (maybe

as easy as removing Serge)

Done. Only auth dirs with V3_DIRINFO flag will be counted. It wasn't obvious =(

...

use microdescs/consensus, cache descriptors

On the list.

Moreover, I've switched to using fallback dirs instead of auth dirs and to guards if torpy has "reasonable" live consensus.

...

Defenses are probably necessary to implement even if torpy can be fixed very quickly, because the older versions of torpy are out there and I assume will continue to be used. Hopefully that point is wrong?

I believe that old versions doesn't work any more because them could not connect to auth dirs. Users getting 503 many times, so they will update client. I hope.

Thank you very much. And sorry again.

Sebastian Hahn

16 Jan 16 Jan

1:06 a.m.

Hi James,

thanks for already working on patches for these issues! I will reply inline some more.

...

On 15. Jan 2021, at 23:56, James jbrown299@yandex.com wrote:

First of all, sorry if torpy hurt in some way Tor Network. It was unintentionally.

I believe you :)

...

In any case, it seems to me that if there was some high-level description of logic for official tor client, it would be very useful.

Indeed. The more people work on alternative clients etc, the more we can learn here. Perhaps you can help point out places where documentation could help or something was not easy to understand.

...

...
First, I found this string in the code: "Hardcoded into each Tor client is the information about 10 beefy Tor nodes run by trusted volunteers". The word beefy is definitely wrong here. The nodes are not particularly powerful, which is why we have the fallback dir design for bootstrapping.

At first glance, it seemed that the AuthDirs were the most trusted and reliable place for obtaining consensus. Now I'm understand more.

The consensus is signed, so all the places to get it from are equally trusted. That's the beauty of the consensus system :) The dirauths are just trusted to create it, it doesn't matter who spreads it.

...

...
Once this happens, torpy goes into a deathly loop of "consensus invalid, trying again". There are no timeouts, backoffs, or failures noted.

Not really, because torpy has only 3 retries for getting consensus. But probably you are right because user code probably can do retry calling torpy in a loop. So that will always try download network_status... If you have some sort of statistic about increasing traffic we can compare that with time when was consensus signed by 4 signers which enough for tor but not enough for torpy.

Interesting, I ran torpy and on the console it seemed to try more often. Perhaps it made some progress and then failed on a different thing, which it then tried again.

To your second point, something like this can probably be done using https://metrics.torproject.org. But I am not doing the analysis here at the moment for personal reasons, sorry. Maybe someone else wants to look at it.

...

...
The code frequently throws exceptions, but when an exception occurs it just continues doing what it was doing before. It has absolutely no regards to constrain its resources when using the Tor network.

What kind of constraints can you advise?

I think instead of throwing an exception and continuing, you should give clear error messages and consider whether you need to stop execution. For example, if you downloaded a consensus and it is invalid, you're likely not going to get a valid one by trying again immediately. Instead, it would be better to declare who gave you the invalid one and log a sensible error.

In addition, properly using already downloaded directory information would be a much more considerate use of resources.

...

...
The logic that if a network_status document was already downloaded that is used rather than trying to download a new one does not work.

It works. But probably not in optimal way. It caches network_status only.

I may have confused it with asking for the diff. But that should not be necessary at all if you already have the latest one, so don't ask for a diff in this case.

...

...
I have a network_status document, but the dirauths are contacted anyway. Perhaps descriptors are not cached to disk and downloaded on every new start of the application?

Exactly. Descriptors and network_status diff every hour was asking always from AuthDirs.

Please cache descriptors.

...

...
New consensuses never seem to be downloaded from guards, only from dirauths.

Thanks for pointing out. I looked more deeply into tor client sources. So basically if we have network_status we can use guard nodes to ask network_status and descriptors from them. Otherwise using fallback dirs to download network_status. I've implemented such logic in last commit.

Cool!

...

...

Stop automatically retrying on failure, without backoff

I've added delays and backoff between retries.

...

Cache failures to disk to ensure a newly started torpy_cli does not

request the same resources again that the previous instance failed to get.

That will be on the list. But probably even if there is a loop level above and without this feature but with backoff it will be delays like: 3 sec, 5, 7, 9; 3, 5, 7, 9. Seems ok?

Well, the problem is if I run torpy_cli in parallel 100 times, we will still send many requests per second. From dirauth access patterns, we can see that some people indeed have such access patterns. So I think the backoff is a great start (tor client uses exponential backoff I think) but it definitely is not enough. If you couldn't get something this hour and you tried a few times, you need to stop trying again for this hour.

...

...
Defenses are probably necessary to implement even if torpy can be fixed very quickly, because the older versions of torpy >are out there and I assume will continue to be used. Hopefully that point is wrong?

I believe that old versions doesn't work any more because them could not connect to auth dirs. Users getting 503 many times, so they will update client. I hope.

Would be nice. We'll see!

Thanks Sebastian

Roger Dingledine

18 Jan 18 Jan

5 p.m.

On Sat, Jan 16, 2021 at 01:56:02AM +0300, James wrote:

...

In any case, it seems to me that if there was some high-level description of logic for official tor client, it would be very useful.

Hi James! Thanks for starting this discussion.

While I was looking at moria1's directory activity during the overload, I did say to myself "wow that's a lot of microdescriptor downloads".

So hearing that torpy isn't caching mirodescriptors yet makes me think that it's a good bet for explaining our overload last weekend.

I agree that we should have clearer docs for "how to be nice to the Tor network." We actually have an open ticket for that goal but nobody has worked on it in a while: https://gitlab.torproject.org/tpo/core/tor/-/issues/7106

Quoting from that ticket:

"""Second, it's easy to make client-side decisions that harm the Tor network. For examples, you can hold your TLS connections open too long, or do too many TLS connections, or make circuits too often, or ask the directory authorities for everything. We need to write up a spec to clarify how well-behaving Tor clients should do things. Maybe that means we write up some principles along the way, or maybe we just identify every design point that matters and say what to do for each of them."""

And in fact, since Nick has been working a lot on Arti lately: https://gitlab.torproject.org/tpo/core/arti/ it might be a perfect time for him to help document the current Tor behavior and the current Arti behavior, and we can think about where there is room for improvement.

...

If you have some sort of statistic about increasing traffic we can compare that

Here's the most interesting graph so far: https://metrics.torproject.org/dirbytes.html

So from that graph, the number of bytes handled by the directory authorities doesn't go up a lot, because they were already rate limited (instead, they just failed more often).

But the number of bytes handled by directory mirrors (including fallbackdirs) shot up a huge amount. For context, if we imagine that the normal Tor network handles between 2M and 8M daily users, then that added dir mirror load would imply an extra 4M to 16M daily users if they follow Tor's directory update habits. I'm guessing that the torpy users weren't following Tor's directory update habits, and so a much smaller set of users accounted for a much larger fraction of the load.

...

...
The logic that if a network_status document was already downloaded that is used rather than trying to download a new one does not work.

It works. But probably not in optimal way. It caches network_status only.

Here's my first start at three principles we should all follow when writing Tor clients:

(1) Reduce redundant interactions. For examples:

- Cache as much as possible of the directory information you fetch (consensus documents, microdescriptors, certs)

- If a directory fetch failed, don't just relaunch a duplicate request right after (because it will probably fail too).

- If your setup involves running multiple Tors locally, consider using a shared directory cache, so only one of them needs to fetch new directory info and then all of them can use it.

(2) Reduce impact of interactions. For examples:

- Always use the "If-Modified-Since" header on consensus updates, so they don't send you a consensus that you already have.

- Try to use the consensus diff system, so if you have an existing consensus you aren't fetching an entire new consensus.

- Ask for compression, to save overall bandwidth in the network.

- Move load off of directory authorities, and then off of fallback directories, as soon as possible. That is, if you have a list of fallbackdirs, ask them instead of directory authorities. And once you have a consensus and you've chosen your directory guards, ask them instead of the fallbackdirs.

(3) Plan ahead for what your current code will do in a few years when the world is different.

- To start here, check out the "slow zombies and fast zombies" discussion in Proposal 266: https://gitweb.torproject.org/torspec.git/tree/proposals/266-removing-curren...

- Specifically, think about how your code handles failures, and design your interactions with the Tor network so that if many people are running your code in the future, and it's failing for example because it is asking directory questions in an old format or because the directory servers have started rate limiting differently, it will back off rather than become more aggressive.

- When possible, look for ways to recognize when your code is asking old questions, so it can warn the user and stop interacting with the network.

...What else should be on the list?

Thanks! --Roger

Sebastian Hahn

20 Jan 20 Jan

11:04 a.m.

...

On 18. Jan 2021, at 18:00, Roger Dingledine arma@torproject.org wrote: While I was looking at moria1's directory activity during the overload, I did say to myself "wow that's a lot of microdescriptor downloads".

So hearing that torpy isn't caching mirodescriptors yet makes me think that it's a good bet for explaining our overload last weekend.

The fact that torpy doesn't use microdescriptors makes me think there's at least some other party involved here. Hopefully they can also improve their software, but it makes me wonder what that software is :/

Cheers Sebastian

Anthony Korte

12:48 p.m.

I have an unrelated question... where could I go with similar minds so that I may ask or would it be appropriate and acceptable to do that here , thanks .

On Mon, Jan 11, 2021, 6:21 PM James jbrown299@yandex.com wrote:

...

Good day.

Is there any chance that torpy (https://github.com/torpyorg/torpy) was triggered this issue https://gitlab.torproject.org/tpo/core/tor/-/issues/33018 ?

Some wary facts:

Torpy using old fashion consensus (not mircodesc)

When consensus not present in cache (first time usage) it downloads

consensus from random directory authorities only.

Before August 2020 it was using plain HTTP requests to DirAuths. Now

it creates "CREATE_FAST" circuits to DirAuths (is that right way by the way?)

From other side:

Torpy store consensus on disk (so whenever client restart it must not

download full consensus again)

It will try download consensus after time which sets by valid_time

field from consensus which more than 1 hour (so it's not so often)

Torpy try get consensus by "diff" feature (so it's minimize traffic)

Still may be some of this features not working well in some conditions. Which could cause a lot of consensus downloads in Jan 2020... Or may be you know more info about this situation?

Do you have some recommendations for tor client implementation? Can you explain in several paragraphs what behavior of original tor client is? As far as I understand when first time original tor starts it tries download consensus from fallback dirs not from DA? Is this key point?

There is one more issue https://gitlab.torproject.org/tpo/core/tor/-/issues/40239 which I'm not understand correctly. Let's imagine it's first run of tor client and that time coincidentally coincided with DA voting. That means client will not be able to download consensus? That is strange decision. Or do you mean clients must download consensus from fallback dirs which never in "voting" process?

tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev

1387

Age (days ago)

1396

Last active (days ago)

tor-dev@lists.torproject.org

6 comments

4 participants

tags (0)

participants (4)

Anthony Korte
James
Roger Dingledine
Sebastian Hahn