Hello,
One of my relays (guard, not exit) started to report being overloaded since once week ago for the first time in its life.
The consensus weight and advertised bandwidth are proper as per what they should be, considering the relay's configuration. More than this, they have not changed for years. So, I started to look at it more closely.
Apparently the overload is triggered at 5-6 days by flooding it with circuit creation requests. All I can see in tor.log is:
[warn] Your computer is too slow to handle this many circuit creation requests! Please consider using the MaxAdvertisedBandwidth config option or choosing a more restricted exit policy. [68382 similar message(s) suppressed in last 482700 seconds]
[warn] Your computer is too slow to handle this many circuit creation requests! Please consider using the MaxAdvertisedBandwidth config option or choosing a more restricted exit policy. [7882 similar message(s) suppressed in last 60 seconds]
This message is logged like 4-5 or 6 time as 1 minute (60 sec) difference between each warn entry.
After that, the relay is back to normal. So it feels like it is being probed or something like this. CPU usage is at 65%, RAM is at under 45%, SSD no problem, bandwidth no problem.
Metrics port says:
tor_relay_load_tcp_exhaustion_total 0
tor_relay_load_onionskins_total{type="tap",action="processed"} 52073 tor_relay_load_onionskins_total{type="tap",action="dropped"} 0 tor_relay_load_onionskins_total{type="fast",action="processed"} 0 tor_relay_load_onionskins_total{type="fast",action="dropped"} 0 tor_relay_load_onionskins_total{type="ntor",action="processed"} 8069522 tor_relay_load_onionskins_total{type="ntor",action="dropped"} 273275
So if we account the dropped ntor circuits with the processed ntor circuits we end up with a reasonable % (it's >8 million vs <300k).
So the question here is: does the computed consensus weight of a relay change if that relay keeps sending reports to directory authorities that it is being overloaded? If yes, could this be triggered by an attacker, in order to arbitrary decrease a relay's consensus weight even when it's not really overloaded (to maybe increase the consensus weights of other malicious relays that we don't know about)?
Also, as a side note, I think that if the dropped/processed ratio is not over 15% or 20% a relay should not consider itself overloaded. Would this be a good idea?
Sending to tor-relays@ for now, if some of you think of this in any way we can open a thread about it on tor-dev@ - please let me know if I should do this.
Replying to myself:
s7r wrote: [SNIP]
Metrics port says:
tor_relay_load_tcp_exhaustion_total 0
tor_relay_load_onionskins_total{type="tap",action="processed"} 52073 tor_relay_load_onionskins_total{type="tap",action="dropped"} 0 tor_relay_load_onionskins_total{type="fast",action="processed"} 0 tor_relay_load_onionskins_total{type="fast",action="dropped"} 0 tor_relay_load_onionskins_total{type="ntor",action="processed"} 8069522 tor_relay_load_onionskins_total{type="ntor",action="dropped"} 273275
So if we account the dropped ntor circuits with the processed ntor circuits we end up with a reasonable % (it's >8 million vs <300k).
So the question here is: does the computed consensus weight of a relay change if that relay keeps sending reports to directory authorities that it is being overloaded? If yes, could this be triggered by an attacker, in order to arbitrary decrease a relay's consensus weight even when it's not really overloaded (to maybe increase the consensus weights of other malicious relays that we don't know about)?
Also, as a side note, I think that if the dropped/processed ratio is not over 15% or 20% a relay should not consider itself overloaded. Would this be a good idea?
Sending to tor-relays@ for now, if some of you think of this in any way we can open a thread about it on tor-dev@ - please let me know if I should do this.
I am now positive that this particular relay is actively being probed, overloaded for just few minutes every 2-3-4 days, rest of the time performing just fine with under 70% usage for CPU and under 50% for RAM, SSD and bandwidth.
I also confirm that after this time's overload report, my consensus weight and advertised bandwidth decreased. So my concerns about this being triggered arbitrary has a network-wide effect in terms of path selection probability and might suite someone a purpose of any sort.
I don't know what is the gain here and who is triggering this, as well as if other Guard relays are experiencing the same (maybe we can analyze onionoo datasets and find out) but until then I am switching to OverloadStatistics 0.
Here are today's Metrics Port results:
tor_relay_load_tcp_exhaustion_total 0
tor_relay_load_onionskins_total{type="tap",action="processed"} 62857 tor_relay_load_onionskins_total{type="tap",action="dropped"} 0 tor_relay_load_onionskins_total{type="fast",action="processed"} 0 tor_relay_load_onionskins_total{type="fast",action="dropped"} 0 tor_relay_load_onionskins_total{type="ntor",action="processed"} 10923543 tor_relay_load_onionskins_total{type="ntor",action="dropped"} 819524
As you can see, like in the first message of this thread, the calculated percent of dropped/processed ntor cells is not a concern (over 10 million processed, under 900 000 dropped).
Other relevant log messages that sustain my doubts: This appeared when it was being hammered intentionally. As you can see the overload only took 7 minutes. At previous overload it took 5 minutes and previous previous overload 6 minutes.
I think the attacker saves resources as it gains same result overloading it 5 minutes versus overloading it 24x7.
Jan 03 07:14:42.000 [warn] Your computer is too slow to handle this many circuit creation requests! Please consider using the MaxAdvertisedBandwidth config option or choosing a more restricted exit policy. [2004 similar message(s) suppressed in last 213900 seconds] Jan 03 07:15:42.000 [warn] Your computer is too slow to handle this many circuit creation requests! Please consider using the MaxAdvertisedBandwidth config option or choosing a more restricted exit policy. [52050 similar message(s) suppressed in last 60 seconds] Jan 03 07:16:42.000 [warn] Your computer is too slow to handle this many circuit creation requests! Please consider using the MaxAdvertisedBandwidth config option or choosing a more restricted exit policy. [92831 similar message(s) suppressed in last 60 seconds] Jan 03 07:17:42.000 [warn] Your computer is too slow to handle this many circuit creation requests! Please consider using the MaxAdvertisedBandwidth config option or choosing a more restricted exit policy. [89226 similar message(s) suppressed in last 60 seconds] Jan 03 07:18:42.000 [warn] Your computer is too slow to handle this many circuit creation requests! Please consider using the MaxAdvertisedBandwidth config option or choosing a more restricted exit policy. [74832 similar message(s) suppressed in last 60 seconds] Jan 03 07:19:42.000 [warn] Your computer is too slow to handle this many circuit creation requests! Please consider using the MaxAdvertisedBandwidth config option or choosing a more restricted exit policy. [79933 similar message(s) suppressed in last 60 seconds] Jan 03 07:20:42.000 [warn] Your computer is too slow to handle this many circuit creation requests! Please consider using the MaxAdvertisedBandwidth config option or choosing a more restricted exit policy. [68678 similar message(s) suppressed in last 60 seconds] Jan 03 07:21:42.000 [warn] Your computer is too slow to handle this many circuit creation requests! Please consider using the MaxAdvertisedBandwidth config option or choosing a more restricted exit policy. [76461 similar message(s) suppressed in last 60 seconds]
Other stats from log file:
14154 circuits open ; I've received 358682 connections on IPv4 and 14198 on IPv6. I've made 185294 connections with IPv4 and 51900 with IPv6.
[notice] Heartbeat: DoS mitigation since startup: 1 circuits killed with too many cells, 27881 circuits rejected, 2 marked addresses, 0 same address concurrent connections rejected, 0 connections rejected, 0 single hop clients refused, 0 INTRODUCE2 rejected.
[notice] Since our last heartbeat, 2878 circuits were closed because of unrecognized cells while we were the last hop. On average, each one was alive for 653.767547 seconds, and had 1.000000 unrecognized cells.
This particular last message I am only seeing it recently, but I see it quite heavily (at every heartbeat); any others of you see it?
My sense which never disappointed me is telling me there's something here worth looking into. I want to analyze onionoo datasets to see if the Guard % overload reports increased in the last month and to open an issue on gitlab to patch Tor to only report Overload in case the dropped/processed ratio is over 20%.
On 01 Jan (21:12:38), s7r wrote:
Hello,
Hi s7r!
Sorry for the delay, some vacationing happened for most of us eheh :).
One of my relays (guard, not exit) started to report being overloaded since once week ago for the first time in its life.
The consensus weight and advertised bandwidth are proper as per what they should be, considering the relay's configuration. More than this, they have not changed for years. So, I started to look at it more closely.
Apparently the overload is triggered at 5-6 days by flooding it with circuit creation requests. All I can see in tor.log is:
[warn] Your computer is too slow to handle this many circuit creation requests! Please consider using the MaxAdvertisedBandwidth config option or choosing a more restricted exit policy. [68382 similar message(s) suppressed in last 482700 seconds]
[warn] Your computer is too slow to handle this many circuit creation requests! Please consider using the MaxAdvertisedBandwidth config option or choosing a more restricted exit policy. [7882 similar message(s) suppressed in last 60 seconds]
This message is logged like 4-5 or 6 time as 1 minute (60 sec) difference between each warn entry.
After that, the relay is back to normal. So it feels like it is being probed or something like this. CPU usage is at 65%, RAM is at under 45%, SSD no problem, bandwidth no problem.
Very plausible theory, especially in the context of such "burst" of traffic, we can rule out that all the sudden your relay has become facebook.onion guard.
Metrics port says:
tor_relay_load_tcp_exhaustion_total 0
tor_relay_load_onionskins_total{type="tap",action="processed"} 52073 tor_relay_load_onionskins_total{type="tap",action="dropped"} 0 tor_relay_load_onionskins_total{type="fast",action="processed"} 0 tor_relay_load_onionskins_total{type="fast",action="dropped"} 0 tor_relay_load_onionskins_total{type="ntor",action="processed"} 8069522 tor_relay_load_onionskins_total{type="ntor",action="dropped"} 273275
So if we account the dropped ntor circuits with the processed ntor circuits we end up with a reasonable % (it's >8 million vs <300k).
Yeah so this is ~3.38% drop so it immediately triggers the overload signal.
So the question here is: does the computed consensus weight of a relay change if that relay keeps sending reports to directory authorities that it is being overloaded? If yes, could this be triggered by an attacker, in order to arbitrary decrease a relay's consensus weight even when it's not really overloaded (to maybe increase the consensus weights of other malicious relays that we don't know about)?
Correct, this is a possibility indeed. I'm not entirely certain that this is the case at the moment as sbws (bandwidth authority software) might not be downgrading the bandwidth weights just yet.
But regardless, the point is that it is where we are going to. But we have control over this so now is a good time to notice these problems and act.
I'll try to get back to you asap after talking with the network team.
Also, as a side note, I think that if the dropped/processed ratio is not over 15% or 20% a relay should not consider itself overloaded. Would this be a good idea?
Plausible that it could be better idea! Unclear what an optimal percentage is but, personally, I'm leaning towards that we need higher threshold so they are not triggered in normal circumstances.
But I think if we raise this to 20% let say, it might not stop an attacker from triggering it. It might just make it that it is a bit longer.
Thanks for your feedback! We'll get back to this thread asap.
David
On 1/12/22 5:36 PM, David Goulet wrote:
On 01 Jan (21:12:38), s7r wrote:
One of my relays (guard, not exit) started to report being overloaded since once week ago for the first time in its life.
The consensus weight and advertised bandwidth are proper as per what they should be, considering the relay's configuration. More than this, they have not changed for years. So, I started to look at it more closely.
Apparently the overload is triggered at 5-6 days by flooding it with circuit creation requests. All I can see in tor.log is:
[warn] Your computer is too slow to handle this many circuit creation requests! Please consider using the MaxAdvertisedBandwidth config option or choosing a more restricted exit policy. [68382 similar message(s) suppressed in last 482700 seconds]
[warn] Your computer is too slow to handle this many circuit creation requests! Please consider using the MaxAdvertisedBandwidth config option or choosing a more restricted exit policy. [7882 similar message(s) suppressed in last 60 seconds]
This message is logged like 4-5 or 6 time as 1 minute (60 sec) difference between each warn entry.
After that, the relay is back to normal. So it feels like it is being probed or something like this. CPU usage is at 65%, RAM is at under 45%, SSD no problem, bandwidth no problem.
Very plausible theory, especially in the context of such "burst" of traffic, we can rule out that all the sudden your relay has become facebook.onion guard.
Metrics port says:
tor_relay_load_tcp_exhaustion_total 0
tor_relay_load_onionskins_total{type="tap",action="processed"} 52073 tor_relay_load_onionskins_total{type="tap",action="dropped"} 0 tor_relay_load_onionskins_total{type="fast",action="processed"} 0 tor_relay_load_onionskins_total{type="fast",action="dropped"} 0 tor_relay_load_onionskins_total{type="ntor",action="processed"} 8069522 tor_relay_load_onionskins_total{type="ntor",action="dropped"} 273275
So if we account the dropped ntor circuits with the processed ntor circuits we end up with a reasonable % (it's >8 million vs <300k).
Yeah so this is ~3.38% drop so it immediately triggers the overload signal.
So the question here is: does the computed consensus weight of a relay change if that relay keeps sending reports to directory authorities that it is being overloaded? If yes, could this be triggered by an attacker, in order to arbitrary decrease a relay's consensus weight even when it's not really overloaded (to maybe increase the consensus weights of other malicious relays that we don't know about)?
Correct, this is a possibility indeed. I'm not entirely certain that this is the case at the moment as sbws (bandwidth authority software) might not be downgrading the bandwidth weights just yet.
But regardless, the point is that it is where we are going to. But we have control over this so now is a good time to notice these problems and act.
I'll try to get back to you asap after talking with the network team.
My thinking is that sbws would avoid reducing weight of a relay that is overloaded until it sees a series of these overload lines, with fresh timestamps. For example, just one with a timestamp that never updates again could be tracked but not reacted to, until the timestamp changes N times.
We can (and should) also have logic that prevents sbws from demoting the capacity of a Guard relay so much that it loses the Guard flag, so DoS attacks can't easily cause clients to abandon a Guard, unless it goes entirely down.
Both of these things can be done in sbws side. This would not solve short blips of overload from still being reported on the metrics portal, but maybe we want to keep that property.
Also, as a side note, I think that if the dropped/processed ratio is not over 15% or 20% a relay should not consider itself overloaded. Would this be a good idea?
Plausible that it could be better idea! Unclear what an optimal percentage is but, personally, I'm leaning towards that we need higher threshold so they are not triggered in normal circumstances.
But I think if we raise this to 20% let say, it might not stop an attacker from triggering it. It might just make it that it is a bit longer.
Hrmm. Parameterizing this threshold as a consensus parameter might be a good idea. I think that if we can make it such that an attack has to be "severe" and "ongoing" long enough such that a relay has lost capacity and/or lost the ability to complete circuits, and that relay can't do anything about it, that relay unfortunately should not be used as much. It's not like the circuit will be likely to succeed or be fast enough to use in that case anyway.
We need better DoS defenses generally :/
Mike Perry wrote:
Correct, this is a possibility indeed. I'm not entirely certain that this is the case at the moment as sbws (bandwidth authority software) might not be downgrading the bandwidth weights just yet.
But regardless, the point is that it is where we are going to. But we have control over this so now is a good time to notice these problems and act.
I'll try to get back to you asap after talking with the network team.
My thinking is that sbws would avoid reducing weight of a relay that is overloaded until it sees a series of these overload lines, with fresh timestamps. For example, just one with a timestamp that never updates again could be tracked but not reacted to, until the timestamp changes N times.
We can (and should) also have logic that prevents sbws from demoting the capacity of a Guard relay so much that it loses the Guard flag, so DoS attacks can't easily cause clients to abandon a Guard, unless it goes entirely down.
Both of these things can be done in sbws side. This would not solve short blips of overload from still being reported on the metrics portal, but maybe we want to keep that property.
I agree with this - sbws should see the overload reports often and with a some kind of continuity before reducing the weight of a relay, not just "bursts" like the ones I was experiencing (maximum 3-5 minutes of overload every 2-3 days).
After switching to OverloadStatistics 0 the consensus weight is back to normal, to what the relay can take with no effort, and the overload bursts are so rare now (just one in 17 days).
The metrics port values are looking good as well. the % of dropped ntors as opposite to processed ntors is very good (under 5%).
I am not sure what the best approach for the metrics portal is, but I think it's easier there to document what to look out for and when should this be considered a problem by the relay operator and when not.
Also, as a side note, I think that if the dropped/processed ratio is not over 15% or 20% a relay should not consider itself overloaded. Would this be a good idea?
Plausible that it could be better idea! Unclear what an optimal percentage is but, personally, I'm leaning towards that we need higher threshold so they are not triggered in normal circumstances.
But I think if we raise this to 20% let say, it might not stop an attacker from triggering it. It might just make it that it is a bit longer.
Hrmm. Parameterizing this threshold as a consensus parameter might be a good idea. I think that if we can make it such that an attack has to be "severe" and "ongoing" long enough such that a relay has lost capacity and/or lost the ability to complete circuits, and that relay can't do anything about it, that relay unfortunately should not be used as much. It's not like the circuit will be likely to succeed or be fast enough to use in that case anyway.
We need better DoS defenses generally :/
Of course we need better defense, DoS is never actually fixed, no matter what we do. It's just an arms race the way I see it. But if we reduce the consensus weight or assume at network level that relay X should be used less because of a super tiny percent of dropped circuits we could end up in wasting network resources on one side, and on the other side maybe granting better probability chances for evil relays that we have not discovered yet to grab circuits. A consensus parameter is of course appropriate here, maybe 20% is a big threshold and should be less, but right now even 0.1% is reported and treated as overload, IMO this is not acceptable.
Thanks for looking into this David & Mike.
On 1/23/22 5:28 PM, s7r wrote:
Mike Perry wrote:
We need better DoS defenses generally :/
Of course we need better defense, DoS is never actually fixed, no matter what we do. It's just an arms race the way I see it.
Well, I am extremely optimistic about https://gitlab.torproject.org/tpo/core/torspec/-/blob/main/proposals/327-pow...
Despite random internet hate on PoW, in the context of onion service DoS (which is our main cause of overall network DoS and actual overload), it looks very likely to be an effective option.
Prop325 describes how we can use PoW to build a system that auto-tunes itself such that only circuits with a sufficient level of PoW succeed. This means that we won't use PoW at all, unless it is needed (ie: only require the level of PoW needed to jump the queue of DoS requests, and advertise this level in the service descriptor).
In this way, an attack can be made considerably more expensive (and most importantly: less profitable) for an attacker, with the only effect that clients wait a bit longer to connect, to solve the PoW, and only while an attack is ongoing. If this system is effective enough, DoS-for-ransom attacks should vanish due to low profitability. And because the system auto-tunes, we should also no longer need PoW at all, after it is deployed, because of this deterrent effect. Win-win-win.
Jamie Harper did an implementation of this proposal: https://github.com/jmhrpr/tor-prop-327
David, asn, and I did a prelim code review of that branch, and while it needs more unit tests, and we need to reduce some of its external dependencies, it was of surprisingly high code quality. Unfortunately, Jamie graduated and moved on to other things.
But if we reduce the consensus weight or assume at network level that relay X should be used less because of a super tiny percent of dropped > circuits we could end up in wasting network resources on one side, and on the other side, maybe granting better probability chances for evil relays that we have not discovered yet to grab circuits. A consensus parameter is of course appropriate here, maybe 20% is a big threshold and should be less, but right now even 0.1% is reported and treated as overload, IMO this is not acceptable.
I generally agree here. For this, I filed: https://gitlab.torproject.org/tpo/core/tor/-/issues/40560
An important detail here is that ntor drops *are also* a form of that same kind of bias away from attacked relays, and toward evil relays.
If a relay is under such heavy DoS such that it is already dropping X% of ntors, it is *already* being forced to not carry X% of circuits. So, at minimum, it is reasonable to give it X% less traffic. However, this reduction must also be subject to the limits of not stripping the guard flag away just for overload/DoS, as I mentioned in my previous post.
At any rate, before we even consider using overload-general for relay weighting, we first need to transition the bwauths to use congestion control, and monitor how that system behaves wrt reducing overload, before we also try to add in overload-general. So it will be a while, and we will have time to thing about and analyze this stuff further.
tor-relays@lists.torproject.org