Total firing alerts: 1 Total resolved alerts: 0
## Firing Alerts
----- Time: 2023-02-22 00:47:51.034543026 +0000 UTC Summary: Too many bridgestrap failures Description: The percent of functional bridges is 43.0%.
-----
##Resolved Alerts
Quoting alertmanager@hetzner-nbg1-02.torproject.org (2023-02-22 01:48:35)
Time: 2023-02-22 00:47:51.034543026 +0000 UTC Summary: Too many bridgestrap failures Description: The percent of functional bridges is 43.0%.
I'm investigating this. It looks like if bridgestrap tests few bridges it's ratio of functional bridges is pretty high, but when it tests 25 bridges (the most common requests from rdsys) it has a 20-30% of functional bridges. I hope we don't have a regression back to: https://gitlab.torproject.org/tpo/anti-censorship/bridgestrap/-/issues/7
bridgestrap is currently using the tor binary installed in polyanthum, that is 0.4.7.13 which should have a fix for this issue.
Quoting meskio (2023-02-22 16:49:44)
Quoting alertmanager@hetzner-nbg1-02.torproject.org (2023-02-22 01:48:35)
Time: 2023-02-22 00:47:51.034543026 +0000 UTC Summary: Too many bridgestrap failures Description: The percent of functional bridges is 43.0%.
I'm investigating this. It looks like if bridgestrap tests few bridges it's ratio of functional bridges is pretty high, but when it tests 25 bridges (the most common requests from rdsys) it has a 20-30% of functional bridges. I hope we don't have a regression back to: https://gitlab.torproject.org/tpo/anti-censorship/bridgestrap/-/issues/7
bridgestrap is currently using the tor binary installed in polyanthum, that is 0.4.7.13 which should have a fix for this issue.
bridgestrap is recovering, now it is claiming that 85% of functional bridges. I don't know what was the source of the problem, maybe we had some network issues in polyanthum?
On Thu, Feb 23, 2023 at 07:27:56PM +0100, meskio wrote:
bridgestrap is recovering, now it is claiming that 85% of functional bridges. I don't know what was the source of the problem, maybe we had some network issues in polyanthum?
There were overload issues around that time on the metal that runs various VMs like check.tpo and bridges.tpo.
So, it would seem that bridgestrap has some bugs where if the network goes away, or if the disk or cpu becomes too loaded and things stall, it calls a lot of bridges down.
How to make things more robust? Hm. One answer might be running two bridgestraps in different places and ignoring one if it says a lot of bridges went down but the other doesn't agree.
I was originally thinking to have a handful of bridges that we *know* are usually mostly up, like the built-in bridges, and if all of those are suddenly down, we stop believing bridgestrap's answer. But then we end up in the situation where all we know is that we don't know.
I guess a third idea would be to ignore it since it doesn't happen *that* often (though it seems to happen more often in our current age of DDoS attacks).
Hopefully there are better ideas out there and how to best handle or tolerate or work around an overload on the underlying bridgestrap server. :)
--Roger
Quoting Roger Dingledine (2023-03-08 23:25:49)
On Thu, Feb 23, 2023 at 07:27:56PM +0100, meskio wrote:
bridgestrap is recovering, now it is claiming that 85% of functional bridges. I don't know what was the source of the problem, maybe we had some network issues in polyanthum?
There were overload issues around that time on the metal that runs various VMs like check.tpo and bridges.tpo.
So, it would seem that bridgestrap has some bugs where if the network goes away, or if the disk or cpu becomes too loaded and things stall, it calls a lot of bridges down.
How to make things more robust? Hm. One answer might be running two bridgestraps in different places and ignoring one if it says a lot of bridges went down but the other doesn't agree.
I was originally thinking to have a handful of bridges that we *know* are usually mostly up, like the built-in bridges, and if all of those are suddenly down, we stop believing bridgestrap's answer. But then we end up in the situation where all we know is that we don't know.
I guess a third idea would be to ignore it since it doesn't happen *that* often (though it seems to happen more often in our current age of DDoS attacks).
Hopefully there are better ideas out there and how to best handle or tolerate or work around an overload on the underlying bridgestrap server. :)
rdsys does already ignore bridgestrap resutls if the percentage of functional bridges is lower than 50% (and this is when this alert is raised), in that situation rdsys does distribute all bridges independently if bridgestrap says that they are functional or not. I think this is good enough for now, as you say it doesn't happen so often and we've being discussing if bridgestrap will be replaced by onbasca or arti-based something.
anti-censorship-alerts@lists.torproject.org