Hi,
the upcoming relayor release will contain new prometheus alert rules that should help operators detect/mitigate/prevent some common operational issues.
One of these rules will alert on high DNS timeout rates [1] on exit relays that should be investigated and resolved to improve the Tor Browser experience for users.
To define the default alert threshold it would be relevant to know current timeout rates by multiple exit operators.
So if you feel up to it you can send me (off-list) your current timeout rates (30 day graph):
If you run tor exits with relayor and MetricsPort enabled you can use these Prometheus queries: timeout percentage by server: (sum by (instance)(rate(tor_relay_exit_dns_error_total{reason="tor_timeout"}[15m])))/(sum by (instance)(rate(tor_relay_exit_dns_query_total[15m])))*100 DNS query rate: sum by (instance)(rate(tor_relay_exit_dns_query_total[15m]))
If you run exits without relayor you can use these queries:
(sum by (job)(rate(tor_relay_exit_dns_error_total{reason="tor_timeout"}[15m])))/(sum by (job)(rate(tor_relay_exit_dns_query_total[15m])))*100
sum(rate(tor_relay_exit_dns_query_total[15m]))
kind regards, nusenu
[1] https://github.com/nusenu/ansible-relayor/commit/9b6937563ec0b41e6abb0217ed6...
tor-relays@lists.torproject.org