Hello David,
again, thanks for your work on adding more metrics to tor's MetricsPort! Many relay operators will love this and documentation will be useful [1].
I reported https://gitlab.torproject.org/tpo/core/tor/-/issues/40699 which got closed yesterday, but there was likely a misunderstanding and the changes did not solve the underlying issue.
The core issue is: The metric called tor_relay_connections(_total) [2][3] contains a mix of different types (gauge, counter).
When mixing types in a single metric, the TYPE definition will always be wrong for one or the other value. For example grafana will show this if you use a counter metric without rate(): "Selected metric is a counter. Consider calculating rate of counter by adding rate()."
It is a best practice to avoid mixing different types in a single metric.
From the prometheus best practices [4]:
"As a rule of thumb, either the sum() or the avg() over all dimensions of a given metric should be meaningful (though not necessarily useful). If it is not meaningful, split the data up into multiple metrics. For example, having the capacity of various queues in one metric is good, while mixing the capacity of a queue with the current number of elements in the queue is not."
An idea to address the underlying issue: One connection metric for counter and one for gauge:
- tor_relay_connections_total for counters, like the current label state="created" - tor_relay_connections for gauge metrics, like the current label state="opened". "rejected" also appears to be a gauge metric.
Another nice feature of these metrics would be to have a label for what type of system is connecting (src="relay", src="non-relay") - more on that in yesterday's email. A tool by toralf [4] also shows these and uses the source IP but tor itself does not need to look at the source IP to determine the type, something discussed in last week's relay operator meetup.
best regards, nat
[1] https://gitlab.torproject.org/tpo/web/support/-/issues/312 [2] https://gitlab.torproject.org/tpo/core/tor/-/commit/06a26f18727d3831339c138c... [3] https://gitlab.torproject.org/tpo/core/tor/-/commit/6d40e980fb149549bbef5d9e... [4] https://prometheus.io/docs/practices/naming/
On 28 Oct (11:04:09), nat@danwin1210.de wrote:
Hello David,
again, thanks for your work on adding more metrics to tor's MetricsPort! Many relay operators will love this and documentation will be useful [1].
I reported https://gitlab.torproject.org/tpo/core/tor/-/issues/40699 which got closed yesterday, but there was likely a misunderstanding and the changes did not solve the underlying issue.
The core issue is: The metric called tor_relay_connections(_total) [2][3] contains a mix of different types (gauge, counter).
The patch I got in makes all the tor_relay_connections_total{} metrics "gauges" because in effect, some can go up and down and some might only go up but I figured even the one that only accumulate can also be gauges.
Is that a problem to your knowledge from a Prometheus standpoint?
Cheers! David
When mixing types in a single metric, the TYPE definition will always be wrong for one or the other value. For example grafana will show this if you use a counter metric without rate(): "Selected metric is a counter. Consider calculating rate of counter by adding rate()."
It is a best practice to avoid mixing different types in a single metric. From the prometheus best practices [4]: "As a rule of thumb, either the sum() or the avg() over all dimensions of a given metric should be meaningful (though not necessarily useful). If it is not meaningful, split the data up into multiple metrics. For example, having the capacity of various queues in one metric is good, while mixing the capacity of a queue with the current number of elements in the queue is not."
An idea to address the underlying issue: One connection metric for counter and one for gauge:
- tor_relay_connections_total for counters, like the current label
state="created"
- tor_relay_connections for gauge metrics, like the current label
state="opened". "rejected" also appears to be a gauge metric.
Another nice feature of these metrics would be to have a label for what type of system is connecting (src="relay", src="non-relay") - more on that in yesterday's email. A tool by toralf [4] also shows these and uses the source IP but tor itself does not need to look at the source IP to determine the type, something discussed in last week's relay operator meetup.
best regards, nat
[1] https://gitlab.torproject.org/tpo/web/support/-/issues/312 [2] https://gitlab.torproject.org/tpo/core/tor/-/commit/06a26f18727d3831339c138c... [3] https://gitlab.torproject.org/tpo/core/tor/-/commit/6d40e980fb149549bbef5d9e... [4] https://prometheus.io/docs/practices/naming/
to remind ourselves about the prometheus types and their differences and when to use what: https://prometheus.io/docs/concepts/metric_types/
The patch I got in makes all the tor_relay_connections_total{} metrics "gauges" because in effect, some can go up and down and some might only go up but I figured even the one that only accumulate can also be gauges.
you are right that prometheus will scrape it anyway even if the type is incorrectly defined but by defining a counter as gauge (or vice versa) you tell people what it is and how to make sense of it, for example the type implies "do (not) use rate() on this".
By looking at
# TYPE tor_relay_connections_total gauge [...]
Anyone familiar with prometheus would expect it to be a gauge - so for example would not use rate() on it - but that is not correct for all labels. For example creating graphs for state="created" without using rate() would produce boring graphs because it is a counter. They might also wonder why a gauge ends with .._total, because that is used for accumulating count [1].
Is that a problem to your knowledge from a Prometheus standpoint?
It is a problem because prometheus users/tools are used to consume that TYPE line to learn what type it is. They will get confused if they treat it according to the current type definition. graphing counters as if they were gauges will result in questions like "oh why is my rate of x always increasing?"
before: # TYPE tor_relay_connections counter tor_relay_connections{type="OR",direction="initiated",state="opened"}
but that value is not a counter, it is a gauge. It can go down.
after yesterday's change:
# TYPE tor_relay_connections_total gauge tor_relay_connections_total{type="OR",direction="received",state="created"}
but that is a counter, it can never go down.
To prevent this and to follow the usual prometheus practices is best to have the current metric split into one for counters and another one for gauges.
thanks for working on this!
best regards, nat
tor-relays@lists.torproject.org