Hi,
every now and then I'm in contact with relay operators about the "health" of their relays. Following these 1:1 discussions and the discussion on tor-relays@ I'd like to rise two issues with you (the developers) with the goal to help improve relay operations and end user experience in the long term:
1) DNS (exits only) 2) tor relay health data
1) DNS ------ Current situation: Arthur Edelstein provides public measurements to tor exit relay operators via his page at: https://arthuredelstein.net/exits/ This page is updated once daily.
the process to use that data looks like this: - first they watch Arthur's measurement results - if their failure rate is non-zero they try to tweak/improve/change their setup - wait for another 24 hours (next measurement)
This is a somewhat suboptimal and slow feedback loop and is probably also less accurate and less valuable data when compared to the data the tor process can provide.
Suggestion for improvement:
Exposes the following DNS status information via tor's controlport to help debug and detect DNS issues on exit relays:
(total numbers since startup) - amount of DNS queries send to the resolver - amount of DNS queries send to the resolver due to a RESOLVE request - DNS queries send to resolver due to a reverse RESOLVE request - amount of queries that did not result in any answer from the resolver - breakdown of number of responses by response code (RCODE) https://www.iana.org/assignments/dns-parameters/dns-parameters.xhtml#dns-par... - max amount of DNS queries send per curcuit
If this causes a significant performance impact this feature should be disabled by default.
2) general relay health metrics --------------------------------
Compared to other server daemons (webserver, DNS server, ..) tor provides little data for operators to detect operational issues and anomalies.
I'd suggest to provide the following stats via the control port: (most of them are already written to logfiles by default but not accessible via the controlport as far as I've seen)
- total amount of memory used by the tor process - amount of currently open circuits - circuit handshake stats (TAP / NTor)
DoS mitigation stats - amount of circuits killed with too many cells - amount of circuits rejected - marked addresses - amount of connections closed - amount of single hop clients refused
- amount of closed/failed circuits broken down by their reason value https://gitweb.torproject.org/torspec.git/tree/tor-spec.txt#n1402 https://gitweb.torproject.org/torspec.git/tree/control-spec.txt#n1994
- amount of closed/failed OR connections broken down by their reason value https://gitweb.torproject.org/torspec.git/tree/control-spec.txt#n2205
If this causes a significant performance impact this feature should be disabled by default.
cell stats - extra info cell stats as defined in: https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n1072
This data should be useful to answer the following questions:
- High level questions: Is the tor relay healthy? - is it hitting any resource limits? - is the tor process under unusual load? - why is tor using more memory? - is it slower than usual at handling circuits? - can the DNS resolver handle the amount of DNS queries tor is sending it?
This data could help prevent errors from occurring or provide additional data when trying to narrow down issues.
When it comes to the question: **Is it "safe" to make this data accessible via the controlport?**
I assume it is safe for all information that current versions of tor writes to logfiles or even publishes as part of its extra info descriptor.
Should tor provide this or similar data I'm planing to write scripts for operators to make use of that data (for example a munin plugin that connects to tor's controlport).
I'm happy to help write updates for control-spec should these features seem reasonable to you.
Looking forward to hearing your feedback. nusenu
All sorts of statistical counters could be useful to graph from some API... a control port stats dump in something like var=value, a BSD sysctl text format, and even up to a proper SNMP port which many graphers already speak.
Logging isn't really the right place for such things, unless they've reached some preset or unusual threshold, thus becoming reportable.
On 03 Feb (00:24:00), nusenu wrote:
Hi,
Hello nusenu,
Thanks for this email. I exporting more metrics on the control port is a great idea. I wanted to have that for a while so quite happy you started rolling the ball :).
There are safety questions to ask ourselves here before blindly exporting many stats. The metrics team also I know has opinion on that, I had a talk very recently with irl on this.
Exporting many stats to the control port unfortunately means that all relay operator can possibly create fancy graphs and make them public which, depending on the stat, can be harmful.
Furthermore, graphing stats can also means that over time the relay operator stores historical data of everything that happened within the relay and that can be used in many ways to pull off attacks (ex: subpoena to access such data base by LE).
The Heartbeat log has a minimum of 30 minutes period but a default of 6 hours. Whatever stats we would end up exporting, I strongly think that keeping delays like that is a strong requirement because we would sort of "bin" those aggregated stats by a "long enough period" instead of having a very fine grained stream of stats that would make it trivial to spot spikes down to the minute.
Some of the stats below are safe in my opinion like the memory usage but most of them need to be looked at in terms of safety from both the stand point of having a very fine grained precision with them and what happens when that data becomes historical data?
I'll stop for now but I will follow up on this once I have thought a bit more about it so I don't say too many stupid things right now :).
Cheers! David
every now and then I'm in contact with relay operators about the "health" of their relays. Following these 1:1 discussions and the discussion on tor-relays@ I'd like to rise two issues with you (the developers) with the goal to help improve relay operations and end user experience in the long term:
DNS (exits only)
tor relay health data
DNS
Current situation: Arthur Edelstein provides public measurements to tor exit relay operators via his page at: https://arthuredelstein.net/exits/ This page is updated once daily.
the process to use that data looks like this:
- first they watch Arthur's measurement results
- if their failure rate is non-zero they try to tweak/improve/change their setup
- wait for another 24 hours (next measurement)
This is a somewhat suboptimal and slow feedback loop and is probably also less accurate and less valuable data when compared to the data the tor process can provide.
Suggestion for improvement:
Exposes the following DNS status information via tor's controlport to help debug and detect DNS issues on exit relays:
(total numbers since startup)
- amount of DNS queries send to the resolver
- amount of DNS queries send to the resolver due to a RESOLVE request
- DNS queries send to resolver due to a reverse RESOLVE request
- amount of queries that did not result in any answer from the resolver
- breakdown of number of responses by response code (RCODE)
https://www.iana.org/assignments/dns-parameters/dns-parameters.xhtml#dns-par...
- max amount of DNS queries send per curcuit
If this causes a significant performance impact this feature should be disabled by default.
- general relay health metrics
Compared to other server daemons (webserver, DNS server, ..) tor provides little data for operators to detect operational issues and anomalies.
I'd suggest to provide the following stats via the control port: (most of them are already written to logfiles by default but not accessible via the controlport as far as I've seen)
- total amount of memory used by the tor process
- amount of currently open circuits
- circuit handshake stats (TAP / NTor)
DoS mitigation stats
amount of circuits killed with too many cells
amount of circuits rejected
marked addresses
amount of connections closed
amount of single hop clients refused
amount of closed/failed circuits broken down by their reason value
https://gitweb.torproject.org/torspec.git/tree/tor-spec.txt#n1402 https://gitweb.torproject.org/torspec.git/tree/control-spec.txt#n1994
- amount of closed/failed OR connections broken down by their reason value
https://gitweb.torproject.org/torspec.git/tree/control-spec.txt#n2205
If this causes a significant performance impact this feature should be disabled by default.
cell stats
- extra info cell stats
as defined in: https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n1072
This data should be useful to answer the following questions:
- High level questions: Is the tor relay healthy?
- is it hitting any resource limits?
- is the tor process under unusual load?
- why is tor using more memory?
- is it slower than usual at handling circuits?
- can the DNS resolver handle the amount of DNS queries tor is sending it?
This data could help prevent errors from occurring or provide additional data when trying to narrow down issues.
When it comes to the question: **Is it "safe" to make this data accessible via the controlport?**
I assume it is safe for all information that current versions of tor writes to logfiles or even publishes as part of its extra info descriptor.
Should tor provide this or similar data I'm planing to write scripts for operators to make use of that data (for example a munin plugin that connects to tor's controlport).
I'm happy to help write updates for control-spec should these features seem reasonable to you.
Looking forward to hearing your feedback. nusenu
-- https://twitter.com/nusenu_ https://mastodon.social/@nusenu
tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Thanks for this email. I exporting more metrics on the control port is a great idea. I wanted to have that for a while
Great to hear that so we have a realistic chance it gets actually implemented :)
There are safety questions to ask ourselves here before blindly exporting many stats.
Sure.
Exporting many stats to the control port unfortunately means that all relay operator can possibly create fancy graphs
making non-public graphs and alerts is the goal
and make them public
public graphs should result in the rejection of affected relays. I'll be submitting a few to bad-relays@ soon since enn.lu apparently does not care when asked to remove their public stats and xml data.
which, depending on the stat, can be harmful.
Furthermore, graphing stats can also means that over time the relay operator stores historical data of everything that happened within the relay and that can be used in many ways to pull off attacks (ex: subpoena to access such data base by LE).
yes, acceptable / unacceptable retention times and granularity should be defined and documented. I'd propose a max. retention time of two weeks.
The Heartbeat log has a minimum of 30 minutes period but a default of 6 hours.
current tor has no restrictions on Heartbeat granularity, you can ask tor to write the data to the logs every other second by issuing "SIGNAL HEARTBEAT" on the control port.
Whatever stats we would end up exporting, I strongly think that keeping delays like that is a strong requirement because we would sort of "bin" those aggregated stats by a "long enough period" instead of having a very fine grained stream of stats that would make it trivial to spot spikes down to the minute.
30 or 60 minutes granularity seems reasonable
Some of the stats below are safe in my opinion like the memory usage but most of them need to be looked at in terms of safety
yes please
On February 3, 2019 10:19:00 PM UTC, nusenu nusenu-lists@riseup.net wrote:
Thanks for this email. I exporting more metrics on the control port
is a
great idea. I wanted to have that for a while
Great to hear that so we have a realistic chance it gets actually implemented :)
There are safety questions to ask ourselves here before blindly exporting many stats.
Sure.
Exporting many stats to the control port unfortunately means that all relay operator can possibly create fancy graphs
making non-public graphs and alerts is the goal
and make them public
public graphs should result in the rejection of affected relays. I'll be submitting a few to bad-relays@ soon since enn.lu apparently does not care when asked to remove their public stats and xml data.
which, depending on the stat, can be harmful.
Furthermore, graphing stats can also means that over time the relay operator stores historical data of everything that happened within
the
relay and that can be used in many ways to pull off attacks (ex: subpoena to access such data base by LE).
yes, acceptable / unacceptable retention times and granularity should be defined and documented. I'd propose a max. retention time of two weeks.
The Heartbeat log has a minimum of 30 minutes period but a default of
6
hours.
current tor has no restrictions on Heartbeat granularity, you can ask tor to write the data to the logs every other second by issuing "SIGNAL HEARTBEAT" on the control port.
Whatever stats we would end up exporting, I strongly think that keeping delays like that is a strong requirement because we would
sort
of "bin" those aggregated stats by a "long enough period" instead of having a very fine grained stream of stats that would make it trivial
to
spot spikes down to the minute.
30 or 60 minutes granularity seems reasonable
Some of the stats below are safe in my opinion like the memory usage
but
most of them need to be looked at in terms of safety
yes please
Here's another design that preserves user privacy: * add noise to every logged statistic (to protect usage in the current period) * round every logged statistic (to protect average usage over multiple periods)
If we add enough noise to protect most users, then we will have privacy by design.
We should still teach operators why detailed stats are bad for users. And have rules about retention periods. But these rules won't be as critical as they are now, because the rules will only be needed for edge cases. (Like a single client that uses moat of a relay, which we can't hide very well, no matter what we do.)
Adding noise will be easier once PrivCount is implemented. Until then, we'll need to rely on the retention rules you are suggesting.
T
-- teor ----------------------------------------------------------------------
Hi All,
On 04/02/2019 06:35, teor wrote:
If we add enough noise to protect most users, then we will have privacy by design.
I would argue that noise does not help here, as we would have to add enough noise to protect against a guard discovery attack, which is too much noise for the stats to be useful.
I only learned that these stats have such high resolution last week and I'm very concerned about this.
Regarding limiting retention time, if I'm trying to pull off a guard discovery attack then I'm probably going to be interested in only the timeframe that relates to my attack. Retention periods aren't going to help here and may in fact make it worse if LE suspects that the data would disappear after a given time period and so issues an emergency order that might be even more restrictive or carry heavier sanctions for non-compliance.
Are the statistics in the extra-info descriptor really not useful for the purpose of graphing to monitor health? If they are not then we should come up with ways of addressing this but if they are then we should not be retaining any more data than that which is already public.
If we think that the 6-hour statistics are safe to collect (which we previously decided they were not when we changed the granularity of the bandwidth stats) then we could add them to extra-info descriptors.
I am worried that exposing/retaining statistics without a proper review of the attacks they enable, even with the best guidelines in the world, is dangerous. If we have retention guidelines we also have no way to enforce those and this could introduce a systemic weakness in the network.
I have filed #29344 to consider these things.
Thanks, Iain.
nusenu:
Hi,
every now and then I'm in contact with relay operators about the "health" of their relays. Following these 1:1 discussions and the discussion on tor-relays@ I'd like to rise two issues with you (the developers) with the goal to help improve relay operations and end user experience in the long term:
- DNS (exits only)
tracked as: https://trac.torproject.org/projects/tor/ticket/31290
- tor relay health data
tracked as: https://trac.torproject.org/projects/tor/ticket/31291
I've put this thread into a new ticket:
"provide relay health prometheus metrics via MetricsPort/MetricsSocket" https://gitlab.torproject.org/tpo/core/tor/-/issues/40194