tor relay process health data for operators (controlport) - tor-dev

3 Feb 2019


      Hi,
every now and then I'm in contact with relay operators
about the "health" of their relays.
Following these 1:1 discussions and the discussion on tor-relays@
I'd like to rise two issues with you (the developers) with the goal 
to help improve relay operations and end user experience in the long term:
1) DNS (exits only)
2) tor relay health data
1) DNS
------
Current situation: 
Arthur Edelstein provides public measurements to tor exit relay operators via
his page at: https://arthuredelstein.net/exits/
This page is updated once daily.
the process to use that data looks like this:
- first they watch Arthur's measurement results
- if their failure rate is non-zero they try to tweak/improve/change their setup
- wait for another 24 hours (next measurement)
This is a somewhat suboptimal and slow feedback loop and is probably also
less accurate and less valuable data when compared to the data the tor
process can provide.
Suggestion for improvement:
Exposes the following DNS status information 
via tor's controlport to help debug and detect DNS issues on exit relays:
(total numbers since startup)
- amount of DNS queries send to the resolver
- amount of DNS queries send to the resolver due to a RESOLVE request
- DNS queries send to resolver due to a reverse RESOLVE request
- amount of queries that did not result in any answer from the resolver
- breakdown of number of responses by response code (RCODE)
https://www.iana.org/assignments/dns-parameters/dns-parameters.xhtml#dns-par...
- max amount of DNS queries send per curcuit
If this causes a significant performance impact this feature should be disabled
by default.
2) general relay health metrics
--------------------------------
Compared to other server daemons (webserver, DNS server, ..)
tor provides little data for operators to detect operational issues
and anomalies.
I'd suggest to provide the following stats via the control port:
(most of them are already written to logfiles by default but not accessible
via the controlport as far as I've seen)
- total amount of memory used by the tor process
- amount of currently open circuits 
- circuit handshake stats (TAP / NTor)
DoS mitigation stats 
- amount of circuits killed with too many cells 
- amount of circuits rejected
- marked addresses
- amount of connections closed
- amount  of single hop clients refused
- amount of closed/failed circuits broken down by their reason value
https://gitweb.torproject.org/torspec.git/tree/tor-spec.txt#n1402
https://gitweb.torproject.org/torspec.git/tree/control-spec.txt#n1994
- amount of closed/failed OR connections broken down by their reason value
https://gitweb.torproject.org/torspec.git/tree/control-spec.txt#n2205
If this causes a significant performance impact this feature should be disabled
by default.
cell stats
- extra info cell stats
as defined in:
https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n1072
This data should be useful to answer the following questions:
- High level questions: Is the tor relay healthy?
- is it hitting any resource limits? 
- is the tor process under unusual load?
- why is tor using more memory?
- is it slower than usual at handling circuits?
- can the DNS resolver handle the amount of DNS queries tor is sending it?
This data could help prevent errors from occurring or provide
additional data when trying to narrow down issues.
When it comes to the question: 
**Is it "safe" to make this data accessible via the controlport?**
I assume it is safe for all information that current versions of 
tor writes to logfiles or even publishes as part of its extra info descriptor.
Should tor provide this or similar data 
I'm planing to write scripts for operators to make use
of that data (for example a munin plugin that connects to tor's controlport).
I'm happy to help write updates for control-spec should these features 
seem reasonable to you.
Looking forward to hearing your feedback.
nusenu
-- 
https://twitter.com/nusenu_
https://mastodon.social/@nusenu