Vasilis:
Hi,
Reopening thread after IRC discussion.
Bottom-posted, instead of more sensibly posting inline.
DaKnOb:
It depends on what you consider “professional” monitoring. Do you mean information collected, or how was it collected?
By professional monitoring I mean a way to find out in a short time-span what was the reason for a relay that suddenly is disconnected from the Tor network, uses an outdated version of tor, performs badly on the Tor network, runs an outdated OS version, misses security updates or other crucial software that may compromise the Tor relays and subsequently the Tor network.
Some important properties of this monitoring system:
- Hardware issues: RAID/HD/hardware failures, kernel panic/OOM
states - Software issues: OS updates, tor updates, security updates - Network issues: RBLs, IP blocking, upstream network issues - Abuse issues: Monitor of abuse emails per relay/network -sort of ticketing system for operators that are unwilling/don't know/have the capacity to track and respond to abuse emails (that most of the time are automated and just a 'foo' response back) - Legal issues: Initiating a canary-like or similar for relay operators that would like to be reached out when they don't provide any updates. I suspect this to have many false positives but better safe that sorry (quite often you are not allowed to speak openly about a legal issue until this is settled, in this part potential organizations may reach out to help operators)
Is measuring something from the tor process using bash scripts and cron professional? Is measuring network traffic using Prometheus and plotting to Grafana professional?
My "professional point of view" will be a system -preferably agent-less- that could ping operators via email and provide alert notifications on an IRC channel.
For a few nodes I control / controlled I measured lots of network info such as:
- Network Traffic in / out (b/s) - Network Packets in / out (p/s) -
Network Flows in / out (f/s)
And I always run a local resolver, so DNS info too:
- Query Responses / Second - Query Latency - SERVFAILs / Second
The DNS info was gathered only in one node, as an experiment, since I wasn’t sure whether it could leak information, and only for a limited amount of time.
I share the same concerns with you so I'm not really interested in measuring DNS responses or collecting long-term stats that may leak sensitive information or potentially used to de-anomymize or compromise in any way (in ways that we don't know yet) the Tor network.
From a quick read through on this, it seems there is a case for different tools instead of "one size fits all".
I would divide the functions into three categories and therefore three tools:
1. remote checks of server responsiveness
Most any tool could work here. I'm a long-time fan of sysmon (https://puck.nether.net/sysmon/). It's light, configurable, very modular with clean syntax and provides a good array of checks with email alerts. Most importantly for the stated purposes, no need for a local agent running on the target systems.
2. local system checks
The BSDs do daily/weekly/monthly emails by default, with raid health checks and other tasks available or easily extendable with a little shell scripting.
Or, I think opting for some shell scripts for checking the raid health, etc, would be fine.
This doesn't scale well, obviously, when you're talking about a daily per system. But in that case, the shell script option or config management should work. Think about what you want, ie, "is the raid array dead", and go for it.
3. remote checks of Tor-related statistics
How Tor is operating can be done one of two ways, I'm thinking off the top of my head.
If you want periodic checks about consensus weight, or anything available through Onionoo from https://metrics.torproject.org/onionoo.html with JSON might make sense worked into some email output.
g