Hi Damian,
we briefly discussed Stem's new descriptor fetching module and how we could extend the existing simple monitors [0] towards a replacement of the Java consensus-health checker [1].
Moving this discussion to this list with your permission.
So, you asked what exactly the consensus-health checker a.k.a. DocTor looks for. Let me try to give you a quick overview of the different parts [2]. Going through the Java source files in an order that hopefully explains best how everything works together:
- Warning.java is an enum of all different warnings that DocTor can emit. Each warning contains a little documentation string saying what it means. If these are ambiguous, let me know, and I can probably explain them better.
- Checker.java contains the various checks that are performed on previously downloaded consensuses and votes. For example, checkMissingConsensuses goes through the (hard-coded) list of known directory authorities and emits a ConsensusDownloadTimeout warning if we couldn't download the consensus from at least one of them. As you see, there are plenty more check* methods.
- StatusFileReport.java uses the results from Checker by putting all warnings in two output files, one of them containing all warnings, the other only containing new warnings. Each warning has a severity, which can be ERROR, WARNING, or NOTICE. Also, each warning defines a time after which we consider the exact same warning string new even though the warning hasn't changed. The latter is useful to rate-limit warnings. For example, the fact that a certificate is going to expire in two months from now doesn't have to be repeated every hour.
- MetricsWebsiteReport.java is the second output of DocTor. It's the website available at [3]. The idea is that the website gives more information about warnings received on IRC or via email. It's actually a hack that this website is presented on metrics. In a rewrite, PyDoctor would have its own little webserver to present consensus-health details. Once it's in place and we shut down DocTor, I'm going to replace the website on metrics with a static page linking to PyDoctor.
- DownloadStatistics.java keeps statistics about consensus download times which are displayed on the website.
- Downloader.java is a wrapper for metrics-lib's descriptor downloader.
- Main.java puts everything together. It first downloads everything, then writes the status files containing warnings, and then generates the website output.
So, that's what DocTor does right now. Here are two more things that would be great to have in DocTor or PyDoctor:
- Warn if directory authorities assign flags to unusually few or many relays [4]. This enhancement has the potential of generating lots of warnings, because the directory authorities currently vote *very* differently on certain flags. The result will be a lot of directory authority operator nagging. Just saying, you should be prepared for that when deploying this!
- Ignore certain known warnings [5]. This will reduce a lot of noise on the consensus-health mailing list. The fewer noise there is the more people will pay attention to actually valid warnings. In theory.
Hope that makes sense. Happy to provide more input or review code. Just let me know!
All the best, Karsten
[0] https://lists.torproject.org/pipermail/tor-dev/2013-July/005209.html
[1] https://www.torproject.org/getinvolved/volunteer#metrics-pyDoctor
[2] https://gitweb.torproject.org/doctor.git/tree/HEAD:/src/org/torproject/docto...
[3] https://metrics.torproject.org/consensus-health.html