Hi Sebastian,
we discussed that you could become the fallback metrics service operator and take over operation whenever I have to go on a four-week round-the-world cruise or in the unlikely case that my application for the one-way trip to Mars [0] goes through. This is awesome! Thanks a lot!
Moving this discussion to tor-dev@ in case this is interesting to others.
So, you asked what the actual tasks are, and how much time I'd think they'll be taking.
Let's start by taking a look at this overview [1] of the various tasks that are performed by metrics services. Going through the various tool names stated in brackets, starting with the ones that I think should have highest priority:
- metrics-db-* is where we collect all the fine Tor network data. The cronjob behind metrics-db-R has to run once per hour, or our archives will be missing consensuses and votes published in that hour. The other metrics-db-*'s can handle short downtimes better than metrics-db-R. Beware that this is not the best code you ever read.
- metrics-web-* is the metrics website, including the necessary PostgreSQL and Tomcat parts. I'm not spending much time on this codebase anymore, because we should really do something like Thomas does in his Visionion project. But until that's there, metrics-web is all we have. And I think it has quite a few visitors. So, if it breaks, we should fix it ASAP.
- task-6498, task-8462, and task-2718 are additions to metrics-web which should be properly integrated into metrics-web.git. I should do this, maybe together with you.
- Onionoo powers Atlas, Compass, and a few other tools. It has become quite complex over its 1.5 years of existence. It consists of two parts, a Java backend and a Tomcat frontend. It's actually one of the few codebases I maintain that has pretty high test coverage. I'm trying hard to keep this service running 24/7, and I'm impressed how fast people report if it breaks for just half an hour.
Those are the highest priority ones. Looking at the lower priorities:
- tor produces descriptors and statistics in its extra-info descriptors that we collect. The only thing I do is nag directory authority operators and the bridge authority operator if something's wrong, so that our descriptor archives will be as complete as possible.
- Torperf is our performance measurement tool. You know it from a few years ago when you wrote a Python controller for it. It hasn't changed much since then. I started rewriting it in Twisted, but that's still work in proress. I'm afraid if Torperf breaks, we'll have to fix it. People care about the performance data we gather.
- TorDNSEL produces exit lists that we download and archive. If TorDNSEL breaks, metrics-db starts complaining, so we learn that it needs to be fixed. It's always unclear who's going to fix it, but in the end there's always somebody who does.
- BridgeDB exports bridge pool assignments that we sanitize and archive. There's currently a bug in BridgeDB making these files not as useful as they could be (#9264). Bugging the BridgeDB maintainer(s) about these things is part of the metrics service operator job, because we care about archives being complete.
And finally, here are the services that I'd say are not part of the fallback metrics service operator job:
- Atlas is already co-maintained by Philipp since the dev meeting, and he does a great job there so far.
- Compass isn't really something I operate. I don't know enough about the web part behind it to fix it if it breaks horribly. I just deploy patches after doing a quick plausibility check, and if they break, I revert. I'd say if it breaks and I'm not around, leave it broken until somebody else fixes it.
- DocTor may soon be replaced by a new tool that Damian writes and that will use the new descriptor fetching stuff in Stem. Ideally, we'll shut down DocTor in two months from now, so that's nothing to worry about from a metrics service operation point of view.
What did I miss?
So, how do we estimate how much time it will take you to get started? Depends on how certain your want to be that you'll be able to handle things if they break.
In a perfect world, I'd say that you should set up each service on a different machine and run it for a while. And when we get new hardware in a month or two, we should move services together. I'm aware that this requires a lot of work on both your and my side to understand the code, make it easier to install, and document everything better. But this would also improve the tools a lot.
In a maybe more realistic world, I'd say that you should take a look at the codebases and at the service directories on yatei, and then we discuss why things are set up as they are. This should also put you into a fine position to rescue broken services, though you might have to do more research once that happens.
Maybe there are more approaches? What do you prefer?
Oh, and to be honest, I didn't apply for the Mars thing. ;)
All the best, Karsten
[0] http://newsfeed.time.com/2013/05/09/78000-people-apply-for-one-way-trip-to-m...