It's that time of the month again! We met, and here are our minutes.
TL;DR: we'll upgrade to Mailman 3, not self-hosted Discourse, Icinga vs Prometheus still up in the air, request templates coming.
# Roll call: who's there and emergencies
materculae is doing OOM errors: see [tpo/tpa/team#40750][]. anarcat is looking into it, no other emergencies
[tpo/tpa/team#40750]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40750
present: anarcat, gaba, kez, lavamind
# product deployment workflow question
Gaba created an issue to provide feedback from the community team in [tpo/tpa/team#40746][]:
Something that came up from one of the project's retrospective this month is about having a space in TPI for testing new tools. We need space where we can **quickly** test things if needed. It could be a policy of getting the service/tool/testing automatically destroyed after a specific amount of time.
[tpo/tpa/team#40746]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40746
Prior art: out of the [Brussels meeting][], came many tasks about server lifecycle, see in particular [tpo/tpa/team#29398][] (a template for requesting resources) and [tpo/tpa/team#29379][] (automatically shutdown).
[Brussels meeting]: https://gitlab.torproject.org/legacy/trac/-/wikis/org/meetings/2019BrusselsA... [tpo/tpa/team#29398]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/29398 [tpo/tpa/team#29379]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/29379
We acknowledge that it was hard to communicate with TPA during the cdr.link testing. The [cdr.link issue][] actually took 9 days to complete between open and close. But once requirements were clarified and we agreed on deployment, it took less than 24 hours to actually setup the machine.
[cdr.link issue]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40578
In general, our turnaround time for new VMs is currently one business day. That's actually part of our OKRs for this quarter, but so far it's typically how long it takes to provision a VM. It can take longer, especially when we are requested odd services we do not understand or that overlap with existing services.
We're looking at at setting up templates to improve communication when setting up new resources, inspired from the [service cookbooks][] idea. The idea behind this mechanism is that the template helps answering common question we have when people ask for services, but it's also a good way to identify friction points. For example, if we get a *lot* of requests for VMs and those take a long time, then we can focus on automating that service. At first the template serves as an input for a manual operation, but eventually it could be a way to automate the creation and destruction of resources as well.
Issue [tpo/tpa/team#29398][] was put back in the backlog to start working on this. One of the problems is that, to have issue templates, we need a Git repository in the project and, right now, the tpo/tpa/team project deliberately doesn't have one so that it "looks" like a wiki. But maybe we can just bite that bullet and move the wiki-replica in there.
[service cookbooks]: https://lethain.com/service-cookbooks/ vs wiki
# bullseye upgrade: phase 3
A quick update on the phase 2 progress ([tpo/tpa/team#40692][]): slower than phase 1, because those servers are more complicated. We had to had to deprecate python 2 (see [TPA-RFC-27][]), so far network health and TPA affected. Both were able to quickly port their scripts to Python 3 so far. Also had difficulties with the PostgreSQL upgrade (see above materculae issue).
[TPA-RFC-27]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/33949 [tpo/tpa/team#40692]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40692
Let's talk about the difficult problems left in [TPA-RFC-20][]: bullseye upgrades.
[TPA-RFC-20]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-20-bullsey...
Extract from the RFC, discuss each individually:
- `alberti`: `userdir-ldap` is, in general, risky and needs special attention, but should be moderately safe to upgrade, see ticket [tpo/tpa/team#40693][]
Tricky server, to be very careful around, but no controversy around it.
- `eugeni`: messy server, with lots of moving parts (e.g. Schleuder, Mailman), Mailman 2 EOL, needs to decide whether to migrate to Mailman 3 or replace with Discourse (and self-host), see [tpo/tpa/team#40471][], followup in [tpo/tpa/team#40694][], Schleuder discussion in [tpo/tpa/team#40564][]
[tpo/tpa/team#40564]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40564
One of the ideas behind the Discourse setup was that we would eventually mirror many lists to Discourse. If we want to use Discourse, we need to start adding a Discourse category for *each* mailing list.
The [Mailman 3 upgrade procedure][], that said, is not that complicated: each list is migrated by hand, but the migration is pretty transparent for users. But if we switch to Discourse, it would be a major change: people would need to register, all archive links would break, etc.
[Mailman 3 upgrade procedure]: https://docs.mailman3.org/en/latest/migration.html
We don't hear a lot of enthusiasm around migrating from Mailman to Discourse at this point. We will therefore upgrade from Mailman 2 to Mailman 3, instead of migrating everything to Discourse.
As an aside, anarcat would rather avoid self-hosting Discourse unless it allows us to replace another service, as Discourse is a complex piece of software that would take a lot of work to maintain (just like Mailman 3). There are currently no plans to self-host discourse inside TPA.
There was at least one vote for removing schleuder. It seems people are having both problems using and managing it, but it's possible that finding new maintainers for the service could help.
- `pauli`: Puppet packages are severely out of date in Debian, and Puppet 5 is EOL (with Puppet 6 soon to be). doesn't necessarily block the upgrade, but we should deal with this problem sooner than later, see [tpo/tpa/team#33588][], followup in [tpo/tpa/team#40696][]
Lavamind made a new puppet agent 7 package that should eventually land in Debian experimental. He will look into the Puppet server and Puppet DB packages with the Clojure team this weekend, has a good feeling that we should be able to use Puppet 7 in Debian bookworm. We need to decide what to do with the current server WRT bullseye.
Options:
1. use upstream puppet 7 packages in bullseye, for bookworm move back to Debian packages 2. use our in-house Puppet 7 packages before upgrading to bookworm 3. stick with Puppet 5 for bullseye, upgrade the server to bookworm and puppet server 7 when we need to (say after the summer), follow puppet agent to 7 as we jump in the bookworm freeze
Lavamind will see if it's possible to use Puppet agent 7 in bullseye, which would make it possible to upgrade only the server to bookworm and keep the fleet upgraded to bookworm progressively (option 3, above, favorite for now).
- `hetzner-hel1-01`: Nagios AKA Icinga 1 is end-of-life and needs to be migrated to Icinga 2, which involves fixing our git hooks to generate Icinga 2 configuration (unlikely), or rebuilding a Icinga 2 server, or replacing with Prometheus (see [tpo/tpa/team#29864][]), followup in [tpo/tpa/team#40695][]
Anarcat proposed to not upgrade Icinga and instead replace it with Prometheus and Alert Manager. We had a debate here: on the one hand, lavamind believes that Alert manager doesn't have all the bells and whistles that Icinga 2 provides. Icinga2 has alert history, a nice and intuitive dashboard where you ack alerts and see everything, while alert manager is just a dispatcher and doesn't actually come with a UI.
Anarcat, however, feels that upgrading to Icinga2 will be a lot of work. We'll need to hook up all the services in Puppet. This is already all done in Prometheus: the node exporter is deployed on all machines, and there are service specific exporters deployed for many services: apache, bind, postgresql (partially) are all monitored. Plus, service admins have widely adopted the second Prometheus server and are actually already using for alerting.
We have a service duplication here, so we need to make a decision on which service we are going to retire: either Alert Manager or Icinga2. The discussion is to be continued.
Other major upgrade tasks remaining, informative, to be done progressively in may:
* upgrades, batch 2: tpo/tpa/team#40692 (probably done by this point?) * gnt-fsn upgrade: tpo/tpa/team#40689 (involves an upgrade to backports, then bullseye) * sunet site move: tpo/tpa/team#40684 (involves rebuilding 3 machines)
# Dashboard review
Skipped for lack of time.
* https://gitlab.torproject.org/tpo/tpa/team/-/boards/117 * https://gitlab.torproject.org/groups/tpo/web/-/boards * https://gitlab.torproject.org/groups/tpo/tpa/-/boards
# Holidays planning
Skipped for lack of time, followup by email.
# Other discussions
We need to review the dashboards at the next checkin, possibly discuss the Icinga vs Prometheus proposal again.
# Next meeting
Next meeting should be on Monday June 6th.
# Metrics of the month
* hosts in Puppet: 93, LDAP: 93, Prometheus exporters: 154 * number of Apache servers monitored: 27, hits per second: 295 * number of self-hosted nameservers: 6, mail servers: 8 * pending upgrades: 0, reboots: 0 * average load: 0.64, memory available: 4.67 TiB/5.83 TiB, running processes: 718 * disk free/total: 34.14 TiB/88.48 TiB * bytes sent: 400.82 MB/s, received: 266.83 MB/s * planned bullseye upgrades completion date: 2022-12-05 * [GitLab tickets][]: 178 tickets including... * open: 0 * icebox: 153 * backlog: 10 * next: 4 * doing: 6 * needs information: 2 * needs review: 3 * (closed: 2732)
[Gitlab tickets]: https://gitlab.torproject.org/tpo/tpa/team/-/boards
Upgrade prediction graph lives at https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/upgrades/bullseye/
Now also available as the main Grafana dashboard. Head to https://grafana.torproject.org/, change the time period to 30 days, and wait a while for results to render.
# Number of the month
4 issues. We have somehow managed to bring the number of tickets in the icebox from 157 to 153, that's a 4 issues gain! It's the first time since we're tracking those numbers that we managed to get that number to go down *at all*, so this is really motivating.
We also closed a whopping 53 tickets since the last report, not quite a record, but certainly on the high range.
Also: we managed to bring back the estimated bullseye upgrades completion date back *two years*, into a more reasonable date. This year, even! We still hope to complete most upgrades by this summer, so hopefully that number will keep going down as we continue the upgrades.
Another fun fact: we now have more Debian bullseye (54) than buster (39) machines.