minutes from the TPA meeting - tor-project

10 May 2022

      It's that time of the month again! We met, and here are our minutes.
TL;DR: we'll upgrade to Mailman 3, not self-hosted Discourse, Icinga vs
Prometheus still up in the air, request templates coming.
# Roll call: who's there and emergencies
materculae is doing OOM errors: see [tpo/tpa/team#40750][]. anarcat is
looking into it, no other emergencies
[tpo/tpa/team#40750]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40750
present: anarcat, gaba, kez, lavamind
# product deployment workflow question
Gaba created an issue to provide feedback from the community team in
[tpo/tpa/team#40746][]:
...
Something that came up from one of the project's retrospective this
month is about having a space in TPI for testing new tools. We need
space where we can **quickly** test things if needed. It could be a
policy of getting the service/tool/testing automatically destroyed
after a specific amount of time.
[tpo/tpa/team#40746]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40746
Prior art: out of the [Brussels meeting][], came many tasks about
server lifecycle, see in particular [tpo/tpa/team#29398][] (a template
for requesting resources) and [tpo/tpa/team#29379][] (automatically
shutdown).
[Brussels meeting]: https://gitlab.torproject.org/legacy/trac/-/wikis/org/meetings/2019BrusselsA...
 [tpo/tpa/team#29398]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/29398
 [tpo/tpa/team#29379]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/29379
We acknowledge that it was hard to communicate with TPA during the
cdr.link testing. The [cdr.link issue][] actually took 9 days to
complete between open and close. But once requirements were clarified
and we agreed on deployment, it took less than 24 hours to actually
setup the machine.
[cdr.link issue]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40578
In general, our turnaround time for new VMs is currently one business
day. That's actually part of our OKRs for this quarter, but so far
it's typically how long it takes to provision a VM. It can take
longer, especially when we are requested odd services we do not
understand or that overlap with existing services.
We're looking at at setting up templates to improve communication when
setting up new resources, inspired from the [service cookbooks][]
idea. The idea behind this mechanism is that the template helps
answering common question we have when people ask for services, but
it's also a good way to identify friction points. For example, if we
get a *lot* of requests for VMs and those take a long time, then we
can focus on automating that service. At first the template serves as
an input for a manual operation, but eventually it could be a way to
automate the creation and destruction of resources as well.
Issue [tpo/tpa/team#29398][] was put back in the backlog to start
working on this. One of the problems is that, to have issue templates,
we need a Git repository in the project and, right now, the
tpo/tpa/team project deliberately doesn't have one so that it "looks"
like a wiki. But maybe we can just bite that bullet and move the
wiki-replica in there.
[service cookbooks]: https://lethain.com/service-cookbooks/ vs wiki
# bullseye upgrade: phase 3
A quick update on the phase 2 progress ([tpo/tpa/team#40692][]):
slower than phase 1, because those servers are more complicated. We
had to had to deprecate python 2 (see [TPA-RFC-27][]), so far network
health and TPA affected. Both were able to quickly port their scripts
to Python 3 so far. Also had difficulties with the PostgreSQL upgrade
(see above materculae issue).
[TPA-RFC-27]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/33949 
[tpo/tpa/team#40692]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40692
Let's talk about the difficult problems left in [TPA-RFC-20][]:
bullseye upgrades.
[TPA-RFC-20]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-20-bullsey...
Extract from the RFC, discuss each individually:
...

`alberti`: `userdir-ldap` is, in general, risky and needs special
attention, but should be moderately safe to upgrade, see ticket
[tpo/tpa/team#40693][]

Tricky server, to be very careful around, but no controversy around
it.
...

`eugeni`: messy server, with lots of moving parts
(e.g. Schleuder, Mailman), Mailman 2 EOL, needs to decide whether
to migrate to Mailman 3 or replace with Discourse (and
self-host), see [tpo/tpa/team#40471][], followup in
[tpo/tpa/team#40694][], Schleuder discussion in
[tpo/tpa/team#40564][]

[tpo/tpa/team#40564]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40564
One of the ideas behind the Discourse setup was that we would
eventually mirror many lists to Discourse. If we want to use
Discourse, we need to start adding a Discourse category for *each*
mailing list.
The [Mailman 3 upgrade procedure][], that said, is not that
complicated: each list is migrated by hand, but the migration is
pretty transparent for users. But if we switch to Discourse, it would
be a major change: people would need to register, all archive links
would break, etc.
[Mailman 3 upgrade procedure]: https://docs.mailman3.org/en/latest/migration.html
We don't hear a lot of enthusiasm around migrating from Mailman to
Discourse at this point. We will therefore upgrade from Mailman 2 to
Mailman 3, instead of migrating everything to Discourse.
As an aside, anarcat would rather avoid self-hosting Discourse unless
it allows us to replace another service, as Discourse is a complex
piece of software that would take a lot of work to maintain (just like
Mailman 3). There are currently no plans to self-host discourse inside
TPA.
There was at least one vote for removing schleuder. It seems people
are having both problems using and managing it, but it's possible that
finding new maintainers for the service could help.
...

`pauli`: Puppet packages are severely out of date in Debian, and
Puppet 5 is EOL (with Puppet 6 soon to be). doesn't necessarily
block the upgrade, but we should deal with this problem sooner
than later, see [tpo/tpa/team#33588][], followup in
[tpo/tpa/team#40696][]

Lavamind made a new puppet agent 7 package that should eventually land
in Debian experimental. He will look into the Puppet server and Puppet
DB packages with the Clojure team this weekend, has a good feeling
that we should be able to use Puppet 7 in Debian bookworm. We need to
decide what to do with the current server WRT bullseye.
Options:
1. use upstream puppet 7 packages in bullseye, for bookworm move back
    to Debian packages
 2. use our in-house Puppet 7 packages before upgrading to bookworm
 3. stick with Puppet 5 for bullseye, upgrade the server to bookworm
    and puppet server 7 when we need to (say after the summer), follow
    puppet agent to 7 as we jump in the bookworm freeze
Lavamind will see if it's possible to use Puppet agent 7 in bullseye,
which would make it possible to upgrade only the server to bookworm
and keep the fleet upgraded to bookworm progressively (option 3,
above, favorite for now).
...

`hetzner-hel1-01`: Nagios AKA Icinga 1 is end-of-life and needs to
be migrated to Icinga 2, which involves fixing our git hooks to
generate Icinga 2 configuration (unlikely), or rebuilding a Icinga
2 server, or replacing with Prometheus (see
[tpo/tpa/team#29864][]), followup in [tpo/tpa/team#40695][]

Anarcat proposed to not upgrade Icinga and instead replace it with
Prometheus and Alert Manager. We had a debate here: on the one hand,
lavamind believes that Alert manager doesn't have all the bells and
whistles that Icinga 2 provides. Icinga2 has alert history, a nice and
intuitive dashboard where you ack alerts and see everything, while
alert manager is just a dispatcher and doesn't actually come with a
UI.
Anarcat, however, feels that upgrading to Icinga2 will be a lot of
work. We'll need to hook up all the services in Puppet. This is
already all done in Prometheus: the node exporter is deployed on all
machines, and there are service specific exporters deployed for many
services: apache, bind, postgresql (partially) are all
monitored. Plus, service admins have widely adopted the second
Prometheus server and are actually already using for alerting.
We have a service duplication here, so we need to make a decision on
which service we are going to retire: either Alert Manager or
Icinga2. The discussion is to be continued.
...
Other major upgrade tasks remaining, informative, to be done
progressively in may:
* upgrades, batch 2: tpo/tpa/team#40692 (probably done by this
   point?)
 * gnt-fsn upgrade: tpo/tpa/team#40689 (involves an upgrade to
   backports, then bullseye)
 * sunet site move: tpo/tpa/team#40684 (involves rebuilding 3
   machines)
# Dashboard review
Skipped for lack of time.
* https://gitlab.torproject.org/tpo/tpa/team/-/boards/117
 * https://gitlab.torproject.org/groups/tpo/web/-/boards
 * https://gitlab.torproject.org/groups/tpo/tpa/-/boards
# Holidays planning
Skipped for lack of time, followup by email.
# Other discussions
We need to review the dashboards at the next checkin, possibly discuss
the Icinga vs Prometheus proposal again.
# Next meeting
Next meeting should be on Monday June 6th.
# Metrics of the month
* hosts in Puppet: 93, LDAP: 93, Prometheus exporters: 154
 * number of Apache servers monitored: 27, hits per second: 295
 * number of self-hosted nameservers: 6, mail servers: 8
 * pending upgrades: 0, reboots: 0
 * average load: 0.64, memory available: 4.67 TiB/5.83 TiB, running
   processes: 718
 * disk free/total: 34.14 TiB/88.48 TiB
 * bytes sent: 400.82 MB/s, received: 266.83 MB/s
 * planned bullseye upgrades completion date: 2022-12-05
 * [GitLab tickets][]: 178 tickets including...
   * open: 0
   * icebox: 153
   * backlog: 10
   * next: 4
   * doing: 6
   * needs information: 2
   * needs review: 3
   * (closed: 2732)
[Gitlab tickets]: https://gitlab.torproject.org/tpo/tpa/team/-/boards
Upgrade prediction graph lives at https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/upgrades/bullseye/
Now also available as the main Grafana dashboard. Head to https://grafana.torproject.org/, change the time period to 30 days, and wait a while for results to render.
# Number of the month
4 issues. We have somehow managed to bring the number of tickets in
the icebox from 157 to 153, that's a 4 issues gain! It's the first
time since we're tracking those numbers that we managed to get that
number to go down *at all*, so this is really motivating.
We also closed a whopping 53 tickets since the last report, not quite
a record, but certainly on the high range.
Also: we managed to bring back the estimated bullseye upgrades
completion date back *two years*, into a more reasonable date. This
year, even! We still hope to complete most upgrades by this summer, so
hopefully that number will keep going down as we continue the
upgrades.
Another fun fact: we now have more Debian bullseye (54) than buster
(39) machines.
-- 
Antoine Beaupré
torproject.org system administration