Hi everyone,
We had our monthly sysadmin meeting (a bit late for various reasons) for February, and here are the minutes.
# Roll call: who's there and emergencies
No emergencies: we have an upcoming maintenance about chi-san-01 which will require a server shutdown at the end of the meeting.
Present: anarcat, gaba, kez, lavamind
# Storage brainstorm
The idea is to just throw ideas for this ticket:
https://gitlab.torproject.org/tpo/tpa/team/-/issues/40478
anarcat went to explain the broad strokes around the current storage problems (lack of space, performance issues) and solutions we're looking for (specific to some service, but also possibly applicable everywhere without creating new tools to learn)
We specifically focused on the storage problems on gitlab-02, naturally, since that's where the problem is most manifest.
lavamind suggested that there were basically two things we could do:
1. go through each project one at a time to see how changing certain options would affect retention (e.g. "keep latest artifacts")
2. delete all artifacts older than 30 or 60 days, regardless of policy about retention (e.g. keep latest), could or could not include job logs
other things we need to do:
* encourage people to: "please delete stale branches if you do have that box checked" * talk with jim and mike about the 45GB of old artifacts * draft new RFC about artifact retention about deleting old artifacts and old jobs (option two above)
We also considered unchecking the "keep latest artifacts" box at the admin level, but this would disable the feature in all projects with no option to opt-in, so it's not really an option.
We considered the following technologies for the broader problem:
* S3 object storage for gitlab * ceph block storage for ganeti * filesystem snapshots for gitlab / metrics servers backups
prototype with image/cache storage on runners? safer because easy to rollback
We'll look at setting up a VM with minio for testing. We could first test the service with the CI runners image/cache storage backends, which can easily be rebuilt/migrated if we want to drop that test.
This would disregard the block storage problem, but we could pretend this would be solved at the service level eventually (e.g. redesign the metrics storage, split up the gitlab server). Anyways, migrating away from DRBD to Ceph is a major undertaking that would require a lot of work. It would also be part of the largest "[trusted high performance cluster][]" work that we recently de-prioritized.
[trusted high performance cluster]: https://gitlab.torproject.org/groups/tpo/tpa/-/milestones/2
# Other discussions
We should process the pending TPA-RFCs, particularly TPA-RFC-16, about the i18 lektor plugin rewrite.
# Next meeting
Our regular schedule would bring us to March 7th, 18:00UTC.
# Metrics of the month
* hosts in Puppet: 88, LDAP: 88, Prometheus exporters: 143 * number of Apache servers monitored: 25, hits per second: 253 * number of self-hosted nameservers: 6, mail servers: 8 * pending upgrades: 0, reboots: 0 * average load: 2.10, memory available: 3.98 TiB/5.07 TiB, running processes: 722 * disk free/total: 35.81 TiB/83.21 TiB * bytes sent: 296.17 MB/s, received: 182.11 MB/s * planned bullseye upgrades completion date: 2024-12-01 * [GitLab tickets][]: 166 tickets including... * open: 1 * icebox: 149 * needs information: 2 * backlog: 7 * next: 5 * doing: 2 * (closed: 2613)
[Gitlab tickets]: https://gitlab.torproject.org/tpo/tpa/team/-/boards
Upgrade prediction graph lives at:
https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/upgrades/bullseye/
# Number of the month
-3 months. Since the last report, our bullseye upgrade completion date moved backwards by three months, from 2024-09-07 to 2024-12-01. That's because we haven't started yet, but it's interesting that it's seems to be moving back faster than time itself... We'll look at deploying a perpetual movement time machine on top of this contraption in the next meeting.
tor-project@lists.torproject.org