On Mon, Sep 04, 2023 at 01:30:29AM -0600, David Fifield wrote:
I was having a look at the graphs and I have an explanation. The snowflake-02 bridge was, for some reason, not processing any connections during this time. The times in the logs match up exactly with the client polls graph. It stopped working at 2023-08-27 17:45 and began working again at 2023-08-30 12:30.
2023/08/27 17:45:33 reading token: read tcp [scrubbed]->[scrubbed]: read: connection timed out (766 more "connection timed out" lines) 2023/08/27 17:47:28 reading token: read tcp [scrubbed]->[scrubbed]: read: connection timed out 2023/08/28 11:03:26 in the past 86400 s, 105514/105990 connections had client_ip 2023/08/29 11:03:26 in the past 86400 s, 0/0 connections had client_ip 2023/08/30 11:03:26 in the past 86400 s, 0/0 connections had client_ip 2023/08/30 12:30:14 reading token: websocket: close 1006 (abnormal closure): unexpected EOF (working again)
What would have happened is clients that randomly selected snowflake-02 as their bridge would have timed out and had to re-rendezvous until they happened to randomly select snowflake-01. Meanwhile snowflake-01 likely was overloaded because it alone is not able to handle all existing clients.
I don't know what went wrong with snowflake-02. The server did not reboot, and as far as I can tell all the processes kept running.
The cause of the outage was a disconnection at the university campus where the bridge is hosted.
https://www.insidehighered.com/news/tech-innovation/teaching-learning/2023/0... https://web.archive.org/web/20230905230409/https://umich.edu/announcements/