On the 24th May, 2024, Atuin Cloud was down for approx. 49 minutes - from 20:26 BST, until 21:15 BST
Cause
Atuin’s API was responding with a 503. On closer inspection, the pods were crashlooping, as they could not access PostgreSQL.
I could not access the instance, as it was offline. Attempts to boot it failed. After contacting Hetzner support, this was due to a neighbouring machine blowing a fuse in the rack
Unfortunately, your server encountered a power outage caused by a neighboring server that triggered the electrical fuse in your rack. We have since resolved the issue and restored power to your server.
I’ve never needed to contact hetzner support with urgency before, and was happy with how fast they responded to me on a Friday evening.
Impact
Impact was fairly minimal. Atuin’s client is designed to handle going offline without any issues, beyond temporarily being unable to sync. Sync requests will fail, and will retry in the future - with changes reconciled when the user is next online. No data is lost.
We also have solid backups, with both regular full snapshots, and WAL kept encrypted in S3.
Detection
I was paged within a minute of the API going offline, though was unable to do much about it. I’d just gotten off a 15hr flight, and was in the taxi back to my house. Talk about timing This added a 20 min delay.
Remediation
While we do have a hot standby postgres instance, we do not have automated failover. This is an intentional decision, in order to keep infrastructure simple.
This does delay incident response, however the simplicity is worth the cost for the time being. Note that in the last 90 days, we have been down for a total of 51 minutes - https://status.atuin.sh
For the time being, I’ll be keeping failover as a manual intervention
Prevention
There are many upsides to building Atuin Cloud on Hetzner. It provides fantastic performance at a very low cost.
The downsides are that you are dealing with real, physical hardware. Occasionally issues like this will happen. However, in the years I’ve been building on Hetzner, they are infrequent.