Whoever commonly visits these forums may have noticed a few days of downtime...
So, here's what happened -- the short version, because I really don't feel like giving all the details, since I've just spent 4 days working on this, it's 1:50 AM and I'm tired.
On May 5th, 12:55:24 UTC, one of the hard drives failed. Fortunately, it was in RAID 1, and thus no data was lost ...
... until literally 4 minutes and 4 seconds later, when the other HDD started failing.
The initial point of failure was one of MySQL database files, which made the daemon do an emergency exit. That's when I noticed something was wrong -- the websites couldn't connect to the MySQL server, and just trying to restart the MySQL server didn't work (it failed with a read error).
Fortunately, I've been doing daily database backups (off-site), so that wasn't a disaster.
The rest of the data on the failing drive was still okay, so I could back that up before both drives were replaced (merely as a conveniency -- it was not critical).
I reinstalled the system after that, restored relevant configuration and started up the servers again. It took a lot of time.
I hope the database backup I was working off of was early enough that there wasn't any incorrect data due to some sort of unnoticed corruption; I basically just took May 4th (I do have backups for every day, but just took the latest "probably fine").
tl;dr: Both HDDs (mirrored) on the server failed a few minutes apart. I had backups.
Downtime explaination
Re: Downtime explaination
Hi Tim.
I did notice, and I did notice your continued dedication to fix 'er up.
Thank you Mr. Sir.
I did notice, and I did notice your continued dedication to fix 'er up.
Thank you Mr. Sir.