Downtime explaination

Post by **Tim** » Thu May 08, 2014 11:53 pm

Whoever commonly visits these forums may have noticed a few days of downtime...

So, here's what happened -- the short version, because I really don't feel like giving all the details, since I've just spent 4 days working on this, it's 1:50 AM and I'm tired.

On May 5th, 12:55:24 UTC, one of the hard drives failed. Fortunately, it was in RAID 1, and thus no data was lost ...
... until literally 4 minutes and 4 seconds later, when the other HDD started failing.

The initial point of failure was one of MySQL database files, which made the daemon do an emergency exit. That's when I noticed something was wrong -- the websites couldn't connect to the MySQL server, and just trying to restart the MySQL server didn't work (it failed with a read error).

Fortunately, I've been doing daily database backups (off-site), so that wasn't a disaster.
The rest of the data on the failing drive was still okay, so I could back that up before both drives were replaced (merely as a conveniency -- it was not critical).
I reinstalled the system after that, restored relevant configuration and started up the servers again. It took a lot of time.

I hope the database backup I was working off of was early enough that there wasn't any incorrect data due to some sort of unnoticed corruption; I basically just took May 4th (I do have backups for every day, but just took the latest "probably fine").

tl;dr: Both HDDs (mirrored) on the server failed a few minutes apart. I had backups.

Post by **R.Flagg** » Fri May 09, 2014 3:09 am

Hi Tim.

I did notice, and I did notice your continued dedication to fix 'er up.

Thank you Mr. Sir.

Chaotic Dreams

Downtime explaination

Downtime explaination

Re: Downtime explaination