May 28th, 2009
Earlier today, May 27, at approximately 10:15EST, one of the Unfuddle servers experienced a hardware failure with its attached storage (an Amazon EBS volume).
Immediately upon failure, we contacted the Amazon support team and began the process of diagnosing the problem. At approximately 20:00EST, the hardware failure was remedied, the volume was restored and all Unfuddle accounts on that server were available as normal.
Why did we take so long to respond? Unfuddle keeps hourly snapshots of all customer data, so it would have been possible from the very moment of the outage to revert to a saved snapshot. However, doing so would have caused everyone on the server to lose approximately one hour of activity on their account – a situation we clearly wanted to avoid. As we worked with Amazon throughout the day, it was looking probable that the data on the volume would be recoverable, avoiding any data loss. Unfortunately, only in the early evening was it actually guaranteed to us by Amazon the volume was intact and had been recovered successfully.
As many of you know, we have been with Amazon EC2 since the beginning of this year and this is the first significant outage we have experienced since then. Our current data partitioning and snapshotting scheme has been excellent at mitigating risk for our customers. Even today, only about 7% of all Unfuddle accounts were affected. However, we do not consider this outage to be acceptable, and in hindsight we should have probably not waited for the volume to be rebuilt, but rather restored directly from the last viable snapshot.
This morning’s events have given us some very practical ideas as to how we can even further improve upon our snapshotting strategy so that this kind of hardware failure is even less likely to affect our customers in the future. We are already working on implementing these changes.
We apologize for the disruption that this outage has caused you and your teams. As a software development team ourselves, we truly understand the kind of problems that this has caused.