Monday, August 24, 2015

Hard Times and Hard Drives

Ever had a hard drive crash and burn? Backblaze, an online backup company, has a little over 6 a day. It's a small fraction of the 35,000 they have up and running right now, but it means they are constantly logging
performance and diagnostic metrics for their hard disk arrays. Conveniently, they have also decided to release this data to the public in a set of CSV files, with a script to import them into an SQL database.

Taking a quick look at the data, I noticed something really odd about a few of the drives - their failure rates were off the charts! These weren't just drives that were oddballs in the rotation - A line of 3 Terabyte Seagate Barracuda drives were having failure rates of over 40 percent, and these were the third most used drives in Backblaze's server farm!


Taking a closer look at the data, I started to notice a pattern emerge - The worst drives tended to have capacity sizes that were cleanly divisible by 3. Here's a different perspective on the failure rates:

For an ordinary person, a hard drive crash every 5 years would be absolutely catastrophic. But for those 3 TB Barracuda drives, Backblaze would lose half of them to drive failures every year. And while Seagate (model number starting with ST) drives seemed to be performing the worst, it looked like anything with a size of 1.5 or 3 terabytes was suspect. In fact, the 4 TB Seagate drives seemed to be only a little worse than normal, while a 3 TB Western Digital drive had a 10% failure rate!

It was at this point that I remembered a distant rumor that I heard on the internet once - that 3-platter drives were less reliable than 2-platter ones (A hard drive is made up of a series of magnetic "platters" that hold data). Sure enough, the ST3000DM001 appears to have 3 platters. ST31500341AS, #2 on the list, has four. But surely that wouldn't be enough to cause the massive attrition rates that we've seen on those bad hard drives...

It turns out the answer has a lot to do with weather in Southeast Asia and a nationwide scramble for hard drive space. In a blog post from this April, Backblaze noted the severe reliability problems with this line of hard drives, but explained that they had no choice. 2011 was a difficult year for storage companies everywhere, as catastrophic floods in Thailand led to a severe hard drive shortage worldwide. Backblaze responded by finding whatever hard drives they could and hoping that their RAID setup could provide enough redundancy to cover whatever problems the hard drives had. Evidently, their ability to compensate for failures was put to the test.

What does this mean for the rest of us? For me, I'm reassured that if I buy a hard drive, I have a good chance of having it last for at least a few years. It looks like some drives are clearly better than others (and as a data driven person, I'm very reassured by seeing empirical results of this). And while 2-platter drives are more expensive per Gigabyte, the proof of their reliability is very clear to see.

If we're not all using Solid State Drives in 2 years anyway.
Until next time.