The Hidden Costs of Unreliable Disk Drives; The Dirty Little Secrets of Managing RAID Storage

It is only after organizations start to deploy storage systems with SATA HDDs that they really start to focus on the reliability of SATA HDDs. While it is unlikely they will lose any data regardless of whose system they deploy, the amount of time that their IT staff spends and resulting operational expense in managing the replacement of failed SATA HDDs can and will vary widely according to which storage system they select. It is only by selecting SATA storage systems that specifically account for the idiosyncrasies of SATA HDDs that organizations can both protect their data and not have to deal with headaches of constantly replacing failed SATA HDDs.

Organizations are storing more data than ever on disk. Archives, backups, DR and video surveillance data along with unstructured data are largely contributing to the explosive growth of data in organizations. Yet one of the dirty little secrets of managing the large disk farms needed to store all of this data is managing the replacement of failed SATA hard disk drives (HDDs) in these disk farms. While current RAID technologies do an adequate job of protecting from data loss in most of these environments, when a SATA HDD fails, it still does require someone to replace it.

Replacing failed SATA HDDs may be no big deal in smaller environments. But when you start to consider how potentially unreliable some SATA HDDs are and the time involved with managing their replacement in large disk farms, the process becomes much more complicated. Here are just some of the steps that I had to follow when I worked at a Fortune 500 data center and had to replace a failed HDD (SATA or otherwise):

  • Open a trouble ticket in my organization’s change control system
  • Open a trouble ticket with the vendor to replace the disk drive
  • Determine the urgency of replacing the failed disk drive. (i.e. – was it part of a RAID 1 array on a storage system with no hot spare used by a mission critical application or was it in a RAID 5 group on a storage system with a hot spare used for backup?)
  • Schedule a time for the HDD replacement. Depending on what applications the failed HDD supported, I might schedule the failed HDD to be replaced as soon as a new one arrives or I might wait until the middle of the night so it can be replaced during normal maintenance hours.
  • Notify the affected application, server, change control and security teams. They need to be notified so they can monitor for any performance impact to their applications or servers as well as be on the lookout for the technician so he can access the data center floor to replace the failed hard drive and install the new one.
  • Verify the new drive was successfully installed and close out the open trouble tickets.

While not every organization has to go through all of these steps to replace failed SATA HDDs, regularly replacing failed HDDs becomes a cost and a risk to any company. And let’s face it, organizations are using SATA HDDs for more applications than ever before and SATA HDDs are the ones increasingly tapped for these functions.
So while organizations obviously do not want to lose data on their SATA storage systems (and probably will not because the data is protected with RAID), they also do not want to dedicate a full time person to manage the task of replacing failed HDDs in their growing disk farm. Yet storage administrators that manage large disk farms often complain about this so they are thinking more about this issue ahead of time and looking to buy storage systems that mitigate this problem.

Organization should therefore look for storage systems where replacing failed SATA HDD remains an occasional hassle and does not become a full time job. While this should not be considered a complete list, here are some features that organizations should look for in SATA storage systems to ensure high reliability of the SATA HDDs:

  • Select storage system manufacturers that have a history (5+ years) of working with SATA. Storage system manufacturers who started in the early days of ATA (pre-SATA) know how poorly those ATA drives were manufactured so they built mechanisms into their storage systems to account for those deficiencies. While today’s SATA HDDs are much more reliable, they still have their own little idiosyncrasies that these manufacturers are more likely to know about and account for in the design of their storage system.
  • Manufacturer only uses enterprise SATA HDDs. Enterprise SATA HDDs are measured by different standards than SATA HDDs intended for the desktop. Enterprise SATA HDDs are tested using different workloads, have longer burn-in cycles, are rated for longer mean times between failures and include 5-year warranties. Desktop SATA HDDs have none of these features so they are more prone to failure.
  • Manufacturer stress tests the HDDs before deploying them in the system. Even enterprise SATA HDDs are not immune to failures so inquire as to how the storage system manufacturer stress tests the HDDs before they put them into the field. So for example, when the storage system is built and tested at the manufacturer’s site, the manufacturer should ideally have software built into its storage systems that weeds out faulty drives. One test they can run is to perform random I/Os on individual HDDs and measure the time it takes to perform these tests. Response times that take too long are an indication that the HDD is prone to failure.
  • Manufacturer can manage HDDs when they are spun down. Since data stored on SATA HDDs is frequently not accessed after it is stored, newer enterprise SATA HDDs have the capability to spin down and go to ‘sleep’ without powering off However, the storage system needs to recognize and manage this behavior so it does not keep trying to spin the HDD up or incorrectly label it as a failed HDD.

Once organizations know about some of these finer points that SATA storage system manufacturers take (or do not take) to ensure the reliability of the SATA HDDs within their systems, it becomes easier to justify choosing one over another for these types of hardware benefits. For instance, Nexsan Technologies is a prime example of an organization that has a long history of working with SATA HDDs (10+ years) and has taken all of these steps and more to ensure the reliability of SATA HDDs on its many products which include SATABoy and SATABeast. Nexsan’s CTO Gary Watson says, “Our customers frequently tell us that their Tier 1 storage systems often have higher HDD failure rates than their Nexsan systems with SATA HDDs.”

Most organizations say that when they are contemplating the use of SATA HDDs that their primary concern is their reliability. In truth, most are more initially concerned about the protection and recoverability of their data which is a fear most SATA storage system manufacturers address through the use of RAID. But RAID only addresses concerns about data reliability, not hardware reliability, and as customers can find out after the fact, reliable SATA HDDs have a value that organizations may only appreciate and understand after they purchase an unreliable storage system.

Click Here to Signup for the DCIG Newsletter!


DCIG Newsletter Signup

Thank you for your interest in DCIG research and analysis.

Please sign up for the free DCIG Newsletter to have new analysis delivered to your inbox each week.