SATA Bit Error Rate Column Puts Some RAID-Based Storage Systems Vendors on the Defensive

The Computerworld column I wrote a few weeks ago on the topic of “A Bit of a Flaw with SATA disk drives” sparked quite a bit of debate around just how safe is data on today’s RAID-based storage systems that use SATA disk drives? A series of comments appeared on Computerworld’s site where the column appeared as well as on a forum at Nabble‘s web site. Also, at least one storage system vendor felt obligated to send me their white paper that explains how its RAID-based storage system accounts for this bit error rate problem on SATA disk drives.
Now before everyone starts worrying that data on their current storage systems is in eminent danger, relax. I have managed hundreds of TBs and about three dozen storage systems from multiple different storage vendors in enterprise environments. In that time and since, I have yet to ever experience or talk to anyone who has yet been impacted by the type of bit error on SATA drives that I described in the aforementioned column (though I suspect someone out there has been adversely affected). My personal experience, anecdotal evidence and general user feedback indicates to me that companies do not need to unnecessarily worry about the integrity of their data. But as data stores grow, companies need to be aware of and educate themselves about this important topic.
I believe it is fair to say that storage system providers themselves do not always know for sure just how reliable their RAID configurations really are in yourenvironment. Maybe they do in their test labs and in their data centers but beyond that, they are just making educated guesses about how well it will perform based on ideal conditions. However in my many years as an end-user and analyst, I have yet to encounter very many companies who run their data centers under ideal conditions. One only needs to look at one of the pictures posted on DCIG Analyst Tim Anderson’s blog entry to understand that.
Another case in point – remember for how many years manufactures and disk drive vendors told us that SATA disk drives only failed once every million hours (or about once every 114 years)? Then all of a sudden a study comes out of Carnegie Mellon‘s Computer Science Departments about a year ago that confirms what many of users have felt in our guts for years – that the manufacturer’s posted disk drive failure rates, when applied to real world environments that the rest of us operate in – are at best suspect.
So now what is happening? Storage systems vendors and their RAID technologies are under the gun and they are feeling a little heat from the Computerworld column that I posted. While they are quick to defend their technologies in forums and white papers, what other assurances or guarantees do users really have? Disk drive manufacturers have disclosed for some time the bit error rate on their respective FC and SATA drives. So is it time for similar disclosures on the reliability of RAID-based storage systems?
As systems scale into the hundreds of TBs or even PBs, as Permabit‘s Enterprise Archive does (though Enterprise Archive is grid storage system, not RAID-based), these types of disclosures become more important. Is the error rate one in 100 trillion like some SATA disk drive manufacturers claim? Is it one in 100 quadrillion? Or, when put to the test, will they also fail up to 4% of the time in some real world situations, and do they have a mechanism to recover for those levels of failure rates?
No matter what the manufacturers say, their RAID systems are probably more prone to failure in the real world than even they realize or can document. Because of this, users are wise to factor in some margin of error in terms of how well vendor’s RAID systems really perform and recover from disk drive failures versus what they claim. This becomes more important as these storage systems scale into the tens and hundreds of TBs and manage more disk drives with ever more capacity. In these circumstances, new grid storage architectures that don’t rely on RAID might be a better fit in your environment.
Now I don’t profess to have all of the answers, but don’t assume your storage system vendor does either. There is no reason now, any more so than there has been in the past, for users to place their inherent trust in the RAID architecture included with your storage system. Because as storage systems scale to manage ever more disk drives with ever larger capacities and then deduplicate data stored on these systems, you are placing more than your faith on these storage systems, you are betting your company’s future on them.
So query your vendors to see if they can provide satisfactory answers about how well their systems manage your data and more importantly, how they can recover it when you need it. Then match their responses to your internally documented application SLAs and user expectations. If companies find your vendors’ answers don’t match your reality, it might be an indication that it is time to evaluate new storage technologies that better match your high capacity, long term data storage needs.


