Deduplication of Databases on Primary Storage Just Rubs Me the Wrong Way

First, let me set the record straight. I have nothing against deduplication. Deduplication is a proven technology that has many benefits within an IT data center–providing improved data protection and reduced costs associated with storage acquisition.

But let’s remember, deduplication is a technology that basically reduces or eliminates redundant data–leaving a unique data set. And if you’ve been keeping up to date on your Net reading, deduplication, which in the past has been primarily used for archive storage, is starting to be heard when talking about primary storage. Now I can’t talk for all the 100’s of different applications running out there but deduplication for applications utilizing primary storage with a database backend just rubs me the wrong way.

From a DBA purist perspective, database design, aka Database Normalization, is undertaken to help eliminate redundancies and facilitate the quick retrieval of information. Place this against the back-drop of deduplication and you might soon begin to wonder about the effectiveness of your database design.

Sure there are instances where duplicate data in a database exists and its reduction would be beneficial; instances such as tables that keep track of male/female, city/state, or yes/no data. There are even features to help with redundant/duplicate data. For instance, Oracle has embraced CLUSTERs and DEDUPLICATE options for building tables as well as bitmap indexing to help manage duplicate data. But these options are not the norm and have very narrow purposes. The key point here is that databases, by definition, are meant to help eliminate redundancies, not produce or propagate them.

Personally, if someone told me that they could reduce my database storage footprint by 50% I’d begin to worry about data quality within my database. Reducing the storage requirements for databases, through deduplication, really just puts a Band-Aid on this problem and doesn’t address the real issues.

Real data quality and elimination of duplicate data are achieved by starting with a pristine data model, quality data, and validating an application will manipulate the data properly. Data modelers, along with DBAs, are responsible for validating schema design and ensuring various indexes, relationships, and constraints are in place to help improve the quality of data–ultimately helping to eliminate duplicate data.

Maybe if we spent more time on data quality, IT, as reported by Gartner, wouldn’t be facing the extreme failure of data warehouse projects. And while Ted Friedman, principal analyst at Gartner, clearly states that the reason for a lack of data quality is most companies focus only on the “identifying, extracting, and loading of data into the warehouse but do not take the time to assess quality/” I am certain this lack of quality of these extract, transform, and load (ETL) processes are jeopardized through poor design.

Until a new form of deduplication technology comes out, deduplication on primary storage for a database environments just doesn’t mix. Deduplication is still a viable technology for archival and backup, but only if you understand your data. For instance, you might want to use deduplication for more static tables pages during backup but you wouldn’t think of using deduplication on Oracle archive redo logs since redo logs contain the real-time block changes to data. Granted, this could change with new technology that would recognize those large hot data blocks that have small changes but we just aren’t there yet.

A more viable alternative to current reduction and stabilization of storage acquisitions within database environments would be to deploy thin provisioning. Using thin provisioning from a vendor such as 3PAR, databases can be allocated with just-enough and just-in-time storage–relieving IT from both having to watch and then add or remove storage. Thin provisioning is the one data reduction technology developed for primary storage applications, and therefore addresses the desire for capacity efficiency without the performance impact associated with today’s storage deduplication technology–which was really developed for archive storage applications.

Storage management has always been the bane of DBAs so deduplication will probably move up the storage stack and end up in primary storage some day but that day is not today. In the meantime, using software like 3PAR’s thin provisioning, DBAs can now provision once and only pay for what they use when they use it which eliminates the tedious, manual, and error prone tasks associated with database storage management while still keeping their data stores under control.

Click Here to Signup for the DCIG Newsletter!

Categories

DCIG Newsletter Signup

Thank you for your interest in DCIG research and analysis.

Please sign up for the free DCIG Newsletter to have new analysis delivered to your inbox each week.