In a recent blog post entitled Deduplication of Databases on Primary Storage Just Rubs Me the Wrong Way I received some great comments, questions and even a ding. Because of the nature and depth of the comments and questions, however, I felt it only appropriate to produce a follow up to that post and help explain a few things brought up in those comments. In particular I wanted to address questions raised by Matt Simmons and Mike Dutch.
Matt Simmons said: Please correct me if I’m wrong, but isn’t there a performance hit for doing random access on deduplicated data? I don’t see how that couldn’t be the case, since every data request would have to be looked up in a table. Even if the entire data store is stored in cache, that’s still a lot of latency just from lookups. Of course, I could be misinformed as to how dedupe works.
Koopmann says: Intuitively, we all know that adding additional code, instructions, or hardware will add processing overhead. What we all want to know is, “How much?” More often then not we ask ourselves or look for a bigger box that will hide the performance hit
Since deduplication on primary storage is semi-new, performance numbers and real-life scenarios are hard to come by and even a number of deduplication vendors are forthright in saying their deduplication is intended for backup environments. However I think we can safely say though that deduplication on primary storage is much more difficult than deduplication of backup data. So I’d like to just leave a few quotes on what some of the experts in the field say:
- In regards to deduplication of primary storage for backups I’d like to direct you to a quote I found from Larry Freeman, Senior Marketing Manager Storage Efficiency Solutions at NetApp and Dr Dedup himself. On June 18, 2009, Dr Dedup says: “If your application or system is extremely performance-sensitive, don’t run dedupe”. Now I personally wouldn’t take such a hard and fast stance here and I’d encourage you to read the rest of the thread as there is some good information on how NetApp’s deduplication works along with performance data.
- In a Storage Switzerland, LLC LAB REPORT: Deduplication of Primary Storage, Senior Analyst George Crump states that “As I have stated in other articles, deduplication by itself has limited value on primary storage” and “works against active online and near line data (like databases), where the occurrence of duplicate data is unlikely. This is in contrast to backup data, where the same full backup runs every week and the chances of duplication are fairly high”.
- In a SearchStorage.com article, Users turn data reduction focus to primary storage, Senior News Director Dave Raffo asked the questions ‘Is data deduplication a good fit for primary storage?’ Data Domain CEO Frank Slootman said not only do “We try to use the term ‘primary storage’ carefully,” but “If you have data that is really hot, has an extremely high change rate, like transactional data, there’s no sense deduplicating it.”
The problem here isn’t really about how much of a performance impact you will see but more along the lines of, “Can you really measure the performance impact?” and “Does anyone really care or perceive it?” I’ve often found in the database shops that I have worked in that there are very few people that know how to relate storage performance to database performance. Just ask your DBA if they know how many IOPS they are getting and if they require additional spindles to help improve SQL performance. (More than likely they will roll their eyes.)
Mike Dutch said: Isn’t database normalization/record linkage essentially concerned with “information deduplication” focused on ensuring “good” query results rather than “data deduplication” which is focused on improving storage and bandwidth consumption? I tend to think of “data deduplication” as being part of the data storage process as it is similar in many ways to “just another file system”… “go here to get the next set of bytes”.
Since most data deduplication procedures store data compressed, it must be uncompressed and that network bandwidth/latency issues may result in undesirable performance impacts for some data. However, this doesn’t mean that you shouldn’t use data dedupe on some databases (for example, copies of a database used for test purposes) or on other types of primary storage data that are not overly performance-sensitive.
Koopmann says: Yes, database normalization is concerned with developing structures that will not only help ensure data quality but also help ensure that an application is able to manipulate data easily. While a byproduct of normalization is the reduction of redundant data let’s stop right there and also realize that normalization does not eliminate all redundant data.
Again, in a purist sense, data modeling, when done effectively (or compulsively depending on your perspective), will push down all redundant data to lookup tables–leaving UIDs as pointers to that lookup data. At this point duplicate data is completely eliminated except for the system generated UIDs that you would be hard pressed to note any storage gains through deduplication.
Granted, most databases are not modeled this way so looking at how your particular database is architected and works is key to breaking the deduplication mystery (with vendor help). For Oracle, databases structures (table and index designs) will share the same blocks but the actual data/records/rows that are stored in the data blocks are likely to be different.
But it does depend on the data model and transaction patterns (INSERTs, UPDATEs, DELETEs). This means having deduplication at the file level surely won’t work and deduplication on individual blocks probably won’t see duplicates. This leaves us with deduplication at the byte level within and across all data blocks in all SEGMENTs of data files. Depending on your dedup vendor this solution may be beneficial but you, will certainly need to verify the solution.
Surely you can use deduplication on some databases as Mike suggests and hit it square on the nose. The key term used here by Mike is “copies of”. Many databases have copies of data that may be for backup, test, historical, or offloading query functions. If your database has many of these “copies of” data, you could look at the benefits of deduplication, or perhaps consider non-duplicative “thin” snapshots. Non-duplicative snapshots refer to the fact that changed data is never duplicated across a group of snapshots versus copying changed data across all snapshot copies. This particular technology is well-proven in primary storage applications.
Again, I have nothing against deduplication when used appropriately and other factors are uncontrollable. But in a properly architected database I am still skeptical about the fit. Databases are just too dynamic with temporary sort, rollback, and redo areas and high transaction rates that make me question what there could possibly be to dedup in the first place.
Secondly, many database practitioners are too far removed from the performance implications that revolve around storage already that they would certainly not be able to handle an additional layer of abstraction. So for now, I hold steady on my statement that a more viable alternative to current reduction and stabilization of storage acquisitions within database environments would be to deploy thin provisioning.
Using proven and industr
y leading thin provisioning and non-duplicative thin copy technologies from a vendor such as 3PAR allows databases to allocate just-enough and just-in-time storage–relieving IT from both having to watch and then add or remove storage. Thin provisioning and thin copy are data reduction technologies developed for primary storage applications, and therefore addresses the desire for capacity efficiency without the performance impact and deployment mystery associated with today’s storage deduplication technology.