Close this search box.

Deciphering Application Metadata is Data Deduplication’s Next Frontier

Dedupe is an easy concept to grasp. At its most basic level it reduces storage requirements and touts the improvement in backup and recovery times. It seems as if it is a “win-win” scenario and, for the most part, it is. But let’s not lose sight of the fact that dedupe is still in its infancy and is being continually fine-tuned and changed. This should keep us from becoming lackadaisical in our perception of this technology and how it is still in its early stages.

ADMAD: Application-Driven Metadata Aware De-duplication Archival Storage System
, a paper co-written by multiple authors from Tsinghua University in Beijing, China, and the University of Minnesota, reinforces this belief. It states in regards to dedupe that, “There are still research challenges in current approaches and storage systems, such as: how to chunk the files more efficiently and better leverage potential similarity and identity among dedicated applications.”

We would encourage many of our readers to go ahead and read the above mentioned paper. It specifically proposes Application-Driven Metadata Aware De-duplication (ADMAD) as an option to interrogate metadata information at the application layer to identify and dynamically define meaningful data chunks to maximize the effect of the dedupe process.

The problems that introducing application awareness into deduplication solves can be succinctly summarized. Every backup application inserts its own metadata into the backup data stream so that it knows how to manage the data. But this can result in problems later on if the target device ingesting the data does not understand this metadata as it deduplicates the data.

The first time the target chops up the data may be completely different than the second time if the deduplicating target cannot decipher it. It is quite conceivable that even though the incoming data in the second stream has not changed, the application sending the chunk of data to the target may tag it differently just because of how the application chooses to manipulate the data. In other words, the data is the same but the tag is different. Yet because this tag now resides inside the chunk of data instead of where the target is looking for it, the data will not be optimally deduplicated.

Deduplicating storage systems like the NEC HYDRAstor may not be doing everything as prescribed by ADMAD, but they are making progress in this important area of incorporating application awareness within deduplication algorithms. About three months ago, NEC announced the addition of application awareness deduplication into its HYDRAstor grid storage platform. The HYDRAstor can now optimize deduplicated data more efficiently for specific applications by analyzing incoming data streams and filtering application level metadata to eliminate the negative impact that metadata has on data deduplication ratios.

HYDRAstor, after an application type has been associated with a file system, will automatically filter the metadata away from the backup data payload which lets its data deduplication process work exclusively on the incoming data. Since different applications have their own metadata that they insert into the data stream to manage ‘their’ data, once that metadata is split out with corresponding offsets it is stored independently to ensure the data is reconstituted in its original form during reads and helps insure much higher deduplication rates can be achieved.

With its initial implementation of application-aware deduplication, HYDRAstor deployed solutions for CommVault® Simpana® and Symantec NetBackup with early results showing impressive gains in data deduplication ratios for both of these applications.

NEC has seen a 2-3x improvement in data deduplication ratios for CommVault for a full weekly backup cycle consisting of one full backup, followed by six incremental backups, and then another full backup. When implemented within a customer’s environment the CommVault solution actually resulted in over 4x improvements in the deduplication ratio, when compared to deduplicating data generically/agnostically, over a span of 4 weeks which represented the customer’s retention period.

Realize that your mileage may vary as data patterns in organizations vary. However the ADMAD study suggest that gains of 50% or more in data deduplication ratios can be achieved when stripping out metadata as compared to using application-agnostic deduplication approaches. The fact that the HYDRAstor showed such notable gains reinforces the conclusions of the ADMAD study and it stands to reason that organizations should expect to see similar improvements in their deduplication ratios when using the HYDRAstor in their environments in conjunction with either of these two backup software platforms.

The inclusion of application-aware data deduplication in the HYDRAstor is a smart move on NEC’s part. Not only is it a competitive differentiator but current and new HYDRAstor customers get this feature for free as part of the base code that is installed on their HYDRAstor. Such gains in data reduction ratios coupled with its increased ability to show more value only make it logical for NEC to look to expand the number of backup software applications that will support this feature going forward as it should prove to be a factor that organizations look at when deciding between NEC and a competing product.


Click Here to Signup for the DCIG Newsletter!


DCIG Newsletter Signup

Thank you for your interest in DCIG research and analysis.

Please sign up for the free DCIG Newsletter to have new analysis delivered to your inbox each week.