Search
Close this search box.

Cherry Garcia and Deduplication

Deduplication (data optimization) is a hot topic these days because its promise to reduce storage consumption and its associated costs is compelling to today’s high data growth businesses. Impact on IT budgets, CAPEX and OPEX as a result of applying dedupe in primary and/or secondary tiers of storage can be substantial!

Lost in all the hype is the fact that there are as many flavors of dedupe as Ben & Jerry’s has flavors of ice cream! They are all different and provide varied results. There are a few vectors to pivot around in this discussion such as file vs. sub-file, memory efficiency and scalability.   Each are part of the puzzle vendors bring together to help reduce the amount of data and associated storage space consumed. Unfortunately, these puzzle pieces determine the effectiveness of the dedupe offering yielding a high performance solution vs. a minimal capability solution.  Just like a sports car compared to a VW Beetle. Both will provide a service, it’s a matter of the results!

In each characteristic, the impact and yield really make a difference;

* File vs. Sub-file – as data is analyzed the smaller the chunk size the higher the deduplication yield. So deduping at the file level yields the least dedupe benefit while deduping at the smaller (sub-file) data chunk level (4K is optimal) yields the largest benefit.

* Memory efficiency is critical since the more hash keys you can keep in memory the better. The closer you can function to 100% ‘in memory,’ the better the overall performance will be!

* Not too far behind memory efficiency is the ability to sample the most amount of data. Again, the more efficient the better and the higher the scalability will be.

Here’s a case in point. Permabit’s Albireo High Performance Data Optimization solution recently changed the deduplication market as it applies to primary storage data optimization (dedupe) and has received substantial press and analyst recognition.  Contrasting Albireo is the Open ZFS Deduplication technology which has been used as a base for some storage optimization solutions. Let’s compare the two:

* ZFS uses a 128K chunk size vs. Albireo using as small as a 4K chunk (it’s configurable).  Albireo will find substantially more duplicates and reduce storage consumption dramatically compared to ZFS

* ZFS uses a table entry approach for data identification at 200 bytes each vs. Albireo using an index with data representation of only 4 bytes each.  Albireo will typically reside in memory 99.5% of the time. ZFS on the other hand, will page out to disk (resulting in major performance slow down) for table look-up more than 50% of the time as a result of the 200 byte data id (50 times the size of Albireo).

* ZFS can scale out to 20TB using 32 GB of RAM vs. Albireo which can scale to 640TB using only 16GB of RAM (64 times greater scalability with Albireo)!

Albireo will be more effective at finding duplicates, more efficient at processing the data and more scalable.   The contrast is significant because businesses do run on the data that is being deduped and the impact on the overall storage costs will be substantial. The higher performance, the more efficient and the more scalable one flavor of dedupe is over another, the better! Like Cherry Garcia is over all those other flavors, Albireo is to the rest of the dedupe offerings!

Deduplication (data optimization) is a hot topic these days because its promise to reduce storage consumption and its associated costs is compelling to today’s high data growth businesses. Impact on IT budgets, CAPEX and OPEX as a result of applying dedupe in primary and/or secondary tiers of storage can be substantial!

Lost in all the hype is the fact that there are as many flavors of dedupe as Ben & Jerry’s has flavors of ice cream! They are all different and provide varied results. There are a few vectors to pivot around in this discussion such as file vs. sub-file, memory efficiency and scalability.   Each are part of the puzzle vendors bring together to help reduce the amount of data and associated storage space consumed. Unfortunately, these puzzle pieces determine the effectiveness of the dedupe offering yielding a high performance solution vs. a minimal capability solution.  Just like a sports car compared to a VW Beetle. Both will provide a service, it’s a matter of the results!

In each characteristic, the impact and yield really make a difference;

* File vs. Sub-file – as data is analyzed the smaller the chunk size the higher the deduplication yield. So deduping at the file level yields the least dedupe benefit while deduping at the smaller (sub-file) data chunk level (4K is optimal) yields the largest benefit.

* Memory efficiency is critical since the more hash keys you can keep in memory the better. The closer you can function to 100% ‘in memory,’ the better the overall performance will be!

* Not too far behind memory efficiency is the ability to sample the most amount of data. Again, the more efficient the better and the higher the scalability will be.

Here’s a case in point. Permabit’s Albireo High Performance Data Optimization solution recently changed the deduplication market as it applies to primary storage data optimization (dedupe) and has received substantial press and analyst recognition.  Contrasting Albireo is the Open ZFS Deduplication technology which has been used as a base for some storage optimization solutions. Let’s compare the two:

* ZFS uses a 128K chunk size vs. Albireo using as small as a 4K chunk (it’s configurable).  Albireo will find substantially more duplicates and reduce storage consumption dramatically compared to ZFS

* ZFS uses a table entry approach for data identification at 200 bytes each vs. Albireo using an index with data representation of only 4 bytes each.  Albireo will typically reside in memory 99.5% of the time. ZFS on the other hand, will page out to disk (resulting in major performance slow down) for table look-up more than 50% of the time as a result of the 200 byte data id (50 times the size of Albireo).

* ZFS can scale out to 20TB using 32 GB of RAM vs. Albireo which can scale to 640TB using only 16GB of RAM (64 times greater scalability with Albireo)!

Albireo will be more effective at finding duplicates, more efficient at processing the data and more scalable.   The contrast is significant because businesses do run on the data that is being deduped and the impact on the overall storage costs will be substantial. The higher performance, the more efficient and the more scalable one flavor of dedupe is over another, the better! Like Cherry Garcia is over all those other flavors, Albireo is to the rest of the dedupe offerings!

Share
Share

Click Here to Signup for the DCIG Newsletter!

Categories

DCIG Newsletter Signup

Thank you for your interest in DCIG research and analysis.

Please sign up for the free DCIG Newsletter to have new analysis delivered to your inbox each week.