As DCIG and SMB Research put together the DCIG 2011 Virtual Server Backup Software Buyer’s Guide, we became aware that Arkeia was up to something special with its deduplication technology. Arkeia had in late 2009 acquired Kadena and was in the process of absorbing and incorporating its technology into its core Arkeia Network Backup solution. Now that the release of Arkeia v9 is official, it takes the best of what deduplication has to offer and makes it better in the form of Progressive Deduplication™ technology.
Progressive Deduplication eschewed each of the traditional approaches (fixed-length block and variable-length block) of deduplication in favor of a new approach: overlapping fixed-length blocks where a file’s block size is optimized based on file type.
Using fixed-length block deduplication data is divided into blocks of a known and consistent size, usually ranging from 32KB to 1MB. This technique offers fair data compression at a high processing speed but sacrifices optimal deduplication as fixed-length block deduplication cannot tolerate changes where data is inserted into or deleted from files.
Conversely, variable-length block deduplication divides data into different size blocks, where block boundaries are established where “magic numbers” or “anchors” are identified in the data. This approach incurs a higher processing load but, because it tolerates data insertions and deletions, results in higher deduplication ratios.
Progressive Deduplication gets its name from a technique (“progressive matching”) that permits use of a sliding window without incurring the high cost of calculating a hash at each byte offset. Sliding windows are not new to compression, but progressive matching is. Even with progressive matching, the computational load of the sliding window is greater than fixed-block.
Arkeia reduces this computational load in the case that data has been previously encountered. This known data is deduplicated at fixed-block speeds because blocks are generally non-overlapping and sliding is unnecessary. The window can simply jump forward to the next block.. Conversely, only new data must be surveyed with a “sliding window” to determine if the data matches known blocks.
However there is another aspect of Arkeia’s solution that intrigued us though it has received less attention than the intricacies of Arkeia’s sliding window implementation. This is Arkeia’s work to optimize deduplication for different file types. So to gain some further insight into this capability, DCIG and SMB Research recently spent some time with Arkeia’s CEO Bill Evans.
Arkeia recognized that smaller block sizes generally deliver better deduplication rates–though at the cost of more computational load. At the limit, if blocks become too small, the overhead of “pointer metadata” overwhelms the benefit of block deduplication. So, the question becomes “What block size delivers optimal compression?”
To figure this out, Arkeia went through an extensive exercise involving hundreds of file types and millions of customer files to determine the block size for each file type at which files of that type are maximally compressed.
Arkeia delivered a purpose-built application to its customers that deduplicated each customer’s files six times using different block sizes (1 KB, 2 KB, 4 KB, 8 KB, 16 KB, 32KB). The application then aggregated the resulting compression rates by file type and these results were returned to Arkeia. Arkeia then aggregated these results across customers to determine the optimal block size for each specific file type.
Arkeia effectively teaches its compression solution to deduplicate different types of files with blocks of different sizes. Smaller block sizes are more costly so Arkeia does not use them if the benefit does not exceed the computational cost. Knowing this optimal block size for each gives Arkeia another leg up on competitive deduplication offerings.
So what do Arkeia’s innovations in deduplication mean for end-users? Here are some items to consider:
1. Better deduplication rates means lower bandwidth requirements and lower telecom costs. Today, offsite backup via WAN makes sense only for smallish data volumes, though that threshold is rising. Better deduplication techniques such as what Arkeia offers will accelerate the upward movement of this threshold. In Arkeia’s case it can facilitate the movement of large volumes of data to/from the cloud.
2. Arkeia sells backup solutions of which deduplication is integral but other product categories could potentially benefit from its deduplication technology. Other areas where Arkeia’s deduplication technology could potentially be used include WAN replication, NAS and NAS cache. The efficiency and speed at which Arkeia’s Progressive Deduplication deduplicates data coupled with the increases in CPU speed and the growing push to store and retrieve data from the cloud could result in Progressive Deduplication being used to deduplicate production data.
3. It creates the opportunity for each organization to achieve an optimal deduplication ratio for its environment. Using Arkeia’s progressive deduplication, an organization can potentially deduplicate all of its data at a block size that is optimal for each file. While this will incur extra CPU cycles, as successive generations of CPUs grow more powerful and have cycles to spare, this is within the realm of possibility.
As has been highlighted in this and other forums, not only is there an ongoing explosion of data within the enterprise, there is a growing number file types within organizations. This is creating a new demand for deduplication solutions that simultaneously perform and optimize storage capacity.
Arkeia’s Progressive Deduplication represents the next generation of deduplication technologies. By Arkeia taking steps to mitigate the performance overhead associated with variable-length deduplication and using the most appropriate block size when deduplicating each file type, it delivers both the performance and the capacity optimized storage that organizations need today while positioning them for the business challenges that are going to have tomorrow.
Bob Eastman of SMB Research also contributed to this blog entry.