Depending on whose numbers you believe, enterprise organizations may achieve deduplication ratios that range anywhere from as low as 4x to as high as 500x. Yet these ratios primarily make for good headlines and are only seen in rare circumstances in real world enterprise environments. Further, these numbers are of little use when enterprises are trying to set realistic deduplication expectations.
Enterprises need to make business decisions based on a deduplication ratio that they can realistically achieve in their environment. Knowing this ratio is absolutely critical in order for them to justify proceeding with a deduplication solution as well as quantifying what benefits they can expect to realize once it is deployed.
Setting realistic expectations about deduplication ratios is not rocket science but it does require some legwork. Five (5) critical pieces of information that enterprises will need to gather from their environment include:
- The type of data that is backed up
- Backup frequency
- The type of backups performed (differential, full, incremental)
- Data change rate
- Backup job retention
While some enterprises may inwardly groan at the thought of having to gather that data in order to arrive at a realistic deduplication ratio, this information is more accessible than they may realize. One storage consultant told me that when he works with his clients in helping them justify their deployment of deduplication solutions he uses their existing backup software as the source for this information.
By looking at how the backup jobs are configured, he can tell how frequently backups occur, what types of backups are performed and how long backup jobs are retained. Getting at the details as to what data is included in each backup job is a little more difficult but he is still able to accomplish this as well.
He finds that backup administrators generally know what applications are running on what servers (physical or virtual.) As such, they usually know which backup jobs represent application servers, database servers and file servers.
So he picks a representative sample of their servers (~5%) and examines what types of data resides in these backup jobs. This gives him a reasonably good idea as to what data they have in their environment without having to spend a great deal of time or money doing a full assessment.
Once enterprises have these five data points, they can then start to arrive at realistic conclusions as to what sort of deduplication ratios they might achieve. These are the types of expectations they should have depending on what type of data they have in their environment:
- Low expectations. Many organizations have some audio, image, video and even encrypted files in their environment. In these circumstances, they should only realistically expect to achieve deduplication ratios of about 4x for these files. The only way they might achieve higher deduplication ratios is if the exact same audio, image, video and/or encrypted file resides on multiple servers and full backups of the file are continually done. Otherwise organizations that have files that meet one of these classifications may want to forego deduplication and simply backup these files to raw disk.
- Realistic expectations. In enterprises the real end game of deduplication is not necessarily to reach the highest deduplication ratio possible. Rather it is to drive the price point of disk down to where it becomes as or more affordable than tape. To do this, most find that the deduplication ratio that they need to achieve is 10 – 20x.
The good news is that most enterprises are minimally hitting these numbers in their environments. Their backups are generally a combination of daily incremental and weekly full backups where data is changing in their environment on the order of 3-7% daily. This data typically meets the classification of “unstructured data” which includes email and files.
- High expectations. Enterprises that hope to achieve deduplication ratios in the range of 30 – 60x make the justification for implementing deduplication almost a no-brainer. These environments generally have very low data changes rates and do full backups of their data every time.
Organizations that are achieving these very high deduplication ratios are doing so by performing full database backups or full backups of VM images. By way of example, one administrator at a university that I spoke with in late December confirmed that he was hitting a deduplication ratio of over 30x when backing up his VM images.
- Stratospheric but not impossible expectations. Few enterprises should expect to achieve deduplication ratios of 60x or greater in their environment but it is not impossible. As more enterprises look to virtualize corporate desktops with technologies like VMware View, achieving these very high deduplication ratios when backing up desktops is conceivable as they tend to have very small data change rates and all desktop images are pretty much the same.
Setting realistic expectations as to what deduplication ratios an enterprise can expect to achieve can be difficult when ratios that range from 4x to 500x are reported. So for an enterprise to set realistic expectations as to what sort of deduplication ratio it will achieve will depend upon it first doing a sampling of the data in its environment which can often be accomplished by looking at information that is available in its backup software.
It is only when this assessment is complete that enterprises can begin to set realistic expectations as to what sort of deduplication ratio they hope to achieve. Most will find that they can expect to achieve realistic deduplication ratios that fall between 10 and 20x while those with high numbers of audio, image, video and encrypted files should expect to hit deduplication ratios in the range of 4x for these files. It is only those that are well down the path of server or even desktop virtualization who can set higher expectations that range from 30 to 60x or more for this type of data.