What is the Best Data Deduplication Software?

The market for storage-oriented disk storage backups is measured in billions of dollars. Quite a few well-known companies operate in this market, releasing products that have already become well-known all over the world: EMC DataDomain, Symantec NetBackup, HP StoreOnce, IBM ProtectTier, ExaGrid, and others.

What is the best data deduplication software

How did this market begin, and in what technological direction is it developing now, how do compare different deduplication software products and devices with each other?

How It Was

The first storage systems with deduplication appeared in the early 2000s. They were created to solve the problem of backing up exponentially growing data. The growth of data in productive systems of companies led to the fact that the duration of backup to tapes increased so much that full backups no longer “fit” in the backup window, and the use of disk storage systems that existed at that time as backup storage was difficult because due to their insufficient capacity.

As a result, backups could “break off” either due to lack of time (in the case of tapes) or due to lack of space (in the case of disks). The problem of disk space could be solved by purchasing high-capacity storage, but in this case, there was a problem of high storage costs.

Backup software products were originally designed on the basis that the backup storage is a tape drive, and the backup algorithm is the father-son-grandson algorithm:

  • “father” (full backup once a week);
  • “son” (incremental copies six days a week);
  • “grandchild” (old full backup, usually sent to offsite storage).

This approach generated a fairly large amount of backup data and was relatively inexpensive for companies when using tapes, but when using disks, the cost of this approach increased significantly.

In those days, only a small number of backup software products provided built-in deduplication of backup data. Storage systems with built-in deduplication appeared precisely to solve this problem – to reduce the cost of storing data on disks (in the future, down to the level of tapes).

A key success factor for these new devices was the fact that storage deduplication worked transparently and did not require any modifications to existing backup software.

How Is It Now

However, over the past time, almost all backup software products have acquired built-in deduplication, and the cost of disks (the original problem of disk storage) has decreased significantly.

Moreover, now many backup products can do deduplication on the side of the original data, that is, backup data is deduplicated even before it is transferred to the backup repository for storage. This allows you to reduce the load on the channel, increase the speed of work and reduce the backup window. For this reason, the functionality of many disk storage systems now includes integration functions with such software products.

Currently, storage systems that are positioned as backup storage are experiencing additional competitive pressure from storage systems designed to work as primary servers in a productive network (Primary Storage), since they often include deduplication functionality for free.

A logical question arises: why then do we need specialized Backup Target storage systems and how to use them correctly? If we summarize information from various manufacturers of such storage systems, they use the following three strategies:

  1. It is claimed that (under certain conditions) deduplication on Backup Target storage systems has advantages over deduplication built into backup products;
  2. They position their storage systems not only as a place to store a backup repository but also as a possible place to store an organization’s electronic document archive;
  3. Include backup software with their storage system, or simply integrate their storage systems with backup software products (including other manufacturers).

Let’s consider each item in more detail.

Strategy #1. What is the best deduplication?

For example, take the deduplication factor. Here we need to correctly determine what and how we measure. 

Some vendors list their products as “30 to 1” deduplication, while others list “3 to 1”. Does this mean that the products of the first manufacturers are better than the second? No, because different data sets were evaluated during the calculations, as a result of which such different deduplication coefficients were obtained. That is, the “deduplication factor”, indicated as a constant, is more of a marketing term, since it shows the deduplication of different data for different manufacturers, and products cannot be compared based on it.

Equivalent storage capacity and equivalent backup performance are also purely marketing criteria.

Unfortunately, at the moment there is no industry (or at least de facto) standard for assessing the deduplication ratio due to the existence of a difference in the algorithms of the backup process itself when using different products (data volume, frequency, etc.). Only the original data transfer rate can be considered an objective metric.

At the same time, the most correct criterion for comparing deduplication tools may ultimately be the disk space saved in the backup repository over a certain period of time (rather than the deduplication ratio). However, in advance (before buying) it is usually impossible to find out, alas.

Strategy #2. Backup Target storage as an electronic archive

Repositioning Backup Target storage as storage that can be used not only to store a backup repository but also to store an organization’s electronic archives is a good idea. However, the requirements for storage systems in these two cases differ significantly.

Archives, unlike backups, by their very nature rarely contain duplicate information. Archives should provide the ability to quickly search for individual items, while access to backups is relatively rare.

These differences in requirements indicate that storage systems to perform these tasks must still have a different architecture. Manufacturers are taking steps in this direction – for example, they are changing the architecture of the file system of their storage systems, however, in doing so, they are essentially moving towards a universal file system and universal storage system (and competition with universal storage systems has already been mentioned above).

Strategy #3. Integration of storage systems with backup software products

As for the idea of integrating backup software products with storage systems, it looks very reasonable if the integration is carried out not just in marketing materials, but includes integration at the technological level.

For example, storage systems make hardware snapshots of their disks as efficiently as possible (getting the lowest possible RPO in practice, since any software implementation from a third-party vendor will most likely be slower).

At the same time, backup software products perform other important backup functions well: building a repository and organizing long-term storage of backups, performing backup testing procedures, and quickly restoring data in case of failure (minimizing RTO).

Such a technological “symbiosis” between manufacturers of backup software products and hardware storage systems makes it possible to obtain the most effective solutions for the user.

What is the best data deduplication software

Conclusion

Over the past 10 years, there has been a technological evolution in the market for products and devices with deduplication – they began to complement each other in terms of functionality. There has been a shift from deduplication on the backup repository to deduplication on the source side, or a combination of approaches.

There is no need to compare the effectiveness of deduplication by “deduplication coefficients” and metrics derived from them since they strongly depend on the initial data, the nature of their daily changes, network bandwidth, and other “environment” factors.

At the moment, when creating a backup infrastructure architecture, it is optimal to look not separately at “hardware storage systems with deduplication” and separately at “backup software products”, but at their integrated complementary bundles of software + storage.

Leave a Comment

 
Share to...