In this multi-part post, I'm exploring specific space-saving technologies; deduplication and compression, and EMC's recent entry into the market with the Celerra Data Deduplication product.
There are 3 commonly used deduplication technologies, and compression, and now the the storage industry is beginning to take advantage of them.
The end goal; storing more for less.
The three varieties of deduplication are fixed block, variable and file level. They're all remarkably similar in terms of what they do; they single-instance some level of data. Variable block I see as a variety of deduplication akin to file level, so I'm going to hand-wave it away right now and focus on the file level and fixed block varieties.
First, let's clear up one point. Are compression and deduplication synonymous? Here's what one company thinks; it's from an EMC whitepaper (more of which in part 2, from Achieving Storage Efficiency through EMC Celerra Data Deduplication, Jan 2009).
- Compression is often considered to be different from deduplication. However, compression can be described as infinitely variable, bit-level, intra-object deduplication.
To be honest, I think that description is so wrong it's not even wrong. It might even have Shannon spinning in his grave.
- Deduplication is a form of delta encoding. In other words, only the differences between objects is stored; in the NetApp case, only different and unique blocks in the same volume.
- Compression is the process of encoding information using fewer bits than the original data would use. There's no delta and no deduplication.
I know why EMC are saying this. There are major differences between compression and deduplication in cost and outcome, particularly with respect to performance and capacity, but the industry buzz around the deduplication word is so attractive, it makes good marketing sense.
Yes, I know that EMC have a product called called Celerra Data Deduplication, and that it deduplicates data. But it should really be called Celerra Data Compression And A Bit Of Deduplication, because compression is where the savings come from.
And the downsides, of which more in part 2.
File Level Deduplication
This works only on objects like files; the major difference between it and fixed block deduplication is the size of the object. First, it's variable length, and second, it's an all or nothing test for duplicates.It's often describes as SIS -- Single Instance Storage.
If two files are identical, it's possible to store only one and have two directory entries point at the same file. UNIX has been supporting this for years with symbolic links; the links are aliases of the data that makes up the file. But that's an expensive active management process for the savings involved, and file level deduplication brings the advantages of automation and accuracy.
At a minimum, there's a metadata (data describing data) overhead of counting the number of times a file is pointed to, and some hash of its contents so that files can be matched.
And file level deduplication depends on lots of duplicate files; files have to match for length and content exactly. One further limitation is that files must normally reside in the same file system. That can limit the benefit, and ratios will depend on how file systems are arranged and their contents.
Writing to files stored like this also brings a downside; the file must be copied. That's twice the space, and there's the overhead of the IO to read the original and write the copy too.
How much file-level deduplication can be achieved? That depends, and I've not seen much research around this issue. Anecdotally, 10 to 20% seems reasonable. A reminder; if it's higher than that, perhaps you need to do some housekeeping and just delete all those duplicates...
Fixed Block Deduplication
NetApp continues to be the only vendor with deduplication on primary storage. It's a simple technology in principle; for every block written to a volume, check if there's been an identical block written before. If the answer is yes, then release the block to the free block pool, and point at the duplicate block.
This is based on deduplicating 4K blocks; and it works on SAN block based and NAS file-based data too. And because NetApp systems use pointers to blocks, reading and writing data in deduplicated volumes is simple. Read a block; get the block pointed to. Update a block or write a new block; write it, point to it, and the system will deduplicate at leisure later.
There's metadata associated with this technique; we need to save a unique fingerprint for each block.
It's all covered in this NetApp Technical Report, so I shan't repeat much more. But it's worth drawing out one of the more interesting tables that indicate the levels of deduplication achievable;
Data Types |
Typical Space Savings |
Range |
Backup Data |
90% |
85-95% |
VMware VMs |
70% |
50-90% |
Geoseismic |
55% |
40-70% |
Database Backups |
55% |
40-70% |
Home Directories |
35% |
20-50% |
CIFS Shares |
35% |
20-50% |
E-mail PSTs |
30% |
20-40% |
| Mixed Enterprise Data |
30% |
20-40% |
| Document Archives |
25% |
20-30% |
Engineering Archives |
25% |
20-30% |
For many classes of data, this is a considerable saving.
Compression
One thing that NetApp doesn't do is compression on its primary storage system (FAS and V Series). It's a feature of our VTL backup device, for reasons that will become apparent.
Compression is a well understood technology with a wide range of applications. Good compression depends on the type of data being compressed; the more random the data, the less likely it is to compress. For the kind of application data you might find on the average NAS box, as much as 50% should be achievable. That is a saving worth making.
There are downsides to compression. There's the not inconsiderable CPU load of compressing and inflating (decompressing) the data, although some systems (like NetApp's VTL) use specialised hardware to offload the task and avoid the processing impact.
More importantly; there are two things you can't do easily once the data is compressed. Reading and writing.
Reading means metadata; you can't read compressed data it without inflating it, and that normally means reading the whole object that was compressed, decompressing it and saving it in its original inflated form.
To get round this problem, systems such as NetApp's VTL retain a certain amount of metadata, so that the system can find blocks of compressed data, and just inflate these blocks.
An example; if I want to read bytes 10000 through 20000 of a compressed file, I look up the meta-data to get the blocks that contain this range, read just them, and inflate in memory before passing back to the requestor. There's no need to reinflate the data on disk at all; it can stay in its compressed format.
Writing means Inflation. Compressing data produces variable rates of savings, so it's not possible to simply update a section of a file by compressing it and getting it to exactly fit in the space available. Writing means inflation; again, the compression hardware can do this and avoid the CPU impact, but there's no avoiding the fact that the data is now occupying more space.
And there's additional IO to account for; the compressed file needs to be read, and the inflated file written back to disk, which can cause considerable delays on doing that first write.
(Worth noting that, as NetApp VTLs are archive and backup devices, they don't need to do updates like this; so the compression plus metadata approach works well.)
Compression levels of up to 50% are achievable for some, but not all, types of data. The data we store tends to be non-random in nature. Text files compress very well, as do certain Microsoft file formats such as PowerPoint. The notable exceptions are already compressed file formats like MPEGs and ZIPs for example.
And if you're wondering why you can't get any benefit by compressing compressed data, the pigeon-hole principle explains why..
More in part 2.
.