I must confess - I was quite surprised when this deduplication (“dedupe”) related patent was granted barely two years after we filed it! That too in a field littered with patent applications – every other company seems to have filed one (and in some cases, many more).
Ever since I began looking at dedupe technology in 2007, it has fascinated me that none of these patent applications, and the numerous academic papers dealing with dedupe, have been able to get out of the strong gravitational pull of Ross Williams’ original patent on this subject, granted in 1999. Williams’ patent introduced the concept of chunking and deduping by storing chunks in a dictionary and searching for matching chunks. Chunking, of course, was an outgrowth of Rabin’s work on fingerprinting and its subsequent application to document similarity by Manber and Broder.
A large portion of my effort in late 2007 and early 2008 went towards studying the chunking method and trying to understand why the numerous inventors who seem to have been running in circles around the concept of dedupe made some of the design decisions inherent in their patents. As I tried to better understand how well the chunking method could scale, some of the questions I found myself asking were:
- Why do chunks need to be of variable length?
- Why use large chunk sizes?
- Why must fingerprint be calculated at each byte?
- Why must strong hashes be used?
This analysis led me to the realization that the technique has one fundamental flaw: it was designed for systems and networks of the early 90’s! Specifically, it was targeted to a single-core CPU system with hard disk storage, connected to the external world by a narrow pipe. It became clear that no amount of tinkering would enable it to scale to multi-gig rates, let alone tens and hundreds of gigs. A new approach was needed to dedupe network traffic running at 10 Gbps speeds (and higher), with dedupe performance at least as good as that of the chunking method.
The search for a new approach was torturous and highly frustrating. There were no bread crumbs to follow like in the chunking world. I even studied some efforts that borrow tricks from the compression world such as differential or delta compression, but they were either too complex to scale or too probabilistic to provide any deterministic estimates for how good the resulting dedupe might be.
The first promising approach took off from chunking but rectified its weaknesses. It was based on very small fixed-length chunks and could scale; memory bandwidth was the bottleneck but appeared feasible based on the DDR3 technologies emerging at that time. Unfortunately, the euphoria evaporated once I talked to programmable logic vendors: – the memory bandwidth required was completely out of reach! On the other hand, given processor roadmaps available in early 2008, I couldn’t see a way to implement the algorithm in software and reach the desired throughput. This led to another round of intense searching, retaining fixed-length dictionary entries as the foundation to build on. Several options were looked at and had to be rejected due to some insurmountable bottleneck – most often, memory bandwidth. This, in turn, led to a frantic search for ways to drastically cut back on the memory bandwidth requirement while concurrently looking for ways to increase the bandwidth availability. The realization that it is not essential to look at each byte of the input during the first phase (where the input packet is aligned with the dictionary) finally brought the goal within reach by reducing the bandwidth needed in that phase to only about a tenth of its prior value.
However, we kept at it, chiseling away at the problem. The outlines of the work reported in the patent took shape in June 2008, with many details coming from some of the previously rejected variations. Two patent applications (the first of which has just been granted) were filed based on these ideas. A companion patent application is being reviewed by the USPTO.
I am pleased at this positive culmination of a lot of hard work. I can only hope that this patent spurs more ideas and innovation in the field of deduplication, similar to Ross Williams’s and Rabin’s seminal work.
Tags: DeduplicationVendor News
Back to all News