| 1 |
David G. Korn, AT&T Labs |
| 2 |
Joshua P. MacDonald, UC Berkeley |
| 3 |
Jeffrey C. Mogul, Compaq WRL |
| 4 |
Internet-Draft Kiem-Phong Vo, AT&T Labs |
| 5 |
Expires: 09 November 2002 09 November 2001 |
| 6 |
|
| 7 |
|
| 8 |
The VCDIFF Generic Differencing and Compression Data Format |
| 9 |
|
| 10 |
draft-korn-vcdiff-06.txt |
| 11 |
|
| 12 |
|
| 13 |
|
| 14 |
Status of this Memo |
| 15 |
|
| 16 |
This document is an Internet-Draft and is in full conformance |
| 17 |
with all provisions of Section 10 of RFC2026. |
| 18 |
|
| 19 |
Internet-Drafts are working documents of the Internet Engineering |
| 20 |
Task Force (IETF), its areas, and its working groups. Note that |
| 21 |
other groups may also distribute working documents as |
| 22 |
Internet-Drafts. |
| 23 |
|
| 24 |
Internet-Drafts are draft documents valid for a maximum of six |
| 25 |
months and may be updated, replaced, or obsoleted by other |
| 26 |
documents at any time. It is inappropriate to use Internet- |
| 27 |
Drafts as reference material or to cite them other than as |
| 28 |
"work in progress." |
| 29 |
|
| 30 |
The list of current Internet-Drafts can be accessed at |
| 31 |
http://www.ietf.org/ietf/1id-abstracts.txt |
| 32 |
|
| 33 |
The list of Internet-Draft Shadow Directories can be accessed at |
| 34 |
http://www.ietf.org/shadow.html. |
| 35 |
|
| 36 |
|
| 37 |
Abstract |
| 38 |
|
| 39 |
This memo describes a general, efficient and portable data format |
| 40 |
suitable for encoding compressed and/or differencing data so that |
| 41 |
they can be easily transported among computers. |
| 42 |
|
| 43 |
|
| 44 |
Table of Contents: |
| 45 |
|
| 46 |
1. EXECUTIVE SUMMARY ............................................ 2 |
| 47 |
2. CONVENTIONS .................................................. 3 |
| 48 |
3. DELTA INSTRUCTIONS ........................................... 4 |
| 49 |
4. DELTA FILE ORGANIZATION ...................................... 5 |
| 50 |
5. DELTA INSTRUCTION ENCODING ................................... 9 |
| 51 |
6. DECODING A TARGET WINDOW ..................................... 14 |
| 52 |
7. APPLICATION-DEFINED CODE TABLES .............................. 16 |
| 53 |
8. PERFORMANCE .................................................. 16 |
| 54 |
9. FURTHER ISSUES ............................................... 17 |
| 55 |
10. SUMMARY ...................................................... 18 |
| 56 |
11. ACKNOWLEDGEMENTS ............................................. 18 |
| 57 |
12. SECURITY CONSIDERATIONS ...................................... 18 |
| 58 |
13. SOURCE CODE AVAILABILITY ..................................... 18 |
| 59 |
14. INTELLECTUAL PROPERTY RIGHTS ................................. 18 |
| 60 |
15. IANA CONSIDERATIONS .......................................... 19 |
| 61 |
16. REFERENCES ................................................... 19 |
| 62 |
17. AUTHOR'S ADDRESS ............................................. 20 |
| 63 |
|
| 64 |
|
| 65 |
1. EXECUTIVE SUMMARY |
| 66 |
|
| 67 |
Compression and differencing techniques can greatly improve storage |
| 68 |
and transmission of files and file versions. Since files are often |
| 69 |
transported across machines with distinct architectures and performance |
| 70 |
characteristics, such data should be encoded in a form that is portable |
| 71 |
and can be decoded with little or no knowledge of the encoders. |
| 72 |
This document describes Vcdiff, a compact portable encoding format |
| 73 |
designed for these purposes. |
| 74 |
|
| 75 |
Data differencing is the process of computing a compact and invertible |
| 76 |
encoding of a "target file" given a "source file". Data compression |
| 77 |
is similar but without the use of source data. The UNIX utilities diff, |
| 78 |
compress, and gzip are well-known examples of data differencing and |
| 79 |
compression tools. For data differencing, the computed encoding is |
| 80 |
called a "delta file", and, for data compression, it is called |
| 81 |
a "compressed file". Delta and compressed files are good for storage |
| 82 |
and transmission as they are often smaller than the originals. |
| 83 |
|
| 84 |
Data differencing and data compression are traditionally treated |
| 85 |
as distinct types of data processing. However, as shown in the Vdelta |
| 86 |
technique by Korn and Vo [1], compression can be thought of as a special |
| 87 |
case of differencing in which the source data is empty. The basic idea |
| 88 |
is to unify the string parsing scheme used in the Lempel-Ziv'77 style |
| 89 |
compressors [2], and the block-move technique of Tichy [3]. Loosely |
| 90 |
speaking, this works as follows: |
| 91 |
|
| 92 |
a. Concatenate source and target data. |
| 93 |
b. Parse the data from left to right as in LZ'77 but |
| 94 |
make sure that a parsed segment starts the target data. |
| 95 |
c. Start to output when reaching target data. |
| 96 |
|
| 97 |
Parsing is based on string matching algorithms such as suffix trees [4] |
| 98 |
or hashing with different time and space performance characteristics. |
| 99 |
Vdelta uses a fast string matching algorithm that requires less memory |
| 100 |
than other techniques [5,6]. However, even with this algorithm, the |
| 101 |
memory requirement can still be prohibitive for large files. A common |
| 102 |
way to deal with memory limitation is to partition an input file into |
| 103 |
chunks called "windows" and process them separately. Here, except for |
| 104 |
unpublished work by Vo, little has been done on designing effective |
| 105 |
windowing schemes. Current techniques, including Vdelta, simply use |
| 106 |
source and target windows with corresponding addresses across source |
| 107 |
and target files. |
| 108 |
|
| 109 |
String matching and windowing algorithms have large influence on the |
| 110 |
compression rate of delta and compressed files. However, it is desirable |
| 111 |
to have a portable encoding format that is independent of such algorithms. |
| 112 |
This enables construction of client-server applications in which a server |
| 113 |
may serve clients with unknown computing characteristics. Unfortunately, |
| 114 |
all current differencing and compressing tools, including Vdelta, fall |
| 115 |
short in this respect. Their storage formats are closely intertwined |
| 116 |
with the implemented string matching and/or windowing algorithms. |
| 117 |
|
| 118 |
The encoding format Vcdiff proposed here addresses the above issues. |
| 119 |
Vcdiff achieves the below characteristics: |
| 120 |
|
| 121 |
Output compactness: |
| 122 |
The basic encoding format compactly represents compressed or delta |
| 123 |
files. Applications can further extend the basic encoding format |
| 124 |
with "secondary encoders" to achieve more compression. |
| 125 |
|
| 126 |
Data portability: |
| 127 |
The basic encoding format is free from machine byte order and |
| 128 |
word size issues. This allows data to be encoded on one machine |
| 129 |
and decoded on a different machine with different architecture. |
| 130 |
|
| 131 |
Algorithm genericity: |
| 132 |
The decoding algorithm is independent from string matching and |
| 133 |
windowing algorithms. This allows competition among implementations |
| 134 |
of the encoder while keeping the same decoder. |
| 135 |
|
| 136 |
Decoding efficiency: |
| 137 |
Except for secondary encoder issues, the decoding algorithm runs |
| 138 |
in time proportional to the size of the target file and uses space |
| 139 |
proportional to the maximal window size. Vcdiff differs from more |
| 140 |
conventional compressors in that it uses only byte-aligned |
| 141 |
data, thus avoiding bit-level operations, which improves |
| 142 |
decoding speed at the slight cost of compression efficiency. |
| 143 |
|
| 144 |
The Vcdiff data format and the algorithms for decoding data shall be |
| 145 |
described next. Since Vcdiff treats compression as a special case of |
| 146 |
differencing, we shall use the term "delta file" to indicate the |
| 147 |
compressed output for both cases. |
| 148 |
|
| 149 |
|
| 150 |
2. CONVENTIONS |
| 151 |
|
| 152 |
The basic data unit is a byte. For portability, Vcdiff shall limit |
| 153 |
a byte to its lower eight bits even on machines with larger bytes. |
| 154 |
The bits in a byte are ordered from right to left so that the least |
| 155 |
significant bit (LSB) has value 1, and the most significant bit (MSB), |
| 156 |
has value 128. |
| 157 |
|
| 158 |
For purposes of exposition in this document, we adopt the convention |
| 159 |
that the LSB is numbered 0, and the MSB is numbered 7. Bit numbers |
| 160 |
never appear in the encoded format itself. |
| 161 |
|
| 162 |
Vcdiff encodes unsigned integer values using a portable variable-sized |
| 163 |
format (originally introduced in the Sfio library [7]). This encoding |
| 164 |
treats an integer as a number in base 128. Then, each digit in this |
| 165 |
representation is encoded in the lower seven bits of a byte. Except for |
| 166 |
the least significant byte, other bytes have their most significant bit |
| 167 |
turned on to indicate that there are still more digits in the encoding. |
| 168 |
The two key properties of this integer encoding that are beneficial |
| 169 |
to a data compression format are: |
| 170 |
|
| 171 |
a. The encoding is portable among systems using 8-bit bytes, and |
| 172 |
b. Small values are encoded compactly. |
| 173 |
|
| 174 |
For example, consider the value 123456789 which can be represented with |
| 175 |
four 7-bit digits whose values are 58, 111, 26, 21 in order from most |
| 176 |
to least significant. Below is the 8-bit byte encoding of these digits. |
| 177 |
Note that the MSBs of 58, 111 and 26 are on. |
| 178 |
|
| 179 |
+-------------------------------------------+ |
| 180 |
| 10111010 | 11101111 | 10011010 | 00010101 | |
| 181 |
+-------------------------------------------+ |
| 182 |
MSB+58 MSB+111 MSB+26 0+21 |
| 183 |
|
| 184 |
|
| 185 |
Henceforth, the terms "byte" and "integer" will refer to a byte and an |
| 186 |
unsigned integer as described. |
| 187 |
|
| 188 |
|
| 189 |
From time to time, algorithms are exhibited to clarify the descriptions |
| 190 |
of parts of the Vcdiff format. On such occasions, the C language will be |
| 191 |
used to make precise the algorithms. The C code shown in this |
| 192 |
document is meant for clarification only, and is not part of the |
| 193 |
actual specification of the Vcdiff format. |
| 194 |
|
| 195 |
In this specification, the key words "MUST", "MUST NOT", |
| 196 |
"SHOULD", "SHOULD NOT", and "MAY" document are to be interpreted as |
| 197 |
described in RFC2119 [12]. |
| 198 |
|
| 199 |
|
| 200 |
3. DELTA INSTRUCTIONS |
| 201 |
|
| 202 |
A large target file is partitioned into non-overlapping sections |
| 203 |
called "target windows". These target windows are processed separately |
| 204 |
and sequentially based on their order in the target file. |
| 205 |
|
| 206 |
A target window T of length t may be compared against some source data |
| 207 |
segment S of length s. By construction, this source data segment S |
| 208 |
comes either from the source file, if one is used, or from a part of |
| 209 |
the target file earlier than T. In this way, during decoding, S is |
| 210 |
completely known when T is being decoded. |
| 211 |
|
| 212 |
The choices of T, t, S and s are made by some window selection algorithm |
| 213 |
which can greatly affect the size of the encoding. However, as seen later, |
| 214 |
these choices are encoded so that no knowledge of the window selection |
| 215 |
algorithm is needed during decoding. |
| 216 |
|
| 217 |
Assume that S[j] represents the jth byte in S, and T[k] represents |
| 218 |
the kth byte in T. Then, for the delta instructions, we treat the data |
| 219 |
windows S and T as substrings of a superstring U formed by concatenating |
| 220 |
them like this: |
| 221 |
|
| 222 |
S[0]S[1]...S[s-1]T[0]T[1]...T[t-1] |
| 223 |
|
| 224 |
The "address" of a byte in S or T is referred to by its location in U. |
| 225 |
For example, the address of T[k] is s+k. |
| 226 |
|
| 227 |
The instructions to encode and direct the reconstruction of a target |
| 228 |
window are called delta instructions. There are three types: |
| 229 |
|
| 230 |
ADD: This instruction has two arguments, a size x and a sequence of |
| 231 |
x bytes to be copied. |
| 232 |
COPY: This instruction has two arguments, a size x and an address p |
| 233 |
in the string U. The arguments specify the substring of U that |
| 234 |
must be copied. We shall assert that such a substring must be |
| 235 |
entirely contained in either S or T. |
| 236 |
RUN: This instruction has two arguments, a size x and a byte b that |
| 237 |
will be repeated x times. |
| 238 |
|
| 239 |
Below are example source and target windows and the delta instructions |
| 240 |
that encode the target window in terms of the source window. |
| 241 |
|
| 242 |
a b c d e f g h i j k l m n o p |
| 243 |
a b c d w x y z e f g h e f g h e f g h e f g h z z z z |
| 244 |
|
| 245 |
COPY 4, 0 |
| 246 |
ADD 4, w x y z |
| 247 |
COPY 4, 4 |
| 248 |
COPY 12, 24 |
| 249 |
RUN 4, z |
| 250 |
|
| 251 |
|
| 252 |
Thus, the first letter 'a' in the target window is at location 16 |
| 253 |
in the superstring. Note that the fourth instruction, "COPY 12, 24", |
| 254 |
copies data from T itself since address 24 is position 8 in T. |
| 255 |
This instruction also shows that it is fine to overlap the data to be |
| 256 |
copied with the data being copied from as long as the latter starts |
| 257 |
earlier. This enables efficient encoding of periodic sequences, |
| 258 |
i.e., sequences with regularly repeated subsequences. The RUN instruction |
| 259 |
is a compact way to encode a sequence repeating the same byte even though |
| 260 |
such a sequence can be thought of as a periodic sequence with period 1. |
| 261 |
|
| 262 |
To reconstruct the target window, one simply processes one delta |
| 263 |
instruction at a time and copy the data either from the source window |
| 264 |
or the being reconstructed target window based on the type of the |
| 265 |
instruction and the associated address, if any. |
| 266 |
|
| 267 |
|
| 268 |
4. DELTA FILE ORGANIZATION |
| 269 |
|
| 270 |
A Vcdiff delta file starts with a Header section followed by a sequence |
| 271 |
of Window sections. The Header section includes magic bytes to identify |
| 272 |
the file type, and information concerning data processing beyond the |
| 273 |
basic encoding format. The Window sections encode the target windows. |
| 274 |
|
| 275 |
Below is the overall organization of a delta file. The indented items |
| 276 |
refine the ones immediately above them. An item in square brackets may |
| 277 |
or may not be present in the file depending on the information encoded |
| 278 |
in the Indicator byte above it. |
| 279 |
|
| 280 |
Header |
| 281 |
Header1 - byte |
| 282 |
Header2 - byte |
| 283 |
Header3 - byte |
| 284 |
Header4 - byte |
| 285 |
Hdr_Indicator - byte |
| 286 |
[Secondary compressor ID] - byte |
| 287 |
|
| 288 |
[@@@ Why is compressor ID not an integer? ] |
| 289 |
[@@@ If we aren't defining any secondary compressors yet, then it seems |
| 290 |
that defining the [Secondary compressor ID] and the corresponding |
| 291 |
VCD_DECOMPRESS Hdr_Indicator bit in this draft has no real value. An |
| 292 |
implementation of this specification won't be able to decode a VCDIFF |
| 293 |
encoded with this option if it doesn't know about any secondary |
| 294 |
compressors. It seems that you should specify the bits related to |
| 295 |
secondary compressors once you have defined the first a secondary |
| 296 |
compressor. I can imagine a secondary-compressor might want to supply |
| 297 |
extra information, such as a dictionary of some kind, in which case |
| 298 |
this speculative treatment wouldn't go far enough.] |
| 299 |
|
| 300 |
[Length of code table data] - integer |
| 301 |
[Code table data] |
| 302 |
Size of near cache - byte |
| 303 |
Size of same cache - byte |
| 304 |
Compressed code table data |
| 305 |
Window1 |
| 306 |
Win_Indicator - byte |
| 307 |
[Source segment size] - integer |
| 308 |
[Source segment position] - integer |
| 309 |
The delta encoding of the target window |
| 310 |
Length of the delta encoding - integer |
| 311 |
The delta encoding |
| 312 |
Size of the target window - integer |
| 313 |
Delta_Indicator - byte |
| 314 |
Length of data for ADDs and RUNs - integer |
| 315 |
Length of instructions and sizes - integer |
| 316 |
Length of addresses for COPYs - integer |
| 317 |
Data section for ADDs and RUNs - array of bytes |
| 318 |
Instructions and sizes section - array of bytes |
| 319 |
Addresses section for COPYs - array of bytes |
| 320 |
Window2 |
| 321 |
... |
| 322 |
|
| 323 |
|
| 324 |
|
| 325 |
4.1 The Header Section |
| 326 |
|
| 327 |
Each delta file starts with a header section organized as below. |
| 328 |
Note the convention that square-brackets enclose optional items. |
| 329 |
|
| 330 |
Header1 - byte = 0xE6 |
| 331 |
Header2 - byte = 0xD3 |
| 332 |
Header3 - byte = 0xD4 |
| 333 |
|
| 334 |
HMMM |
| 335 |
|
| 336 |
0xD6 |
| 337 |
0xC3 |
| 338 |
0xC4 |
| 339 |
|
| 340 |
Header4 - byte |
| 341 |
Hdr_Indicator - byte |
| 342 |
[Secondary compressor ID] - byte |
| 343 |
[Length of code table data] - integer |
| 344 |
[Code table data] |
| 345 |
|
| 346 |
The first three Header bytes are the ASCII characters 'V', 'C' and 'D' |
| 347 |
with their most significant bits turned on (in hexadecimal, the values |
| 348 |
are 0xE6, 0xD3, and 0xD4). The fourth Header byte is currently set to |
| 349 |
zero. In the future, it might be used to indicate the version of Vcdiff. |
| 350 |
|
| 351 |
The Hdr_Indicator byte shows if there are any initialization data |
| 352 |
required to aid in the reconstruction of data in the Window sections. |
| 353 |
This byte MAY have non-zero values for either, both, or neither of |
| 354 |
the two bits VCD_DECOMPRESS and VCD_CODETABLE below: |
| 355 |
|
| 356 |
7 6 5 4 3 2 1 0 |
| 357 |
+-+-+-+-+-+-+-+-+ |
| 358 |
| | | | | | | | | |
| 359 |
+-+-+-+-+-+-+-+-+ |
| 360 |
^ ^ |
| 361 |
| | |
| 362 |
| +-- VCD_DECOMPRESS |
| 363 |
+---- VCD_CODETABLE |
| 364 |
|
| 365 |
If bit 0 (VCD_DECOMPRESS) is non-zero, this indicates that a secondary |
| 366 |
compressor may have been used to further compress certain parts of the |
| 367 |
delta encoding data as described in Sections 4.3 and 6. In that case, |
| 368 |
the ID of the secondary compressor is given next. If this bit is zero, |
| 369 |
the compressor ID byte is not included. |
| 370 |
|
| 371 |
[@@@ If we aren't defining any secondary compressors yet, then it seems |
| 372 |
this bit has no real value yet..] |
| 373 |
|
| 374 |
If bit 1 (VCD_CODETABLE) is non-zero, this indicates that an |
| 375 |
application-defined code table is to be used for decoding the delta |
| 376 |
instructions. This table itself is compressed. The length of the data |
| 377 |
comprising this compressed code table and the data follow next. Section 7 |
| 378 |
discusses application-defined code tables. If this bit is zero, the code |
| 379 |
table data length and the code table data are not included. |
| 380 |
|
| 381 |
If both bits are set, then the compressor ID byte is included |
| 382 |
before the code table data length and the code table data. |
| 383 |
|
| 384 |
|
| 385 |
4.2 The Format of a Window Section |
| 386 |
|
| 387 |
Each Window section is organized as follows: |
| 388 |
|
| 389 |
Win_Indicator - byte |
| 390 |
[Source segment length] - integer |
| 391 |
[Source segment position] - integer |
| 392 |
The delta encoding of the target window |
| 393 |
|
| 394 |
|
| 395 |
Below are the detail of the various items: |
| 396 |
|
| 397 |
[@@@ Here, I want to replace the Win_Indicator with a source-count, |
| 398 |
followed by source-count length/position pairs?] |
| 399 |
|
| 400 |
Win_Indicator: |
| 401 |
This byte is a set of bits, as shown: |
| 402 |
|
| 403 |
7 6 5 4 3 2 1 0 |
| 404 |
+-+-+-+-+-+-+-+-+ |
| 405 |
| | | | | | | | | |
| 406 |
+-+-+-+-+-+-+-+-+ |
| 407 |
^ ^ |
| 408 |
| | |
| 409 |
| +-- VCD_SOURCE |
| 410 |
+---- VCD_TARGET |
| 411 |
|
| 412 |
|
| 413 |
If bit 0 (VCD_SOURCE) is non-zero, this indicates that a segment |
| 414 |
of data from the "source" file was used as the corresponding |
| 415 |
source window of data to encode the target window. The decoder |
| 416 |
will use this same source data segment to decode the target window. |
| 417 |
|
| 418 |
If bit 1 (VCD_TARGET) is non-zero, this indicates that a segment |
| 419 |
of data from the "target" file was used as the corresponding |
| 420 |
source window of data to encode the target window. As above, this |
| 421 |
same source data segment is used to decode the target window. |
| 422 |
|
| 423 |
The Win_Indicator byte MUST NOT have more than one of the bits |
| 424 |
set (non-zero). It MAY have none of these bits set. |
| 425 |
|
| 426 |
If one of these bits is set, the byte is followed by two |
| 427 |
integers to indicate respectively the length and position of |
| 428 |
the source data segment in the relevant file. If the |
| 429 |
indicator byte is zero, the target window was compressed |
| 430 |
by itself without comparing against another data segment, |
| 431 |
and these two integers are not included. |
| 432 |
|
| 433 |
The delta encoding of the target window: |
| 434 |
This contains the delta encoding of the target window either |
| 435 |
in terms of the source data segment (i.e., VCD_SOURCE |
| 436 |
or VCD_TARGET was set) or by itself if no source window |
| 437 |
is specified. This data format is discussed next. |
| 438 |
|
| 439 |
|
| 440 |
4.3 The Delta Encoding of a Target Window |
| 441 |
|
| 442 |
The delta encoding of a target window is organized as follows: |
| 443 |
|
| 444 |
Length of the delta encoding - integer |
| 445 |
The delta encoding |
| 446 |
Length of the target window - integer |
| 447 |
Delta_Indicator - byte |
| 448 |
Length of data for ADDs and RUNs - integer |
| 449 |
Length of instructions section - integer |
| 450 |
Length of addresses for COPYs - integer |
| 451 |
Data section for ADDs and RUNs - array of bytes |
| 452 |
Instructions and sizes section - array of bytes |
| 453 |
Addresses section for COPYs - array of bytes |
| 454 |
|
| 455 |
|
| 456 |
Length of the delta encoding: |
| 457 |
This integer gives the total number of remaining bytes that |
| 458 |
comprise data of the delta encoding for this target window. |
| 459 |
|
| 460 |
The delta encoding: |
| 461 |
This contains the data representing the delta encoding which |
| 462 |
is described next. |
| 463 |
|
| 464 |
Length of the target window: |
| 465 |
This integer indicates the actual size of the target window |
| 466 |
after decompression. A decoder can use this value to allocate |
| 467 |
memory to store the uncompressed data. |
| 468 |
|
| 469 |
Delta_Indicator: |
| 470 |
This byte is a set of bits, as shown: |
| 471 |
|
| 472 |
7 6 5 4 3 2 1 0 |
| 473 |
+-+-+-+-+-+-+-+-+ |
| 474 |
| | | | | | | | | |
| 475 |
+-+-+-+-+-+-+-+-+ |
| 476 |
^ ^ ^ |
| 477 |
| | | |
| 478 |
| | +-- VCD_DATACOMP |
| 479 |
| +---- VCD_INSTCOMP |
| 480 |
+------ VCD_ADDRCOMP |
| 481 |
|
| 482 |
VCD_DATACOMP: bit value 1. |
| 483 |
VCD_INSTCOMP: bit value 2. |
| 484 |
VCD_ADDRCOMP: bit value 4. |
| 485 |
|
| 486 |
As discussed, the delta encoding consists of COPY, ADD and RUN |
| 487 |
instructions. The ADD and RUN instructions have accompanying |
| 488 |
unmatched data (that is, data that does not specifically match |
| 489 |
any data in the source window or in some earlier part of the |
| 490 |
target window) and the COPY instructions have addresses of where |
| 491 |
the matches occur. OPTIONALLY, these types of data MAY be further |
| 492 |
compressed using a secondary compressor. Thus, Vcdiff separates |
| 493 |
the encoding of the delta instructions into three parts: |
| 494 |
|
| 495 |
a. The unmatched data in the ADD and RUN instructions, |
| 496 |
b. The delta instructions and accompanying sizes, and |
| 497 |
c. The addresses of the COPY instructions. |
| 498 |
|
| 499 |
If the bit VCD_DECOMPRESS (Section 4.1) was on, each of these |
| 500 |
sections may have been compressed using the specified secondary |
| 501 |
compressor. The bit positions 0 (VCD_DATACOMP), 1 (VCD_INSTCOMP), |
| 502 |
and 2 (VCD_ADDRCOMP) respectively indicate, if non-zero, that |
| 503 |
the corresponding parts are compressed. Then, these parts MUST |
| 504 |
be decompressed before decoding the delta instructions. |
| 505 |
|
| 506 |
Length of data for ADDs and RUNs: |
| 507 |
This is the length (in bytes) of the section of data storing |
| 508 |
the unmatched data accompanying the ADD and RUN instructions. |
| 509 |
|
| 510 |
Length of instructions section: |
| 511 |
This is the length (in bytes) of the delta instructions and |
| 512 |
accompanying sizes. |
| 513 |
|
| 514 |
Length of addresses for COPYs: |
| 515 |
This is the length (in bytes) of the section storing |
| 516 |
the addresses of the COPY instructions. |
| 517 |
|
| 518 |
Data section for ADDs and RUNs: |
| 519 |
This sequence of bytes encodes the unmatched data for the ADD |
| 520 |
and RUN instructions. |
| 521 |
|
| 522 |
Instructions and sizes section: |
| 523 |
This sequence of bytes encodes the instructions and their sizes. |
| 524 |
|
| 525 |
Addresses section for COPYs: |
| 526 |
This sequence of bytes encodes the addresses of the COPY |
| 527 |
instructions. |
| 528 |
|
| 529 |
|
| 530 |
5. DELTA INSTRUCTION ENCODING |
| 531 |
|
| 532 |
The delta instructions described in Section 3 represent the results of |
| 533 |
string matching. For many data differencing applications in which the |
| 534 |
changes between source and target data are small, any straightforward |
| 535 |
representation of these instructions would be adequate. However, for |
| 536 |
applications including data compression, it is important to encode |
| 537 |
these instructions well to achieve good compression rates. From our |
| 538 |
experience, the following observations can be made: |
| 539 |
|
| 540 |
a. The addresses in COPY instructions are locations of matches and |
| 541 |
often occur close by or even exactly equal to one another. This is |
| 542 |
because data in local regions are often replicated with minor changes. |
| 543 |
In turn, this means that coding a newly matched address against some |
| 544 |
set of recently matched addresses can be beneficial. |
| 545 |
|
| 546 |
b. The matches are often short in length and separated by small amounts |
| 547 |
of unmatched data. That is, the lengths of COPY and ADD instructions |
| 548 |
are often small. This is particularly true of binary data such as |
| 549 |
executable files or structured data such as HTML or XML. In such cases, |
| 550 |
compression can be improved by combining the encoding of the sizes |
| 551 |
and the instruction types as well as combining the encoding of adjacent |
| 552 |
delta instructions with sufficiently small data sizes. |
| 553 |
|
| 554 |
The below subsections discuss how the Vcdiff data format provides |
| 555 |
mechanisms enabling encoders to use the above observations to improve |
| 556 |
compression rates. |
| 557 |
|
| 558 |
|
| 559 |
5.1 Address Encoding Modes of COPY Instructions |
| 560 |
|
| 561 |
As mentioned earlier, addresses of COPY instructions often occur close |
| 562 |
to one another or are exactly equal. To take advantage of this phenomenon |
| 563 |
and encode addresses of COPY instructions more efficiently, the Vcdiff |
| 564 |
data format supports the use of two different types of address caches. |
| 565 |
Both the encoder and decoder maintain these caches, so that decoder's |
| 566 |
caches remain synchronized with the encoder's caches. |
| 567 |
|
| 568 |
a. A "near" cache is an array with "s_near" slots, each containing an |
| 569 |
address used for encoding addresses nearby to previously encoded |
| 570 |
addresses (in the positive direction only). The near cache also |
| 571 |
maintains a "next_slot" index to the near cache. New entries to the |
| 572 |
near cache are always inserted in the next_slot index, which maintains |
| 573 |
a circular buffer of the s_near most recent addresses. |
| 574 |
|
| 575 |
b. A "same" cache is an array with "s_same" multiple of 256 slots, each |
| 576 |
containing an address. The same cache maintains a hash table of recent |
| 577 |
addresses used for repeated encoding of the exact same address. |
| 578 |
|
| 579 |
|
| 580 |
By default, the parameters s_near and s_same are respectively set to 4 |
| 581 |
and 3. An encoder MAY modify these values, but then it MUST encode the |
| 582 |
new values in the encoding itself, as discussed in Section 7, so that |
| 583 |
the decoder can properly set up its own caches. |
| 584 |
|
| 585 |
At the start of processing a target window, an implementation |
| 586 |
(encoder or decoder) initializes all of the slots in both caches |
| 587 |
to zero. The next_slot pointer of the near cache is set |
| 588 |
to point to slot zero. |
| 589 |
|
| 590 |
Each time a COPY instruction is processed by the encoder or |
| 591 |
decoder, the implementation's caches are updated as follows, where |
| 592 |
"addr" is the address in the COPY instruction. |
| 593 |
|
| 594 |
a. The slot in the near cache referenced by the next_slot |
| 595 |
index is set to addr. The next_slot index is then incremented |
| 596 |
modulo s_near. |
| 597 |
|
| 598 |
b. The slot in the same cache whose index is addr%(s_same*256) |
| 599 |
is set to addr. [We use the C notations of % for modulo and |
| 600 |
* for multiplication.] |
| 601 |
|
| 602 |
|
| 603 |
5.2 Example code for maintaining caches |
| 604 |
|
| 605 |
To make clear the above description, below are example cache data |
| 606 |
structures and algorithms to initialize and update them: |
| 607 |
|
| 608 |
typedef struct _cache_s |
| 609 |
{ |
| 610 |
int* near; /* array of size s_near */ |
| 611 |
int s_near; |
| 612 |
int next_slot; /* the circular index for near */ |
| 613 |
int* same; /* array of size s_same*256 */ |
| 614 |
int s_same; |
| 615 |
} Cache_t; |
| 616 |
|
| 617 |
cache_init(Cache_t* ka) |
| 618 |
{ |
| 619 |
int i; |
| 620 |
|
| 621 |
ka->next_slot = 0; |
| 622 |
for(i = 0; i < ka->s_near; ++i) |
| 623 |
ka->near[i] = 0; |
| 624 |
|
| 625 |
for(i = 0; i < ka->s_same*256; ++i) |
| 626 |
ka->same[i] = 0; |
| 627 |
} |
| 628 |
|
| 629 |
cache_update(Cache_t* ka, int addr) |
| 630 |
{ |
| 631 |
if(ka->s_near > 0) |
| 632 |
{ ka->near[ka->next_slot] = addr; |
| 633 |
ka->next_slot = (ka->next_slot + 1) % ka->s_near; |
| 634 |
} |
| 635 |
|
| 636 |
if(ka->s_same > 0) |
| 637 |
ka->same[addr % (ka->s_same*256)] = addr; |
| 638 |
} |
| 639 |
|
| 640 |
|
| 641 |
5.3 Encoding of COPY instruction addresses |
| 642 |
|
| 643 |
The address of a COPY instruction is encoded using different modes |
| 644 |
depending on the type of cached address used, if any. |
| 645 |
|
| 646 |
Let "addr" be the address of a COPY instruction to be decoded and "here" |
| 647 |
be the current location in the target data (i.e., the start of the data |
| 648 |
about to be encoded or decoded). Let near[j] be the jth element in |
| 649 |
the near cache, and same[k] be the kth element in the same cache. |
| 650 |
Below are the possible address modes: |
| 651 |
|
| 652 |
VCD_SELF: This mode has value 0. The address was encoded by itself |
| 653 |
as an integer. |
| 654 |
|
| 655 |
VCD_HERE: This mode has value 1. The address was encoded as |
| 656 |
the integer value "here - addr". |
| 657 |
|
| 658 |
Near modes: The "near modes" are in the range [2,s_near+1]. Let m |
| 659 |
be the mode of the address encoding. The address was encoded |
| 660 |
as the integer value "addr - near[m-2]". |
| 661 |
|
| 662 |
Same modes: The "same modes" are in the range |
| 663 |
[s_near+2,s_near+s_same+1]. Let m be the mode of the encoding. |
| 664 |
The address was encoded as a single byte b such that |
| 665 |
"addr == same[(m - (s_near+2))*256 + b]". |
| 666 |
|
| 667 |
|
| 668 |
5.3 Example code for encoding and decoding of COPY instruction addresses |
| 669 |
|
| 670 |
We show example algorithms below to demonstrate use of address modes more |
| 671 |
clearly. The encoder has freedom to choose address modes, the sample |
| 672 |
addr_encode() algorithm merely shows one way of picking the address |
| 673 |
mode. The decoding algorithm addr_decode() will uniquely decode addresses |
| 674 |
regardless of the encoder's algorithm choice. |
| 675 |
|
| 676 |
Note that the address caches are updated immediately after an address is |
| 677 |
encoded or decoded. In this way, the decoder is always synchronized with |
| 678 |
the encoder. |
| 679 |
|
| 680 |
int addr_encode(Cache_t* ka, int addr, int here, int* mode) |
| 681 |
{ |
| 682 |
int i, d, bestd, bestm; |
| 683 |
|
| 684 |
/* Attempt to find the address mode that yields the |
| 685 |
* smallest integer value for "d", the encoded address |
| 686 |
* value, thereby minimizing the encoded size of the |
| 687 |
* address. */ |
| 688 |
|
| 689 |
bestd = addr; bestm = VCD_SELF; /* VCD_SELF == 0 */ |
| 690 |
|
| 691 |
if((d = here-addr) < bestd) |
| 692 |
{ bestd = d; bestm = VCD_HERE; } /* VCD_HERE == 1 */ |
| 693 |
|
| 694 |
for(i = 0; i < ka->s_near; ++i) |
| 695 |
if((d = addr - ka->near[i]) >= 0 && d < bestd) |
| 696 |
{ bestd = d; bestm = i+2; } |
| 697 |
|
| 698 |
if(ka->s_same > 0 && ka->same[d = addr%(ka->s_same*256)] == addr) |
| 699 |
{ bestd = d%256; bestm = ka->s_near + 2 + d/256; } |
| 700 |
|
| 701 |
cache_update(ka,addr); |
| 702 |
|
| 703 |
*mode = bestm; /* this returns the address encoding mode */ |
| 704 |
return bestd; /* this returns the encoded address */ |
| 705 |
} |
| 706 |
|
| 707 |
Note that the addr_encode() algorithm chooses the best address mode using a |
| 708 |
local optimization, but that may not lead to the best encoding efficiency |
| 709 |
because different modes lead to different instruction encodings, as described below. |
| 710 |
|
| 711 |
The functions addrint() and addrbyte() used in addr_decode() obtain from |
| 712 |
the "Addresses section for COPYs" (Section 4.3) an integer or a byte, |
| 713 |
respectively. These utilities will not be described here. We simply |
| 714 |
recall that an integer is represented as a compact variable-sized string |
| 715 |
of bytes as described in Section 2 (i.e., base 128). |
| 716 |
|
| 717 |
int addr_decode(Cache_t* ka, int here, int mode) |
| 718 |
{ int addr, m; |
| 719 |
|
| 720 |
if(mode == VCD_SELF) |
| 721 |
addr = addrint(); |
| 722 |
else if(mode == VCD_HERE) |
| 723 |
addr = here - addrint(); |
| 724 |
else if((m = mode - 2) >= 0 && m < ka->s_near) /* near cache */ |
| 725 |
addr = ka->near[m] + addrint(); |
| 726 |
else /* same cache */ |
| 727 |
{ m = mode - (2 + ka->s_near); |
| 728 |
addr = ka->same[m*256 + addrbyte()]; |
| 729 |
} |
| 730 |
|
| 731 |
cache_update(ka, addr); |
| 732 |
|
| 733 |
return addr; |
| 734 |
} |
| 735 |
|
| 736 |
|
| 737 |
5.4 Instruction Codes |
| 738 |
|
| 739 |
As noted, the data sizes associated with delta instructions are often |
| 740 |
small. Thus, compression efficiency can be improved by combining the sizes |
| 741 |
and instruction types in a single encoding, as well by combining certain |
| 742 |
pairs of adjacent delta instructions. Effective choices of when to perform |
| 743 |
such combinations depend on many factors including the data being processed |
| 744 |
and the string matching algorithm in use. For example, if many COPY |
| 745 |
instructions have the same data sizes, it may be worth to encode these |
| 746 |
instructions more compactly than others. |
| 747 |
|
| 748 |
The Vcdiff data format is designed so that a decoder does not need to be |
| 749 |
aware of the choices made in encoding algorithms. This is achieved with the |
| 750 |
notion of an "instruction code table" containing 256 entries. Each entry |
| 751 |
defines either a single delta instruction or a pair of instructions that |
| 752 |
have been combined. Note that the code table itself only exists in main |
| 753 |
memory, not in the delta file (unless using an application-defined code |
| 754 |
table, described in Section 7). The encoded data simply includes the index |
| 755 |
of each instruction and, since there are only 256 indices, each index |
| 756 |
can be represented as a single byte. |
| 757 |
|
| 758 |
Each instruction code entry contains six fields, each of which |
| 759 |
is a single byte with unsigned value: |
| 760 |
|
| 761 |
+-----------------------------------------------+ |
| 762 |
| inst1 | size1 | mode1 | inst2 | size2 | mode2 | |
| 763 |
+-----------------------------------------------+ |
| 764 |
|
| 765 |
@@@ could be more compact |
| 766 |
|
| 767 |
Each triple (inst,size,mode) defines a delta instruction. The meanings |
| 768 |
of these fields are as follows: |
| 769 |
|
| 770 |
inst: An "inst" field can have one of the four values: NOOP (0), ADD (1), |
| 771 |
RUN (2) or COPY (3) to indicate the instruction types. NOOP means |
| 772 |
that no instruction is specified. In this case, both the corresponding |
| 773 |
size and mode fields will be zero. |
| 774 |
|
| 775 |
size: A "size" field is zero or positive. A value zero means that the |
| 776 |
size associated with the instruction is encoded separately as |
| 777 |
an integer in the "Instructions and sizes section" (Section 6). |
| 778 |
A positive value for "size" defines the actual data size. |
| 779 |
Note that since the size is restricted to a byte, the maximum |
| 780 |
value for any instruction with size implicitly defined in the code |
| 781 |
table is 255. |
| 782 |
|
| 783 |
mode: A "mode" field is significant only when the associated delta |
| 784 |
instruction is a COPY. It defines the mode used to encode the |
| 785 |
associated addresses. For other instructions, this is always zero. |
| 786 |
|
| 787 |
|
| 788 |
5.5 The Code Table |
| 789 |
|
| 790 |
Following the discussions on address modes and instruction code tables, |
| 791 |
we define a "Code Table" to have the data below: |
| 792 |
|
| 793 |
s_near: the size of the near cache, |
| 794 |
s_same: the size of the same cache, |
| 795 |
i_code: the 256-entry instruction code table. |
| 796 |
|
| 797 |
Vcdiff itself defines a "default code table" in which s_near is 4 |
| 798 |
and s_same is 3. Thus, there are 9 address modes for a COPY instruction. |
| 799 |
The first two are VCD_SELF (0) and VCD_HERE (1). Modes 2, 3, 4 and 5 |
| 800 |
are for addresses coded against the near cache. And, modes 6, 7 and 8 |
| 801 |
are for addresses coded against the same cache. |
| 802 |
|
| 803 |
The default instruction code table is depicted below, in a compact |
| 804 |
representation that we use only for descriptive purposes. See section 7 |
| 805 |
for the specification of how an instruction code table is represented |
| 806 |
in the Vcdiff encoding format. In the depiction, a zero value for |
| 807 |
size indicates that the size is separately coded. The mode of non-COPY |
| 808 |
instructions is represented as 0 even though they are not used. |
| 809 |
|
| 810 |
|
| 811 |
TYPE SIZE MODE TYPE SIZE MODE INDEX |
| 812 |
--------------------------------------------------------------- |
| 813 |
1. RUN 0 0 NOOP 0 0 0 |
| 814 |
2. ADD 0, [1,17] 0 NOOP 0 0 [1,18] |
| 815 |
3. COPY 0, [4,18] 0 NOOP 0 0 [19,34] |
| 816 |
4. COPY 0, [4,18] 1 NOOP 0 0 [35,50] |
| 817 |
5. COPY 0, [4,18] 2 NOOP 0 0 [51,66] |
| 818 |
6. COPY 0, [4,18] 3 NOOP 0 0 [67,82] |
| 819 |
7. COPY 0, [4,18] 4 NOOP 0 0 [83,98] |
| 820 |
8. COPY 0, [4,18] 5 NOOP 0 0 [99,114] |
| 821 |
9. COPY 0, [4,18] 6 NOOP 0 0 [115,130] |
| 822 |
10. COPY 0, [4,18] 7 NOOP 0 0 [131,146] |
| 823 |
11. COPY 0, [4,18] 8 NOOP 0 0 [147,162] |
| 824 |
12. ADD [1,4] 0 COPY [4,6] 0 [163,174] |
| 825 |
13. ADD [1,4] 0 COPY [4,6] 1 [175,186] |
| 826 |
14. ADD [1,4] 0 COPY [4,6] 2 [187,198] |
| 827 |
15. ADD [1,4] 0 COPY [4,6] 3 [199,210] |
| 828 |
16. ADD [1,4] 0 COPY [4,6] 4 [211,222] |
| 829 |
17. ADD [1,4] 0 COPY [4,6] 5 [223,234] |
| 830 |
18. ADD [1,4] 0 COPY 4 6 [235,238] |
| 831 |
19. ADD [1,4] 0 COPY 4 7 [239,242] |
| 832 |
20. ADD [1,4] 0 COPY 4 8 [243,246] |
| 833 |
21. COPY 4 [0,8] ADD 1 0 [247,255] |
| 834 |
--------------------------------------------------------------- |
| 835 |
|
| 836 |
In the above depiction, each numbered line represents one or more |
| 837 |
entries in the actual instruction code table (recall that an entry in |
| 838 |
the instruction code table may represent up to two combined delta |
| 839 |
instructions.) The last column ("INDEX") shows which index value or |
| 840 |
range of index values of the entries covered by that line. The notation |
| 841 |
[i,j] means values from i through j, inclusive. The first 6 columns of |
| 842 |
a line in the depiction describe the pairs of instructions used for |
| 843 |
the corresponding index value(s). |
| 844 |
|
| 845 |
If a line in the depiction includes a column entry using the [i,j] |
| 846 |
notation, this means that the line is instantiated for each value |
| 847 |
in the range from i to j, inclusive. The notation "0, [i,j]" means |
| 848 |
that the line is instantiated for the value 0 and for each value |
| 849 |
in the range from i to j, inclusive. |
| 850 |
|
| 851 |
If a line in the depiction includes more than one entry using the [i,j] |
| 852 |
notation, implying a "nested loop" to convert the line to a range of |
| 853 |
table entries, the first such [i,j] range specifies the outer loop, |
| 854 |
and the second specifies the inner loop. |
| 855 |
|
| 856 |
The below examples should make clear the above description: |
| 857 |
|
| 858 |
Line 1 shows the single RUN instruction with index 0. As the size field |
| 859 |
is 0, this RUN instruction always has its actual size encoded separately. |
| 860 |
|
| 861 |
Line 2 shows the 18 single ADD instructions. The ADD instruction with |
| 862 |
size field 0 (i.e., the actual size is coded separately) has index 1. |
| 863 |
ADD instructions with sizes from 1 to 17 use code indices 2 to 18 and |
| 864 |
their sizes are as given (so they will not be separately encoded.) |
| 865 |
|
| 866 |
Following the single ADD instructions are the single COPY instructions |
| 867 |
ordered by their address encoding modes. For example, line 11 shows the |
| 868 |
COPY instructions with mode 8, i.e., the last of the same cache. |
| 869 |
In this case, the COPY instruction with size field 0 has index 147. |
| 870 |
Again, the actual size of this instruction will be coded separately. |
| 871 |
|
| 872 |
Lines 12 to 21 show the pairs of instructions that are combined together. |
| 873 |
For example, line 12 depicts the 12 entries in which an ADD instruction |
| 874 |
is combined with an immediately following COPY instruction. The entries |
| 875 |
with indices 163, 164, 165 represent the pairs in which the ADD |
| 876 |
instructions all have size 1 while the COPY instructions has mode |
| 877 |
0 (VCD_SELF) and sizes 4, 5 and 6 respectively. |
| 878 |
|
| 879 |
The last line, line 21, shows the eight instruction pairs where the first |
| 880 |
instruction is a COPY and the second is an ADD. In this case, all COPY |
| 881 |
instructions have size 4 with mode ranging from 0 to 8 and all the ADD |
| 882 |
instructions have size 1. Thus, the entry with largest index 255 |
| 883 |
combines a COPY instruction of size 4 and mode 8 with an ADD instruction |
| 884 |
of size 1. |
| 885 |
|
| 886 |
The choice of the minimum size 4 for COPY instructions in the default code |
| 887 |
table was made from experiments that showed that excluding small matches |
| 888 |
(less then 4 bytes long) improved the compression rates. |
| 889 |
|
| 890 |
|
| 891 |
6. DECODING A TARGET WINDOW |
| 892 |
|
| 893 |
Section 4.3 discusses that the delta instructions and associated data |
| 894 |
are encoded in three arrays of bytes: |
| 895 |
|
| 896 |
Data section for ADDs and RUNs, |
| 897 |
Instructions and sizes section, and |
| 898 |
Addresses section for COPYs. |
| 899 |
|
| 900 |
|
| 901 |
Further, these data sections may have been further compressed by some |
| 902 |
secondary compressor. Assuming that any such compressed data has been |
| 903 |
decompressed so that we now have three arrays: |
| 904 |
|
| 905 |
inst: bytes coding the instructions and sizes. |
| 906 |
data: unmatched data associated with ADDs and RUNs. |
| 907 |
addr: bytes coding the addresses of COPYs. |
| 908 |
|
| 909 |
These arrays are organized as follows: |
| 910 |
|
| 911 |
inst: |
| 912 |
a sequence of (index, [size1], [size2]) tuples, where "index" |
| 913 |
is an index into the instruction code table, and size1 and size2 |
| 914 |
are integers that MAY or MAY NOT be included in the tuple as |
| 915 |
follows. The entry with the given "index" in the instruction |
| 916 |
code table potentially defines two delta instructions. If the |
| 917 |
first delta instruction is not a VCD_NOOP and its size is zero, |
| 918 |
then size1 MUST be present. Otherwise, size1 MUST be omitted and |
| 919 |
the size of the instruction (if it is not VCD_NOOP) is as defined |
| 920 |
in the table. The presence or absence of size2 is defined |
| 921 |
similarly with respect to the second delta instruction. |
| 922 |
|
| 923 |
data: |
| 924 |
a sequence of data values, encoded as bytes. |
| 925 |
|
| 926 |
addr: |
| 927 |
a sequence of address values. Addresses are normally encoded as |
| 928 |
integers as described in Section 2 (i.e., base 128). |
| 929 |
Since the same cache emits addresses in the range [0,255], |
| 930 |
however, same cache addresses are always encoded as a |
| 931 |
single byte. |
| 932 |
|
| 933 |
To summarize, each tuple in the "inst" array includes an index to some |
| 934 |
entry in the instruction code table that determines: |
| 935 |
|
| 936 |
a. Whether one or two instructions were encoded and their types. |
| 937 |
|
| 938 |
b. If the instructions have their sizes encoded separately, these |
| 939 |
sizes will follow, in order, in the tuple. |
| 940 |
|
| 941 |
c. If the instructions have accompanying data, i.e., ADDs or RUNs, |
| 942 |
their data will be in the array "data". |
| 943 |
|
| 944 |
d. Similarly, if the instructions are COPYs, the coded addresses are |
| 945 |
found in the array "addr". |
| 946 |
|
| 947 |
The decoding procedure simply processes the arrays by reading one code |
| 948 |
index at a time, looking up the corresponding instruction code entry, |
| 949 |
then consuming the respective sizes, data and addresses following the |
| 950 |
directions in this entry. In other words, the decoder maintains an implicit |
| 951 |
next-element pointer for each array; "consuming" an instruction tuple, |
| 952 |
data, or address value implies incrementing the associated pointer. |
| 953 |
|
| 954 |
For example, if during the processing of the target window, the next |
| 955 |
unconsumed tuple in the inst array has index value 19, then the first |
| 956 |
instruction is a COPY, whose size is found as the immediately following |
| 957 |
integer in the inst array. Since the mode of this COPY instruction is |
| 958 |
VCD_SELF, the corresponding address is found by consuming the next |
| 959 |
integer in the addr array. The data array is left intact. As the second |
| 960 |
instruction for code index 19 is a NOOP, this tuple is finished. |
| 961 |
|
| 962 |
|
| 963 |
7. APPLICATION-DEFINED CODE TABLES |
| 964 |
|
| 965 |
Although the default code table used in Vcdiff is good for general |
| 966 |
purpose encoders, there are times when other code tables may perform |
| 967 |
better. For example, to code a file with many identical segments of data, |
| 968 |
it may be advantageous to have a COPY instruction with the specific size |
| 969 |
of these data segments so that the instruction can be encoded in a single |
| 970 |
byte. Such a special code table MUST then be encoded in the delta file |
| 971 |
so that the decoder can reconstruct it before decoding the data. |
| 972 |
|
| 973 |
Vcdiff allows an application-defined code table to be specified |
| 974 |
in a delta file with the following data: |
| 975 |
|
| 976 |
Size of near cache - byte |
| 977 |
Size of same cache - byte |
| 978 |
Compressed code table data |
| 979 |
|
| 980 |
The "compressed code table data" encodes the delta between the default |
| 981 |
code table (source) and the new code table (target) in the same manner as |
| 982 |
described in Section 4.3 for encoding a target window in terms of a |
| 983 |
source window. This delta is computed using the following steps: |
| 984 |
|
| 985 |
a. Convert the new instruction code table into a string, "code", of |
| 986 |
1536 bytes using the below steps in order: |
| 987 |
|
| 988 |
i. Add in order the 256 bytes representing the types of the first |
| 989 |
instructions in the instruction pairs. |
| 990 |
ii. Add in order the 256 bytes representing the types of the second |
| 991 |
instructions in the instruction pairs. |
| 992 |
iii. Add in order the 256 bytes representing the sizes of the first |
| 993 |
instructions in the instruction pairs. |
| 994 |
iv. Add in order the 256 bytes representing the sizes of the second |
| 995 |
instructions in the instruction pairs. |
| 996 |
v. Add in order the 256 bytes representing the modes of the first |
| 997 |
instructions in the instruction pairs. |
| 998 |
vi. Add in order the 256 bytes representing the modes of the second |
| 999 |
instructions in the instruction pairs. |
| 1000 |
|
| 1001 |
b. Similarly, convert the default instruction code table into |
| 1002 |
a string "dflt". |
| 1003 |
|
| 1004 |
c. Treat the string "code" as a target window and "dflt" as the |
| 1005 |
corresponding source data and apply an encoding algorithm to |
| 1006 |
compute the delta encoding of "code" in terms of "dflt". |
| 1007 |
This computation MUST use the default code table for encoding |
| 1008 |
the delta instructions. |
| 1009 |
|
| 1010 |
The decoder can then reverse the above steps to decode the compressed |
| 1011 |
table data using the method of Section 6, employing the default code |
| 1012 |
table, to generate the new code table. Note that the decoder does not |
| 1013 |
need to know anything about the details of the encoding algorithm used |
| 1014 |
in step (c). The decoder is still able to decode the new code table |
| 1015 |
because the Vcdiff format is independent from the choice of encoding |
| 1016 |
algorithm, and because the encoder in step (c) uses the known, default |
| 1017 |
code table. |
| 1018 |
|
| 1019 |
|
| 1020 |
8. PERFORMANCE |
| 1021 |
|
| 1022 |
The encoding format is compact. For compression only, using the LZ-77 |
| 1023 |
string parsing strategy and without any secondary compressors, the typical |
| 1024 |
compression rate is better than Unix compress and close to gzip. For |
| 1025 |
differencing, the data format is better than all known methods in |
| 1026 |
terms of its stated goal, which is primarily decoding speed and |
| 1027 |
encoding efficiency. |
| 1028 |
|
| 1029 |
We compare the performance of compress, gzip and Vcdiff using the |
| 1030 |
archives of three versions of the Gnu C compiler, gcc-2.95.1.tar, |
| 1031 |
gcc-2.95.2.tar and gcc-2.95.3.tar. The experiments were done on an |
| 1032 |
SGI-MIPS3, 400MHZ. Gzip was used at its default compression level. |
| 1033 |
Vcdiff timings were done using the Vcodex/Vcdiff software (Section 13). |
| 1034 |
As string and window matching typically dominates the computation during |
| 1035 |
compression, the Vcdiff compression times were directly due to the |
| 1036 |
algorithms used in the Vcodex/Vcdiff software. However, the decompression |
| 1037 |
times should be generic and representative of any good implementation |
| 1038 |
of the Vcdiff data format. Timing was done by running each program |
| 1039 |
three times and taking the average of the total cpu+system times. |
| 1040 |
|
| 1041 |
Below are the different Vcdiff runs: |
| 1042 |
|
| 1043 |
Vcdiff: vcdiff is used as compressor only. |
| 1044 |
|
| 1045 |
Vcdiff-d: vcdiff is used as a differencer only. That is, it only |
| 1046 |
compares target data against source data. Since the files |
| 1047 |
involved are large, they are broken into windows. In this |
| 1048 |
case, each target window starting at some file offset in |
| 1049 |
the target file is compared against a source window with |
| 1050 |
the same file offset (in the source file). The source |
| 1051 |
window is also slightly larger than the target window |
| 1052 |
to increase matching opportunities. The -d option also gives |
| 1053 |
a hint to the string matching algorithm of Vcdiff that |
| 1054 |
the two files are very similar with long stretches of matches. |
| 1055 |
The algorithm takes advantage of this to minimize its |
| 1056 |
processing of source data and save time. |
| 1057 |
|
| 1058 |
Vcdiff-dc: This is similar to Vcdiff-d but vcdiff can also compare |
| 1059 |
target data against target data as applicable. Thus, vcdiff |
| 1060 |
both computes differences and compresses data. The windowing |
| 1061 |
algorithm is the same as above. However, the above hint is |
| 1062 |
recinded in this case. |
| 1063 |
|
| 1064 |
Vcdiff-dcs: This is similar to Vcdiff-dc but the windowing algorithm |
| 1065 |
uses a content-based heuristic to select source data segments |
| 1066 |
that are more likely to match with a given target window. |
| 1067 |
Thus, the source data segment selected for a target window |
| 1068 |
often will not be aligned with the file offsets of this |
| 1069 |
target window. |
| 1070 |
|
| 1071 |
|
| 1072 |
gcc-2.95.1 gcc-2.95.2 compression decompression |
| 1073 |
raw size 55746560 55797760 |
| 1074 |
compress - 19939390 13.85s 7.09s |
| 1075 |
gzip - 12973443 42.99s 5.35s |
| 1076 |
Vcdiff - 15358786 20.04s 4.65s |
| 1077 |
Vcdiff-d - 100971 10.93s 1.92s |
| 1078 |
Vcdiff-dc - 97246 20.03s 1.84s |
| 1079 |
Vcdiff-dcs - 256445 44.81s 1.84s |
| 1080 |
|
| 1081 |
TABLE 1. Compressing gcc-2.95.2.tar given gcc-2.95.1 |
| 1082 |
|
| 1083 |
|
| 1084 |
TABLE 1 shows the raw sizes of gcc-2.95.1.tar and gcc-2.95.2.tar and the |
| 1085 |
sizes of the compressed results. As a pure compressor, the compression |
| 1086 |
rate for Vcdiff is worse than gzip and better than compress. The last |
| 1087 |
three rows shows that when two file versions are very similar, differencing |
| 1088 |
can have dramatically good compression rates. Vcdiff-d and Vcdiff-dc use |
| 1089 |
the same simple window selection method but Vcdiff-dc also does compression |
| 1090 |
so its output is slightly smaller. Vcdiff-dcs uses a heuristic based on |
| 1091 |
data content to search for source data that likely will match a given target |
| 1092 |
window. Although it does a good job, the heuristic did not always find the |
| 1093 |
best matches which are given by the simple algorithm of Vcdiff-d. As a |
| 1094 |
result, the output size is slightly larger. Note also that there is a large |
| 1095 |
cost in computing matching windows this way. Finally, the compression times |
| 1096 |
of Vcdiff-d is nearly half of that of Vcdiff-dc. It is tempting to conclude |
| 1097 |
that the compression feature causes the additional time in Vcdiff-dc |
| 1098 |
relative to Vcdiff-d. However, this is not the case. The hint given to |
| 1099 |
the Vcdiff string matching algorithm that the two files are likely to |
| 1100 |
have very long stretches of matches helps the algorithm to minimize |
| 1101 |
processing of the "source data", thus saving half the time. However, as we |
| 1102 |
shall see below when this hint is wrong, the result is even longer time. |
| 1103 |
|
| 1104 |
|
| 1105 |
gcc-2.95.2 gcc-2.95.3 compression decompression |
| 1106 |
raw size 55797760 55787520 |
| 1107 |
compress - 19939453 13.54s 7.00s |
| 1108 |
gzip - 12998097 42.63s 5.62s |
| 1109 |
Vcdiff - 15371737 20.09s 4.74s |
| 1110 |
Vcdiff-d - 26383849 71.41s 6.41s |
| 1111 |
Vcdiff-dc - 14461203 42.48s 4.82s |
| 1112 |
Vcdiff-dcs - 1248543 61.18s 1.99s |
| 1113 |
|
| 1114 |
TABLE 2. Compressing gcc-2.95.3.tar given gcc-2.95.2 |
| 1115 |
|
| 1116 |
|
| 1117 |
TABLE 2 shows the raw sizes of gcc-2.95.2.tar and gcc-2.95.3.tar and |
| 1118 |
the sizes of the compressed results. In this case, the tar file of |
| 1119 |
gcc-2.95.3 is rearranged in a way that makes the straightforward method |
| 1120 |
of matching file offsets for source and target windows fail. As a |
| 1121 |
result, Vcdiff-d performs rather dismally both in time and output size. |
| 1122 |
The large time for Vcdiff-d is directly due to fact that the string |
| 1123 |
matching algorithm has to work much harder to find matches when the hint |
| 1124 |
that two files have long matching stretches fails to hold. On the other |
| 1125 |
hand, Vcdiff-dc does both differencing and compression resulting in good |
| 1126 |
output size. Finally, the window searching heuristic used in Vcdiff-dcs is |
| 1127 |
effective in finding the right matching source windows for target windows |
| 1128 |
resulting a small output size. This shows why the data format needs to |
| 1129 |
have a way to specify matching windows to gain performance. Finally, |
| 1130 |
we note that the decoding times are always good regardless of how |
| 1131 |
the string matching or window searching algorithms perform. |
| 1132 |
|
| 1133 |
|
| 1134 |
9. FURTHER ISSUES |
| 1135 |
|
| 1136 |
This document does not address a few issues: |
| 1137 |
|
| 1138 |
Secondary compressors: |
| 1139 |
As discussed in Section 4.3, certain sections in the delta encoding |
| 1140 |
of a window may be further compressed by a secondary compressor. |
| 1141 |
In our experience, the basic Vcdiff format is adequate for most |
| 1142 |
purposes so that secondary compressors are seldom needed. In |
| 1143 |
particular, for normal use of data differencing where the files to |
| 1144 |
be compared have long stretches of matches, much of the gain in |
| 1145 |
compression rate is already achieved by normal string matching. |
| 1146 |
Thus, the use of secondary compressors is seldom needed in this case. |
| 1147 |
However, for applications beyond differencing of such nearly identical |
| 1148 |
files, secondary compressors may be needed to achieve maximal |
| 1149 |
compressed results. |
| 1150 |
|
| 1151 |
Therefore, we recommend to leave the Vcdiff data format defined |
| 1152 |
as in this document so that the use of secondary compressors |
| 1153 |
can be implemented when they become needed in the future. |
| 1154 |
The formats of the compressed data via such compressors or any |
| 1155 |
compressors that may be defined in the future are left open to |
| 1156 |
their implementations. These could include Huffman encoding, |
| 1157 |
arithmetic encoding, and splay tree encoding [8,9]. |
| 1158 |
|
| 1159 |
Large file system vs. small file system: |
| 1160 |
As discussed in Section 4, a target window in a large file may be |
| 1161 |
compared against some source window in another file or in the same |
| 1162 |
file (from some earlier part). In that case, the file offset of the |
| 1163 |
source window is specified as a variable-sized integer in the delta |
| 1164 |
encoding. There is a possibility that the encoding was computed on |
| 1165 |
a system supporting much larger files than in a system where |
| 1166 |
the data may be decoded (e.g., 64-bit file systems vs. 32-bit file |
| 1167 |
systems). In that case, some target data may not be recoverable. |
| 1168 |
This problem could afflict any compression format, and ought |
| 1169 |
to be resolved with a generic negotiation mechanism in the |
| 1170 |
appropriate protocol(s). |
| 1171 |
|
| 1172 |
|
| 1173 |
10. SUMMARY |
| 1174 |
|
| 1175 |
We have described Vcdiff, a general and portable encoding format for |
| 1176 |
compression and differencing. The format is good in that it allows |
| 1177 |
implementing a decoder without knowledge of the encoders. Further, |
| 1178 |
ignoring the use of secondary compressors not defined within the format, |
| 1179 |
the decoding algorithms runs in linear time and requires working space |
| 1180 |
proportional to window sizes. |
| 1181 |
|
| 1182 |
|
| 1183 |
|
| 1184 |
11. ACKNOWLEDGEMENTS |
| 1185 |
|
| 1186 |
Thanks are due to Balachander Krishnamurthy, Jeff Mogul and Arthur Van Hoff |
| 1187 |
who provided much encouragement to publicize Vcdiff. In particular, Jeff |
| 1188 |
helped clarifying the description of the data format presented here. |
| 1189 |
|
| 1190 |
|
| 1191 |
|
| 1192 |
12. SECURITY CONSIDERATIONS |
| 1193 |
|
| 1194 |
Vcdiff only provides a format to encode compressed and differenced data. |
| 1195 |
It does not address any issues concerning how such data are, in fact, |
| 1196 |
stored in a given file system or the run-time memory of a computer system. |
| 1197 |
Therefore, we do not anticipate any security issues with respect to Vcdiff. |
| 1198 |
|
| 1199 |
|
| 1200 |
|
| 1201 |
13. SOURCE CODE AVAILABILITY |
| 1202 |
|
| 1203 |
Vcdiff is implemented as a data transforming method in Phong Vo's |
| 1204 |
Vcodex library. AT&T Corp. has made the source code for Vcodex available |
| 1205 |
for anyone to use to transmit data via HTTP/1.1 Delta Encoding [10,11]. |
| 1206 |
The source code and according license is accessible at the below URL: |
| 1207 |
|
| 1208 |
http://www.research.att.com/sw/tools |
| 1209 |
|
| 1210 |
|
| 1211 |
14. INTELLECTUAL PROPERTY RIGHTS |
| 1212 |
|
| 1213 |
The IETF has been notified of intellectual property rights claimed in |
| 1214 |
regard to some or all of the specification contained in this |
| 1215 |
document. For more information consult the online list of claimed |
| 1216 |
rights, at <http://www.ietf.org/ipr.html>. |
| 1217 |
|
| 1218 |
The IETF takes no position regarding the validity or scope of any |
| 1219 |
intellectual property or other rights that might be claimed to |
| 1220 |
pertain to the implementation or use of the technology described in |
| 1221 |
this document or the extent to which any license under such rights |
| 1222 |
might or might not be available; neither does it represent that it |
| 1223 |
has made any effort to identify any such rights. Information on the |
| 1224 |
IETF's procedures with respect to rights in standards-track and |
| 1225 |
standards-related documentation can be found in BCP-11. Copies of |
| 1226 |
claims of rights made available for publication and any assurances of |
| 1227 |
licenses to be made available, or the result of an attempt made to |
| 1228 |
obtain a general license or permission for the use of such |
| 1229 |
proprietary rights by implementors or users of this specification can |
| 1230 |
be obtained from the IETF Secretariat. |
| 1231 |
|
| 1232 |
|
| 1233 |
|
| 1234 |
15. IANA CONSIDERATIONS |
| 1235 |
|
| 1236 |
The Internet Assigned Numbers Authority (IANA) administers the number |
| 1237 |
space for Secondary Compressor ID values. Values and their meaning |
| 1238 |
must be documented in an RFC or other peer-reviewed, permanent, and |
| 1239 |
readily available reference, in sufficient detail so that |
| 1240 |
interoperability between independent implementations is possible. |
| 1241 |
Subject to these constraints, name assignments are First Come, First |
| 1242 |
Served - see RFC2434 [13]. Legal ID values are in the range 1..255. |
| 1243 |
|
| 1244 |
This document does not define any values in this number space. |
| 1245 |
|
| 1246 |
|
| 1247 |
16. REFERENCES |
| 1248 |
|
| 1249 |
[1] D.G. Korn and K.P. Vo, Vdelta: Differencing and Compression, |
| 1250 |
Practical Reusable Unix Software, Editor B. Krishnamurthy, |
| 1251 |
John Wiley & Sons, Inc., 1995. |
| 1252 |
|
| 1253 |
[2] J. Ziv and A. Lempel, A Universal Algorithm for Sequential Data |
| 1254 |
Compression, IEEE Trans. on Information Theory, 23(3):337-343, 1977. |
| 1255 |
|
| 1256 |
[3] W. Tichy, The String-to-String Correction Problem with Block Moves, |
| 1257 |
ACM Transactions on Computer Systems, 2(4):309-321, November 1984. |
| 1258 |
|
| 1259 |
[4] E.M. McCreight, A Space-Economical Suffix Tree Construction |
| 1260 |
Algorithm, Journal of the ACM, 23:262-272, 1976. |
| 1261 |
|
| 1262 |
[5] J.J. Hunt, K.P. Vo, W. Tichy, An Empirical Study of Delta Algorithms, |
| 1263 |
IEEE Software Configuration and Maintenance Workshop, 1996. |
| 1264 |
|
| 1265 |
[6] J.J. Hunt, K.P. Vo, W. Tichy, Delta Algorithms: An Empirical Analysis, |
| 1266 |
ACM Trans. on Software Engineering and Methodology, 7:192-214, 1998. |
| 1267 |
|
| 1268 |
[7] D.G. Korn, K.P. Vo, Sfio: A buffered I/O Library, |
| 1269 |
Proc. of the Summer '91 Usenix Conference, 1991. |
| 1270 |
|
| 1271 |
[8] D. W. Jones, Application of Splay Trees to Data Compression, |
| 1272 |
CACM, 31(8):996:1007. |
| 1273 |
|
| 1274 |
[9] M. Nelson, J. Gailly, The Data Compression Book, ISBN 1-55851-434-1, |
| 1275 |
M&T Books, New York, NY, 1995. |
| 1276 |
|
| 1277 |
[10] J.C. Mogul, F. Douglis, A. Feldmann, and B. Krishnamurthy, |
| 1278 |
Potential benefits of delta encoding and data compression for HTTP, |
| 1279 |
SIGCOMM '97, Cannes, France, 1997. |
| 1280 |
|
| 1281 |
[11] J.C. Mogul, B. Krishnamurthy, F. Douglis, A. Feldmann, |
| 1282 |
Y. Goland, and A. Van Hoff, Delta Encoding in HTTP, |
| 1283 |
IETF, draft-mogul-http-delta-10, 2001. |
| 1284 |
|
| 1285 |
[12] S. Bradner, Key words for use in RFCs to Indicate Requirement Levels, |
| 1286 |
RFC 2119, March 1997. |
| 1287 |
|
| 1288 |
[13] T. Narten, H. Alvestrand, Guidelines for Writing an IANA |
| 1289 |
Considerations Section in RFCs, RFC2434, October 1998. |
| 1290 |
|
| 1291 |
|
| 1292 |
|
| 1293 |
17. AUTHOR'S ADDRESS |
| 1294 |
|
| 1295 |
Kiem-Phong Vo (main contact) |
| 1296 |
AT&T Labs, Room D223 |
| 1297 |
180 Park Avenue |
| 1298 |
Florham Park, NJ 07932 |
| 1299 |
Email: kpv@research.att.com |
| 1300 |
Phone: 1 973 360 8630 |
| 1301 |
|
| 1302 |
David G. Korn |
| 1303 |
AT&T Labs, Room D237 |
| 1304 |
180 Park Avenue |
| 1305 |
Florham Park, NJ 07932 |
| 1306 |
Email: dgk@research.att.com |
| 1307 |
Phone: 1 973 360 8602 |
| 1308 |
|
| 1309 |
Jeffrey C. Mogul |
| 1310 |
Western Research Laboratory |
| 1311 |
Compaq Computer Corporation |
| 1312 |
250 University Avenue |
| 1313 |
Palo Alto, California, 94305, U.S.A. |
| 1314 |
Email: JeffMogul@acm.org |
| 1315 |
Phone: 1 650 617 3304 (email preferred) |
| 1316 |
|
| 1317 |
Joshua P. MacDonald |
| 1318 |
Computer Science Division |
| 1319 |
University of California, Berkeley |
| 1320 |
345 Soda Hall |
| 1321 |
Berkeley, CA 94720 |
| 1322 |
Email: jmacd@cs.berkeley.edu |