| 1 | David G. Korn, AT&T Labs | 
 
 
 
 
 | 2 | Joshua P. MacDonald, UC Berkeley | 
 
 
 
 
 | 3 | Jeffrey C. Mogul, Compaq WRL | 
 
 
 
 
 | 4 | Internet-Draft                                       Kiem-Phong Vo, AT&T Labs | 
 
 
 
 
 | 5 | Expires: 09 November 2002                                    09 November 2001 | 
 
 
 
 
 | 6 |  | 
 
 
 
 
 | 7 |  | 
 
 
 
 
 | 8 | The VCDIFF Generic Differencing and Compression Data Format | 
 
 
 
 
 | 9 |  | 
 
 
 
 
 | 10 | draft-korn-vcdiff-06.txt | 
 
 
 
 
 | 11 |  | 
 
 
 
 
 | 12 |  | 
 
 
 
 
 | 13 |  | 
 
 
 
 
 | 14 | Status of this Memo | 
 
 
 
 
 | 15 |  | 
 
 
 
 
 | 16 | This document is an Internet-Draft and is in full conformance | 
 
 
 
 
 | 17 | with all provisions of Section 10 of RFC2026. | 
 
 
 
 
 | 18 |  | 
 
 
 
 
 | 19 | Internet-Drafts are working documents of the Internet Engineering | 
 
 
 
 
 | 20 | Task Force (IETF), its areas, and its working groups.  Note that | 
 
 
 
 
 | 21 | other groups may also distribute working documents as | 
 
 
 
 
 | 22 | Internet-Drafts. | 
 
 
 
 
 | 23 |  | 
 
 
 
 
 | 24 | Internet-Drafts are draft documents valid for a maximum of six | 
 
 
 
 
 | 25 | months and may be updated, replaced, or obsoleted by other | 
 
 
 
 
 | 26 | documents at any time.  It is inappropriate to use Internet- | 
 
 
 
 
 | 27 | Drafts as reference material or to cite them other than as | 
 
 
 
 
 | 28 | "work in progress." | 
 
 
 
 
 | 29 |  | 
 
 
 
 
 | 30 | The list of current Internet-Drafts can be accessed at | 
 
 
 
 
 | 31 | http://www.ietf.org/ietf/1id-abstracts.txt | 
 
 
 
 
 | 32 |  | 
 
 
 
 
 | 33 | The list of Internet-Draft Shadow Directories can be accessed at | 
 
 
 
 
 | 34 | http://www.ietf.org/shadow.html. | 
 
 
 
 
 | 35 |  | 
 
 
 
 
 | 36 |  | 
 
 
 
 
 | 37 | Abstract | 
 
 
 
 
 | 38 |  | 
 
 
 
 
 | 39 | This memo describes a general, efficient and portable data format | 
 
 
 
 
 | 40 | suitable for encoding compressed and/or differencing data so that | 
 
 
 
 
 | 41 | they can be easily transported among computers. | 
 
 
 
 
 | 42 |  | 
 
 
 
 
 | 43 |  | 
 
 
 
 
 | 44 | Table of Contents: | 
 
 
 
 
 | 45 |  | 
 
 
 
 
 | 46 | 1.  EXECUTIVE SUMMARY ............................................  2 | 
 
 
 
 
 | 47 | 2.  CONVENTIONS ..................................................  3 | 
 
 
 
 
 | 48 | 3.  DELTA INSTRUCTIONS ...........................................  4 | 
 
 
 
 
 | 49 | 4.  DELTA FILE ORGANIZATION ......................................  5 | 
 
 
 
 
 | 50 | 5.  DELTA INSTRUCTION ENCODING ...................................  9 | 
 
 
 
 
 | 51 | 6.  DECODING A TARGET WINDOW ..................................... 14 | 
 
 
 
 
 | 52 | 7.  APPLICATION-DEFINED CODE TABLES .............................. 16 | 
 
 
 
 
 | 53 | 8.  PERFORMANCE .................................................. 16 | 
 
 
 
 
 | 54 | 9.  FURTHER ISSUES ............................................... 17 | 
 
 
 
 
 | 55 | 10.  SUMMARY ...................................................... 18 | 
 
 
 
 
 | 56 | 11.  ACKNOWLEDGEMENTS ............................................. 18 | 
 
 
 
 
 | 57 | 12.  SECURITY CONSIDERATIONS ...................................... 18 | 
 
 
 
 
 | 58 | 13.  SOURCE CODE AVAILABILITY ..................................... 18 | 
 
 
 
 
 | 59 | 14.  INTELLECTUAL PROPERTY RIGHTS ................................. 18 | 
 
 
 
 
 | 60 | 15.  IANA CONSIDERATIONS .......................................... 19 | 
 
 
 
 
 | 61 | 16.  REFERENCES ................................................... 19 | 
 
 
 
 
 | 62 | 17.  AUTHOR'S ADDRESS ............................................. 20 | 
 
 
 
 
 | 63 |  | 
 
 
 
 
 | 64 |  | 
 
 
 
 
 | 65 | 1.  EXECUTIVE SUMMARY | 
 
 
 
 
 | 66 |  | 
 
 
 
 
 | 67 | Compression and differencing techniques can greatly improve storage | 
 
 
 
 
 | 68 | and transmission of files and file versions.  Since files are often | 
 
 
 
 
 | 69 | transported across machines with distinct architectures and performance | 
 
 
 
 
 | 70 | characteristics, such data should be encoded in a form that is portable | 
 
 
 
 
 | 71 | and can be decoded with little or no knowledge of the encoders. | 
 
 
 
 
 | 72 | This document describes Vcdiff, a compact portable encoding format | 
 
 
 
 
 | 73 | designed for these purposes. | 
 
 
 
 
 | 74 |  | 
 
 
 
 
 | 75 | Data differencing is the process of computing a compact and invertible | 
 
 
 
 
 | 76 | encoding of a "target file" given a "source file".  Data compression | 
 
 
 
 
 | 77 | is similar but without the use of source data.  The UNIX utilities diff, | 
 
 
 
 
 | 78 | compress, and gzip are well-known examples of data differencing and | 
 
 
 
 
 | 79 | compression tools.  For data differencing, the computed encoding is | 
 
 
 
 
 | 80 | called a "delta file", and, for data compression, it is called | 
 
 
 
 
 | 81 | a "compressed file".  Delta and compressed files are good for storage | 
 
 
 
 
 | 82 | and transmission as they are often smaller than the originals. | 
 
 
 
 
 | 83 |  | 
 
 
 
 
 | 84 | Data differencing and data compression are traditionally treated | 
 
 
 
 
 | 85 | as distinct types of data processing.  However, as shown in the Vdelta | 
 
 
 
 
 | 86 | technique by Korn and Vo [1], compression can be thought of as a special | 
 
 
 
 
 | 87 | case of differencing in which the source data is empty. The basic idea | 
 
 
 
 
 | 88 | is to unify the string parsing scheme used in the Lempel-Ziv'77 style | 
 
 
 
 
 | 89 | compressors [2], and the block-move technique of Tichy [3].  Loosely | 
 
 
 
 
 | 90 | speaking, this works as follows: | 
 
 
 
 
 | 91 |  | 
 
 
 
 
 | 92 | a. Concatenate source and target data. | 
 
 
 
 
 | 93 | b. Parse the data from left to right as in LZ'77 but | 
 
 
 
 
 | 94 | make sure that a parsed segment starts the target data. | 
 
 
 
 
 | 95 | c. Start to output when reaching target data. | 
 
 
 
 
 | 96 |  | 
 
 
 
 
 | 97 | Parsing is based on string matching algorithms such as suffix trees [4] | 
 
 
 
 
 | 98 | or hashing with different time and space performance characteristics. | 
 
 
 
 
 | 99 | Vdelta uses a fast string matching algorithm that requires less memory | 
 
 
 
 
 | 100 | than other techniques [5,6].  However, even with this algorithm, the | 
 
 
 
 
 | 101 | memory requirement can still be prohibitive for large files.  A common | 
 
 
 
 
 | 102 | way to deal with memory limitation is to partition an input file into | 
 
 
 
 
 | 103 | chunks called "windows" and process them separately. Here, except for | 
 
 
 
 
 | 104 | unpublished work by Vo, little has been done on designing effective | 
 
 
 
 
 | 105 | windowing schemes. Current techniques, including Vdelta, simply use | 
 
 
 
 
 | 106 | source and target windows with corresponding addresses across source | 
 
 
 
 
 | 107 | and target files. | 
 
 
 
 
 | 108 |  | 
 
 
 
 
 | 109 | String matching and windowing algorithms have large influence on the | 
 
 
 
 
 | 110 | compression rate of delta and compressed files. However, it is desirable | 
 
 
 
 
 | 111 | to have a portable encoding format that is independent of such algorithms. | 
 
 
 
 
 | 112 | This enables construction of client-server applications in which a server | 
 
 
 
 
 | 113 | may serve clients with unknown computing characteristics.  Unfortunately, | 
 
 
 
 
 | 114 | all current differencing and compressing tools, including Vdelta, fall | 
 
 
 
 
 | 115 | short in this respect. Their storage formats are closely intertwined | 
 
 
 
 
 | 116 | with the implemented string matching and/or windowing algorithms. | 
 
 
 
 
 | 117 |  | 
 
 
 
 
 | 118 | The encoding format Vcdiff proposed here addresses the above issues. | 
 
 
 
 
 | 119 | Vcdiff achieves the below characteristics: | 
 
 
 
 
 | 120 |  | 
 
 
 
 
 | 121 | Output compactness: | 
 
 
 
 
 | 122 | The basic encoding format compactly represents compressed or delta | 
 
 
 
 
 | 123 | files. Applications can further extend the basic encoding format | 
 
 
 
 
 | 124 | with "secondary encoders" to achieve more compression. | 
 
 
 
 
 | 125 |  | 
 
 
 
 
 | 126 | Data portability: | 
 
 
 
 
 | 127 | The basic encoding format is free from machine byte order and | 
 
 
 
 
 | 128 | word size issues. This allows data to be encoded on one machine | 
 
 
 
 
 | 129 | and decoded on a different machine with different architecture. | 
 
 
 
 
 | 130 |  | 
 
 
 
 
 | 131 | Algorithm genericity: | 
 
 
 
 
 | 132 | The decoding algorithm is independent from string matching and | 
 
 
 
 
 | 133 | windowing algorithms. This allows competition among implementations | 
 
 
 
 
 | 134 | of the encoder while keeping the same decoder. | 
 
 
 
 
 | 135 |  | 
 
 
 
 
 | 136 | Decoding efficiency: | 
 
 
 
 
 | 137 | Except for secondary encoder issues, the decoding algorithm runs | 
 
 
 
 
 | 138 | in time proportional to the size of the target file and uses space | 
 
 
 
 
 | 139 | proportional to the maximal window size.  Vcdiff differs from more | 
 
 
 
 
 | 140 | conventional compressors in that it uses only byte-aligned | 
 
 
 
 
 | 141 | data, thus avoiding bit-level operations, which improves | 
 
 
 
 
 | 142 | decoding speed at the slight cost of compression efficiency. | 
 
 
 
 
 | 143 |  | 
 
 
 
 
 | 144 | The Vcdiff data format and the algorithms for decoding data shall be | 
 
 
 
 
 | 145 | described next.  Since Vcdiff treats compression as a special case of | 
 
 
 
 
 | 146 | differencing, we shall use the term "delta file" to indicate the | 
 
 
 
 
 | 147 | compressed output for both cases. | 
 
 
 
 
 | 148 |  | 
 
 
 
 
 | 149 |  | 
 
 
 
 
 | 150 | 2. CONVENTIONS | 
 
 
 
 
 | 151 |  | 
 
 
 
 
 | 152 | The basic data unit is a byte.  For portability, Vcdiff shall limit | 
 
 
 
 
 | 153 | a byte to its lower eight bits even on machines with larger bytes. | 
 
 
 
 
 | 154 | The bits in a byte are ordered from right to left so that the least | 
 
 
 
 
 | 155 | significant bit (LSB) has value 1, and the most significant bit (MSB), | 
 
 
 
 
 | 156 | has value 128. | 
 
 
 
 
 | 157 |  | 
 
 
 
 
 | 158 | For purposes of exposition in this document, we adopt the convention | 
 
 
 
 
 | 159 | that the LSB is numbered 0, and the MSB is numbered 7.  Bit numbers | 
 
 
 
 
 | 160 | never appear in the encoded format itself. | 
 
 
 
 
 | 161 |  | 
 
 
 
 
 | 162 | Vcdiff encodes unsigned integer values using a portable variable-sized | 
 
 
 
 
 | 163 | format (originally introduced in the Sfio library [7]). This encoding | 
 
 
 
 
 | 164 | treats an integer as a number in base 128. Then, each digit in this | 
 
 
 
 
 | 165 | representation is encoded in the lower seven bits of a byte. Except for | 
 
 
 
 
 | 166 | the least significant byte, other bytes have their most significant bit | 
 
 
 
 
 | 167 | turned on to indicate that there are still more digits in the encoding. | 
 
 
 
 
 | 168 | The two key properties of this integer encoding that are beneficial | 
 
 
 
 
 | 169 | to a data compression format are: | 
 
 
 
 
 | 170 |  | 
 
 
 
 
 | 171 | a. The encoding is portable among systems using 8-bit bytes, and | 
 
 
 
 
 | 172 | b. Small values are encoded compactly. | 
 
 
 
 
 | 173 |  | 
 
 
 
 
 | 174 | For example, consider the value 123456789 which can be represented with | 
 
 
 
 
 | 175 | four 7-bit digits whose values are 58, 111, 26, 21 in order from most | 
 
 
 
 
 | 176 | to least significant. Below is the 8-bit byte encoding of these digits. | 
 
 
 
 
 | 177 | Note that the MSBs of 58, 111 and 26 are on. | 
 
 
 
 
 | 178 |  | 
 
 
 
 
 | 179 | +-------------------------------------------+ | 
 
 
 
 
 | 180 | | 10111010 | 11101111 | 10011010 | 00010101 | | 
 
 
 
 
 | 181 | +-------------------------------------------+ | 
 
 
 
 
 | 182 | MSB+58     MSB+111    MSB+26     0+21 | 
 
 
 
 
 | 183 |  | 
 
 
 
 
 | 184 |  | 
 
 
 
 
 | 185 | Henceforth, the terms "byte" and "integer" will refer to a byte and an | 
 
 
 
 
 | 186 | unsigned integer as described. | 
 
 
 
 
 | 187 |  | 
 
 
 
 
 | 188 |  | 
 
 
 
 
 | 189 | From time to time, algorithms are exhibited to clarify the descriptions | 
 
 
 
 
 | 190 | of parts of the Vcdiff format. On such occasions, the C language will be | 
 
 
 
 
 | 191 | used to make precise the algorithms.  The C code shown in this | 
 
 
 
 
 | 192 | document is meant for clarification only, and is not part of the | 
 
 
 
 
 | 193 | actual specification of the Vcdiff format. | 
 
 
 
 
 | 194 |  | 
 
 
 
 
 | 195 | In this specification, the key words "MUST", "MUST NOT", | 
 
 
 
 
 | 196 | "SHOULD", "SHOULD NOT", and "MAY" document are to be interpreted as | 
 
 
 
 
 | 197 | described in RFC2119 [12]. | 
 
 
 
 
 | 198 |  | 
 
 
 
 
 | 199 |  | 
 
 
 
 
 | 200 | 3.  DELTA INSTRUCTIONS | 
 
 
 
 
 | 201 |  | 
 
 
 
 
 | 202 | A large target file is partitioned into non-overlapping sections | 
 
 
 
 
 | 203 | called "target windows". These target windows are processed separately | 
 
 
 
 
 | 204 | and sequentially based on their order in the target file. | 
 
 
 
 
 | 205 |  | 
 
 
 
 
 | 206 | A target window T of length t may be compared against some source data | 
 
 
 
 
 | 207 | segment S of length s. By construction, this source data segment S | 
 
 
 
 
 | 208 | comes either from the source file, if one is used, or from a part of | 
 
 
 
 
 | 209 | the target file earlier than T.  In this way, during decoding, S is | 
 
 
 
 
 | 210 | completely known when T is being decoded. | 
 
 
 
 
 | 211 |  | 
 
 
 
 
 | 212 | The choices of T, t, S and s are made by some window selection algorithm | 
 
 
 
 
 | 213 | which can greatly affect the size of the encoding. However, as seen later, | 
 
 
 
 
 | 214 | these choices are encoded so that no knowledge of the window selection | 
 
 
 
 
 | 215 | algorithm is needed during decoding. | 
 
 
 
 
 | 216 |  | 
 
 
 
 
 | 217 | Assume that S[j] represents the jth byte in S, and T[k] represents | 
 
 
 
 
 | 218 | the kth byte in T.  Then, for the delta instructions, we treat the data | 
 
 
 
 
 | 219 | windows S and T as substrings of a superstring U formed by concatenating | 
 
 
 
 
 | 220 | them like this: | 
 
 
 
 
 | 221 |  | 
 
 
 
 
 | 222 | S[0]S[1]...S[s-1]T[0]T[1]...T[t-1] | 
 
 
 
 
 | 223 |  | 
 
 
 
 
 | 224 | The "address" of a byte in S or T is referred to by its location in U. | 
 
 
 
 
 | 225 | For example, the address of T[k] is s+k. | 
 
 
 
 
 | 226 |  | 
 
 
 
 
 | 227 | The instructions to encode and direct the reconstruction of a target | 
 
 
 
 
 | 228 | window are called delta instructions. There are three types: | 
 
 
 
 
 | 229 |  | 
 
 
 
 
 | 230 | ADD: This instruction has two arguments, a size x and a sequence of | 
 
 
 
 
 | 231 | x bytes to be copied. | 
 
 
 
 
 | 232 | COPY: This instruction has two arguments, a size x and an address p | 
 
 
 
 
 | 233 | in the string U. The arguments specify the substring of U that | 
 
 
 
 
 | 234 | must be copied. We shall assert that such a substring must be | 
 
 
 
 
 | 235 | entirely contained in either S or T. | 
 
 
 
 
 | 236 | RUN: This instruction has two arguments, a size x and a byte b that | 
 
 
 
 
 | 237 | will be repeated x times. | 
 
 
 
 
 | 238 |  | 
 
 
 
 
 | 239 | Below are example source and target windows and the delta instructions | 
 
 
 
 
 | 240 | that encode the target window in terms of the source window. | 
 
 
 
 
 | 241 |  | 
 
 
 
 
 | 242 | a b c d e f g h i j k l m n o p | 
 
 
 
 
 | 243 | a b c d w x y z e f g h e f g h e f g h e f g h z z z z | 
 
 
 
 
 | 244 |  | 
 
 
 
 
 | 245 | COPY  4, 0 | 
 
 
 
 
 | 246 | ADD   4, w x y z | 
 
 
 
 
 | 247 | COPY  4, 4 | 
 
 
 
 
 | 248 | COPY 12, 24 | 
 
 
 
 
 | 249 | RUN   4, z | 
 
 
 
 
 | 250 |  | 
 
 
 
 
 | 251 |  | 
 
 
 
 
 | 252 | Thus, the first letter 'a' in the target window is at location 16 | 
 
 
 
 
 | 253 | in the superstring. Note that the fourth instruction, "COPY 12, 24", | 
 
 
 
 
 | 254 | copies data from T itself since address 24 is position 8 in T. | 
 
 
 
 
 | 255 | This instruction also shows that it is fine to overlap the data to be | 
 
 
 
 
 | 256 | copied with the data being copied from as long as the latter starts | 
 
 
 
 
 | 257 | earlier. This enables efficient encoding of periodic sequences, | 
 
 
 
 
 | 258 | i.e., sequences with regularly repeated subsequences. The RUN instruction | 
 
 
 
 
 | 259 | is a compact way to encode a sequence repeating the same byte even though | 
 
 
 
 
 | 260 | such a sequence can be thought of as a periodic sequence with period 1. | 
 
 
 
 
 | 261 |  | 
 
 
 
 
 | 262 | To reconstruct the target window, one simply processes one delta | 
 
 
 
 
 | 263 | instruction at a time and copy the data either from the source window | 
 
 
 
 
 | 264 | or the being reconstructed target window based on the type of the | 
 
 
 
 
 | 265 | instruction and the associated address, if any. | 
 
 
 
 
 | 266 |  | 
 
 
 
 
 | 267 |  | 
 
 
 
 
 | 268 | 4.  DELTA FILE ORGANIZATION | 
 
 
 
 
 | 269 |  | 
 
 
 
 
 | 270 | A Vcdiff delta file starts with a Header section followed by a sequence | 
 
 
 
 
 | 271 | of Window sections. The Header section includes magic bytes to identify | 
 
 
 
 
 | 272 | the file type, and information concerning data processing beyond the | 
 
 
 
 
 | 273 | basic encoding format. The Window sections encode the target windows. | 
 
 
 
 
 | 274 |  | 
 
 
 
 
 | 275 | Below is the overall organization of a delta file. The indented items | 
 
 
 
 
 | 276 | refine the ones immediately above them. An item in square brackets may | 
 
 
 
 
 | 277 | or may not be present in the file depending on the information encoded | 
 
 
 
 
 | 278 | in the Indicator byte above it. | 
 
 
 
 
 | 279 |  | 
 
 
 
 
 | 280 | Header | 
 
 
 
 
 | 281 | Header1                                  - byte | 
 
 
 
 
 | 282 | Header2                                  - byte | 
 
 
 
 
 | 283 | Header3                                  - byte | 
 
 
 
 
 | 284 | Header4                                  - byte | 
 
 
 
 
 | 285 | Hdr_Indicator                            - byte | 
 
 
 
 
 | 286 | [Secondary compressor ID]                - byte | 
 
 
 
 
 | 287 |  | 
 
 
 
 
 | 288 | [@@@ Why is compressor ID not an integer? ] | 
 
 
 
 
 | 289 | [@@@ If we aren't defining any secondary compressors yet, then it seems | 
 
 
 
 
 | 290 | that defining the [Secondary compressor ID] and the corresponding | 
 
 
 
 
 | 291 | VCD_DECOMPRESS Hdr_Indicator bit in this draft has no real value.  An | 
 
 
 
 
 | 292 | implementation of this specification won't be able to decode a VCDIFF | 
 
 
 
 
 | 293 | encoded with this option if it doesn't know about any secondary | 
 
 
 
 
 | 294 | compressors.  It seems that you should specify the bits related to | 
 
 
 
 
 | 295 | secondary compressors once you have defined the first a secondary | 
 
 
 
 
 | 296 | compressor.  I can imagine a secondary-compressor might want to supply | 
 
 
 
 
 | 297 | extra information, such as a dictionary of some kind, in which case | 
 
 
 
 
 | 298 | this speculative treatment wouldn't go far enough.] | 
 
 
 
 
 | 299 |  | 
 
 
 
 
 | 300 | [Length of code table data]              - integer | 
 
 
 
 
 | 301 | [Code table data] | 
 
 
 
 
 | 302 | Size of near cache                   - byte | 
 
 
 
 
 | 303 | Size of same cache                   - byte | 
 
 
 
 
 | 304 | Compressed code table data | 
 
 
 
 
 | 305 | Window1 | 
 
 
 
 
 | 306 | Win_Indicator                            - byte | 
 
 
 
 
 | 307 | [Source segment size]                    - integer | 
 
 
 
 
 | 308 | [Source segment position]                - integer | 
 
 
 
 
 | 309 | The delta encoding of the target window | 
 
 
 
 
 | 310 | Length of the delta encoding         - integer | 
 
 
 
 
 | 311 | The delta encoding | 
 
 
 
 
 | 312 | Size of the target window        - integer | 
 
 
 
 
 | 313 | Delta_Indicator                  - byte | 
 
 
 
 
 | 314 | Length of data for ADDs and RUNs - integer | 
 
 
 
 
 | 315 | Length of instructions and sizes - integer | 
 
 
 
 
 | 316 | Length of addresses for COPYs    - integer | 
 
 
 
 
 | 317 | Data section for ADDs and RUNs   - array of bytes | 
 
 
 
 
 | 318 | Instructions and sizes section   - array of bytes | 
 
 
 
 
 | 319 | Addresses section for COPYs      - array of bytes | 
 
 
 
 
 | 320 | Window2 | 
 
 
 
 
 | 321 | ... | 
 
 
 
 
 | 322 |  | 
 
 
 
 
 | 323 |  | 
 
 
 
 
 | 324 |  | 
 
 
 
 
 | 325 | 4.1 The Header Section | 
 
 
 
 
 | 326 |  | 
 
 
 
 
 | 327 | Each delta file starts with a header section organized as below. | 
 
 
 
 
 | 328 | Note the convention that square-brackets enclose optional items. | 
 
 
 
 
 | 329 |  | 
 
 
 
 
 | 330 | Header1                                  - byte = 0xE6 | 
 
 
 
 
 | 331 | Header2                                  - byte = 0xD3 | 
 
 
 
 
 | 332 | Header3                                  - byte = 0xD4 | 
 
 
 
 
 | 333 |  | 
 
 
 
 
 | 334 | HMMM | 
 
 
 
 
 | 335 |  | 
 
 
 
 
 | 336 | 0xD6 | 
 
 
 
 
 | 337 | 0xC3 | 
 
 
 
 
 | 338 | 0xC4 | 
 
 
 
 
 | 339 |  | 
 
 
 
 
 | 340 | Header4                                  - byte | 
 
 
 
 
 | 341 | Hdr_Indicator                            - byte | 
 
 
 
 
 | 342 | [Secondary compressor ID]                - byte | 
 
 
 
 
 | 343 | [Length of code table data]              - integer | 
 
 
 
 
 | 344 | [Code table data] | 
 
 
 
 
 | 345 |  | 
 
 
 
 
 | 346 | The first three Header bytes are the ASCII characters 'V', 'C' and 'D' | 
 
 
 
 
 | 347 | with their most significant bits turned on (in hexadecimal, the values | 
 
 
 
 
 | 348 | are 0xE6, 0xD3, and 0xD4). The fourth Header byte is currently set to | 
 
 
 
 
 | 349 | zero. In the future, it might be used to indicate the version of Vcdiff. | 
 
 
 
 
 | 350 |  | 
 
 
 
 
 | 351 | The Hdr_Indicator byte shows if there are any initialization data | 
 
 
 
 
 | 352 | required to aid in the reconstruction of data in the Window sections. | 
 
 
 
 
 | 353 | This byte MAY have non-zero values for either, both, or neither of | 
 
 
 
 
 | 354 | the two bits VCD_DECOMPRESS and VCD_CODETABLE below: | 
 
 
 
 
 | 355 |  | 
 
 
 
 
 | 356 | 7 6 5 4 3 2 1 0 | 
 
 
 
 
 | 357 | +-+-+-+-+-+-+-+-+ | 
 
 
 
 
 | 358 | | | | | | | | | | | 
 
 
 
 
 | 359 | +-+-+-+-+-+-+-+-+ | 
 
 
 
 
 | 360 | ^ ^ | 
 
 
 
 
 | 361 | | | | 
 
 
 
 
 | 362 | | +-- VCD_DECOMPRESS | 
 
 
 
 
 | 363 | +---- VCD_CODETABLE | 
 
 
 
 
 | 364 |  | 
 
 
 
 
 | 365 | If bit 0 (VCD_DECOMPRESS) is non-zero, this indicates that a secondary | 
 
 
 
 
 | 366 | compressor may have been used to further compress certain parts of the | 
 
 
 
 
 | 367 | delta encoding data as described in Sections 4.3 and 6. In that case, | 
 
 
 
 
 | 368 | the ID of the secondary compressor is given next. If this bit is zero, | 
 
 
 
 
 | 369 | the compressor ID byte is not included. | 
 
 
 
 
 | 370 |  | 
 
 
 
 
 | 371 | [@@@ If we aren't defining any secondary compressors yet, then it seems | 
 
 
 
 
 | 372 | this bit has no real value yet..] | 
 
 
 
 
 | 373 |  | 
 
 
 
 
 | 374 | If bit 1 (VCD_CODETABLE) is non-zero, this indicates that an | 
 
 
 
 
 | 375 | application-defined code table is to be used for decoding the delta | 
 
 
 
 
 | 376 | instructions. This table itself is compressed.  The length of the data | 
 
 
 
 
 | 377 | comprising this compressed code table and the data follow next. Section 7 | 
 
 
 
 
 | 378 | discusses application-defined code tables.  If this bit is zero, the code | 
 
 
 
 
 | 379 | table data length and the code table data are not included. | 
 
 
 
 
 | 380 |  | 
 
 
 
 
 | 381 | If both bits are set, then the compressor ID byte is included | 
 
 
 
 
 | 382 | before the code table data length and the code table data. | 
 
 
 
 
 | 383 |  | 
 
 
 
 
 | 384 |  | 
 
 
 
 
 | 385 | 4.2 The Format of a Window Section | 
 
 
 
 
 | 386 |  | 
 
 
 
 
 | 387 | Each Window section is organized as follows: | 
 
 
 
 
 | 388 |  | 
 
 
 
 
 | 389 | Win_Indicator                            - byte | 
 
 
 
 
 | 390 | [Source segment length]                  - integer | 
 
 
 
 
 | 391 | [Source segment position]                - integer | 
 
 
 
 
 | 392 | The delta encoding of the target window | 
 
 
 
 
 | 393 |  | 
 
 
 
 
 | 394 |  | 
 
 
 
 
 | 395 | Below are the detail of the various items: | 
 
 
 
 
 | 396 |  | 
 
 
 
 
 | 397 | [@@@ Here, I want to replace the Win_Indicator with a source-count, | 
 
 
 
 
 | 398 | followed by source-count length/position pairs?] | 
 
 
 
 
 | 399 |  | 
 
 
 
 
 | 400 | Win_Indicator: | 
 
 
 
 
 | 401 | This byte is a set of bits, as shown: | 
 
 
 
 
 | 402 |  | 
 
 
 
 
 | 403 | 7 6 5 4 3 2 1 0 | 
 
 
 
 
 | 404 | +-+-+-+-+-+-+-+-+ | 
 
 
 
 
 | 405 | | | | | | | | | | | 
 
 
 
 
 | 406 | +-+-+-+-+-+-+-+-+ | 
 
 
 
 
 | 407 | ^ ^ | 
 
 
 
 
 | 408 | | | | 
 
 
 
 
 | 409 | | +-- VCD_SOURCE | 
 
 
 
 
 | 410 | +---- VCD_TARGET | 
 
 
 
 
 | 411 |  | 
 
 
 
 
 | 412 |  | 
 
 
 
 
 | 413 | If bit 0 (VCD_SOURCE) is non-zero, this indicates that a segment | 
 
 
 
 
 | 414 | of data from the "source" file was used as the corresponding | 
 
 
 
 
 | 415 | source window of data to encode the target window. The decoder | 
 
 
 
 
 | 416 | will use this same source data segment to decode the target window. | 
 
 
 
 
 | 417 |  | 
 
 
 
 
 | 418 | If bit 1 (VCD_TARGET) is non-zero, this indicates that a segment | 
 
 
 
 
 | 419 | of data from the "target" file was used as the corresponding | 
 
 
 
 
 | 420 | source window of data to encode the target window. As above, this | 
 
 
 
 
 | 421 | same source data segment is used to decode the target window. | 
 
 
 
 
 | 422 |  | 
 
 
 
 
 | 423 | The Win_Indicator byte MUST NOT have more than one of the bits | 
 
 
 
 
 | 424 | set (non-zero).  It MAY have none of these bits set. | 
 
 
 
 
 | 425 |  | 
 
 
 
 
 | 426 | If one of these bits is set, the byte is followed by two | 
 
 
 
 
 | 427 | integers to indicate respectively the length and position of | 
 
 
 
 
 | 428 | the source data segment in the relevant file.  If the | 
 
 
 
 
 | 429 | indicator byte is zero, the target window was compressed | 
 
 
 
 
 | 430 | by itself without comparing against another data segment, | 
 
 
 
 
 | 431 | and these two integers are not included. | 
 
 
 
 
 | 432 |  | 
 
 
 
 
 | 433 | The delta encoding of the target window: | 
 
 
 
 
 | 434 | This contains the delta encoding of the target window either | 
 
 
 
 
 | 435 | in terms of the source data segment (i.e., VCD_SOURCE | 
 
 
 
 
 | 436 | or VCD_TARGET was set) or by itself if no source window | 
 
 
 
 
 | 437 | is specified. This data format is discussed next. | 
 
 
 
 
 | 438 |  | 
 
 
 
 
 | 439 |  | 
 
 
 
 
 | 440 | 4.3 The Delta Encoding of a Target Window | 
 
 
 
 
 | 441 |  | 
 
 
 
 
 | 442 | The delta encoding of a target window is organized as follows: | 
 
 
 
 
 | 443 |  | 
 
 
 
 
 | 444 | Length of the delta encoding            - integer | 
 
 
 
 
 | 445 | The delta encoding | 
 
 
 
 
 | 446 | Length of the target window         - integer | 
 
 
 
 
 | 447 | Delta_Indicator                     - byte | 
 
 
 
 
 | 448 | Length of data for ADDs and RUNs    - integer | 
 
 
 
 
 | 449 | Length of instructions section      - integer | 
 
 
 
 
 | 450 | Length of addresses for COPYs       - integer | 
 
 
 
 
 | 451 | Data section for ADDs and RUNs      - array of bytes | 
 
 
 
 
 | 452 | Instructions and sizes section      - array of bytes | 
 
 
 
 
 | 453 | Addresses section for COPYs         - array of bytes | 
 
 
 
 
 | 454 |  | 
 
 
 
 
 | 455 |  | 
 
 
 
 
 | 456 | Length of the delta encoding: | 
 
 
 
 
 | 457 | This integer gives the total number of remaining bytes that | 
 
 
 
 
 | 458 | comprise data of the delta encoding for this target window. | 
 
 
 
 
 | 459 |  | 
 
 
 
 
 | 460 | The delta encoding: | 
 
 
 
 
 | 461 | This contains the data representing the delta encoding which | 
 
 
 
 
 | 462 | is described next. | 
 
 
 
 
 | 463 |  | 
 
 
 
 
 | 464 | Length of the target window: | 
 
 
 
 
 | 465 | This integer indicates the actual size of the target window | 
 
 
 
 
 | 466 | after decompression. A decoder can use this value to allocate | 
 
 
 
 
 | 467 | memory to store the uncompressed data. | 
 
 
 
 
 | 468 |  | 
 
 
 
 
 | 469 | Delta_Indicator: | 
 
 
 
 
 | 470 | This byte is a set of bits, as shown: | 
 
 
 
 
 | 471 |  | 
 
 
 
 
 | 472 | 7 6 5 4 3 2 1 0 | 
 
 
 
 
 | 473 | +-+-+-+-+-+-+-+-+ | 
 
 
 
 
 | 474 | | | | | | | | | | | 
 
 
 
 
 | 475 | +-+-+-+-+-+-+-+-+ | 
 
 
 
 
 | 476 | ^ ^ ^ | 
 
 
 
 
 | 477 | | | | | 
 
 
 
 
 | 478 | | | +-- VCD_DATACOMP | 
 
 
 
 
 | 479 | | +---- VCD_INSTCOMP | 
 
 
 
 
 | 480 | +------ VCD_ADDRCOMP | 
 
 
 
 
 | 481 |  | 
 
 
 
 
 | 482 | VCD_DATACOMP:   bit value 1. | 
 
 
 
 
 | 483 | VCD_INSTCOMP:   bit value 2. | 
 
 
 
 
 | 484 | VCD_ADDRCOMP:   bit value 4. | 
 
 
 
 
 | 485 |  | 
 
 
 
 
 | 486 | As discussed, the delta encoding consists of COPY, ADD and RUN | 
 
 
 
 
 | 487 | instructions. The ADD and RUN instructions have accompanying | 
 
 
 
 
 | 488 | unmatched data (that is, data that does not specifically match | 
 
 
 
 
 | 489 | any data in the source window or in some earlier part of the | 
 
 
 
 
 | 490 | target window) and the COPY instructions have addresses of where | 
 
 
 
 
 | 491 | the matches occur. OPTIONALLY, these types of data MAY be further | 
 
 
 
 
 | 492 | compressed using a secondary compressor. Thus, Vcdiff separates | 
 
 
 
 
 | 493 | the encoding of the delta instructions into three parts: | 
 
 
 
 
 | 494 |  | 
 
 
 
 
 | 495 | a. The unmatched data in the ADD and RUN instructions, | 
 
 
 
 
 | 496 | b. The delta instructions and accompanying sizes, and | 
 
 
 
 
 | 497 | c. The addresses of the COPY instructions. | 
 
 
 
 
 | 498 |  | 
 
 
 
 
 | 499 | If the bit VCD_DECOMPRESS (Section 4.1) was on, each of these | 
 
 
 
 
 | 500 | sections may have been compressed using the specified secondary | 
 
 
 
 
 | 501 | compressor. The bit positions 0 (VCD_DATACOMP), 1 (VCD_INSTCOMP), | 
 
 
 
 
 | 502 | and 2 (VCD_ADDRCOMP) respectively indicate, if non-zero, that | 
 
 
 
 
 | 503 | the corresponding parts are compressed. Then, these parts MUST | 
 
 
 
 
 | 504 | be decompressed before decoding the delta instructions. | 
 
 
 
 
 | 505 |  | 
 
 
 
 
 | 506 | Length of data for ADDs and RUNs: | 
 
 
 
 
 | 507 | This is the length (in bytes) of the section of data storing | 
 
 
 
 
 | 508 | the unmatched data accompanying the ADD and RUN instructions. | 
 
 
 
 
 | 509 |  | 
 
 
 
 
 | 510 | Length of instructions section: | 
 
 
 
 
 | 511 | This is the length (in bytes) of the delta instructions and | 
 
 
 
 
 | 512 | accompanying sizes. | 
 
 
 
 
 | 513 |  | 
 
 
 
 
 | 514 | Length of addresses for COPYs: | 
 
 
 
 
 | 515 | This is the length (in bytes) of the section storing | 
 
 
 
 
 | 516 | the addresses of the COPY instructions. | 
 
 
 
 
 | 517 |  | 
 
 
 
 
 | 518 | Data section for ADDs and RUNs: | 
 
 
 
 
 | 519 | This sequence of bytes encodes the unmatched data for the ADD | 
 
 
 
 
 | 520 | and RUN instructions. | 
 
 
 
 
 | 521 |  | 
 
 
 
 
 | 522 | Instructions and sizes section: | 
 
 
 
 
 | 523 | This sequence of bytes encodes the instructions and their sizes. | 
 
 
 
 
 | 524 |  | 
 
 
 
 
 | 525 | Addresses section for COPYs: | 
 
 
 
 
 | 526 | This sequence of bytes encodes the addresses of the COPY | 
 
 
 
 
 | 527 | instructions. | 
 
 
 
 
 | 528 |  | 
 
 
 
 
 | 529 |  | 
 
 
 
 
 | 530 | 5. DELTA INSTRUCTION ENCODING | 
 
 
 
 
 | 531 |  | 
 
 
 
 
 | 532 | The delta instructions described in Section 3 represent the results of | 
 
 
 
 
 | 533 | string matching. For many data differencing applications in which the | 
 
 
 
 
 | 534 | changes between source and target data are small, any straightforward | 
 
 
 
 
 | 535 | representation of these instructions would be adequate.  However, for | 
 
 
 
 
 | 536 | applications including data compression, it is important to encode | 
 
 
 
 
 | 537 | these instructions well to achieve good compression rates.  From our | 
 
 
 
 
 | 538 | experience, the following observations can be made: | 
 
 
 
 
 | 539 |  | 
 
 
 
 
 | 540 | a. The addresses in COPY instructions are locations of matches and | 
 
 
 
 
 | 541 | often occur close by or even exactly equal to one another. This is | 
 
 
 
 
 | 542 | because data in local regions are often replicated with minor changes. | 
 
 
 
 
 | 543 | In turn, this means that coding a newly matched address against some | 
 
 
 
 
 | 544 | set of recently matched addresses can be beneficial. | 
 
 
 
 
 | 545 |  | 
 
 
 
 
 | 546 | b. The matches are often short in length and separated by small amounts | 
 
 
 
 
 | 547 | of unmatched data. That is, the lengths of COPY and ADD instructions | 
 
 
 
 
 | 548 | are often small. This is particularly true of binary data such as | 
 
 
 
 
 | 549 | executable files or structured data such as HTML or XML. In such cases, | 
 
 
 
 
 | 550 | compression can be improved by combining the encoding of the sizes | 
 
 
 
 
 | 551 | and the instruction types as well as combining the encoding of adjacent | 
 
 
 
 
 | 552 | delta instructions with sufficiently small data sizes. | 
 
 
 
 
 | 553 |  | 
 
 
 
 
 | 554 | The below subsections discuss how the Vcdiff data format provides | 
 
 
 
 
 | 555 | mechanisms enabling encoders to use the above observations to improve | 
 
 
 
 
 | 556 | compression rates. | 
 
 
 
 
 | 557 |  | 
 
 
 
 
 | 558 |  | 
 
 
 
 
 | 559 | 5.1 Address Encoding Modes of COPY Instructions | 
 
 
 
 
 | 560 |  | 
 
 
 
 
 | 561 | As mentioned earlier, addresses of COPY instructions often occur close | 
 
 
 
 
 | 562 | to one another or are exactly equal. To take advantage of this phenomenon | 
 
 
 
 
 | 563 | and encode addresses of COPY instructions more efficiently, the Vcdiff | 
 
 
 
 
 | 564 | data format supports the use of two different types of address caches. | 
 
 
 
 
 | 565 | Both the encoder and decoder maintain these caches, so that decoder's | 
 
 
 
 
 | 566 | caches remain synchronized with the encoder's caches. | 
 
 
 
 
 | 567 |  | 
 
 
 
 
 | 568 | a. A "near" cache is an array with "s_near" slots, each containing an | 
 
 
 
 
 | 569 | address used for encoding addresses nearby to previously encoded | 
 
 
 
 
 | 570 | addresses (in the positive direction only).  The near cache also | 
 
 
 
 
 | 571 | maintains a "next_slot" index to the near cache.  New entries to the | 
 
 
 
 
 | 572 | near cache are always inserted in the next_slot index, which maintains | 
 
 
 
 
 | 573 | a circular buffer of the s_near most recent addresses. | 
 
 
 
 
 | 574 |  | 
 
 
 
 
 | 575 | b. A "same" cache is an array with "s_same" multiple of 256 slots, each | 
 
 
 
 
 | 576 | containing an address.  The same cache maintains a hash table of recent | 
 
 
 
 
 | 577 | addresses used for repeated encoding of the exact same address. | 
 
 
 
 
 | 578 |  | 
 
 
 
 
 | 579 |  | 
 
 
 
 
 | 580 | By default, the parameters s_near and s_same are respectively set to 4 | 
 
 
 
 
 | 581 | and 3. An encoder MAY modify these values, but then it MUST encode the | 
 
 
 
 
 | 582 | new values in the encoding itself, as discussed in Section 7, so that | 
 
 
 
 
 | 583 | the decoder can properly set up its own caches. | 
 
 
 
 
 | 584 |  | 
 
 
 
 
 | 585 | At the start of processing a target window, an implementation | 
 
 
 
 
 | 586 | (encoder or decoder) initializes all of the slots in both caches | 
 
 
 
 
 | 587 | to zero.  The next_slot pointer of the near cache is set | 
 
 
 
 
 | 588 | to point to slot zero. | 
 
 
 
 
 | 589 |  | 
 
 
 
 
 | 590 | Each time a COPY instruction is processed by the encoder or | 
 
 
 
 
 | 591 | decoder, the implementation's caches are updated as follows, where | 
 
 
 
 
 | 592 | "addr" is the address in the COPY instruction. | 
 
 
 
 
 | 593 |  | 
 
 
 
 
 | 594 | a. The slot in the near cache referenced by the next_slot | 
 
 
 
 
 | 595 | index is set to addr.  The next_slot index is then incremented | 
 
 
 
 
 | 596 | modulo s_near. | 
 
 
 
 
 | 597 |  | 
 
 
 
 
 | 598 | b. The slot in the same cache whose index is addr%(s_same*256) | 
 
 
 
 
 | 599 | is set to addr. [We use the C notations of % for modulo and | 
 
 
 
 
 | 600 | * for multiplication.] | 
 
 
 
 
 | 601 |  | 
 
 
 
 
 | 602 |  | 
 
 
 
 
 | 603 | 5.2 Example code for maintaining caches | 
 
 
 
 
 | 604 |  | 
 
 
 
 
 | 605 | To make clear the above description, below are example cache data | 
 
 
 
 
 | 606 | structures and algorithms to initialize and update them: | 
 
 
 
 
 | 607 |  | 
 
 
 
 
 | 608 | typedef struct _cache_s | 
 
 
 
 
 | 609 | { | 
 
 
 
 
 | 610 | int*  near;      /* array of size s_near        */ | 
 
 
 
 
 | 611 | int   s_near; | 
 
 
 
 
 | 612 | int   next_slot; /* the circular index for near */ | 
 
 
 
 
 | 613 | int*  same;      /* array of size s_same*256    */ | 
 
 
 
 
 | 614 | int   s_same; | 
 
 
 
 
 | 615 | } Cache_t; | 
 
 
 
 
 | 616 |  | 
 
 
 
 
 | 617 | cache_init(Cache_t* ka) | 
 
 
 
 
 | 618 | { | 
 
 
 
 
 | 619 | int   i; | 
 
 
 
 
 | 620 |  | 
 
 
 
 
 | 621 | ka->next_slot = 0; | 
 
 
 
 
 | 622 | for(i = 0; i < ka->s_near; ++i) | 
 
 
 
 
 | 623 | ka->near[i] = 0; | 
 
 
 
 
 | 624 |  | 
 
 
 
 
 | 625 | for(i = 0; i < ka->s_same*256; ++i) | 
 
 
 
 
 | 626 | ka->same[i] = 0; | 
 
 
 
 
 | 627 | } | 
 
 
 
 
 | 628 |  | 
 
 
 
 
 | 629 | cache_update(Cache_t* ka, int addr) | 
 
 
 
 
 | 630 | { | 
 
 
 
 
 | 631 | if(ka->s_near > 0) | 
 
 
 
 
 | 632 | {   ka->near[ka->next_slot] = addr; | 
 
 
 
 
 | 633 | ka->next_slot = (ka->next_slot + 1) % ka->s_near; | 
 
 
 
 
 | 634 | } | 
 
 
 
 
 | 635 |  | 
 
 
 
 
 | 636 | if(ka->s_same > 0) | 
 
 
 
 
 | 637 | ka->same[addr % (ka->s_same*256)] = addr; | 
 
 
 
 
 | 638 | } | 
 
 
 
 
 | 639 |  | 
 
 
 
 
 | 640 |  | 
 
 
 
 
 | 641 | 5.3 Encoding of COPY instruction addresses | 
 
 
 
 
 | 642 |  | 
 
 
 
 
 | 643 | The address of a COPY instruction is encoded using different modes | 
 
 
 
 
 | 644 | depending on the type of cached address used, if any. | 
 
 
 
 
 | 645 |  | 
 
 
 
 
 | 646 | Let "addr" be the address of a COPY instruction to be decoded and "here" | 
 
 
 
 
 | 647 | be the current location in the target data (i.e., the start of the data | 
 
 
 
 
 | 648 | about to be encoded or decoded).  Let near[j] be the jth element in | 
 
 
 
 
 | 649 | the near cache, and same[k] be the kth element in the same cache. | 
 
 
 
 
 | 650 | Below are the possible address modes: | 
 
 
 
 
 | 651 |  | 
 
 
 
 
 | 652 | VCD_SELF: This mode has value 0. The address was encoded by itself | 
 
 
 
 
 | 653 | as an integer. | 
 
 
 
 
 | 654 |  | 
 
 
 
 
 | 655 | VCD_HERE: This mode has value 1. The address was encoded as | 
 
 
 
 
 | 656 | the integer value "here - addr". | 
 
 
 
 
 | 657 |  | 
 
 
 
 
 | 658 | Near modes: The "near modes" are in the range [2,s_near+1]. Let m | 
 
 
 
 
 | 659 | be the mode of the address encoding. The address was encoded | 
 
 
 
 
 | 660 | as the integer value "addr - near[m-2]". | 
 
 
 
 
 | 661 |  | 
 
 
 
 
 | 662 | Same modes: The "same modes" are in the range | 
 
 
 
 
 | 663 | [s_near+2,s_near+s_same+1]. Let m be the mode of the encoding. | 
 
 
 
 
 | 664 | The address was encoded as a single byte b such that | 
 
 
 
 
 | 665 | "addr == same[(m - (s_near+2))*256 + b]". | 
 
 
 
 
 | 666 |  | 
 
 
 
 
 | 667 |  | 
 
 
 
 
 | 668 | 5.3 Example code for encoding and decoding of COPY instruction addresses | 
 
 
 
 
 | 669 |  | 
 
 
 
 
 | 670 | We show example algorithms below to demonstrate use of address modes more | 
 
 
 
 
 | 671 | clearly. The encoder has freedom to choose address modes, the sample | 
 
 
 
 
 | 672 | addr_encode() algorithm merely shows one way of picking the address | 
 
 
 
 
 | 673 | mode. The decoding algorithm addr_decode() will uniquely decode addresses | 
 
 
 
 
 | 674 | regardless of the encoder's algorithm choice. | 
 
 
 
 
 | 675 |  | 
 
 
 
 
 | 676 | Note that the address caches are updated immediately after an address is | 
 
 
 
 
 | 677 | encoded or decoded. In this way, the decoder is always synchronized with | 
 
 
 
 
 | 678 | the encoder. | 
 
 
 
 
 | 679 |  | 
 
 
 
 
 | 680 | int addr_encode(Cache_t* ka, int addr, int here, int* mode) | 
 
 
 
 
 | 681 | { | 
 
 
 
 
 | 682 | int  i, d, bestd, bestm; | 
 
 
 
 
 | 683 |  | 
 
 
 
 
 | 684 | /* Attempt to find the address mode that yields the | 
 
 
 
 
 | 685 | * smallest integer value for "d", the encoded address | 
 
 
 
 
 | 686 | * value, thereby minimizing the encoded size of the | 
 
 
 
 
 | 687 | * address. */ | 
 
 
 
 
 | 688 |  | 
 
 
 
 
 | 689 | bestd = addr; bestm = VCD_SELF;      /* VCD_SELF == 0 */ | 
 
 
 
 
 | 690 |  | 
 
 
 
 
 | 691 | if((d = here-addr) < bestd) | 
 
 
 
 
 | 692 | { bestd = d; bestm = VCD_HERE; } /* VCD_HERE == 1 */ | 
 
 
 
 
 | 693 |  | 
 
 
 
 
 | 694 | for(i = 0; i < ka->s_near; ++i) | 
 
 
 
 
 | 695 | if((d = addr - ka->near[i]) >= 0 && d < bestd) | 
 
 
 
 
 | 696 | { bestd = d; bestm = i+2; } | 
 
 
 
 
 | 697 |  | 
 
 
 
 
 | 698 | if(ka->s_same > 0 && ka->same[d = addr%(ka->s_same*256)] == addr) | 
 
 
 
 
 | 699 | { bestd = d%256; bestm = ka->s_near + 2 + d/256; } | 
 
 
 
 
 | 700 |  | 
 
 
 
 
 | 701 | cache_update(ka,addr); | 
 
 
 
 
 | 702 |  | 
 
 
 
 
 | 703 | *mode = bestm; /* this returns the address encoding mode */ | 
 
 
 
 
 | 704 | return  bestd; /* this returns the encoded address       */ | 
 
 
 
 
 | 705 | } | 
 
 
 
 
 | 706 |  | 
 
 
 
 
 | 707 | Note that the addr_encode() algorithm chooses the best address mode using a | 
 
 
 
 
 | 708 | local optimization, but that may not lead to the best encoding efficiency | 
 
 
 
 
 | 709 | because different modes lead to different instruction encodings, as    described below. | 
 
 
 
 
 | 710 |  | 
 
 
 
 
 | 711 | The functions addrint() and addrbyte() used in addr_decode() obtain from | 
 
 
 
 
 | 712 | the "Addresses section for COPYs" (Section 4.3) an integer or a byte, | 
 
 
 
 
 | 713 | respectively. These utilities will not be described here.  We simply | 
 
 
 
 
 | 714 | recall that an integer is represented as a compact variable-sized string | 
 
 
 
 
 | 715 | of bytes as described in Section 2 (i.e., base 128). | 
 
 
 
 
 | 716 |  | 
 
 
 
 
 | 717 | int addr_decode(Cache_t* ka, int here, int mode) | 
 
 
 
 
 | 718 | {   int  addr, m; | 
 
 
 
 
 | 719 |  | 
 
 
 
 
 | 720 | if(mode == VCD_SELF) | 
 
 
 
 
 | 721 | addr = addrint(); | 
 
 
 
 
 | 722 | else if(mode == VCD_HERE) | 
 
 
 
 
 | 723 | addr = here - addrint(); | 
 
 
 
 
 | 724 | else if((m = mode - 2) >= 0 && m < ka->s_near) /* near cache */ | 
 
 
 
 
 | 725 | addr = ka->near[m] + addrint(); | 
 
 
 
 
 | 726 | else /* same cache */ | 
 
 
 
 
 | 727 | {    m = mode - (2 + ka->s_near); | 
 
 
 
 
 | 728 | addr = ka->same[m*256 + addrbyte()]; | 
 
 
 
 
 | 729 | } | 
 
 
 
 
 | 730 |  | 
 
 
 
 
 | 731 | cache_update(ka, addr); | 
 
 
 
 
 | 732 |  | 
 
 
 
 
 | 733 | return addr; | 
 
 
 
 
 | 734 | } | 
 
 
 
 
 | 735 |  | 
 
 
 
 
 | 736 |  | 
 
 
 
 
 | 737 | 5.4 Instruction Codes | 
 
 
 
 
 | 738 |  | 
 
 
 
 
 | 739 | As noted, the data sizes associated with delta instructions are often | 
 
 
 
 
 | 740 | small. Thus, compression efficiency can be improved by combining the sizes | 
 
 
 
 
 | 741 | and instruction types in a single encoding, as well by combining certain | 
 
 
 
 
 | 742 | pairs of adjacent delta instructions. Effective choices of when to perform | 
 
 
 
 
 | 743 | such combinations depend on many factors including the data being processed | 
 
 
 
 
 | 744 | and the string matching algorithm in use. For example, if many COPY | 
 
 
 
 
 | 745 | instructions have the same data sizes, it may be worth to encode these | 
 
 
 
 
 | 746 | instructions more compactly than others. | 
 
 
 
 
 | 747 |  | 
 
 
 
 
 | 748 | The Vcdiff data format is designed so that a decoder does not need to be | 
 
 
 
 
 | 749 | aware of the choices made in encoding algorithms. This is achieved with the | 
 
 
 
 
 | 750 | notion of an "instruction code table" containing 256 entries. Each entry | 
 
 
 
 
 | 751 | defines either a single delta instruction or a pair of instructions that | 
 
 
 
 
 | 752 | have been combined.  Note that the code table itself only exists in main | 
 
 
 
 
 | 753 | memory, not in the delta file (unless using an application-defined code | 
 
 
 
 
 | 754 | table, described in Section 7). The encoded data simply includes the index | 
 
 
 
 
 | 755 | of each instruction and, since there are only 256 indices, each index | 
 
 
 
 
 | 756 | can be represented as a single byte. | 
 
 
 
 
 | 757 |  | 
 
 
 
 
 | 758 | Each instruction code entry contains six fields, each of which | 
 
 
 
 
 | 759 | is a single byte with unsigned value: | 
 
 
 
 
 | 760 |  | 
 
 
 
 
 | 761 | +-----------------------------------------------+ | 
 
 
 
 
 | 762 | | inst1 | size1 | mode1 | inst2 | size2 | mode2 | | 
 
 
 
 
 | 763 | +-----------------------------------------------+ | 
 
 
 
 
 | 764 |  | 
 
 
 
 
 | 765 | @@@ could be more compact | 
 
 
 
 
 | 766 |  | 
 
 
 
 
 | 767 | Each triple (inst,size,mode) defines a delta instruction. The meanings | 
 
 
 
 
 | 768 | of these fields are as follows: | 
 
 
 
 
 | 769 |  | 
 
 
 
 
 | 770 | inst: An "inst" field can have one of the four values: NOOP (0), ADD (1), | 
 
 
 
 
 | 771 | RUN (2) or COPY (3) to indicate the instruction types. NOOP means | 
 
 
 
 
 | 772 | that no instruction is specified. In this case, both the corresponding | 
 
 
 
 
 | 773 | size and mode fields will be zero. | 
 
 
 
 
 | 774 |  | 
 
 
 
 
 | 775 | size: A "size" field is zero or positive. A value zero means that the | 
 
 
 
 
 | 776 | size associated with the instruction is encoded separately as | 
 
 
 
 
 | 777 | an integer in the "Instructions and sizes section" (Section 6). | 
 
 
 
 
 | 778 | A positive value for "size" defines the actual data size. | 
 
 
 
 
 | 779 | Note that since the size is restricted to a byte, the maximum | 
 
 
 
 
 | 780 | value for any instruction with size implicitly defined in the code | 
 
 
 
 
 | 781 | table is 255. | 
 
 
 
 
 | 782 |  | 
 
 
 
 
 | 783 | mode: A "mode" field is significant only when the associated delta | 
 
 
 
 
 | 784 | instruction is a COPY. It defines the mode used to encode the | 
 
 
 
 
 | 785 | associated addresses. For other instructions, this is always zero. | 
 
 
 
 
 | 786 |  | 
 
 
 
 
 | 787 |  | 
 
 
 
 
 | 788 | 5.5 The Code Table | 
 
 
 
 
 | 789 |  | 
 
 
 
 
 | 790 | Following the discussions on address modes and instruction code tables, | 
 
 
 
 
 | 791 | we define a "Code Table" to have the data below: | 
 
 
 
 
 | 792 |  | 
 
 
 
 
 | 793 | s_near: the size of the near cache, | 
 
 
 
 
 | 794 | s_same: the size of the same cache, | 
 
 
 
 
 | 795 | i_code: the 256-entry instruction code table. | 
 
 
 
 
 | 796 |  | 
 
 
 
 
 | 797 | Vcdiff itself defines a "default code table" in which s_near is 4 | 
 
 
 
 
 | 798 | and s_same is 3. Thus, there are 9 address modes for a COPY instruction. | 
 
 
 
 
 | 799 | The first two are VCD_SELF (0) and VCD_HERE (1). Modes 2, 3, 4 and 5 | 
 
 
 
 
 | 800 | are for addresses coded against the near cache. And, modes 6, 7  and 8 | 
 
 
 
 
 | 801 | are for addresses coded against the same cache. | 
 
 
 
 
 | 802 |  | 
 
 
 
 
 | 803 | The default instruction code table is depicted below, in a compact | 
 
 
 
 
 | 804 | representation that we use only for descriptive purposes.  See section 7 | 
 
 
 
 
 | 805 | for the specification of how an instruction code table is represented | 
 
 
 
 
 | 806 | in the Vcdiff encoding format.  In the depiction, a zero value for | 
 
 
 
 
 | 807 | size indicates that the size is separately coded. The mode of non-COPY | 
 
 
 
 
 | 808 | instructions is represented as 0 even though they are not used. | 
 
 
 
 
 | 809 |  | 
 
 
 
 
 | 810 |  | 
 
 
 
 
 | 811 | TYPE      SIZE     MODE    TYPE     SIZE     MODE     INDEX | 
 
 
 
 
 | 812 | --------------------------------------------------------------- | 
 
 
 
 
 | 813 | 1.  RUN         0        0     NOOP       0        0        0 | 
 
 
 
 
 | 814 | 2.  ADD    0, [1,17]     0     NOOP       0        0      [1,18] | 
 
 
 
 
 | 815 | 3.  COPY   0, [4,18]     0     NOOP       0        0     [19,34] | 
 
 
 
 
 | 816 | 4.  COPY   0, [4,18]     1     NOOP       0        0     [35,50] | 
 
 
 
 
 | 817 | 5.  COPY   0, [4,18]     2     NOOP       0        0     [51,66] | 
 
 
 
 
 | 818 | 6.  COPY   0, [4,18]     3     NOOP       0        0     [67,82] | 
 
 
 
 
 | 819 | 7.  COPY   0, [4,18]     4     NOOP       0        0     [83,98] | 
 
 
 
 
 | 820 | 8.  COPY   0, [4,18]     5     NOOP       0        0     [99,114] | 
 
 
 
 
 | 821 | 9.  COPY   0, [4,18]     6     NOOP       0        0    [115,130] | 
 
 
 
 
 | 822 | 10.  COPY   0, [4,18]     7     NOOP       0        0    [131,146] | 
 
 
 
 
 | 823 | 11.  COPY   0, [4,18]     8     NOOP       0        0    [147,162] | 
 
 
 
 
 | 824 | 12.  ADD       [1,4]      0     COPY     [4,6]      0    [163,174] | 
 
 
 
 
 | 825 | 13.  ADD       [1,4]      0     COPY     [4,6]      1    [175,186] | 
 
 
 
 
 | 826 | 14.  ADD       [1,4]      0     COPY     [4,6]      2    [187,198] | 
 
 
 
 
 | 827 | 15.  ADD       [1,4]      0     COPY     [4,6]      3    [199,210] | 
 
 
 
 
 | 828 | 16.  ADD       [1,4]      0     COPY     [4,6]      4    [211,222] | 
 
 
 
 
 | 829 | 17.  ADD       [1,4]      0     COPY     [4,6]      5    [223,234] | 
 
 
 
 
 | 830 | 18.  ADD       [1,4]      0     COPY       4        6    [235,238] | 
 
 
 
 
 | 831 | 19.  ADD       [1,4]      0     COPY       4        7    [239,242] | 
 
 
 
 
 | 832 | 20.  ADD       [1,4]      0     COPY       4        8    [243,246] | 
 
 
 
 
 | 833 | 21.  COPY        4      [0,8]   ADD        1        0    [247,255] | 
 
 
 
 
 | 834 | --------------------------------------------------------------- | 
 
 
 
 
 | 835 |  | 
 
 
 
 
 | 836 | In the above depiction, each numbered line represents one or more | 
 
 
 
 
 | 837 | entries in the actual instruction code table (recall that an entry in | 
 
 
 
 
 | 838 | the instruction code table may represent up to two combined delta | 
 
 
 
 
 | 839 | instructions.) The last column ("INDEX") shows which index value or | 
 
 
 
 
 | 840 | range of index values of the entries covered by that line. The notation | 
 
 
 
 
 | 841 | [i,j] means values from i through j, inclusive. The first 6 columns of | 
 
 
 
 
 | 842 | a line in the depiction describe the pairs of instructions used for | 
 
 
 
 
 | 843 | the corresponding index value(s). | 
 
 
 
 
 | 844 |  | 
 
 
 
 
 | 845 | If a line in the depiction includes a column entry using the [i,j] | 
 
 
 
 
 | 846 | notation, this means that the line is instantiated for each value | 
 
 
 
 
 | 847 | in the range from i to j, inclusive.  The notation "0, [i,j]" means | 
 
 
 
 
 | 848 | that the line is instantiated for the value 0 and for each value | 
 
 
 
 
 | 849 | in the range from i to j, inclusive. | 
 
 
 
 
 | 850 |  | 
 
 
 
 
 | 851 | If a line in the depiction includes more than one entry using the [i,j] | 
 
 
 
 
 | 852 | notation, implying a "nested loop" to convert the line to a range of | 
 
 
 
 
 | 853 | table entries, the first such [i,j] range specifies the outer loop, | 
 
 
 
 
 | 854 | and the second specifies the inner loop. | 
 
 
 
 
 | 855 |  | 
 
 
 
 
 | 856 | The below examples should make clear the above description: | 
 
 
 
 
 | 857 |  | 
 
 
 
 
 | 858 | Line 1 shows the single RUN instruction with index 0. As the size field | 
 
 
 
 
 | 859 | is 0, this RUN instruction always has its actual size encoded separately. | 
 
 
 
 
 | 860 |  | 
 
 
 
 
 | 861 | Line 2 shows the 18 single ADD instructions. The ADD instruction with | 
 
 
 
 
 | 862 | size field 0 (i.e., the actual size is coded separately) has index 1. | 
 
 
 
 
 | 863 | ADD instructions with sizes from 1 to 17 use code indices 2 to 18 and | 
 
 
 
 
 | 864 | their sizes are as given (so they will not be separately encoded.) | 
 
 
 
 
 | 865 |  | 
 
 
 
 
 | 866 | Following the single ADD instructions are the single COPY instructions | 
 
 
 
 
 | 867 | ordered by their address encoding modes. For example, line 11 shows the | 
 
 
 
 
 | 868 | COPY instructions with mode 8, i.e., the last of the same cache. | 
 
 
 
 
 | 869 | In this case, the COPY instruction with size field 0 has index 147. | 
 
 
 
 
 | 870 | Again, the actual size of this instruction will be coded separately. | 
 
 
 
 
 | 871 |  | 
 
 
 
 
 | 872 | Lines 12 to 21 show the pairs of instructions that are combined together. | 
 
 
 
 
 | 873 | For example, line 12 depicts the 12 entries in which an ADD instruction | 
 
 
 
 
 | 874 | is combined with an immediately following COPY instruction. The entries | 
 
 
 
 
 | 875 | with indices 163, 164, 165 represent the pairs in which the ADD | 
 
 
 
 
 | 876 | instructions all have size 1 while the COPY instructions has mode | 
 
 
 
 
 | 877 | 0 (VCD_SELF) and sizes 4, 5 and 6 respectively. | 
 
 
 
 
 | 878 |  | 
 
 
 
 
 | 879 | The last line, line 21, shows the eight instruction pairs where the first | 
 
 
 
 
 | 880 | instruction is a COPY and the second is an ADD. In this case, all COPY | 
 
 
 
 
 | 881 | instructions have size 4 with mode ranging from 0 to 8 and all the ADD | 
 
 
 
 
 | 882 | instructions have size 1. Thus, the entry with largest index 255 | 
 
 
 
 
 | 883 | combines a COPY instruction of size 4 and mode 8 with an ADD instruction | 
 
 
 
 
 | 884 | of size 1. | 
 
 
 
 
 | 885 |  | 
 
 
 
 
 | 886 | The choice of the minimum size 4 for COPY instructions in the default code | 
 
 
 
 
 | 887 | table was made from experiments that showed that excluding small matches | 
 
 
 
 
 | 888 | (less then 4 bytes long) improved the compression rates. | 
 
 
 
 
 | 889 |  | 
 
 
 
 
 | 890 |  | 
 
 
 
 
 | 891 | 6. DECODING A TARGET WINDOW | 
 
 
 
 
 | 892 |  | 
 
 
 
 
 | 893 | Section 4.3 discusses that the delta instructions and associated data | 
 
 
 
 
 | 894 | are encoded in three arrays of bytes: | 
 
 
 
 
 | 895 |  | 
 
 
 
 
 | 896 | Data section for ADDs and RUNs, | 
 
 
 
 
 | 897 | Instructions and sizes section, and | 
 
 
 
 
 | 898 | Addresses section for COPYs. | 
 
 
 
 
 | 899 |  | 
 
 
 
 
 | 900 |  | 
 
 
 
 
 | 901 | Further, these data sections may have been further compressed by some | 
 
 
 
 
 | 902 | secondary compressor. Assuming that any such compressed data has been | 
 
 
 
 
 | 903 | decompressed so that we now have three arrays: | 
 
 
 
 
 | 904 |  | 
 
 
 
 
 | 905 | inst: bytes coding the instructions and sizes. | 
 
 
 
 
 | 906 | data: unmatched data associated with ADDs and RUNs. | 
 
 
 
 
 | 907 | addr: bytes coding the addresses of COPYs. | 
 
 
 
 
 | 908 |  | 
 
 
 
 
 | 909 | These arrays are organized as follows: | 
 
 
 
 
 | 910 |  | 
 
 
 
 
 | 911 | inst: | 
 
 
 
 
 | 912 | a sequence of (index, [size1], [size2]) tuples, where "index" | 
 
 
 
 
 | 913 | is an index into the instruction code table, and size1 and size2 | 
 
 
 
 
 | 914 | are integers that MAY or MAY NOT be included in the tuple as | 
 
 
 
 
 | 915 | follows. The entry with the given "index" in the instruction | 
 
 
 
 
 | 916 | code table potentially defines two delta instructions. If the | 
 
 
 
 
 | 917 | first delta instruction is not a VCD_NOOP and its size is zero, | 
 
 
 
 
 | 918 | then size1 MUST be present. Otherwise, size1 MUST be omitted and | 
 
 
 
 
 | 919 | the size of the instruction (if it is not VCD_NOOP) is as defined | 
 
 
 
 
 | 920 | in the table. The presence or absence of size2 is defined | 
 
 
 
 
 | 921 | similarly with respect to the second delta instruction. | 
 
 
 
 
 | 922 |  | 
 
 
 
 
 | 923 | data: | 
 
 
 
 
 | 924 | a sequence of data values, encoded as bytes. | 
 
 
 
 
 | 925 |  | 
 
 
 
 
 | 926 | addr: | 
 
 
 
 
 | 927 | a sequence of address values. Addresses are normally encoded as | 
 
 
 
 
 | 928 | integers as described in Section 2 (i.e., base 128). | 
 
 
 
 
 | 929 | Since the same cache emits addresses in the range [0,255], | 
 
 
 
 
 | 930 | however, same cache addresses are always encoded as a | 
 
 
 
 
 | 931 | single byte. | 
 
 
 
 
 | 932 |  | 
 
 
 
 
 | 933 | To summarize, each tuple in the "inst" array includes an index to some | 
 
 
 
 
 | 934 | entry in the instruction code table that determines: | 
 
 
 
 
 | 935 |  | 
 
 
 
 
 | 936 | a. Whether one or two instructions were encoded and their types. | 
 
 
 
 
 | 937 |  | 
 
 
 
 
 | 938 | b. If the instructions have their sizes encoded separately, these | 
 
 
 
 
 | 939 | sizes will follow, in order, in the tuple. | 
 
 
 
 
 | 940 |  | 
 
 
 
 
 | 941 | c. If the instructions have accompanying data, i.e., ADDs or RUNs, | 
 
 
 
 
 | 942 | their data will be in the array "data". | 
 
 
 
 
 | 943 |  | 
 
 
 
 
 | 944 | d. Similarly, if the instructions are COPYs, the coded addresses are | 
 
 
 
 
 | 945 | found in the array "addr". | 
 
 
 
 
 | 946 |  | 
 
 
 
 
 | 947 | The decoding procedure simply processes the arrays by reading one code | 
 
 
 
 
 | 948 | index at a time, looking up the corresponding instruction code entry, | 
 
 
 
 
 | 949 | then consuming the respective sizes, data and addresses following the | 
 
 
 
 
 | 950 | directions in this entry. In other words, the decoder maintains an implicit | 
 
 
 
 
 | 951 | next-element pointer for each array; "consuming" an instruction tuple, | 
 
 
 
 
 | 952 | data, or address value implies incrementing the associated pointer. | 
 
 
 
 
 | 953 |  | 
 
 
 
 
 | 954 | For example, if during the processing of the target window, the next | 
 
 
 
 
 | 955 | unconsumed tuple in the inst array has index value 19, then the first | 
 
 
 
 
 | 956 | instruction is a COPY, whose size is found as the immediately following | 
 
 
 
 
 | 957 | integer in the inst array.  Since the mode of this COPY instruction is | 
 
 
 
 
 | 958 | VCD_SELF, the corresponding address is found by consuming the next | 
 
 
 
 
 | 959 | integer in the addr array.  The data array is left intact. As the second | 
 
 
 
 
 | 960 | instruction for code index 19 is a NOOP, this tuple is finished. | 
 
 
 
 
 | 961 |  | 
 
 
 
 
 | 962 |  | 
 
 
 
 
 | 963 | 7. APPLICATION-DEFINED CODE TABLES | 
 
 
 
 
 | 964 |  | 
 
 
 
 
 | 965 | Although the default code table used in Vcdiff is good for general | 
 
 
 
 
 | 966 | purpose encoders, there are times when other code tables may perform | 
 
 
 
 
 | 967 | better. For example, to code a file with many identical segments of data, | 
 
 
 
 
 | 968 | it may be advantageous to have a COPY instruction with the specific size | 
 
 
 
 
 | 969 | of these data segments so that the instruction can be encoded in a single | 
 
 
 
 
 | 970 | byte. Such a special code table MUST then be encoded in the delta file | 
 
 
 
 
 | 971 | so that the decoder can reconstruct it before decoding the data. | 
 
 
 
 
 | 972 |  | 
 
 
 
 
 | 973 | Vcdiff allows an application-defined code table to be specified | 
 
 
 
 
 | 974 | in a delta file with the following data: | 
 
 
 
 
 | 975 |  | 
 
 
 
 
 | 976 | Size of near cache            - byte | 
 
 
 
 
 | 977 | Size of same cache            - byte | 
 
 
 
 
 | 978 | Compressed code table data | 
 
 
 
 
 | 979 |  | 
 
 
 
 
 | 980 | The "compressed code table data" encodes the delta between the default | 
 
 
 
 
 | 981 | code table (source) and the new code table (target) in the same manner as | 
 
 
 
 
 | 982 | described in Section 4.3 for encoding a target window in terms of a | 
 
 
 
 
 | 983 | source window. This delta is computed using the following steps: | 
 
 
 
 
 | 984 |  | 
 
 
 
 
 | 985 | a.  Convert the new instruction code table into a string, "code", of | 
 
 
 
 
 | 986 | 1536 bytes using the below steps in order: | 
 
 
 
 
 | 987 |  | 
 
 
 
 
 | 988 | i. Add in order the 256 bytes representing the types of the first | 
 
 
 
 
 | 989 | instructions in the instruction pairs. | 
 
 
 
 
 | 990 | ii. Add in order the 256 bytes representing the types of the second | 
 
 
 
 
 | 991 | instructions in the instruction pairs. | 
 
 
 
 
 | 992 | iii. Add in order the 256 bytes representing the sizes of the first | 
 
 
 
 
 | 993 | instructions in the instruction pairs. | 
 
 
 
 
 | 994 | iv. Add in order the 256 bytes representing the sizes of the second | 
 
 
 
 
 | 995 | instructions in the instruction pairs. | 
 
 
 
 
 | 996 | v. Add in order the 256 bytes representing the modes of the first | 
 
 
 
 
 | 997 | instructions in the instruction pairs. | 
 
 
 
 
 | 998 | vi. Add in order the 256 bytes representing the modes of the second | 
 
 
 
 
 | 999 | instructions in the instruction pairs. | 
 
 
 
 
 | 1000 |  | 
 
 
 
 
 | 1001 | b.  Similarly, convert the default instruction code table into | 
 
 
 
 
 | 1002 | a string "dflt". | 
 
 
 
 
 | 1003 |  | 
 
 
 
 
 | 1004 | c.  Treat the string "code" as a target window and "dflt" as the | 
 
 
 
 
 | 1005 | corresponding source data and apply an encoding algorithm to | 
 
 
 
 
 | 1006 | compute the delta encoding of "code" in terms of "dflt". | 
 
 
 
 
 | 1007 | This computation MUST use the default code table for encoding | 
 
 
 
 
 | 1008 | the delta instructions. | 
 
 
 
 
 | 1009 |  | 
 
 
 
 
 | 1010 | The decoder can then reverse the above steps to decode the compressed | 
 
 
 
 
 | 1011 | table data using the method of Section 6, employing the default code | 
 
 
 
 
 | 1012 | table, to generate the new code table.  Note that the decoder does not | 
 
 
 
 
 | 1013 | need to know anything about the details of the encoding algorithm used | 
 
 
 
 
 | 1014 | in step (c). The decoder is still able to decode the new code table | 
 
 
 
 
 | 1015 | because the Vcdiff format is independent from the choice of encoding | 
 
 
 
 
 | 1016 | algorithm, and because the encoder in step (c) uses the known, default | 
 
 
 
 
 | 1017 | code table. | 
 
 
 
 
 | 1018 |  | 
 
 
 
 
 | 1019 |  | 
 
 
 
 
 | 1020 | 8. PERFORMANCE | 
 
 
 
 
 | 1021 |  | 
 
 
 
 
 | 1022 | The encoding format is compact. For compression only, using the LZ-77 | 
 
 
 
 
 | 1023 | string parsing strategy and without any secondary compressors, the typical | 
 
 
 
 
 | 1024 | compression rate is better than Unix compress and close to gzip.  For | 
 
 
 
 
 | 1025 | differencing, the data format is better than all known methods in | 
 
 
 
 
 | 1026 | terms of its stated goal, which is primarily decoding speed and | 
 
 
 
 
 | 1027 | encoding efficiency. | 
 
 
 
 
 | 1028 |  | 
 
 
 
 
 | 1029 | We compare the performance of compress, gzip and Vcdiff using the | 
 
 
 
 
 | 1030 | archives of three versions of the Gnu C compiler, gcc-2.95.1.tar, | 
 
 
 
 
 | 1031 | gcc-2.95.2.tar and gcc-2.95.3.tar. The experiments were done on an | 
 
 
 
 
 | 1032 | SGI-MIPS3, 400MHZ. Gzip was used at its default compression level. | 
 
 
 
 
 | 1033 | Vcdiff timings were done using the Vcodex/Vcdiff software (Section 13). | 
 
 
 
 
 | 1034 | As string and window matching typically dominates the computation during | 
 
 
 
 
 | 1035 | compression, the Vcdiff compression times were directly due to the | 
 
 
 
 
 | 1036 | algorithms used in the Vcodex/Vcdiff software. However, the decompression | 
 
 
 
 
 | 1037 | times should be generic and representative of any good implementation | 
 
 
 
 
 | 1038 | of the Vcdiff data format. Timing was done by running each program | 
 
 
 
 
 | 1039 | three times and taking the average of the total cpu+system times. | 
 
 
 
 
 | 1040 |  | 
 
 
 
 
 | 1041 | Below are the different Vcdiff runs: | 
 
 
 
 
 | 1042 |  | 
 
 
 
 
 | 1043 | Vcdiff: vcdiff is used as compressor only. | 
 
 
 
 
 | 1044 |  | 
 
 
 
 
 | 1045 | Vcdiff-d: vcdiff is used as a differencer only. That is, it only | 
 
 
 
 
 | 1046 | compares target data against source data.  Since the files | 
 
 
 
 
 | 1047 | involved are large, they are broken into windows. In this | 
 
 
 
 
 | 1048 | case, each target window starting at some file offset in | 
 
 
 
 
 | 1049 | the target file is compared against a source window with | 
 
 
 
 
 | 1050 | the same file offset (in the source file). The source | 
 
 
 
 
 | 1051 | window is also slightly larger than the target window | 
 
 
 
 
 | 1052 | to increase matching opportunities. The -d option also gives | 
 
 
 
 
 | 1053 | a hint to the string matching algorithm of Vcdiff that | 
 
 
 
 
 | 1054 | the two files are very similar with long stretches of matches. | 
 
 
 
 
 | 1055 | The algorithm takes advantage of this to minimize its | 
 
 
 
 
 | 1056 | processing of source data and save time. | 
 
 
 
 
 | 1057 |  | 
 
 
 
 
 | 1058 | Vcdiff-dc: This is similar to Vcdiff-d but vcdiff can also compare | 
 
 
 
 
 | 1059 | target data against target data as applicable. Thus, vcdiff | 
 
 
 
 
 | 1060 | both computes differences and compresses data. The windowing | 
 
 
 
 
 | 1061 | algorithm is the same as above. However, the above hint is | 
 
 
 
 
 | 1062 | recinded in this case. | 
 
 
 
 
 | 1063 |  | 
 
 
 
 
 | 1064 | Vcdiff-dcs: This is similar to Vcdiff-dc but the windowing algorithm | 
 
 
 
 
 | 1065 | uses a content-based heuristic to select source data segments | 
 
 
 
 
 | 1066 | that are more likely to match with a given target window. | 
 
 
 
 
 | 1067 | Thus, the source data segment selected for a target window | 
 
 
 
 
 | 1068 | often will not be aligned with the file offsets of this | 
 
 
 
 
 | 1069 | target window. | 
 
 
 
 
 | 1070 |  | 
 
 
 
 
 | 1071 |  | 
 
 
 
 
 | 1072 | gcc-2.95.1    gcc-2.95.2    compression   decompression | 
 
 
 
 
 | 1073 | raw size      55746560      55797760 | 
 
 
 
 
 | 1074 | compress         -          19939390       13.85s         7.09s | 
 
 
 
 
 | 1075 | gzip             -          12973443       42.99s         5.35s | 
 
 
 
 
 | 1076 | Vcdiff           -          15358786       20.04s         4.65s | 
 
 
 
 
 | 1077 | Vcdiff-d         -            100971       10.93s         1.92s | 
 
 
 
 
 | 1078 | Vcdiff-dc        -             97246       20.03s         1.84s | 
 
 
 
 
 | 1079 | Vcdiff-dcs       -            256445       44.81s         1.84s | 
 
 
 
 
 | 1080 |  | 
 
 
 
 
 | 1081 | TABLE 1. Compressing gcc-2.95.2.tar given gcc-2.95.1 | 
 
 
 
 
 | 1082 |  | 
 
 
 
 
 | 1083 |  | 
 
 
 
 
 | 1084 | TABLE 1 shows the raw sizes of gcc-2.95.1.tar and gcc-2.95.2.tar and the | 
 
 
 
 
 | 1085 | sizes of the compressed results. As a pure compressor, the compression | 
 
 
 
 
 | 1086 | rate for Vcdiff is worse than gzip and better than compress. The last | 
 
 
 
 
 | 1087 | three rows shows that when two file versions are very similar, differencing | 
 
 
 
 
 | 1088 | can have dramatically good compression rates. Vcdiff-d and Vcdiff-dc use | 
 
 
 
 
 | 1089 | the same simple window selection method but Vcdiff-dc also does compression | 
 
 
 
 
 | 1090 | so its output is slightly smaller. Vcdiff-dcs uses a heuristic based on | 
 
 
 
 
 | 1091 | data content to search for source data that likely will match a given target | 
 
 
 
 
 | 1092 | window. Although it does a good job, the heuristic did not always find the | 
 
 
 
 
 | 1093 | best matches which are given by the simple algorithm of Vcdiff-d.  As a | 
 
 
 
 
 | 1094 | result, the output size is slightly larger. Note also that there is a large | 
 
 
 
 
 | 1095 | cost in computing matching windows this way. Finally, the compression times | 
 
 
 
 
 | 1096 | of Vcdiff-d is nearly half of that of Vcdiff-dc. It is tempting to conclude | 
 
 
 
 
 | 1097 | that the compression feature causes the additional time in Vcdiff-dc | 
 
 
 
 
 | 1098 | relative to Vcdiff-d.  However, this is not the case. The hint given to | 
 
 
 
 
 | 1099 | the Vcdiff string matching algorithm that the two files are likely to | 
 
 
 
 
 | 1100 | have very long stretches of matches helps the algorithm to minimize | 
 
 
 
 
 | 1101 | processing of the "source data", thus saving half the time. However, as we | 
 
 
 
 
 | 1102 | shall see below when this hint is wrong, the result is even longer time. | 
 
 
 
 
 | 1103 |  | 
 
 
 
 
 | 1104 |  | 
 
 
 
 
 | 1105 | gcc-2.95.2    gcc-2.95.3    compression   decompression | 
 
 
 
 
 | 1106 | raw size      55797760      55787520 | 
 
 
 
 
 | 1107 | compress         -          19939453       13.54s         7.00s | 
 
 
 
 
 | 1108 | gzip             -          12998097       42.63s         5.62s | 
 
 
 
 
 | 1109 | Vcdiff           -          15371737       20.09s         4.74s | 
 
 
 
 
 | 1110 | Vcdiff-d         -          26383849       71.41s         6.41s | 
 
 
 
 
 | 1111 | Vcdiff-dc        -          14461203       42.48s         4.82s | 
 
 
 
 
 | 1112 | Vcdiff-dcs       -           1248543       61.18s         1.99s | 
 
 
 
 
 | 1113 |  | 
 
 
 
 
 | 1114 | TABLE 2. Compressing gcc-2.95.3.tar given gcc-2.95.2 | 
 
 
 
 
 | 1115 |  | 
 
 
 
 
 | 1116 |  | 
 
 
 
 
 | 1117 | TABLE 2 shows the raw sizes of gcc-2.95.2.tar and gcc-2.95.3.tar and | 
 
 
 
 
 | 1118 | the sizes of the compressed results. In this case, the tar file of | 
 
 
 
 
 | 1119 | gcc-2.95.3 is rearranged in a way that makes the straightforward method | 
 
 
 
 
 | 1120 | of matching file offsets for source and target windows fail. As a | 
 
 
 
 
 | 1121 | result, Vcdiff-d performs rather dismally both in time and output size. | 
 
 
 
 
 | 1122 | The large time for Vcdiff-d is directly due to fact that the string | 
 
 
 
 
 | 1123 | matching algorithm has to work much harder to find matches when the hint | 
 
 
 
 
 | 1124 | that two files have long matching stretches fails to hold. On the other | 
 
 
 
 
 | 1125 | hand, Vcdiff-dc does both differencing and compression resulting in good | 
 
 
 
 
 | 1126 | output size. Finally, the window searching heuristic used in Vcdiff-dcs is | 
 
 
 
 
 | 1127 | effective in finding the right matching source windows for target windows | 
 
 
 
 
 | 1128 | resulting a small output size. This shows why the data format needs to | 
 
 
 
 
 | 1129 | have a way to specify matching windows to gain performance. Finally, | 
 
 
 
 
 | 1130 | we note that the decoding times are always good regardless of how | 
 
 
 
 
 | 1131 | the string matching or window searching algorithms perform. | 
 
 
 
 
 | 1132 |  | 
 
 
 
 
 | 1133 |  | 
 
 
 
 
 | 1134 | 9. FURTHER ISSUES | 
 
 
 
 
 | 1135 |  | 
 
 
 
 
 | 1136 | This document does not address a few issues: | 
 
 
 
 
 | 1137 |  | 
 
 
 
 
 | 1138 | Secondary compressors: | 
 
 
 
 
 | 1139 | As discussed in Section 4.3, certain sections in the delta encoding | 
 
 
 
 
 | 1140 | of a window may be further compressed by a secondary compressor. | 
 
 
 
 
 | 1141 | In our experience, the basic Vcdiff format is adequate for most | 
 
 
 
 
 | 1142 | purposes so that secondary compressors are seldom needed. In | 
 
 
 
 
 | 1143 | particular, for normal use of data differencing where the files to | 
 
 
 
 
 | 1144 | be compared have long stretches of matches, much of the gain in | 
 
 
 
 
 | 1145 | compression rate is already achieved by normal string matching. | 
 
 
 
 
 | 1146 | Thus, the use of secondary compressors is seldom needed in this case. | 
 
 
 
 
 | 1147 | However, for applications beyond differencing of such nearly identical | 
 
 
 
 
 | 1148 | files, secondary compressors may be needed to achieve maximal | 
 
 
 
 
 | 1149 | compressed results. | 
 
 
 
 
 | 1150 |  | 
 
 
 
 
 | 1151 | Therefore, we recommend to leave the Vcdiff data format defined | 
 
 
 
 
 | 1152 | as in this document so that the use of secondary compressors | 
 
 
 
 
 | 1153 | can be implemented when they become needed in the future. | 
 
 
 
 
 | 1154 | The formats of the compressed data via such compressors or any | 
 
 
 
 
 | 1155 | compressors that may be defined in the future are left open to | 
 
 
 
 
 | 1156 | their implementations.  These could include Huffman encoding, | 
 
 
 
 
 | 1157 | arithmetic encoding, and splay tree encoding [8,9]. | 
 
 
 
 
 | 1158 |  | 
 
 
 
 
 | 1159 | Large file system vs. small file system: | 
 
 
 
 
 | 1160 | As discussed in Section 4, a target window in a large file may be | 
 
 
 
 
 | 1161 | compared against some source window in another file or in the same | 
 
 
 
 
 | 1162 | file (from some earlier part). In that case, the file offset of the | 
 
 
 
 
 | 1163 | source window is specified as a variable-sized integer in the delta | 
 
 
 
 
 | 1164 | encoding. There is a possibility that the encoding was computed on | 
 
 
 
 
 | 1165 | a system supporting much larger files than in a system where | 
 
 
 
 
 | 1166 | the data may be decoded (e.g., 64-bit file systems vs. 32-bit file | 
 
 
 
 
 | 1167 | systems). In that case, some target data may not be recoverable. | 
 
 
 
 
 | 1168 | This problem could afflict any compression format, and ought | 
 
 
 
 
 | 1169 | to be resolved with a generic negotiation mechanism in the | 
 
 
 
 
 | 1170 | appropriate protocol(s). | 
 
 
 
 
 | 1171 |  | 
 
 
 
 
 | 1172 |  | 
 
 
 
 
 | 1173 | 10.  SUMMARY | 
 
 
 
 
 | 1174 |  | 
 
 
 
 
 | 1175 | We have described Vcdiff, a general and portable encoding format for | 
 
 
 
 
 | 1176 | compression and differencing. The format is good in that it allows | 
 
 
 
 
 | 1177 | implementing a decoder without knowledge of the encoders. Further, | 
 
 
 
 
 | 1178 | ignoring the use of secondary compressors not defined within the format, | 
 
 
 
 
 | 1179 | the decoding algorithms runs in linear time and requires working space | 
 
 
 
 
 | 1180 | proportional to window sizes. | 
 
 
 
 
 | 1181 |  | 
 
 
 
 
 | 1182 |  | 
 
 
 
 
 | 1183 |  | 
 
 
 
 
 | 1184 | 11. ACKNOWLEDGEMENTS | 
 
 
 
 
 | 1185 |  | 
 
 
 
 
 | 1186 | Thanks are due to Balachander Krishnamurthy, Jeff Mogul and Arthur Van Hoff | 
 
 
 
 
 | 1187 | who provided much encouragement to publicize Vcdiff. In particular, Jeff | 
 
 
 
 
 | 1188 | helped clarifying the description of the data format presented here. | 
 
 
 
 
 | 1189 |  | 
 
 
 
 
 | 1190 |  | 
 
 
 
 
 | 1191 |  | 
 
 
 
 
 | 1192 | 12. SECURITY CONSIDERATIONS | 
 
 
 
 
 | 1193 |  | 
 
 
 
 
 | 1194 | Vcdiff only provides a format to encode compressed and differenced data. | 
 
 
 
 
 | 1195 | It does not address any issues concerning how such data are, in fact, | 
 
 
 
 
 | 1196 | stored in a given file system or the run-time memory of a computer system. | 
 
 
 
 
 | 1197 | Therefore, we do not anticipate any security issues with respect to Vcdiff. | 
 
 
 
 
 | 1198 |  | 
 
 
 
 
 | 1199 |  | 
 
 
 
 
 | 1200 |  | 
 
 
 
 
 | 1201 | 13. SOURCE CODE AVAILABILITY | 
 
 
 
 
 | 1202 |  | 
 
 
 
 
 | 1203 | Vcdiff is implemented as a data transforming method in Phong Vo's | 
 
 
 
 
 | 1204 | Vcodex library. AT&T Corp. has made the source code for Vcodex available | 
 
 
 
 
 | 1205 | for anyone to use to transmit data via HTTP/1.1 Delta Encoding [10,11]. | 
 
 
 
 
 | 1206 | The source code and according license is accessible at the below URL: | 
 
 
 
 
 | 1207 |  | 
 
 
 
 
 | 1208 | http://www.research.att.com/sw/tools | 
 
 
 
 
 | 1209 |  | 
 
 
 
 
 | 1210 |  | 
 
 
 
 
 | 1211 | 14. INTELLECTUAL PROPERTY RIGHTS | 
 
 
 
 
 | 1212 |  | 
 
 
 
 
 | 1213 | The IETF has been notified of intellectual property rights claimed in | 
 
 
 
 
 | 1214 | regard to some or all of the specification contained in this | 
 
 
 
 
 | 1215 | document.  For more information consult the online list of claimed | 
 
 
 
 
 | 1216 | rights, at <http://www.ietf.org/ipr.html>. | 
 
 
 
 
 | 1217 |  | 
 
 
 
 
 | 1218 | The IETF takes no position regarding the validity or scope of any | 
 
 
 
 
 | 1219 | intellectual property or other rights that might be claimed to | 
 
 
 
 
 | 1220 | pertain to the implementation or use of the technology described in | 
 
 
 
 
 | 1221 | this document or the extent to which any license under such rights | 
 
 
 
 
 | 1222 | might or might not be available; neither does it represent that it | 
 
 
 
 
 | 1223 | has made any effort to identify any such rights.  Information on the | 
 
 
 
 
 | 1224 | IETF's procedures with respect to rights in standards-track and | 
 
 
 
 
 | 1225 | standards-related documentation can be found in BCP-11.  Copies of | 
 
 
 
 
 | 1226 | claims of rights made available for publication and any assurances of | 
 
 
 
 
 | 1227 | licenses to be made available, or the result of an attempt made to | 
 
 
 
 
 | 1228 | obtain a general license or permission for the use of such | 
 
 
 
 
 | 1229 | proprietary rights by implementors or users of this specification can | 
 
 
 
 
 | 1230 | be obtained from the IETF Secretariat. | 
 
 
 
 
 | 1231 |  | 
 
 
 
 
 | 1232 |  | 
 
 
 
 
 | 1233 |  | 
 
 
 
 
 | 1234 | 15. IANA CONSIDERATIONS | 
 
 
 
 
 | 1235 |  | 
 
 
 
 
 | 1236 | The Internet Assigned Numbers Authority (IANA) administers the number | 
 
 
 
 
 | 1237 | space for Secondary Compressor ID values.  Values and their meaning | 
 
 
 
 
 | 1238 | must be documented in an RFC or other peer-reviewed, permanent, and | 
 
 
 
 
 | 1239 | readily available reference, in sufficient detail so that | 
 
 
 
 
 | 1240 | interoperability between independent implementations is possible. | 
 
 
 
 
 | 1241 | Subject to these constraints, name assignments are First Come, First | 
 
 
 
 
 | 1242 | Served - see RFC2434 [13].  Legal ID values are in the range 1..255. | 
 
 
 
 
 | 1243 |  | 
 
 
 
 
 | 1244 | This document does not define any values in this number space. | 
 
 
 
 
 | 1245 |  | 
 
 
 
 
 | 1246 |  | 
 
 
 
 
 | 1247 | 16. REFERENCES | 
 
 
 
 
 | 1248 |  | 
 
 
 
 
 | 1249 | [1] D.G. Korn and K.P. Vo, Vdelta: Differencing and Compression, | 
 
 
 
 
 | 1250 | Practical Reusable Unix Software, Editor B. Krishnamurthy, | 
 
 
 
 
 | 1251 | John Wiley & Sons, Inc., 1995. | 
 
 
 
 
 | 1252 |  | 
 
 
 
 
 | 1253 | [2] J. Ziv and A. Lempel, A Universal Algorithm for Sequential Data | 
 
 
 
 
 | 1254 | Compression, IEEE Trans. on Information Theory, 23(3):337-343, 1977. | 
 
 
 
 
 | 1255 |  | 
 
 
 
 
 | 1256 | [3] W. Tichy, The String-to-String Correction Problem with Block Moves, | 
 
 
 
 
 | 1257 | ACM Transactions on Computer Systems, 2(4):309-321, November 1984. | 
 
 
 
 
 | 1258 |  | 
 
 
 
 
 | 1259 | [4] E.M. McCreight, A Space-Economical Suffix Tree Construction | 
 
 
 
 
 | 1260 | Algorithm, Journal of the ACM, 23:262-272, 1976. | 
 
 
 
 
 | 1261 |  | 
 
 
 
 
 | 1262 | [5] J.J. Hunt, K.P. Vo, W. Tichy, An Empirical Study of Delta Algorithms, | 
 
 
 
 
 | 1263 | IEEE Software Configuration and Maintenance Workshop, 1996. | 
 
 
 
 
 | 1264 |  | 
 
 
 
 
 | 1265 | [6] J.J. Hunt, K.P. Vo, W. Tichy, Delta Algorithms: An Empirical Analysis, | 
 
 
 
 
 | 1266 | ACM Trans. on Software Engineering and Methodology, 7:192-214, 1998. | 
 
 
 
 
 | 1267 |  | 
 
 
 
 
 | 1268 | [7] D.G. Korn, K.P. Vo, Sfio: A buffered I/O Library, | 
 
 
 
 
 | 1269 | Proc. of the Summer '91 Usenix Conference, 1991. | 
 
 
 
 
 | 1270 |  | 
 
 
 
 
 | 1271 | [8] D. W. Jones, Application of Splay Trees to Data Compression, | 
 
 
 
 
 | 1272 | CACM, 31(8):996:1007. | 
 
 
 
 
 | 1273 |  | 
 
 
 
 
 | 1274 | [9] M. Nelson, J. Gailly, The Data Compression Book, ISBN 1-55851-434-1, | 
 
 
 
 
 | 1275 | M&T Books, New York, NY, 1995. | 
 
 
 
 
 | 1276 |  | 
 
 
 
 
 | 1277 | [10] J.C. Mogul, F. Douglis, A. Feldmann, and B. Krishnamurthy, | 
 
 
 
 
 | 1278 | Potential benefits of delta encoding and data compression for HTTP, | 
 
 
 
 
 | 1279 | SIGCOMM '97, Cannes, France, 1997. | 
 
 
 
 
 | 1280 |  | 
 
 
 
 
 | 1281 | [11] J.C. Mogul, B. Krishnamurthy, F. Douglis, A. Feldmann, | 
 
 
 
 
 | 1282 | Y. Goland, and A. Van Hoff, Delta Encoding in HTTP, | 
 
 
 
 
 | 1283 | IETF, draft-mogul-http-delta-10, 2001. | 
 
 
 
 
 | 1284 |  | 
 
 
 
 
 | 1285 | [12] S. Bradner, Key words for use in RFCs to Indicate Requirement Levels, | 
 
 
 
 
 | 1286 | RFC 2119, March 1997. | 
 
 
 
 
 | 1287 |  | 
 
 
 
 
 | 1288 | [13] T. Narten, H. Alvestrand, Guidelines for Writing an IANA | 
 
 
 
 
 | 1289 | Considerations Section in RFCs, RFC2434, October 1998. | 
 
 
 
 
 | 1290 |  | 
 
 
 
 
 | 1291 |  | 
 
 
 
 
 | 1292 |  | 
 
 
 
 
 | 1293 | 17. AUTHOR'S ADDRESS | 
 
 
 
 
 | 1294 |  | 
 
 
 
 
 | 1295 | Kiem-Phong Vo (main contact) | 
 
 
 
 
 | 1296 | AT&T Labs, Room D223 | 
 
 
 
 
 | 1297 | 180 Park Avenue | 
 
 
 
 
 | 1298 | Florham Park, NJ 07932 | 
 
 
 
 
 | 1299 | Email: kpv@research.att.com | 
 
 
 
 
 | 1300 | Phone: 1 973 360 8630 | 
 
 
 
 
 | 1301 |  | 
 
 
 
 
 | 1302 | David G. Korn | 
 
 
 
 
 | 1303 | AT&T Labs, Room D237 | 
 
 
 
 
 | 1304 | 180 Park Avenue | 
 
 
 
 
 | 1305 | Florham Park, NJ 07932 | 
 
 
 
 
 | 1306 | Email: dgk@research.att.com | 
 
 
 
 
 | 1307 | Phone: 1 973 360 8602 | 
 
 
 
 
 | 1308 |  | 
 
 
 
 
 | 1309 | Jeffrey C. Mogul | 
 
 
 
 
 | 1310 | Western Research Laboratory | 
 
 
 
 
 | 1311 | Compaq Computer Corporation | 
 
 
 
 
 | 1312 | 250 University Avenue | 
 
 
 
 
 | 1313 | Palo Alto, California, 94305, U.S.A. | 
 
 
 
 
 | 1314 | Email: JeffMogul@acm.org | 
 
 
 
 
 | 1315 | Phone: 1 650 617 3304 (email preferred) | 
 
 
 
 
 | 1316 |  | 
 
 
 
 
 | 1317 | Joshua P. MacDonald | 
 
 
 
 
 | 1318 | Computer Science Division | 
 
 
 
 
 | 1319 | University of California, Berkeley | 
 
 
 
 
 | 1320 | 345 Soda Hall | 
 
 
 
 
 | 1321 | Berkeley, CA 94720 | 
 
 
 
 
 | 1322 | Email: jmacd@cs.berkeley.edu |