| 1 | 
                                                      David G. Korn, AT&T Labs | 
 
 
 
 
 
 | 2 | 
                                              Joshua P. MacDonald, UC Berkeley | 
 
 
 
 
 
 | 3 | 
                                                  Jeffrey C. Mogul, Compaq WRL | 
 
 
 
 
 
 | 4 | 
 Internet-Draft                                       Kiem-Phong Vo, AT&T Labs | 
 
 
 
 
 
 | 5 | 
 Expires: 09 November 2002                                    09 November 2001 | 
 
 
 
 
 
 | 6 | 
  | 
 
 
 
 
 
 | 7 | 
  | 
 
 
 
 
 
 | 8 | 
         The VCDIFF Generic Differencing and Compression Data Format | 
 
 
 
 
 
 | 9 | 
  | 
 
 
 
 
 
 | 10 | 
                          draft-korn-vcdiff-06.txt | 
 
 
 
 
 
 | 11 | 
  | 
 
 
 
 
 
 | 12 | 
  | 
 
 
 
 
 
 | 13 | 
  | 
 
 
 
 
 
 | 14 | 
 Status of this Memo | 
 
 
 
 
 
 | 15 | 
  | 
 
 
 
 
 
 | 16 | 
     This document is an Internet-Draft and is in full conformance | 
 
 
 
 
 
 | 17 | 
     with all provisions of Section 10 of RFC2026. | 
 
 
 
 
 
 | 18 | 
  | 
 
 
 
 
 
 | 19 | 
     Internet-Drafts are working documents of the Internet Engineering | 
 
 
 
 
 
 | 20 | 
     Task Force (IETF), its areas, and its working groups.  Note that | 
 
 
 
 
 
 | 21 | 
     other groups may also distribute working documents as | 
 
 
 
 
 
 | 22 | 
     Internet-Drafts. | 
 
 
 
 
 
 | 23 | 
  | 
 
 
 
 
 
 | 24 | 
     Internet-Drafts are draft documents valid for a maximum of six | 
 
 
 
 
 
 | 25 | 
     months and may be updated, replaced, or obsoleted by other | 
 
 
 
 
 
 | 26 | 
     documents at any time.  It is inappropriate to use Internet- | 
 
 
 
 
 
 | 27 | 
     Drafts as reference material or to cite them other than as | 
 
 
 
 
 
 | 28 | 
     "work in progress." | 
 
 
 
 
 
 | 29 | 
  | 
 
 
 
 
 
 | 30 | 
     The list of current Internet-Drafts can be accessed at | 
 
 
 
 
 
 | 31 | 
     http://www.ietf.org/ietf/1id-abstracts.txt | 
 
 
 
 
 
 | 32 | 
  | 
 
 
 
 
 
 | 33 | 
     The list of Internet-Draft Shadow Directories can be accessed at | 
 
 
 
 
 
 | 34 | 
     http://www.ietf.org/shadow.html. | 
 
 
 
 
 
 | 35 | 
  | 
 
 
 
 
 
 | 36 | 
  | 
 
 
 
 
 
 | 37 | 
 Abstract | 
 
 
 
 
 
 | 38 | 
  | 
 
 
 
 
 
 | 39 | 
     This memo describes a general, efficient and portable data format | 
 
 
 
 
 
 | 40 | 
     suitable for encoding compressed and/or differencing data so that | 
 
 
 
 
 
 | 41 | 
     they can be easily transported among computers. | 
 
 
 
 
 
 | 42 | 
  | 
 
 
 
 
 
 | 43 | 
  | 
 
 
 
 
 
 | 44 | 
 Table of Contents: | 
 
 
 
 
 
 | 45 | 
  | 
 
 
 
 
 
 | 46 | 
     1.  EXECUTIVE SUMMARY ............................................  2 | 
 
 
 
 
 
 | 47 | 
     2.  CONVENTIONS ..................................................  3 | 
 
 
 
 
 
 | 48 | 
     3.  DELTA INSTRUCTIONS ...........................................  4 | 
 
 
 
 
 
 | 49 | 
     4.  DELTA FILE ORGANIZATION ......................................  5 | 
 
 
 
 
 
 | 50 | 
     5.  DELTA INSTRUCTION ENCODING ...................................  9 | 
 
 
 
 
 
 | 51 | 
     6.  DECODING A TARGET WINDOW ..................................... 14 | 
 
 
 
 
 
 | 52 | 
     7.  APPLICATION-DEFINED CODE TABLES .............................. 16 | 
 
 
 
 
 
 | 53 | 
     8.  PERFORMANCE .................................................. 16 | 
 
 
 
 
 
 | 54 | 
     9.  FURTHER ISSUES ............................................... 17 | 
 
 
 
 
 
 | 55 | 
    10.  SUMMARY ...................................................... 18 | 
 
 
 
 
 
 | 56 | 
    11.  ACKNOWLEDGEMENTS ............................................. 18 | 
 
 
 
 
 
 | 57 | 
    12.  SECURITY CONSIDERATIONS ...................................... 18 | 
 
 
 
 
 
 | 58 | 
    13.  SOURCE CODE AVAILABILITY ..................................... 18 | 
 
 
 
 
 
 | 59 | 
    14.  INTELLECTUAL PROPERTY RIGHTS ................................. 18 | 
 
 
 
 
 
 | 60 | 
    15.  IANA CONSIDERATIONS .......................................... 19 | 
 
 
 
 
 
 | 61 | 
    16.  REFERENCES ................................................... 19 | 
 
 
 
 
 
 | 62 | 
    17.  AUTHOR'S ADDRESS ............................................. 20 | 
 
 
 
 
 
 | 63 | 
  | 
 
 
 
 
 
 | 64 | 
  | 
 
 
 
 
 
 | 65 | 
 1.  EXECUTIVE SUMMARY | 
 
 
 
 
 
 | 66 | 
  | 
 
 
 
 
 
 | 67 | 
     Compression and differencing techniques can greatly improve storage | 
 
 
 
 
 
 | 68 | 
     and transmission of files and file versions.  Since files are often | 
 
 
 
 
 
 | 69 | 
     transported across machines with distinct architectures and performance | 
 
 
 
 
 
 | 70 | 
     characteristics, such data should be encoded in a form that is portable | 
 
 
 
 
 
 | 71 | 
     and can be decoded with little or no knowledge of the encoders. | 
 
 
 
 
 
 | 72 | 
     This document describes Vcdiff, a compact portable encoding format | 
 
 
 
 
 
 | 73 | 
     designed for these purposes. | 
 
 
 
 
 
 | 74 | 
  | 
 
 
 
 
 
 | 75 | 
     Data differencing is the process of computing a compact and invertible | 
 
 
 
 
 
 | 76 | 
     encoding of a "target file" given a "source file".  Data compression | 
 
 
 
 
 
 | 77 | 
     is similar but without the use of source data.  The UNIX utilities diff, | 
 
 
 
 
 
 | 78 | 
     compress, and gzip are well-known examples of data differencing and | 
 
 
 
 
 
 | 79 | 
     compression tools.  For data differencing, the computed encoding is | 
 
 
 
 
 
 | 80 | 
     called a "delta file", and, for data compression, it is called | 
 
 
 
 
 
 | 81 | 
     a "compressed file".  Delta and compressed files are good for storage | 
 
 
 
 
 
 | 82 | 
     and transmission as they are often smaller than the originals. | 
 
 
 
 
 
 | 83 | 
  | 
 
 
 
 
 
 | 84 | 
     Data differencing and data compression are traditionally treated | 
 
 
 
 
 
 | 85 | 
     as distinct types of data processing.  However, as shown in the Vdelta | 
 
 
 
 
 
 | 86 | 
     technique by Korn and Vo [1], compression can be thought of as a special | 
 
 
 
 
 
 | 87 | 
     case of differencing in which the source data is empty. The basic idea | 
 
 
 
 
 
 | 88 | 
     is to unify the string parsing scheme used in the Lempel-Ziv'77 style | 
 
 
 
 
 
 | 89 | 
     compressors [2], and the block-move technique of Tichy [3].  Loosely | 
 
 
 
 
 
 | 90 | 
     speaking, this works as follows: | 
 
 
 
 
 
 | 91 | 
  | 
 
 
 
 
 
 | 92 | 
         a. Concatenate source and target data. | 
 
 
 
 
 
 | 93 | 
         b. Parse the data from left to right as in LZ'77 but | 
 
 
 
 
 
 | 94 | 
            make sure that a parsed segment starts the target data. | 
 
 
 
 
 
 | 95 | 
         c. Start to output when reaching target data. | 
 
 
 
 
 
 | 96 | 
  | 
 
 
 
 
 
 | 97 | 
     Parsing is based on string matching algorithms such as suffix trees [4] | 
 
 
 
 
 
 | 98 | 
     or hashing with different time and space performance characteristics. | 
 
 
 
 
 
 | 99 | 
     Vdelta uses a fast string matching algorithm that requires less memory | 
 
 
 
 
 
 | 100 | 
     than other techniques [5,6].  However, even with this algorithm, the | 
 
 
 
 
 
 | 101 | 
     memory requirement can still be prohibitive for large files.  A common | 
 
 
 
 
 
 | 102 | 
     way to deal with memory limitation is to partition an input file into | 
 
 
 
 
 
 | 103 | 
     chunks called "windows" and process them separately. Here, except for | 
 
 
 
 
 
 | 104 | 
     unpublished work by Vo, little has been done on designing effective | 
 
 
 
 
 
 | 105 | 
     windowing schemes. Current techniques, including Vdelta, simply use | 
 
 
 
 
 
 | 106 | 
     source and target windows with corresponding addresses across source | 
 
 
 
 
 
 | 107 | 
     and target files. | 
 
 
 
 
 
 | 108 | 
  | 
 
 
 
 
 
 | 109 | 
     String matching and windowing algorithms have large influence on the | 
 
 
 
 
 
 | 110 | 
     compression rate of delta and compressed files. However, it is desirable | 
 
 
 
 
 
 | 111 | 
     to have a portable encoding format that is independent of such algorithms. | 
 
 
 
 
 
 | 112 | 
     This enables construction of client-server applications in which a server | 
 
 
 
 
 
 | 113 | 
     may serve clients with unknown computing characteristics.  Unfortunately, | 
 
 
 
 
 
 | 114 | 
     all current differencing and compressing tools, including Vdelta, fall | 
 
 
 
 
 
 | 115 | 
     short in this respect. Their storage formats are closely intertwined | 
 
 
 
 
 
 | 116 | 
     with the implemented string matching and/or windowing algorithms. | 
 
 
 
 
 
 | 117 | 
  | 
 
 
 
 
 
 | 118 | 
     The encoding format Vcdiff proposed here addresses the above issues. | 
 
 
 
 
 
 | 119 | 
     Vcdiff achieves the below characteristics: | 
 
 
 
 
 
 | 120 | 
  | 
 
 
 
 
 
 | 121 | 
         Output compactness: | 
 
 
 
 
 
 | 122 | 
             The basic encoding format compactly represents compressed or delta | 
 
 
 
 
 
 | 123 | 
             files. Applications can further extend the basic encoding format | 
 
 
 
 
 
 | 124 | 
             with "secondary encoders" to achieve more compression. | 
 
 
 
 
 
 | 125 | 
  | 
 
 
 
 
 
 | 126 | 
         Data portability: | 
 
 
 
 
 
 | 127 | 
             The basic encoding format is free from machine byte order and | 
 
 
 
 
 
 | 128 | 
             word size issues. This allows data to be encoded on one machine | 
 
 
 
 
 
 | 129 | 
             and decoded on a different machine with different architecture. | 
 
 
 
 
 
 | 130 | 
  | 
 
 
 
 
 
 | 131 | 
         Algorithm genericity: | 
 
 
 
 
 
 | 132 | 
             The decoding algorithm is independent from string matching and | 
 
 
 
 
 
 | 133 | 
             windowing algorithms. This allows competition among implementations | 
 
 
 
 
 
 | 134 | 
             of the encoder while keeping the same decoder. | 
 
 
 
 
 
 | 135 | 
  | 
 
 
 
 
 
 | 136 | 
         Decoding efficiency: | 
 
 
 
 
 
 | 137 | 
             Except for secondary encoder issues, the decoding algorithm runs | 
 
 
 
 
 
 | 138 | 
             in time proportional to the size of the target file and uses space | 
 
 
 
 
 
 | 139 | 
             proportional to the maximal window size.  Vcdiff differs from more | 
 
 
 
 
 
 | 140 | 
             conventional compressors in that it uses only byte-aligned | 
 
 
 
 
 
 | 141 | 
             data, thus avoiding bit-level operations, which improves | 
 
 
 
 
 
 | 142 | 
             decoding speed at the slight cost of compression efficiency. | 
 
 
 
 
 
 | 143 | 
  | 
 
 
 
 
 
 | 144 | 
     The Vcdiff data format and the algorithms for decoding data shall be | 
 
 
 
 
 
 | 145 | 
     described next.  Since Vcdiff treats compression as a special case of | 
 
 
 
 
 
 | 146 | 
     differencing, we shall use the term "delta file" to indicate the | 
 
 
 
 
 
 | 147 | 
     compressed output for both cases. | 
 
 
 
 
 
 | 148 | 
  | 
 
 
 
 
 
 | 149 | 
  | 
 
 
 
 
 
 | 150 | 
 2. CONVENTIONS | 
 
 
 
 
 
 | 151 | 
  | 
 
 
 
 
 
 | 152 | 
     The basic data unit is a byte.  For portability, Vcdiff shall limit | 
 
 
 
 
 
 | 153 | 
     a byte to its lower eight bits even on machines with larger bytes. | 
 
 
 
 
 
 | 154 | 
     The bits in a byte are ordered from right to left so that the least | 
 
 
 
 
 
 | 155 | 
     significant bit (LSB) has value 1, and the most significant bit (MSB), | 
 
 
 
 
 
 | 156 | 
     has value 128. | 
 
 
 
 
 
 | 157 | 
  | 
 
 
 
 
 
 | 158 | 
     For purposes of exposition in this document, we adopt the convention | 
 
 
 
 
 
 | 159 | 
     that the LSB is numbered 0, and the MSB is numbered 7.  Bit numbers | 
 
 
 
 
 
 | 160 | 
     never appear in the encoded format itself. | 
 
 
 
 
 
 | 161 | 
  | 
 
 
 
 
 
 | 162 | 
     Vcdiff encodes unsigned integer values using a portable variable-sized | 
 
 
 
 
 
 | 163 | 
     format (originally introduced in the Sfio library [7]). This encoding | 
 
 
 
 
 
 | 164 | 
     treats an integer as a number in base 128. Then, each digit in this | 
 
 
 
 
 
 | 165 | 
     representation is encoded in the lower seven bits of a byte. Except for | 
 
 
 
 
 
 | 166 | 
     the least significant byte, other bytes have their most significant bit | 
 
 
 
 
 
 | 167 | 
     turned on to indicate that there are still more digits in the encoding. | 
 
 
 
 
 
 | 168 | 
     The two key properties of this integer encoding that are beneficial | 
 
 
 
 
 
 | 169 | 
     to a data compression format are: | 
 
 
 
 
 
 | 170 | 
  | 
 
 
 
 
 
 | 171 | 
         a. The encoding is portable among systems using 8-bit bytes, and | 
 
 
 
 
 
 | 172 | 
         b. Small values are encoded compactly. | 
 
 
 
 
 
 | 173 | 
  | 
 
 
 
 
 
 | 174 | 
     For example, consider the value 123456789 which can be represented with | 
 
 
 
 
 
 | 175 | 
     four 7-bit digits whose values are 58, 111, 26, 21 in order from most | 
 
 
 
 
 
 | 176 | 
     to least significant. Below is the 8-bit byte encoding of these digits. | 
 
 
 
 
 
 | 177 | 
     Note that the MSBs of 58, 111 and 26 are on. | 
 
 
 
 
 
 | 178 | 
  | 
 
 
 
 
 
 | 179 | 
                  +-------------------------------------------+ | 
 
 
 
 
 
 | 180 | 
                  | 10111010 | 11101111 | 10011010 | 00010101 | | 
 
 
 
 
 
 | 181 | 
                  +-------------------------------------------+ | 
 
 
 
 
 
 | 182 | 
                    MSB+58     MSB+111    MSB+26     0+21 | 
 
 
 
 
 
 | 183 | 
  | 
 
 
 
 
 
 | 184 | 
  | 
 
 
 
 
 
 | 185 | 
     Henceforth, the terms "byte" and "integer" will refer to a byte and an | 
 
 
 
 
 
 | 186 | 
     unsigned integer as described. | 
 
 
 
 
 
 | 187 | 
  | 
 
 
 
 
 
 | 188 | 
  | 
 
 
 
 
 
 | 189 | 
     From time to time, algorithms are exhibited to clarify the descriptions | 
 
 
 
 
 
 | 190 | 
     of parts of the Vcdiff format. On such occasions, the C language will be | 
 
 
 
 
 
 | 191 | 
     used to make precise the algorithms.  The C code shown in this | 
 
 
 
 
 
 | 192 | 
     document is meant for clarification only, and is not part of the | 
 
 
 
 
 
 | 193 | 
     actual specification of the Vcdiff format. | 
 
 
 
 
 
 | 194 | 
  | 
 
 
 
 
 
 | 195 | 
     In this specification, the key words "MUST", "MUST NOT", | 
 
 
 
 
 
 | 196 | 
     "SHOULD", "SHOULD NOT", and "MAY" document are to be interpreted as | 
 
 
 
 
 
 | 197 | 
     described in RFC2119 [12]. | 
 
 
 
 
 
 | 198 | 
  | 
 
 
 
 
 
 | 199 | 
  | 
 
 
 
 
 
 | 200 | 
 3.  DELTA INSTRUCTIONS | 
 
 
 
 
 
 | 201 | 
  | 
 
 
 
 
 
 | 202 | 
     A large target file is partitioned into non-overlapping sections | 
 
 
 
 
 
 | 203 | 
     called "target windows". These target windows are processed separately | 
 
 
 
 
 
 | 204 | 
     and sequentially based on their order in the target file. | 
 
 
 
 
 
 | 205 | 
  | 
 
 
 
 
 
 | 206 | 
     A target window T of length t may be compared against some source data | 
 
 
 
 
 
 | 207 | 
     segment S of length s. By construction, this source data segment S | 
 
 
 
 
 
 | 208 | 
     comes either from the source file, if one is used, or from a part of | 
 
 
 
 
 
 | 209 | 
     the target file earlier than T.  In this way, during decoding, S is | 
 
 
 
 
 
 | 210 | 
     completely known when T is being decoded. | 
 
 
 
 
 
 | 211 | 
  | 
 
 
 
 
 
 | 212 | 
     The choices of T, t, S and s are made by some window selection algorithm | 
 
 
 
 
 
 | 213 | 
     which can greatly affect the size of the encoding. However, as seen later, | 
 
 
 
 
 
 | 214 | 
     these choices are encoded so that no knowledge of the window selection | 
 
 
 
 
 
 | 215 | 
     algorithm is needed during decoding. | 
 
 
 
 
 
 | 216 | 
  | 
 
 
 
 
 
 | 217 | 
     Assume that S[j] represents the jth byte in S, and T[k] represents | 
 
 
 
 
 
 | 218 | 
     the kth byte in T.  Then, for the delta instructions, we treat the data | 
 
 
 
 
 
 | 219 | 
     windows S and T as substrings of a superstring U formed by concatenating | 
 
 
 
 
 
 | 220 | 
     them like this: | 
 
 
 
 
 
 | 221 | 
  | 
 
 
 
 
 
 | 222 | 
         S[0]S[1]...S[s-1]T[0]T[1]...T[t-1] | 
 
 
 
 
 
 | 223 | 
  | 
 
 
 
 
 
 | 224 | 
     The "address" of a byte in S or T is referred to by its location in U. | 
 
 
 
 
 
 | 225 | 
     For example, the address of T[k] is s+k. | 
 
 
 
 
 
 | 226 | 
  | 
 
 
 
 
 
 | 227 | 
     The instructions to encode and direct the reconstruction of a target | 
 
 
 
 
 
 | 228 | 
     window are called delta instructions. There are three types: | 
 
 
 
 
 
 | 229 | 
  | 
 
 
 
 
 
 | 230 | 
         ADD: This instruction has two arguments, a size x and a sequence of | 
 
 
 
 
 
 | 231 | 
             x bytes to be copied. | 
 
 
 
 
 
 | 232 | 
         COPY: This instruction has two arguments, a size x and an address p | 
 
 
 
 
 
 | 233 | 
             in the string U. The arguments specify the substring of U that | 
 
 
 
 
 
 | 234 | 
             must be copied. We shall assert that such a substring must be | 
 
 
 
 
 
 | 235 | 
             entirely contained in either S or T. | 
 
 
 
 
 
 | 236 | 
         RUN: This instruction has two arguments, a size x and a byte b that | 
 
 
 
 
 
 | 237 | 
             will be repeated x times. | 
 
 
 
 
 
 | 238 | 
  | 
 
 
 
 
 
 | 239 | 
     Below are example source and target windows and the delta instructions | 
 
 
 
 
 
 | 240 | 
     that encode the target window in terms of the source window. | 
 
 
 
 
 
 | 241 | 
  | 
 
 
 
 
 
 | 242 | 
         a b c d e f g h i j k l m n o p | 
 
 
 
 
 
 | 243 | 
         a b c d w x y z e f g h e f g h e f g h e f g h z z z z | 
 
 
 
 
 
 | 244 | 
  | 
 
 
 
 
 
 | 245 | 
         COPY  4, 0 | 
 
 
 
 
 
 | 246 | 
         ADD   4, w x y z | 
 
 
 
 
 
 | 247 | 
         COPY  4, 4 | 
 
 
 
 
 
 | 248 | 
         COPY 12, 24 | 
 
 
 
 
 
 | 249 | 
         RUN   4, z | 
 
 
 
 
 
 | 250 | 
  | 
 
 
 
 
 
 | 251 | 
  | 
 
 
 
 
 
 | 252 | 
     Thus, the first letter 'a' in the target window is at location 16 | 
 
 
 
 
 
 | 253 | 
     in the superstring. Note that the fourth instruction, "COPY 12, 24", | 
 
 
 
 
 
 | 254 | 
     copies data from T itself since address 24 is position 8 in T. | 
 
 
 
 
 
 | 255 | 
     This instruction also shows that it is fine to overlap the data to be | 
 
 
 
 
 
 | 256 | 
     copied with the data being copied from as long as the latter starts | 
 
 
 
 
 
 | 257 | 
     earlier. This enables efficient encoding of periodic sequences, | 
 
 
 
 
 
 | 258 | 
     i.e., sequences with regularly repeated subsequences. The RUN instruction | 
 
 
 
 
 
 | 259 | 
     is a compact way to encode a sequence repeating the same byte even though | 
 
 
 
 
 
 | 260 | 
     such a sequence can be thought of as a periodic sequence with period 1. | 
 
 
 
 
 
 | 261 | 
  | 
 
 
 
 
 
 | 262 | 
     To reconstruct the target window, one simply processes one delta | 
 
 
 
 
 
 | 263 | 
     instruction at a time and copy the data either from the source window | 
 
 
 
 
 
 | 264 | 
     or the being reconstructed target window based on the type of the | 
 
 
 
 
 
 | 265 | 
     instruction and the associated address, if any. | 
 
 
 
 
 
 | 266 | 
  | 
 
 
 
 
 
 | 267 | 
  | 
 
 
 
 
 
 | 268 | 
 4.  DELTA FILE ORGANIZATION | 
 
 
 
 
 
 | 269 | 
  | 
 
 
 
 
 
 | 270 | 
     A Vcdiff delta file starts with a Header section followed by a sequence | 
 
 
 
 
 
 | 271 | 
     of Window sections. The Header section includes magic bytes to identify | 
 
 
 
 
 
 | 272 | 
     the file type, and information concerning data processing beyond the | 
 
 
 
 
 
 | 273 | 
     basic encoding format. The Window sections encode the target windows. | 
 
 
 
 
 
 | 274 | 
  | 
 
 
 
 
 
 | 275 | 
     Below is the overall organization of a delta file. The indented items | 
 
 
 
 
 
 | 276 | 
     refine the ones immediately above them. An item in square brackets may | 
 
 
 
 
 
 | 277 | 
     or may not be present in the file depending on the information encoded | 
 
 
 
 
 
 | 278 | 
     in the Indicator byte above it. | 
 
 
 
 
 
 | 279 | 
  | 
 
 
 
 
 
 | 280 | 
         Header | 
 
 
 
 
 
 | 281 | 
             Header1                                  - byte | 
 
 
 
 
 
 | 282 | 
             Header2                                  - byte | 
 
 
 
 
 
 | 283 | 
             Header3                                  - byte | 
 
 
 
 
 
 | 284 | 
             Header4                                  - byte | 
 
 
 
 
 
 | 285 | 
             Hdr_Indicator                            - byte | 
 
 
 
 
 
 | 286 | 
             [Secondary compressor ID]                - byte | 
 
 
 
 
 
 | 287 | 
  | 
 
 
 
 
 
 | 288 | 
 [@@@ Why is compressor ID not an integer? ] | 
 
 
 
 
 
 | 289 | 
 [@@@ If we aren't defining any secondary compressors yet, then it seems | 
 
 
 
 
 
 | 290 | 
 that defining the [Secondary compressor ID] and the corresponding | 
 
 
 
 
 
 | 291 | 
 VCD_DECOMPRESS Hdr_Indicator bit in this draft has no real value.  An | 
 
 
 
 
 
 | 292 | 
 implementation of this specification won't be able to decode a VCDIFF | 
 
 
 
 
 
 | 293 | 
 encoded with this option if it doesn't know about any secondary | 
 
 
 
 
 
 | 294 | 
 compressors.  It seems that you should specify the bits related to | 
 
 
 
 
 
 | 295 | 
 secondary compressors once you have defined the first a secondary | 
 
 
 
 
 
 | 296 | 
 compressor.  I can imagine a secondary-compressor might want to supply | 
 
 
 
 
 
 | 297 | 
 extra information, such as a dictionary of some kind, in which case | 
 
 
 
 
 
 | 298 | 
 this speculative treatment wouldn't go far enough.] | 
 
 
 
 
 
 | 299 | 
  | 
 
 
 
 
 
 | 300 | 
             [Length of code table data]              - integer | 
 
 
 
 
 
 | 301 | 
             [Code table data] | 
 
 
 
 
 
 | 302 | 
                 Size of near cache                   - byte | 
 
 
 
 
 
 | 303 | 
                 Size of same cache                   - byte | 
 
 
 
 
 
 | 304 | 
                 Compressed code table data | 
 
 
 
 
 
 | 305 | 
         Window1 | 
 
 
 
 
 
 | 306 | 
             Win_Indicator                            - byte | 
 
 
 
 
 
 | 307 | 
             [Source segment size]                    - integer | 
 
 
 
 
 
 | 308 | 
             [Source segment position]                - integer | 
 
 
 
 
 
 | 309 | 
             The delta encoding of the target window | 
 
 
 
 
 
 | 310 | 
                 Length of the delta encoding         - integer | 
 
 
 
 
 
 | 311 | 
                 The delta encoding | 
 
 
 
 
 
 | 312 | 
                     Size of the target window        - integer | 
 
 
 
 
 
 | 313 | 
                     Delta_Indicator                  - byte | 
 
 
 
 
 
 | 314 | 
                     Length of data for ADDs and RUNs - integer | 
 
 
 
 
 
 | 315 | 
                     Length of instructions and sizes - integer | 
 
 
 
 
 
 | 316 | 
                     Length of addresses for COPYs    - integer | 
 
 
 
 
 
 | 317 | 
                     Data section for ADDs and RUNs   - array of bytes | 
 
 
 
 
 
 | 318 | 
                     Instructions and sizes section   - array of bytes | 
 
 
 
 
 
 | 319 | 
                     Addresses section for COPYs      - array of bytes | 
 
 
 
 
 
 | 320 | 
         Window2 | 
 
 
 
 
 
 | 321 | 
         ... | 
 
 
 
 
 
 | 322 | 
  | 
 
 
 
 
 
 | 323 | 
  | 
 
 
 
 
 
 | 324 | 
  | 
 
 
 
 
 
 | 325 | 
 4.1 The Header Section | 
 
 
 
 
 
 | 326 | 
  | 
 
 
 
 
 
 | 327 | 
     Each delta file starts with a header section organized as below. | 
 
 
 
 
 
 | 328 | 
     Note the convention that square-brackets enclose optional items. | 
 
 
 
 
 
 | 329 | 
  | 
 
 
 
 
 
 | 330 | 
             Header1                                  - byte = 0xE6 | 
 
 
 
 
 
 | 331 | 
             Header2                                  - byte = 0xD3 | 
 
 
 
 
 
 | 332 | 
             Header3                                  - byte = 0xD4 | 
 
 
 
 
 
 | 333 | 
  | 
 
 
 
 
 
 | 334 | 
 HMMM | 
 
 
 
 
 
 | 335 | 
  | 
 
 
 
 
 
 | 336 | 
 0xD6 | 
 
 
 
 
 
 | 337 | 
 0xC3 | 
 
 
 
 
 
 | 338 | 
 0xC4 | 
 
 
 
 
 
 | 339 | 
  | 
 
 
 
 
 
 | 340 | 
             Header4                                  - byte | 
 
 
 
 
 
 | 341 | 
             Hdr_Indicator                            - byte | 
 
 
 
 
 
 | 342 | 
             [Secondary compressor ID]                - byte | 
 
 
 
 
 
 | 343 | 
             [Length of code table data]              - integer | 
 
 
 
 
 
 | 344 | 
             [Code table data] | 
 
 
 
 
 
 | 345 | 
  | 
 
 
 
 
 
 | 346 | 
     The first three Header bytes are the ASCII characters 'V', 'C' and 'D' | 
 
 
 
 
 
 | 347 | 
     with their most significant bits turned on (in hexadecimal, the values | 
 
 
 
 
 
 | 348 | 
     are 0xE6, 0xD3, and 0xD4). The fourth Header byte is currently set to | 
 
 
 
 
 
 | 349 | 
     zero. In the future, it might be used to indicate the version of Vcdiff. | 
 
 
 
 
 
 | 350 | 
  | 
 
 
 
 
 
 | 351 | 
     The Hdr_Indicator byte shows if there are any initialization data | 
 
 
 
 
 
 | 352 | 
     required to aid in the reconstruction of data in the Window sections. | 
 
 
 
 
 
 | 353 | 
     This byte MAY have non-zero values for either, both, or neither of | 
 
 
 
 
 
 | 354 | 
     the two bits VCD_DECOMPRESS and VCD_CODETABLE below: | 
 
 
 
 
 
 | 355 | 
  | 
 
 
 
 
 
 | 356 | 
             7 6 5 4 3 2 1 0 | 
 
 
 
 
 
 | 357 | 
            +-+-+-+-+-+-+-+-+ | 
 
 
 
 
 
 | 358 | 
            | | | | | | | | | | 
 
 
 
 
 
 | 359 | 
            +-+-+-+-+-+-+-+-+ | 
 
 
 
 
 
 | 360 | 
                         ^ ^ | 
 
 
 
 
 
 | 361 | 
                         | | | 
 
 
 
 
 
 | 362 | 
                         | +-- VCD_DECOMPRESS | 
 
 
 
 
 
 | 363 | 
                         +---- VCD_CODETABLE | 
 
 
 
 
 
 | 364 | 
  | 
 
 
 
 
 
 | 365 | 
     If bit 0 (VCD_DECOMPRESS) is non-zero, this indicates that a secondary | 
 
 
 
 
 
 | 366 | 
     compressor may have been used to further compress certain parts of the | 
 
 
 
 
 
 | 367 | 
     delta encoding data as described in Sections 4.3 and 6. In that case, | 
 
 
 
 
 
 | 368 | 
     the ID of the secondary compressor is given next. If this bit is zero, | 
 
 
 
 
 
 | 369 | 
     the compressor ID byte is not included. | 
 
 
 
 
 
 | 370 | 
  | 
 
 
 
 
 
 | 371 | 
 [@@@ If we aren't defining any secondary compressors yet, then it seems | 
 
 
 
 
 
 | 372 | 
 this bit has no real value yet..] | 
 
 
 
 
 
 | 373 | 
  | 
 
 
 
 
 
 | 374 | 
     If bit 1 (VCD_CODETABLE) is non-zero, this indicates that an | 
 
 
 
 
 
 | 375 | 
     application-defined code table is to be used for decoding the delta | 
 
 
 
 
 
 | 376 | 
     instructions. This table itself is compressed.  The length of the data | 
 
 
 
 
 
 | 377 | 
     comprising this compressed code table and the data follow next. Section 7 | 
 
 
 
 
 
 | 378 | 
     discusses application-defined code tables.  If this bit is zero, the code | 
 
 
 
 
 
 | 379 | 
     table data length and the code table data are not included. | 
 
 
 
 
 
 | 380 | 
  | 
 
 
 
 
 
 | 381 | 
     If both bits are set, then the compressor ID byte is included | 
 
 
 
 
 
 | 382 | 
     before the code table data length and the code table data. | 
 
 
 
 
 
 | 383 | 
  | 
 
 
 
 
 
 | 384 | 
  | 
 
 
 
 
 
 | 385 | 
 4.2 The Format of a Window Section | 
 
 
 
 
 
 | 386 | 
  | 
 
 
 
 
 
 | 387 | 
     Each Window section is organized as follows: | 
 
 
 
 
 
 | 388 | 
  | 
 
 
 
 
 
 | 389 | 
             Win_Indicator                            - byte | 
 
 
 
 
 
 | 390 | 
             [Source segment length]                  - integer | 
 
 
 
 
 
 | 391 | 
             [Source segment position]                - integer | 
 
 
 
 
 
 | 392 | 
             The delta encoding of the target window | 
 
 
 
 
 
 | 393 | 
  | 
 
 
 
 
 
 | 394 | 
  | 
 
 
 
 
 
 | 395 | 
     Below are the detail of the various items: | 
 
 
 
 
 
 | 396 | 
  | 
 
 
 
 
 
 | 397 | 
 [@@@ Here, I want to replace the Win_Indicator with a source-count, | 
 
 
 
 
 
 | 398 | 
 followed by source-count length/position pairs?] | 
 
 
 
 
 
 | 399 | 
  | 
 
 
 
 
 
 | 400 | 
         Win_Indicator: | 
 
 
 
 
 
 | 401 | 
             This byte is a set of bits, as shown: | 
 
 
 
 
 
 | 402 | 
  | 
 
 
 
 
 
 | 403 | 
             7 6 5 4 3 2 1 0 | 
 
 
 
 
 
 | 404 | 
            +-+-+-+-+-+-+-+-+ | 
 
 
 
 
 
 | 405 | 
            | | | | | | | | | | 
 
 
 
 
 
 | 406 | 
            +-+-+-+-+-+-+-+-+ | 
 
 
 
 
 
 | 407 | 
                         ^ ^ | 
 
 
 
 
 
 | 408 | 
                         | | | 
 
 
 
 
 
 | 409 | 
                         | +-- VCD_SOURCE | 
 
 
 
 
 
 | 410 | 
                         +---- VCD_TARGET | 
 
 
 
 
 
 | 411 | 
  | 
 
 
 
 
 
 | 412 | 
  | 
 
 
 
 
 
 | 413 | 
             If bit 0 (VCD_SOURCE) is non-zero, this indicates that a segment | 
 
 
 
 
 
 | 414 | 
             of data from the "source" file was used as the corresponding | 
 
 
 
 
 
 | 415 | 
             source window of data to encode the target window. The decoder | 
 
 
 
 
 
 | 416 | 
             will use this same source data segment to decode the target window. | 
 
 
 
 
 
 | 417 | 
  | 
 
 
 
 
 
 | 418 | 
             If bit 1 (VCD_TARGET) is non-zero, this indicates that a segment | 
 
 
 
 
 
 | 419 | 
             of data from the "target" file was used as the corresponding | 
 
 
 
 
 
 | 420 | 
             source window of data to encode the target window. As above, this | 
 
 
 
 
 
 | 421 | 
             same source data segment is used to decode the target window. | 
 
 
 
 
 
 | 422 | 
  | 
 
 
 
 
 
 | 423 | 
             The Win_Indicator byte MUST NOT have more than one of the bits | 
 
 
 
 
 
 | 424 | 
             set (non-zero).  It MAY have none of these bits set. | 
 
 
 
 
 
 | 425 | 
  | 
 
 
 
 
 
 | 426 | 
             If one of these bits is set, the byte is followed by two | 
 
 
 
 
 
 | 427 | 
             integers to indicate respectively the length and position of | 
 
 
 
 
 
 | 428 | 
             the source data segment in the relevant file.  If the | 
 
 
 
 
 
 | 429 | 
             indicator byte is zero, the target window was compressed | 
 
 
 
 
 
 | 430 | 
             by itself without comparing against another data segment, | 
 
 
 
 
 
 | 431 | 
             and these two integers are not included. | 
 
 
 
 
 
 | 432 | 
  | 
 
 
 
 
 
 | 433 | 
         The delta encoding of the target window: | 
 
 
 
 
 
 | 434 | 
             This contains the delta encoding of the target window either | 
 
 
 
 
 
 | 435 | 
             in terms of the source data segment (i.e., VCD_SOURCE | 
 
 
 
 
 
 | 436 | 
             or VCD_TARGET was set) or by itself if no source window | 
 
 
 
 
 
 | 437 | 
             is specified. This data format is discussed next. | 
 
 
 
 
 
 | 438 | 
  | 
 
 
 
 
 
 | 439 | 
  | 
 
 
 
 
 
 | 440 | 
 4.3 The Delta Encoding of a Target Window | 
 
 
 
 
 
 | 441 | 
  | 
 
 
 
 
 
 | 442 | 
     The delta encoding of a target window is organized as follows: | 
 
 
 
 
 
 | 443 | 
  | 
 
 
 
 
 
 | 444 | 
         Length of the delta encoding            - integer | 
 
 
 
 
 
 | 445 | 
         The delta encoding | 
 
 
 
 
 
 | 446 | 
             Length of the target window         - integer | 
 
 
 
 
 
 | 447 | 
             Delta_Indicator                     - byte | 
 
 
 
 
 
 | 448 | 
             Length of data for ADDs and RUNs    - integer | 
 
 
 
 
 
 | 449 | 
             Length of instructions section      - integer | 
 
 
 
 
 
 | 450 | 
             Length of addresses for COPYs       - integer | 
 
 
 
 
 
 | 451 | 
             Data section for ADDs and RUNs      - array of bytes | 
 
 
 
 
 
 | 452 | 
             Instructions and sizes section      - array of bytes | 
 
 
 
 
 
 | 453 | 
             Addresses section for COPYs         - array of bytes | 
 
 
 
 
 
 | 454 | 
  | 
 
 
 
 
 
 | 455 | 
  | 
 
 
 
 
 
 | 456 | 
         Length of the delta encoding: | 
 
 
 
 
 
 | 457 | 
             This integer gives the total number of remaining bytes that | 
 
 
 
 
 
 | 458 | 
             comprise data of the delta encoding for this target window. | 
 
 
 
 
 
 | 459 | 
  | 
 
 
 
 
 
 | 460 | 
         The delta encoding: | 
 
 
 
 
 
 | 461 | 
             This contains the data representing the delta encoding which | 
 
 
 
 
 
 | 462 | 
             is described next. | 
 
 
 
 
 
 | 463 | 
  | 
 
 
 
 
 
 | 464 | 
         Length of the target window: | 
 
 
 
 
 
 | 465 | 
             This integer indicates the actual size of the target window | 
 
 
 
 
 
 | 466 | 
             after decompression. A decoder can use this value to allocate | 
 
 
 
 
 
 | 467 | 
             memory to store the uncompressed data. | 
 
 
 
 
 
 | 468 | 
  | 
 
 
 
 
 
 | 469 | 
         Delta_Indicator: | 
 
 
 
 
 
 | 470 | 
             This byte is a set of bits, as shown: | 
 
 
 
 
 
 | 471 | 
  | 
 
 
 
 
 
 | 472 | 
             7 6 5 4 3 2 1 0 | 
 
 
 
 
 
 | 473 | 
            +-+-+-+-+-+-+-+-+ | 
 
 
 
 
 
 | 474 | 
            | | | | | | | | | | 
 
 
 
 
 
 | 475 | 
            +-+-+-+-+-+-+-+-+ | 
 
 
 
 
 
 | 476 | 
                       ^ ^ ^ | 
 
 
 
 
 
 | 477 | 
                       | | | | 
 
 
 
 
 
 | 478 | 
                       | | +-- VCD_DATACOMP | 
 
 
 
 
 
 | 479 | 
                       | +---- VCD_INSTCOMP | 
 
 
 
 
 
 | 480 | 
                       +------ VCD_ADDRCOMP | 
 
 
 
 
 
 | 481 | 
  | 
 
 
 
 
 
 | 482 | 
                 VCD_DATACOMP:   bit value 1. | 
 
 
 
 
 
 | 483 | 
                 VCD_INSTCOMP:   bit value 2. | 
 
 
 
 
 
 | 484 | 
                 VCD_ADDRCOMP:   bit value 4. | 
 
 
 
 
 
 | 485 | 
  | 
 
 
 
 
 
 | 486 | 
             As discussed, the delta encoding consists of COPY, ADD and RUN | 
 
 
 
 
 
 | 487 | 
             instructions. The ADD and RUN instructions have accompanying | 
 
 
 
 
 
 | 488 | 
             unmatched data (that is, data that does not specifically match | 
 
 
 
 
 
 | 489 | 
             any data in the source window or in some earlier part of the | 
 
 
 
 
 
 | 490 | 
             target window) and the COPY instructions have addresses of where | 
 
 
 
 
 
 | 491 | 
             the matches occur. OPTIONALLY, these types of data MAY be further | 
 
 
 
 
 
 | 492 | 
             compressed using a secondary compressor. Thus, Vcdiff separates | 
 
 
 
 
 
 | 493 | 
             the encoding of the delta instructions into three parts: | 
 
 
 
 
 
 | 494 | 
  | 
 
 
 
 
 
 | 495 | 
                 a. The unmatched data in the ADD and RUN instructions, | 
 
 
 
 
 
 | 496 | 
                 b. The delta instructions and accompanying sizes, and | 
 
 
 
 
 
 | 497 | 
                 c. The addresses of the COPY instructions. | 
 
 
 
 
 
 | 498 | 
  | 
 
 
 
 
 
 | 499 | 
             If the bit VCD_DECOMPRESS (Section 4.1) was on, each of these | 
 
 
 
 
 
 | 500 | 
             sections may have been compressed using the specified secondary | 
 
 
 
 
 
 | 501 | 
             compressor. The bit positions 0 (VCD_DATACOMP), 1 (VCD_INSTCOMP), | 
 
 
 
 
 
 | 502 | 
             and 2 (VCD_ADDRCOMP) respectively indicate, if non-zero, that | 
 
 
 
 
 
 | 503 | 
             the corresponding parts are compressed. Then, these parts MUST | 
 
 
 
 
 
 | 504 | 
             be decompressed before decoding the delta instructions. | 
 
 
 
 
 
 | 505 | 
  | 
 
 
 
 
 
 | 506 | 
         Length of data for ADDs and RUNs: | 
 
 
 
 
 
 | 507 | 
             This is the length (in bytes) of the section of data storing | 
 
 
 
 
 
 | 508 | 
             the unmatched data accompanying the ADD and RUN instructions. | 
 
 
 
 
 
 | 509 | 
  | 
 
 
 
 
 
 | 510 | 
         Length of instructions section: | 
 
 
 
 
 
 | 511 | 
             This is the length (in bytes) of the delta instructions and | 
 
 
 
 
 
 | 512 | 
             accompanying sizes. | 
 
 
 
 
 
 | 513 | 
  | 
 
 
 
 
 
 | 514 | 
         Length of addresses for COPYs: | 
 
 
 
 
 
 | 515 | 
             This is the length (in bytes) of the section storing | 
 
 
 
 
 
 | 516 | 
             the addresses of the COPY instructions. | 
 
 
 
 
 
 | 517 | 
  | 
 
 
 
 
 
 | 518 | 
         Data section for ADDs and RUNs: | 
 
 
 
 
 
 | 519 | 
             This sequence of bytes encodes the unmatched data for the ADD | 
 
 
 
 
 
 | 520 | 
             and RUN instructions. | 
 
 
 
 
 
 | 521 | 
  | 
 
 
 
 
 
 | 522 | 
         Instructions and sizes section: | 
 
 
 
 
 
 | 523 | 
             This sequence of bytes encodes the instructions and their sizes. | 
 
 
 
 
 
 | 524 | 
  | 
 
 
 
 
 
 | 525 | 
         Addresses section for COPYs: | 
 
 
 
 
 
 | 526 | 
             This sequence of bytes encodes the addresses of the COPY | 
 
 
 
 
 
 | 527 | 
             instructions. | 
 
 
 
 
 
 | 528 | 
  | 
 
 
 
 
 
 | 529 | 
  | 
 
 
 
 
 
 | 530 | 
 5. DELTA INSTRUCTION ENCODING | 
 
 
 
 
 
 | 531 | 
  | 
 
 
 
 
 
 | 532 | 
     The delta instructions described in Section 3 represent the results of | 
 
 
 
 
 
 | 533 | 
     string matching. For many data differencing applications in which the | 
 
 
 
 
 
 | 534 | 
     changes between source and target data are small, any straightforward | 
 
 
 
 
 
 | 535 | 
     representation of these instructions would be adequate.  However, for | 
 
 
 
 
 
 | 536 | 
     applications including data compression, it is important to encode | 
 
 
 
 
 
 | 537 | 
     these instructions well to achieve good compression rates.  From our | 
 
 
 
 
 
 | 538 | 
     experience, the following observations can be made: | 
 
 
 
 
 
 | 539 | 
  | 
 
 
 
 
 
 | 540 | 
     a. The addresses in COPY instructions are locations of matches and | 
 
 
 
 
 
 | 541 | 
        often occur close by or even exactly equal to one another. This is | 
 
 
 
 
 
 | 542 | 
        because data in local regions are often replicated with minor changes. | 
 
 
 
 
 
 | 543 | 
        In turn, this means that coding a newly matched address against some | 
 
 
 
 
 
 | 544 | 
        set of recently matched addresses can be beneficial. | 
 
 
 
 
 
 | 545 | 
  | 
 
 
 
 
 
 | 546 | 
     b. The matches are often short in length and separated by small amounts | 
 
 
 
 
 
 | 547 | 
        of unmatched data. That is, the lengths of COPY and ADD instructions | 
 
 
 
 
 
 | 548 | 
        are often small. This is particularly true of binary data such as | 
 
 
 
 
 
 | 549 | 
        executable files or structured data such as HTML or XML. In such cases, | 
 
 
 
 
 
 | 550 | 
        compression can be improved by combining the encoding of the sizes | 
 
 
 
 
 
 | 551 | 
        and the instruction types as well as combining the encoding of adjacent | 
 
 
 
 
 
 | 552 | 
        delta instructions with sufficiently small data sizes. | 
 
 
 
 
 
 | 553 | 
  | 
 
 
 
 
 
 | 554 | 
     The below subsections discuss how the Vcdiff data format provides | 
 
 
 
 
 
 | 555 | 
     mechanisms enabling encoders to use the above observations to improve | 
 
 
 
 
 
 | 556 | 
     compression rates. | 
 
 
 
 
 
 | 557 | 
  | 
 
 
 
 
 
 | 558 | 
  | 
 
 
 
 
 
 | 559 | 
 5.1 Address Encoding Modes of COPY Instructions | 
 
 
 
 
 
 | 560 | 
  | 
 
 
 
 
 
 | 561 | 
     As mentioned earlier, addresses of COPY instructions often occur close | 
 
 
 
 
 
 | 562 | 
     to one another or are exactly equal. To take advantage of this phenomenon | 
 
 
 
 
 
 | 563 | 
     and encode addresses of COPY instructions more efficiently, the Vcdiff | 
 
 
 
 
 
 | 564 | 
     data format supports the use of two different types of address caches. | 
 
 
 
 
 
 | 565 | 
     Both the encoder and decoder maintain these caches, so that decoder's | 
 
 
 
 
 
 | 566 | 
     caches remain synchronized with the encoder's caches. | 
 
 
 
 
 
 | 567 | 
  | 
 
 
 
 
 
 | 568 | 
     a. A "near" cache is an array with "s_near" slots, each containing an | 
 
 
 
 
 
 | 569 | 
        address used for encoding addresses nearby to previously encoded | 
 
 
 
 
 
 | 570 | 
        addresses (in the positive direction only).  The near cache also | 
 
 
 
 
 
 | 571 | 
        maintains a "next_slot" index to the near cache.  New entries to the | 
 
 
 
 
 
 | 572 | 
        near cache are always inserted in the next_slot index, which maintains | 
 
 
 
 
 
 | 573 | 
        a circular buffer of the s_near most recent addresses. | 
 
 
 
 
 
 | 574 | 
  | 
 
 
 
 
 
 | 575 | 
     b. A "same" cache is an array with "s_same" multiple of 256 slots, each | 
 
 
 
 
 
 | 576 | 
        containing an address.  The same cache maintains a hash table of recent | 
 
 
 
 
 
 | 577 | 
        addresses used for repeated encoding of the exact same address. | 
 
 
 
 
 
 | 578 | 
  | 
 
 
 
 
 
 | 579 | 
  | 
 
 
 
 
 
 | 580 | 
     By default, the parameters s_near and s_same are respectively set to 4 | 
 
 
 
 
 
 | 581 | 
     and 3. An encoder MAY modify these values, but then it MUST encode the | 
 
 
 
 
 
 | 582 | 
     new values in the encoding itself, as discussed in Section 7, so that | 
 
 
 
 
 
 | 583 | 
     the decoder can properly set up its own caches. | 
 
 
 
 
 
 | 584 | 
  | 
 
 
 
 
 
 | 585 | 
     At the start of processing a target window, an implementation | 
 
 
 
 
 
 | 586 | 
     (encoder or decoder) initializes all of the slots in both caches | 
 
 
 
 
 
 | 587 | 
     to zero.  The next_slot pointer of the near cache is set | 
 
 
 
 
 
 | 588 | 
     to point to slot zero. | 
 
 
 
 
 
 | 589 | 
  | 
 
 
 
 
 
 | 590 | 
     Each time a COPY instruction is processed by the encoder or | 
 
 
 
 
 
 | 591 | 
     decoder, the implementation's caches are updated as follows, where | 
 
 
 
 
 
 | 592 | 
     "addr" is the address in the COPY instruction. | 
 
 
 
 
 
 | 593 | 
  | 
 
 
 
 
 
 | 594 | 
     a. The slot in the near cache referenced by the next_slot | 
 
 
 
 
 
 | 595 | 
        index is set to addr.  The next_slot index is then incremented | 
 
 
 
 
 
 | 596 | 
        modulo s_near. | 
 
 
 
 
 
 | 597 | 
  | 
 
 
 
 
 
 | 598 | 
     b. The slot in the same cache whose index is addr%(s_same*256) | 
 
 
 
 
 
 | 599 | 
        is set to addr. [We use the C notations of % for modulo and | 
 
 
 
 
 
 | 600 | 
        * for multiplication.] | 
 
 
 
 
 
 | 601 | 
  | 
 
 
 
 
 
 | 602 | 
  | 
 
 
 
 
 
 | 603 | 
 5.2 Example code for maintaining caches | 
 
 
 
 
 
 | 604 | 
  | 
 
 
 
 
 
 | 605 | 
     To make clear the above description, below are example cache data | 
 
 
 
 
 
 | 606 | 
     structures and algorithms to initialize and update them: | 
 
 
 
 
 
 | 607 | 
  | 
 
 
 
 
 
 | 608 | 
         typedef struct _cache_s | 
 
 
 
 
 
 | 609 | 
         { | 
 
 
 
 
 
 | 610 | 
             int*  near;      /* array of size s_near        */ | 
 
 
 
 
 
 | 611 | 
             int   s_near; | 
 
 
 
 
 
 | 612 | 
             int   next_slot; /* the circular index for near */ | 
 
 
 
 
 
 | 613 | 
             int*  same;      /* array of size s_same*256    */ | 
 
 
 
 
 
 | 614 | 
             int   s_same; | 
 
 
 
 
 
 | 615 | 
         } Cache_t; | 
 
 
 
 
 
 | 616 | 
  | 
 
 
 
 
 
 | 617 | 
         cache_init(Cache_t* ka) | 
 
 
 
 
 
 | 618 | 
         { | 
 
 
 
 
 
 | 619 | 
             int   i; | 
 
 
 
 
 
 | 620 | 
  | 
 
 
 
 
 
 | 621 | 
             ka->next_slot = 0; | 
 
 
 
 
 
 | 622 | 
             for(i = 0; i < ka->s_near; ++i) | 
 
 
 
 
 
 | 623 | 
                  ka->near[i] = 0; | 
 
 
 
 
 
 | 624 | 
  | 
 
 
 
 
 
 | 625 | 
             for(i = 0; i < ka->s_same*256; ++i) | 
 
 
 
 
 
 | 626 | 
                  ka->same[i] = 0; | 
 
 
 
 
 
 | 627 | 
         } | 
 
 
 
 
 
 | 628 | 
  | 
 
 
 
 
 
 | 629 | 
         cache_update(Cache_t* ka, int addr) | 
 
 
 
 
 
 | 630 | 
         { | 
 
 
 
 
 
 | 631 | 
             if(ka->s_near > 0) | 
 
 
 
 
 
 | 632 | 
             {   ka->near[ka->next_slot] = addr; | 
 
 
 
 
 
 | 633 | 
                 ka->next_slot = (ka->next_slot + 1) % ka->s_near; | 
 
 
 
 
 
 | 634 | 
             } | 
 
 
 
 
 
 | 635 | 
  | 
 
 
 
 
 
 | 636 | 
             if(ka->s_same > 0) | 
 
 
 
 
 
 | 637 | 
                 ka->same[addr % (ka->s_same*256)] = addr; | 
 
 
 
 
 
 | 638 | 
         } | 
 
 
 
 
 
 | 639 | 
  | 
 
 
 
 
 
 | 640 | 
  | 
 
 
 
 
 
 | 641 | 
 5.3 Encoding of COPY instruction addresses | 
 
 
 
 
 
 | 642 | 
  | 
 
 
 
 
 
 | 643 | 
     The address of a COPY instruction is encoded using different modes | 
 
 
 
 
 
 | 644 | 
     depending on the type of cached address used, if any. | 
 
 
 
 
 
 | 645 | 
  | 
 
 
 
 
 
 | 646 | 
     Let "addr" be the address of a COPY instruction to be decoded and "here" | 
 
 
 
 
 
 | 647 | 
     be the current location in the target data (i.e., the start of the data | 
 
 
 
 
 
 | 648 | 
     about to be encoded or decoded).  Let near[j] be the jth element in | 
 
 
 
 
 
 | 649 | 
     the near cache, and same[k] be the kth element in the same cache. | 
 
 
 
 
 
 | 650 | 
     Below are the possible address modes: | 
 
 
 
 
 
 | 651 | 
  | 
 
 
 
 
 
 | 652 | 
         VCD_SELF: This mode has value 0. The address was encoded by itself | 
 
 
 
 
 
 | 653 | 
             as an integer. | 
 
 
 
 
 
 | 654 | 
  | 
 
 
 
 
 
 | 655 | 
         VCD_HERE: This mode has value 1. The address was encoded as | 
 
 
 
 
 
 | 656 | 
             the integer value "here - addr". | 
 
 
 
 
 
 | 657 | 
  | 
 
 
 
 
 
 | 658 | 
         Near modes: The "near modes" are in the range [2,s_near+1]. Let m | 
 
 
 
 
 
 | 659 | 
             be the mode of the address encoding. The address was encoded | 
 
 
 
 
 
 | 660 | 
             as the integer value "addr - near[m-2]". | 
 
 
 
 
 
 | 661 | 
  | 
 
 
 
 
 
 | 662 | 
         Same modes: The "same modes" are in the range | 
 
 
 
 
 
 | 663 | 
             [s_near+2,s_near+s_same+1]. Let m be the mode of the encoding. | 
 
 
 
 
 
 | 664 | 
             The address was encoded as a single byte b such that | 
 
 
 
 
 
 | 665 | 
             "addr == same[(m - (s_near+2))*256 + b]". | 
 
 
 
 
 
 | 666 | 
  | 
 
 
 
 
 
 | 667 | 
  | 
 
 
 
 
 
 | 668 | 
 5.3 Example code for encoding and decoding of COPY instruction addresses | 
 
 
 
 
 
 | 669 | 
  | 
 
 
 
 
 
 | 670 | 
     We show example algorithms below to demonstrate use of address modes more | 
 
 
 
 
 
 | 671 | 
     clearly. The encoder has freedom to choose address modes, the sample | 
 
 
 
 
 
 | 672 | 
     addr_encode() algorithm merely shows one way of picking the address | 
 
 
 
 
 
 | 673 | 
     mode. The decoding algorithm addr_decode() will uniquely decode addresses | 
 
 
 
 
 
 | 674 | 
     regardless of the encoder's algorithm choice. | 
 
 
 
 
 
 | 675 | 
  | 
 
 
 
 
 
 | 676 | 
     Note that the address caches are updated immediately after an address is | 
 
 
 
 
 
 | 677 | 
     encoded or decoded. In this way, the decoder is always synchronized with | 
 
 
 
 
 
 | 678 | 
     the encoder. | 
 
 
 
 
 
 | 679 | 
  | 
 
 
 
 
 
 | 680 | 
         int addr_encode(Cache_t* ka, int addr, int here, int* mode) | 
 
 
 
 
 
 | 681 | 
         { | 
 
 
 
 
 
 | 682 | 
             int  i, d, bestd, bestm; | 
 
 
 
 
 
 | 683 | 
  | 
 
 
 
 
 
 | 684 | 
             /* Attempt to find the address mode that yields the | 
 
 
 
 
 
 | 685 | 
              * smallest integer value for "d", the encoded address | 
 
 
 
 
 
 | 686 | 
              * value, thereby minimizing the encoded size of the | 
 
 
 
 
 
 | 687 | 
              * address. */ | 
 
 
 
 
 
 | 688 | 
  | 
 
 
 
 
 
 | 689 | 
             bestd = addr; bestm = VCD_SELF;      /* VCD_SELF == 0 */ | 
 
 
 
 
 
 | 690 | 
  | 
 
 
 
 
 
 | 691 | 
             if((d = here-addr) < bestd) | 
 
 
 
 
 
 | 692 | 
                 { bestd = d; bestm = VCD_HERE; } /* VCD_HERE == 1 */ | 
 
 
 
 
 
 | 693 | 
  | 
 
 
 
 
 
 | 694 | 
             for(i = 0; i < ka->s_near; ++i) | 
 
 
 
 
 
 | 695 | 
                 if((d = addr - ka->near[i]) >= 0 && d < bestd) | 
 
 
 
 
 
 | 696 | 
                     { bestd = d; bestm = i+2; } | 
 
 
 
 
 
 | 697 | 
  | 
 
 
 
 
 
 | 698 | 
             if(ka->s_same > 0 && ka->same[d = addr%(ka->s_same*256)] == addr) | 
 
 
 
 
 
 | 699 | 
                 { bestd = d%256; bestm = ka->s_near + 2 + d/256; } | 
 
 
 
 
 
 | 700 | 
  | 
 
 
 
 
 
 | 701 | 
             cache_update(ka,addr); | 
 
 
 
 
 
 | 702 | 
  | 
 
 
 
 
 
 | 703 | 
             *mode = bestm; /* this returns the address encoding mode */ | 
 
 
 
 
 
 | 704 | 
             return  bestd; /* this returns the encoded address       */ | 
 
 
 
 
 
 | 705 | 
         } | 
 
 
 
 
 
 | 706 | 
  | 
 
 
 
 
 
 | 707 | 
     Note that the addr_encode() algorithm chooses the best address mode using a | 
 
 
 
 
 
 | 708 | 
     local optimization, but that may not lead to the best encoding efficiency | 
 
 
 
 
 
 | 709 | 
     because different modes lead to different instruction encodings, as    described below. | 
 
 
 
 
 
 | 710 | 
  | 
 
 
 
 
 
 | 711 | 
     The functions addrint() and addrbyte() used in addr_decode() obtain from | 
 
 
 
 
 
 | 712 | 
     the "Addresses section for COPYs" (Section 4.3) an integer or a byte, | 
 
 
 
 
 
 | 713 | 
     respectively. These utilities will not be described here.  We simply | 
 
 
 
 
 
 | 714 | 
     recall that an integer is represented as a compact variable-sized string | 
 
 
 
 
 
 | 715 | 
     of bytes as described in Section 2 (i.e., base 128). | 
 
 
 
 
 
 | 716 | 
  | 
 
 
 
 
 
 | 717 | 
         int addr_decode(Cache_t* ka, int here, int mode) | 
 
 
 
 
 
 | 718 | 
         {   int  addr, m; | 
 
 
 
 
 
 | 719 | 
  | 
 
 
 
 
 
 | 720 | 
             if(mode == VCD_SELF) | 
 
 
 
 
 
 | 721 | 
                  addr = addrint(); | 
 
 
 
 
 
 | 722 | 
             else if(mode == VCD_HERE) | 
 
 
 
 
 
 | 723 | 
                  addr = here - addrint(); | 
 
 
 
 
 
 | 724 | 
             else if((m = mode - 2) >= 0 && m < ka->s_near) /* near cache */ | 
 
 
 
 
 
 | 725 | 
                  addr = ka->near[m] + addrint(); | 
 
 
 
 
 
 | 726 | 
             else /* same cache */ | 
 
 
 
 
 
 | 727 | 
             {    m = mode - (2 + ka->s_near); | 
 
 
 
 
 
 | 728 | 
                  addr = ka->same[m*256 + addrbyte()]; | 
 
 
 
 
 
 | 729 | 
             } | 
 
 
 
 
 
 | 730 | 
  | 
 
 
 
 
 
 | 731 | 
             cache_update(ka, addr); | 
 
 
 
 
 
 | 732 | 
  | 
 
 
 
 
 
 | 733 | 
             return addr; | 
 
 
 
 
 
 | 734 | 
         } | 
 
 
 
 
 
 | 735 | 
  | 
 
 
 
 
 
 | 736 | 
  | 
 
 
 
 
 
 | 737 | 
 5.4 Instruction Codes | 
 
 
 
 
 
 | 738 | 
  | 
 
 
 
 
 
 | 739 | 
     As noted, the data sizes associated with delta instructions are often | 
 
 
 
 
 
 | 740 | 
     small. Thus, compression efficiency can be improved by combining the sizes | 
 
 
 
 
 
 | 741 | 
     and instruction types in a single encoding, as well by combining certain | 
 
 
 
 
 
 | 742 | 
     pairs of adjacent delta instructions. Effective choices of when to perform | 
 
 
 
 
 
 | 743 | 
     such combinations depend on many factors including the data being processed | 
 
 
 
 
 
 | 744 | 
     and the string matching algorithm in use. For example, if many COPY | 
 
 
 
 
 
 | 745 | 
     instructions have the same data sizes, it may be worth to encode these | 
 
 
 
 
 
 | 746 | 
     instructions more compactly than others. | 
 
 
 
 
 
 | 747 | 
  | 
 
 
 
 
 
 | 748 | 
     The Vcdiff data format is designed so that a decoder does not need to be | 
 
 
 
 
 
 | 749 | 
     aware of the choices made in encoding algorithms. This is achieved with the | 
 
 
 
 
 
 | 750 | 
     notion of an "instruction code table" containing 256 entries. Each entry | 
 
 
 
 
 
 | 751 | 
     defines either a single delta instruction or a pair of instructions that | 
 
 
 
 
 
 | 752 | 
     have been combined.  Note that the code table itself only exists in main | 
 
 
 
 
 
 | 753 | 
     memory, not in the delta file (unless using an application-defined code | 
 
 
 
 
 
 | 754 | 
     table, described in Section 7). The encoded data simply includes the index | 
 
 
 
 
 
 | 755 | 
     of each instruction and, since there are only 256 indices, each index | 
 
 
 
 
 
 | 756 | 
     can be represented as a single byte. | 
 
 
 
 
 
 | 757 | 
  | 
 
 
 
 
 
 | 758 | 
     Each instruction code entry contains six fields, each of which | 
 
 
 
 
 
 | 759 | 
     is a single byte with unsigned value: | 
 
 
 
 
 
 | 760 | 
  | 
 
 
 
 
 
 | 761 | 
             +-----------------------------------------------+ | 
 
 
 
 
 
 | 762 | 
             | inst1 | size1 | mode1 | inst2 | size2 | mode2 | | 
 
 
 
 
 
 | 763 | 
             +-----------------------------------------------+ | 
 
 
 
 
 
 | 764 | 
  | 
 
 
 
 
 
 | 765 | 
 @@@ could be more compact | 
 
 
 
 
 
 | 766 | 
  | 
 
 
 
 
 
 | 767 | 
     Each triple (inst,size,mode) defines a delta instruction. The meanings | 
 
 
 
 
 
 | 768 | 
     of these fields are as follows: | 
 
 
 
 
 
 | 769 | 
  | 
 
 
 
 
 
 | 770 | 
     inst: An "inst" field can have one of the four values: NOOP (0), ADD (1), | 
 
 
 
 
 
 | 771 | 
         RUN (2) or COPY (3) to indicate the instruction types. NOOP means | 
 
 
 
 
 
 | 772 | 
         that no instruction is specified. In this case, both the corresponding | 
 
 
 
 
 
 | 773 | 
         size and mode fields will be zero. | 
 
 
 
 
 
 | 774 | 
  | 
 
 
 
 
 
 | 775 | 
     size: A "size" field is zero or positive. A value zero means that the | 
 
 
 
 
 
 | 776 | 
         size associated with the instruction is encoded separately as | 
 
 
 
 
 
 | 777 | 
         an integer in the "Instructions and sizes section" (Section 6). | 
 
 
 
 
 
 | 778 | 
         A positive value for "size" defines the actual data size. | 
 
 
 
 
 
 | 779 | 
         Note that since the size is restricted to a byte, the maximum | 
 
 
 
 
 
 | 780 | 
         value for any instruction with size implicitly defined in the code | 
 
 
 
 
 
 | 781 | 
         table is 255. | 
 
 
 
 
 
 | 782 | 
  | 
 
 
 
 
 
 | 783 | 
     mode: A "mode" field is significant only when the associated delta | 
 
 
 
 
 
 | 784 | 
         instruction is a COPY. It defines the mode used to encode the | 
 
 
 
 
 
 | 785 | 
         associated addresses. For other instructions, this is always zero. | 
 
 
 
 
 
 | 786 | 
  | 
 
 
 
 
 
 | 787 | 
  | 
 
 
 
 
 
 | 788 | 
 5.5 The Code Table | 
 
 
 
 
 
 | 789 | 
  | 
 
 
 
 
 
 | 790 | 
     Following the discussions on address modes and instruction code tables, | 
 
 
 
 
 
 | 791 | 
     we define a "Code Table" to have the data below: | 
 
 
 
 
 
 | 792 | 
  | 
 
 
 
 
 
 | 793 | 
         s_near: the size of the near cache, | 
 
 
 
 
 
 | 794 | 
         s_same: the size of the same cache, | 
 
 
 
 
 
 | 795 | 
         i_code: the 256-entry instruction code table. | 
 
 
 
 
 
 | 796 | 
  | 
 
 
 
 
 
 | 797 | 
     Vcdiff itself defines a "default code table" in which s_near is 4 | 
 
 
 
 
 
 | 798 | 
     and s_same is 3. Thus, there are 9 address modes for a COPY instruction. | 
 
 
 
 
 
 | 799 | 
     The first two are VCD_SELF (0) and VCD_HERE (1). Modes 2, 3, 4 and 5 | 
 
 
 
 
 
 | 800 | 
     are for addresses coded against the near cache. And, modes 6, 7  and 8 | 
 
 
 
 
 
 | 801 | 
     are for addresses coded against the same cache. | 
 
 
 
 
 
 | 802 | 
  | 
 
 
 
 
 
 | 803 | 
     The default instruction code table is depicted below, in a compact | 
 
 
 
 
 
 | 804 | 
     representation that we use only for descriptive purposes.  See section 7 | 
 
 
 
 
 
 | 805 | 
     for the specification of how an instruction code table is represented | 
 
 
 
 
 
 | 806 | 
     in the Vcdiff encoding format.  In the depiction, a zero value for | 
 
 
 
 
 
 | 807 | 
     size indicates that the size is separately coded. The mode of non-COPY | 
 
 
 
 
 
 | 808 | 
     instructions is represented as 0 even though they are not used. | 
 
 
 
 
 
 | 809 | 
  | 
 
 
 
 
 
 | 810 | 
  | 
 
 
 
 
 
 | 811 | 
          TYPE      SIZE     MODE    TYPE     SIZE     MODE     INDEX | 
 
 
 
 
 
 | 812 | 
         --------------------------------------------------------------- | 
 
 
 
 
 
 | 813 | 
      1.  RUN         0        0     NOOP       0        0        0 | 
 
 
 
 
 
 | 814 | 
      2.  ADD    0, [1,17]     0     NOOP       0        0      [1,18] | 
 
 
 
 
 
 | 815 | 
      3.  COPY   0, [4,18]     0     NOOP       0        0     [19,34] | 
 
 
 
 
 
 | 816 | 
      4.  COPY   0, [4,18]     1     NOOP       0        0     [35,50] | 
 
 
 
 
 
 | 817 | 
      5.  COPY   0, [4,18]     2     NOOP       0        0     [51,66] | 
 
 
 
 
 
 | 818 | 
      6.  COPY   0, [4,18]     3     NOOP       0        0     [67,82] | 
 
 
 
 
 
 | 819 | 
      7.  COPY   0, [4,18]     4     NOOP       0        0     [83,98] | 
 
 
 
 
 
 | 820 | 
      8.  COPY   0, [4,18]     5     NOOP       0        0     [99,114] | 
 
 
 
 
 
 | 821 | 
      9.  COPY   0, [4,18]     6     NOOP       0        0    [115,130] | 
 
 
 
 
 
 | 822 | 
     10.  COPY   0, [4,18]     7     NOOP       0        0    [131,146] | 
 
 
 
 
 
 | 823 | 
     11.  COPY   0, [4,18]     8     NOOP       0        0    [147,162] | 
 
 
 
 
 
 | 824 | 
     12.  ADD       [1,4]      0     COPY     [4,6]      0    [163,174] | 
 
 
 
 
 
 | 825 | 
     13.  ADD       [1,4]      0     COPY     [4,6]      1    [175,186] | 
 
 
 
 
 
 | 826 | 
     14.  ADD       [1,4]      0     COPY     [4,6]      2    [187,198] | 
 
 
 
 
 
 | 827 | 
     15.  ADD       [1,4]      0     COPY     [4,6]      3    [199,210] | 
 
 
 
 
 
 | 828 | 
     16.  ADD       [1,4]      0     COPY     [4,6]      4    [211,222] | 
 
 
 
 
 
 | 829 | 
     17.  ADD       [1,4]      0     COPY     [4,6]      5    [223,234] | 
 
 
 
 
 
 | 830 | 
     18.  ADD       [1,4]      0     COPY       4        6    [235,238] | 
 
 
 
 
 
 | 831 | 
     19.  ADD       [1,4]      0     COPY       4        7    [239,242] | 
 
 
 
 
 
 | 832 | 
     20.  ADD       [1,4]      0     COPY       4        8    [243,246] | 
 
 
 
 
 
 | 833 | 
     21.  COPY        4      [0,8]   ADD        1        0    [247,255] | 
 
 
 
 
 
 | 834 | 
         --------------------------------------------------------------- | 
 
 
 
 
 
 | 835 | 
  | 
 
 
 
 
 
 | 836 | 
     In the above depiction, each numbered line represents one or more | 
 
 
 
 
 
 | 837 | 
     entries in the actual instruction code table (recall that an entry in | 
 
 
 
 
 
 | 838 | 
     the instruction code table may represent up to two combined delta | 
 
 
 
 
 
 | 839 | 
     instructions.) The last column ("INDEX") shows which index value or | 
 
 
 
 
 
 | 840 | 
     range of index values of the entries covered by that line. The notation | 
 
 
 
 
 
 | 841 | 
     [i,j] means values from i through j, inclusive. The first 6 columns of | 
 
 
 
 
 
 | 842 | 
     a line in the depiction describe the pairs of instructions used for | 
 
 
 
 
 
 | 843 | 
     the corresponding index value(s). | 
 
 
 
 
 
 | 844 | 
  | 
 
 
 
 
 
 | 845 | 
     If a line in the depiction includes a column entry using the [i,j] | 
 
 
 
 
 
 | 846 | 
     notation, this means that the line is instantiated for each value | 
 
 
 
 
 
 | 847 | 
     in the range from i to j, inclusive.  The notation "0, [i,j]" means | 
 
 
 
 
 
 | 848 | 
     that the line is instantiated for the value 0 and for each value | 
 
 
 
 
 
 | 849 | 
     in the range from i to j, inclusive. | 
 
 
 
 
 
 | 850 | 
  | 
 
 
 
 
 
 | 851 | 
     If a line in the depiction includes more than one entry using the [i,j] | 
 
 
 
 
 
 | 852 | 
     notation, implying a "nested loop" to convert the line to a range of | 
 
 
 
 
 
 | 853 | 
     table entries, the first such [i,j] range specifies the outer loop, | 
 
 
 
 
 
 | 854 | 
     and the second specifies the inner loop. | 
 
 
 
 
 
 | 855 | 
  | 
 
 
 
 
 
 | 856 | 
     The below examples should make clear the above description: | 
 
 
 
 
 
 | 857 | 
  | 
 
 
 
 
 
 | 858 | 
     Line 1 shows the single RUN instruction with index 0. As the size field | 
 
 
 
 
 
 | 859 | 
     is 0, this RUN instruction always has its actual size encoded separately. | 
 
 
 
 
 
 | 860 | 
  | 
 
 
 
 
 
 | 861 | 
     Line 2 shows the 18 single ADD instructions. The ADD instruction with | 
 
 
 
 
 
 | 862 | 
     size field 0 (i.e., the actual size is coded separately) has index 1. | 
 
 
 
 
 
 | 863 | 
     ADD instructions with sizes from 1 to 17 use code indices 2 to 18 and | 
 
 
 
 
 
 | 864 | 
     their sizes are as given (so they will not be separately encoded.) | 
 
 
 
 
 
 | 865 | 
  | 
 
 
 
 
 
 | 866 | 
     Following the single ADD instructions are the single COPY instructions | 
 
 
 
 
 
 | 867 | 
     ordered by their address encoding modes. For example, line 11 shows the | 
 
 
 
 
 
 | 868 | 
     COPY instructions with mode 8, i.e., the last of the same cache. | 
 
 
 
 
 
 | 869 | 
     In this case, the COPY instruction with size field 0 has index 147. | 
 
 
 
 
 
 | 870 | 
     Again, the actual size of this instruction will be coded separately. | 
 
 
 
 
 
 | 871 | 
  | 
 
 
 
 
 
 | 872 | 
     Lines 12 to 21 show the pairs of instructions that are combined together. | 
 
 
 
 
 
 | 873 | 
     For example, line 12 depicts the 12 entries in which an ADD instruction | 
 
 
 
 
 
 | 874 | 
     is combined with an immediately following COPY instruction. The entries | 
 
 
 
 
 
 | 875 | 
     with indices 163, 164, 165 represent the pairs in which the ADD | 
 
 
 
 
 
 | 876 | 
     instructions all have size 1 while the COPY instructions has mode | 
 
 
 
 
 
 | 877 | 
     0 (VCD_SELF) and sizes 4, 5 and 6 respectively. | 
 
 
 
 
 
 | 878 | 
  | 
 
 
 
 
 
 | 879 | 
     The last line, line 21, shows the eight instruction pairs where the first | 
 
 
 
 
 
 | 880 | 
     instruction is a COPY and the second is an ADD. In this case, all COPY | 
 
 
 
 
 
 | 881 | 
     instructions have size 4 with mode ranging from 0 to 8 and all the ADD | 
 
 
 
 
 
 | 882 | 
     instructions have size 1. Thus, the entry with largest index 255 | 
 
 
 
 
 
 | 883 | 
     combines a COPY instruction of size 4 and mode 8 with an ADD instruction | 
 
 
 
 
 
 | 884 | 
     of size 1. | 
 
 
 
 
 
 | 885 | 
  | 
 
 
 
 
 
 | 886 | 
     The choice of the minimum size 4 for COPY instructions in the default code | 
 
 
 
 
 
 | 887 | 
     table was made from experiments that showed that excluding small matches | 
 
 
 
 
 
 | 888 | 
     (less then 4 bytes long) improved the compression rates. | 
 
 
 
 
 
 | 889 | 
  | 
 
 
 
 
 
 | 890 | 
  | 
 
 
 
 
 
 | 891 | 
 6. DECODING A TARGET WINDOW | 
 
 
 
 
 
 | 892 | 
  | 
 
 
 
 
 
 | 893 | 
     Section 4.3 discusses that the delta instructions and associated data | 
 
 
 
 
 
 | 894 | 
     are encoded in three arrays of bytes: | 
 
 
 
 
 
 | 895 | 
  | 
 
 
 
 
 
 | 896 | 
         Data section for ADDs and RUNs, | 
 
 
 
 
 
 | 897 | 
         Instructions and sizes section, and | 
 
 
 
 
 
 | 898 | 
         Addresses section for COPYs. | 
 
 
 
 
 
 | 899 | 
  | 
 
 
 
 
 
 | 900 | 
  | 
 
 
 
 
 
 | 901 | 
     Further, these data sections may have been further compressed by some | 
 
 
 
 
 
 | 902 | 
     secondary compressor. Assuming that any such compressed data has been | 
 
 
 
 
 
 | 903 | 
     decompressed so that we now have three arrays: | 
 
 
 
 
 
 | 904 | 
  | 
 
 
 
 
 
 | 905 | 
         inst: bytes coding the instructions and sizes. | 
 
 
 
 
 
 | 906 | 
         data: unmatched data associated with ADDs and RUNs. | 
 
 
 
 
 
 | 907 | 
         addr: bytes coding the addresses of COPYs. | 
 
 
 
 
 
 | 908 | 
  | 
 
 
 
 
 
 | 909 | 
     These arrays are organized as follows: | 
 
 
 
 
 
 | 910 | 
  | 
 
 
 
 
 
 | 911 | 
         inst: | 
 
 
 
 
 
 | 912 | 
             a sequence of (index, [size1], [size2]) tuples, where "index" | 
 
 
 
 
 
 | 913 | 
             is an index into the instruction code table, and size1 and size2 | 
 
 
 
 
 
 | 914 | 
             are integers that MAY or MAY NOT be included in the tuple as | 
 
 
 
 
 
 | 915 | 
             follows. The entry with the given "index" in the instruction | 
 
 
 
 
 
 | 916 | 
             code table potentially defines two delta instructions. If the | 
 
 
 
 
 
 | 917 | 
             first delta instruction is not a VCD_NOOP and its size is zero, | 
 
 
 
 
 
 | 918 | 
             then size1 MUST be present. Otherwise, size1 MUST be omitted and | 
 
 
 
 
 
 | 919 | 
             the size of the instruction (if it is not VCD_NOOP) is as defined | 
 
 
 
 
 
 | 920 | 
             in the table. The presence or absence of size2 is defined | 
 
 
 
 
 
 | 921 | 
             similarly with respect to the second delta instruction. | 
 
 
 
 
 
 | 922 | 
  | 
 
 
 
 
 
 | 923 | 
         data: | 
 
 
 
 
 
 | 924 | 
             a sequence of data values, encoded as bytes. | 
 
 
 
 
 
 | 925 | 
  | 
 
 
 
 
 
 | 926 | 
         addr: | 
 
 
 
 
 
 | 927 | 
             a sequence of address values. Addresses are normally encoded as | 
 
 
 
 
 
 | 928 | 
             integers as described in Section 2 (i.e., base 128). | 
 
 
 
 
 
 | 929 | 
             Since the same cache emits addresses in the range [0,255], | 
 
 
 
 
 
 | 930 | 
             however, same cache addresses are always encoded as a | 
 
 
 
 
 
 | 931 | 
             single byte. | 
 
 
 
 
 
 | 932 | 
  | 
 
 
 
 
 
 | 933 | 
     To summarize, each tuple in the "inst" array includes an index to some | 
 
 
 
 
 
 | 934 | 
     entry in the instruction code table that determines: | 
 
 
 
 
 
 | 935 | 
  | 
 
 
 
 
 
 | 936 | 
     a. Whether one or two instructions were encoded and their types. | 
 
 
 
 
 
 | 937 | 
  | 
 
 
 
 
 
 | 938 | 
     b. If the instructions have their sizes encoded separately, these | 
 
 
 
 
 
 | 939 | 
        sizes will follow, in order, in the tuple. | 
 
 
 
 
 
 | 940 | 
  | 
 
 
 
 
 
 | 941 | 
     c. If the instructions have accompanying data, i.e., ADDs or RUNs, | 
 
 
 
 
 
 | 942 | 
        their data will be in the array "data". | 
 
 
 
 
 
 | 943 | 
  | 
 
 
 
 
 
 | 944 | 
     d. Similarly, if the instructions are COPYs, the coded addresses are | 
 
 
 
 
 
 | 945 | 
        found in the array "addr". | 
 
 
 
 
 
 | 946 | 
  | 
 
 
 
 
 
 | 947 | 
     The decoding procedure simply processes the arrays by reading one code | 
 
 
 
 
 
 | 948 | 
     index at a time, looking up the corresponding instruction code entry, | 
 
 
 
 
 
 | 949 | 
     then consuming the respective sizes, data and addresses following the | 
 
 
 
 
 
 | 950 | 
     directions in this entry. In other words, the decoder maintains an implicit | 
 
 
 
 
 
 | 951 | 
     next-element pointer for each array; "consuming" an instruction tuple, | 
 
 
 
 
 
 | 952 | 
     data, or address value implies incrementing the associated pointer. | 
 
 
 
 
 
 | 953 | 
  | 
 
 
 
 
 
 | 954 | 
     For example, if during the processing of the target window, the next | 
 
 
 
 
 
 | 955 | 
     unconsumed tuple in the inst array has index value 19, then the first | 
 
 
 
 
 
 | 956 | 
     instruction is a COPY, whose size is found as the immediately following | 
 
 
 
 
 
 | 957 | 
     integer in the inst array.  Since the mode of this COPY instruction is | 
 
 
 
 
 
 | 958 | 
     VCD_SELF, the corresponding address is found by consuming the next | 
 
 
 
 
 
 | 959 | 
     integer in the addr array.  The data array is left intact. As the second | 
 
 
 
 
 
 | 960 | 
     instruction for code index 19 is a NOOP, this tuple is finished. | 
 
 
 
 
 
 | 961 | 
  | 
 
 
 
 
 
 | 962 | 
  | 
 
 
 
 
 
 | 963 | 
 7. APPLICATION-DEFINED CODE TABLES | 
 
 
 
 
 
 | 964 | 
  | 
 
 
 
 
 
 | 965 | 
     Although the default code table used in Vcdiff is good for general | 
 
 
 
 
 
 | 966 | 
     purpose encoders, there are times when other code tables may perform | 
 
 
 
 
 
 | 967 | 
     better. For example, to code a file with many identical segments of data, | 
 
 
 
 
 
 | 968 | 
     it may be advantageous to have a COPY instruction with the specific size | 
 
 
 
 
 
 | 969 | 
     of these data segments so that the instruction can be encoded in a single | 
 
 
 
 
 
 | 970 | 
     byte. Such a special code table MUST then be encoded in the delta file | 
 
 
 
 
 
 | 971 | 
     so that the decoder can reconstruct it before decoding the data. | 
 
 
 
 
 
 | 972 | 
  | 
 
 
 
 
 
 | 973 | 
     Vcdiff allows an application-defined code table to be specified | 
 
 
 
 
 
 | 974 | 
     in a delta file with the following data: | 
 
 
 
 
 
 | 975 | 
  | 
 
 
 
 
 
 | 976 | 
         Size of near cache            - byte | 
 
 
 
 
 
 | 977 | 
         Size of same cache            - byte | 
 
 
 
 
 
 | 978 | 
         Compressed code table data | 
 
 
 
 
 
 | 979 | 
  | 
 
 
 
 
 
 | 980 | 
     The "compressed code table data" encodes the delta between the default | 
 
 
 
 
 
 | 981 | 
     code table (source) and the new code table (target) in the same manner as | 
 
 
 
 
 
 | 982 | 
     described in Section 4.3 for encoding a target window in terms of a | 
 
 
 
 
 
 | 983 | 
     source window. This delta is computed using the following steps: | 
 
 
 
 
 
 | 984 | 
  | 
 
 
 
 
 
 | 985 | 
     a.  Convert the new instruction code table into a string, "code", of | 
 
 
 
 
 
 | 986 | 
         1536 bytes using the below steps in order: | 
 
 
 
 
 
 | 987 | 
  | 
 
 
 
 
 
 | 988 | 
         i. Add in order the 256 bytes representing the types of the first | 
 
 
 
 
 
 | 989 | 
            instructions in the instruction pairs. | 
 
 
 
 
 
 | 990 | 
        ii. Add in order the 256 bytes representing the types of the second | 
 
 
 
 
 
 | 991 | 
            instructions in the instruction pairs. | 
 
 
 
 
 
 | 992 | 
       iii. Add in order the 256 bytes representing the sizes of the first | 
 
 
 
 
 
 | 993 | 
            instructions in the instruction pairs. | 
 
 
 
 
 
 | 994 | 
        iv. Add in order the 256 bytes representing the sizes of the second | 
 
 
 
 
 
 | 995 | 
            instructions in the instruction pairs. | 
 
 
 
 
 
 | 996 | 
         v. Add in order the 256 bytes representing the modes of the first | 
 
 
 
 
 
 | 997 | 
            instructions in the instruction pairs. | 
 
 
 
 
 
 | 998 | 
        vi. Add in order the 256 bytes representing the modes of the second | 
 
 
 
 
 
 | 999 | 
            instructions in the instruction pairs. | 
 
 
 
 
 
 | 1000 | 
  | 
 
 
 
 
 
 | 1001 | 
     b.  Similarly, convert the default instruction code table into | 
 
 
 
 
 
 | 1002 | 
         a string "dflt". | 
 
 
 
 
 
 | 1003 | 
  | 
 
 
 
 
 
 | 1004 | 
     c.  Treat the string "code" as a target window and "dflt" as the | 
 
 
 
 
 
 | 1005 | 
         corresponding source data and apply an encoding algorithm to | 
 
 
 
 
 
 | 1006 | 
         compute the delta encoding of "code" in terms of "dflt". | 
 
 
 
 
 
 | 1007 | 
         This computation MUST use the default code table for encoding | 
 
 
 
 
 
 | 1008 | 
         the delta instructions. | 
 
 
 
 
 
 | 1009 | 
  | 
 
 
 
 
 
 | 1010 | 
     The decoder can then reverse the above steps to decode the compressed | 
 
 
 
 
 
 | 1011 | 
     table data using the method of Section 6, employing the default code | 
 
 
 
 
 
 | 1012 | 
     table, to generate the new code table.  Note that the decoder does not | 
 
 
 
 
 
 | 1013 | 
     need to know anything about the details of the encoding algorithm used | 
 
 
 
 
 
 | 1014 | 
     in step (c). The decoder is still able to decode the new code table | 
 
 
 
 
 
 | 1015 | 
     because the Vcdiff format is independent from the choice of encoding | 
 
 
 
 
 
 | 1016 | 
     algorithm, and because the encoder in step (c) uses the known, default | 
 
 
 
 
 
 | 1017 | 
     code table. | 
 
 
 
 
 
 | 1018 | 
  | 
 
 
 
 
 
 | 1019 | 
  | 
 
 
 
 
 
 | 1020 | 
 8. PERFORMANCE | 
 
 
 
 
 
 | 1021 | 
  | 
 
 
 
 
 
 | 1022 | 
     The encoding format is compact. For compression only, using the LZ-77 | 
 
 
 
 
 
 | 1023 | 
     string parsing strategy and without any secondary compressors, the typical | 
 
 
 
 
 
 | 1024 | 
     compression rate is better than Unix compress and close to gzip.  For | 
 
 
 
 
 
 | 1025 | 
     differencing, the data format is better than all known methods in | 
 
 
 
 
 
 | 1026 | 
     terms of its stated goal, which is primarily decoding speed and | 
 
 
 
 
 
 | 1027 | 
     encoding efficiency. | 
 
 
 
 
 
 | 1028 | 
  | 
 
 
 
 
 
 | 1029 | 
     We compare the performance of compress, gzip and Vcdiff using the | 
 
 
 
 
 
 | 1030 | 
     archives of three versions of the Gnu C compiler, gcc-2.95.1.tar, | 
 
 
 
 
 
 | 1031 | 
     gcc-2.95.2.tar and gcc-2.95.3.tar. The experiments were done on an | 
 
 
 
 
 
 | 1032 | 
     SGI-MIPS3, 400MHZ. Gzip was used at its default compression level. | 
 
 
 
 
 
 | 1033 | 
     Vcdiff timings were done using the Vcodex/Vcdiff software (Section 13). | 
 
 
 
 
 
 | 1034 | 
     As string and window matching typically dominates the computation during | 
 
 
 
 
 
 | 1035 | 
     compression, the Vcdiff compression times were directly due to the | 
 
 
 
 
 
 | 1036 | 
     algorithms used in the Vcodex/Vcdiff software. However, the decompression | 
 
 
 
 
 
 | 1037 | 
     times should be generic and representative of any good implementation | 
 
 
 
 
 
 | 1038 | 
     of the Vcdiff data format. Timing was done by running each program | 
 
 
 
 
 
 | 1039 | 
     three times and taking the average of the total cpu+system times. | 
 
 
 
 
 
 | 1040 | 
  | 
 
 
 
 
 
 | 1041 | 
     Below are the different Vcdiff runs: | 
 
 
 
 
 
 | 1042 | 
  | 
 
 
 
 
 
 | 1043 | 
         Vcdiff: vcdiff is used as compressor only. | 
 
 
 
 
 
 | 1044 | 
  | 
 
 
 
 
 
 | 1045 | 
         Vcdiff-d: vcdiff is used as a differencer only. That is, it only | 
 
 
 
 
 
 | 1046 | 
                 compares target data against source data.  Since the files | 
 
 
 
 
 
 | 1047 | 
                 involved are large, they are broken into windows. In this | 
 
 
 
 
 
 | 1048 | 
                 case, each target window starting at some file offset in | 
 
 
 
 
 
 | 1049 | 
                 the target file is compared against a source window with | 
 
 
 
 
 
 | 1050 | 
                 the same file offset (in the source file). The source | 
 
 
 
 
 
 | 1051 | 
                 window is also slightly larger than the target window | 
 
 
 
 
 
 | 1052 | 
                 to increase matching opportunities. The -d option also gives | 
 
 
 
 
 
 | 1053 | 
                 a hint to the string matching algorithm of Vcdiff that | 
 
 
 
 
 
 | 1054 | 
                 the two files are very similar with long stretches of matches. | 
 
 
 
 
 
 | 1055 | 
                 The algorithm takes advantage of this to minimize its | 
 
 
 
 
 
 | 1056 | 
                 processing of source data and save time. | 
 
 
 
 
 
 | 1057 | 
  | 
 
 
 
 
 
 | 1058 | 
         Vcdiff-dc: This is similar to Vcdiff-d but vcdiff can also compare | 
 
 
 
 
 
 | 1059 | 
                 target data against target data as applicable. Thus, vcdiff | 
 
 
 
 
 
 | 1060 | 
                 both computes differences and compresses data. The windowing | 
 
 
 
 
 
 | 1061 | 
                 algorithm is the same as above. However, the above hint is | 
 
 
 
 
 
 | 1062 | 
                 recinded in this case. | 
 
 
 
 
 
 | 1063 | 
  | 
 
 
 
 
 
 | 1064 | 
         Vcdiff-dcs: This is similar to Vcdiff-dc but the windowing algorithm | 
 
 
 
 
 
 | 1065 | 
                 uses a content-based heuristic to select source data segments | 
 
 
 
 
 
 | 1066 | 
                 that are more likely to match with a given target window. | 
 
 
 
 
 
 | 1067 | 
                 Thus, the source data segment selected for a target window | 
 
 
 
 
 
 | 1068 | 
                 often will not be aligned with the file offsets of this | 
 
 
 
 
 
 | 1069 | 
                 target window. | 
 
 
 
 
 
 | 1070 | 
  | 
 
 
 
 
 
 | 1071 | 
  | 
 
 
 
 
 
 | 1072 | 
                 gcc-2.95.1    gcc-2.95.2    compression   decompression | 
 
 
 
 
 
 | 1073 | 
     raw size      55746560      55797760 | 
 
 
 
 
 
 | 1074 | 
     compress         -          19939390       13.85s         7.09s | 
 
 
 
 
 
 | 1075 | 
     gzip             -          12973443       42.99s         5.35s | 
 
 
 
 
 
 | 1076 | 
     Vcdiff           -          15358786       20.04s         4.65s | 
 
 
 
 
 
 | 1077 | 
     Vcdiff-d         -            100971       10.93s         1.92s | 
 
 
 
 
 
 | 1078 | 
     Vcdiff-dc        -             97246       20.03s         1.84s | 
 
 
 
 
 
 | 1079 | 
     Vcdiff-dcs       -            256445       44.81s         1.84s | 
 
 
 
 
 
 | 1080 | 
  | 
 
 
 
 
 
 | 1081 | 
                 TABLE 1. Compressing gcc-2.95.2.tar given gcc-2.95.1 | 
 
 
 
 
 
 | 1082 | 
  | 
 
 
 
 
 
 | 1083 | 
  | 
 
 
 
 
 
 | 1084 | 
     TABLE 1 shows the raw sizes of gcc-2.95.1.tar and gcc-2.95.2.tar and the | 
 
 
 
 
 
 | 1085 | 
     sizes of the compressed results. As a pure compressor, the compression | 
 
 
 
 
 
 | 1086 | 
     rate for Vcdiff is worse than gzip and better than compress. The last | 
 
 
 
 
 
 | 1087 | 
     three rows shows that when two file versions are very similar, differencing | 
 
 
 
 
 
 | 1088 | 
     can have dramatically good compression rates. Vcdiff-d and Vcdiff-dc use | 
 
 
 
 
 
 | 1089 | 
     the same simple window selection method but Vcdiff-dc also does compression | 
 
 
 
 
 
 | 1090 | 
     so its output is slightly smaller. Vcdiff-dcs uses a heuristic based on | 
 
 
 
 
 
 | 1091 | 
     data content to search for source data that likely will match a given target | 
 
 
 
 
 
 | 1092 | 
     window. Although it does a good job, the heuristic did not always find the | 
 
 
 
 
 
 | 1093 | 
     best matches which are given by the simple algorithm of Vcdiff-d.  As a | 
 
 
 
 
 
 | 1094 | 
     result, the output size is slightly larger. Note also that there is a large | 
 
 
 
 
 
 | 1095 | 
     cost in computing matching windows this way. Finally, the compression times | 
 
 
 
 
 
 | 1096 | 
     of Vcdiff-d is nearly half of that of Vcdiff-dc. It is tempting to conclude | 
 
 
 
 
 
 | 1097 | 
     that the compression feature causes the additional time in Vcdiff-dc | 
 
 
 
 
 
 | 1098 | 
     relative to Vcdiff-d.  However, this is not the case. The hint given to | 
 
 
 
 
 
 | 1099 | 
     the Vcdiff string matching algorithm that the two files are likely to | 
 
 
 
 
 
 | 1100 | 
     have very long stretches of matches helps the algorithm to minimize | 
 
 
 
 
 
 | 1101 | 
     processing of the "source data", thus saving half the time. However, as we | 
 
 
 
 
 
 | 1102 | 
     shall see below when this hint is wrong, the result is even longer time. | 
 
 
 
 
 
 | 1103 | 
  | 
 
 
 
 
 
 | 1104 | 
  | 
 
 
 
 
 
 | 1105 | 
                 gcc-2.95.2    gcc-2.95.3    compression   decompression | 
 
 
 
 
 
 | 1106 | 
     raw size      55797760      55787520 | 
 
 
 
 
 
 | 1107 | 
     compress         -          19939453       13.54s         7.00s | 
 
 
 
 
 
 | 1108 | 
     gzip             -          12998097       42.63s         5.62s | 
 
 
 
 
 
 | 1109 | 
     Vcdiff           -          15371737       20.09s         4.74s | 
 
 
 
 
 
 | 1110 | 
     Vcdiff-d         -          26383849       71.41s         6.41s | 
 
 
 
 
 
 | 1111 | 
     Vcdiff-dc        -          14461203       42.48s         4.82s | 
 
 
 
 
 
 | 1112 | 
     Vcdiff-dcs       -           1248543       61.18s         1.99s | 
 
 
 
 
 
 | 1113 | 
  | 
 
 
 
 
 
 | 1114 | 
                 TABLE 2. Compressing gcc-2.95.3.tar given gcc-2.95.2 | 
 
 
 
 
 
 | 1115 | 
  | 
 
 
 
 
 
 | 1116 | 
  | 
 
 
 
 
 
 | 1117 | 
     TABLE 2 shows the raw sizes of gcc-2.95.2.tar and gcc-2.95.3.tar and | 
 
 
 
 
 
 | 1118 | 
     the sizes of the compressed results. In this case, the tar file of | 
 
 
 
 
 
 | 1119 | 
     gcc-2.95.3 is rearranged in a way that makes the straightforward method | 
 
 
 
 
 
 | 1120 | 
     of matching file offsets for source and target windows fail. As a | 
 
 
 
 
 
 | 1121 | 
     result, Vcdiff-d performs rather dismally both in time and output size. | 
 
 
 
 
 
 | 1122 | 
     The large time for Vcdiff-d is directly due to fact that the string | 
 
 
 
 
 
 | 1123 | 
     matching algorithm has to work much harder to find matches when the hint | 
 
 
 
 
 
 | 1124 | 
     that two files have long matching stretches fails to hold. On the other | 
 
 
 
 
 
 | 1125 | 
     hand, Vcdiff-dc does both differencing and compression resulting in good | 
 
 
 
 
 
 | 1126 | 
     output size. Finally, the window searching heuristic used in Vcdiff-dcs is | 
 
 
 
 
 
 | 1127 | 
     effective in finding the right matching source windows for target windows | 
 
 
 
 
 
 | 1128 | 
     resulting a small output size. This shows why the data format needs to | 
 
 
 
 
 
 | 1129 | 
     have a way to specify matching windows to gain performance. Finally, | 
 
 
 
 
 
 | 1130 | 
     we note that the decoding times are always good regardless of how | 
 
 
 
 
 
 | 1131 | 
     the string matching or window searching algorithms perform. | 
 
 
 
 
 
 | 1132 | 
  | 
 
 
 
 
 
 | 1133 | 
  | 
 
 
 
 
 
 | 1134 | 
 9. FURTHER ISSUES | 
 
 
 
 
 
 | 1135 | 
  | 
 
 
 
 
 
 | 1136 | 
     This document does not address a few issues: | 
 
 
 
 
 
 | 1137 | 
  | 
 
 
 
 
 
 | 1138 | 
     Secondary compressors: | 
 
 
 
 
 
 | 1139 | 
         As discussed in Section 4.3, certain sections in the delta encoding | 
 
 
 
 
 
 | 1140 | 
         of a window may be further compressed by a secondary compressor. | 
 
 
 
 
 
 | 1141 | 
         In our experience, the basic Vcdiff format is adequate for most | 
 
 
 
 
 
 | 1142 | 
         purposes so that secondary compressors are seldom needed. In | 
 
 
 
 
 
 | 1143 | 
         particular, for normal use of data differencing where the files to | 
 
 
 
 
 
 | 1144 | 
         be compared have long stretches of matches, much of the gain in | 
 
 
 
 
 
 | 1145 | 
         compression rate is already achieved by normal string matching. | 
 
 
 
 
 
 | 1146 | 
         Thus, the use of secondary compressors is seldom needed in this case. | 
 
 
 
 
 
 | 1147 | 
         However, for applications beyond differencing of such nearly identical | 
 
 
 
 
 
 | 1148 | 
         files, secondary compressors may be needed to achieve maximal | 
 
 
 
 
 
 | 1149 | 
         compressed results. | 
 
 
 
 
 
 | 1150 | 
  | 
 
 
 
 
 
 | 1151 | 
         Therefore, we recommend to leave the Vcdiff data format defined | 
 
 
 
 
 
 | 1152 | 
         as in this document so that the use of secondary compressors | 
 
 
 
 
 
 | 1153 | 
         can be implemented when they become needed in the future. | 
 
 
 
 
 
 | 1154 | 
         The formats of the compressed data via such compressors or any | 
 
 
 
 
 
 | 1155 | 
         compressors that may be defined in the future are left open to | 
 
 
 
 
 
 | 1156 | 
         their implementations.  These could include Huffman encoding, | 
 
 
 
 
 
 | 1157 | 
         arithmetic encoding, and splay tree encoding [8,9]. | 
 
 
 
 
 
 | 1158 | 
  | 
 
 
 
 
 
 | 1159 | 
     Large file system vs. small file system: | 
 
 
 
 
 
 | 1160 | 
         As discussed in Section 4, a target window in a large file may be | 
 
 
 
 
 
 | 1161 | 
         compared against some source window in another file or in the same | 
 
 
 
 
 
 | 1162 | 
         file (from some earlier part). In that case, the file offset of the | 
 
 
 
 
 
 | 1163 | 
         source window is specified as a variable-sized integer in the delta | 
 
 
 
 
 
 | 1164 | 
         encoding. There is a possibility that the encoding was computed on | 
 
 
 
 
 
 | 1165 | 
         a system supporting much larger files than in a system where | 
 
 
 
 
 
 | 1166 | 
         the data may be decoded (e.g., 64-bit file systems vs. 32-bit file | 
 
 
 
 
 
 | 1167 | 
         systems). In that case, some target data may not be recoverable. | 
 
 
 
 
 
 | 1168 | 
         This problem could afflict any compression format, and ought | 
 
 
 
 
 
 | 1169 | 
         to be resolved with a generic negotiation mechanism in the | 
 
 
 
 
 
 | 1170 | 
         appropriate protocol(s). | 
 
 
 
 
 
 | 1171 | 
  | 
 
 
 
 
 
 | 1172 | 
  | 
 
 
 
 
 
 | 1173 | 
 10.  SUMMARY | 
 
 
 
 
 
 | 1174 | 
  | 
 
 
 
 
 
 | 1175 | 
     We have described Vcdiff, a general and portable encoding format for | 
 
 
 
 
 
 | 1176 | 
     compression and differencing. The format is good in that it allows | 
 
 
 
 
 
 | 1177 | 
     implementing a decoder without knowledge of the encoders. Further, | 
 
 
 
 
 
 | 1178 | 
     ignoring the use of secondary compressors not defined within the format, | 
 
 
 
 
 
 | 1179 | 
     the decoding algorithms runs in linear time and requires working space | 
 
 
 
 
 
 | 1180 | 
     proportional to window sizes. | 
 
 
 
 
 
 | 1181 | 
  | 
 
 
 
 
 
 | 1182 | 
  | 
 
 
 
 
 
 | 1183 | 
  | 
 
 
 
 
 
 | 1184 | 
 11. ACKNOWLEDGEMENTS | 
 
 
 
 
 
 | 1185 | 
  | 
 
 
 
 
 
 | 1186 | 
     Thanks are due to Balachander Krishnamurthy, Jeff Mogul and Arthur Van Hoff | 
 
 
 
 
 
 | 1187 | 
     who provided much encouragement to publicize Vcdiff. In particular, Jeff | 
 
 
 
 
 
 | 1188 | 
     helped clarifying the description of the data format presented here. | 
 
 
 
 
 
 | 1189 | 
  | 
 
 
 
 
 
 | 1190 | 
  | 
 
 
 
 
 
 | 1191 | 
  | 
 
 
 
 
 
 | 1192 | 
 12. SECURITY CONSIDERATIONS | 
 
 
 
 
 
 | 1193 | 
  | 
 
 
 
 
 
 | 1194 | 
     Vcdiff only provides a format to encode compressed and differenced data. | 
 
 
 
 
 
 | 1195 | 
     It does not address any issues concerning how such data are, in fact, | 
 
 
 
 
 
 | 1196 | 
     stored in a given file system or the run-time memory of a computer system. | 
 
 
 
 
 
 | 1197 | 
     Therefore, we do not anticipate any security issues with respect to Vcdiff. | 
 
 
 
 
 
 | 1198 | 
  | 
 
 
 
 
 
 | 1199 | 
  | 
 
 
 
 
 
 | 1200 | 
  | 
 
 
 
 
 
 | 1201 | 
 13. SOURCE CODE AVAILABILITY | 
 
 
 
 
 
 | 1202 | 
  | 
 
 
 
 
 
 | 1203 | 
     Vcdiff is implemented as a data transforming method in Phong Vo's | 
 
 
 
 
 
 | 1204 | 
     Vcodex library. AT&T Corp. has made the source code for Vcodex available | 
 
 
 
 
 
 | 1205 | 
     for anyone to use to transmit data via HTTP/1.1 Delta Encoding [10,11]. | 
 
 
 
 
 
 | 1206 | 
     The source code and according license is accessible at the below URL: | 
 
 
 
 
 
 | 1207 | 
  | 
 
 
 
 
 
 | 1208 | 
           http://www.research.att.com/sw/tools | 
 
 
 
 
 
 | 1209 | 
  | 
 
 
 
 
 
 | 1210 | 
  | 
 
 
 
 
 
 | 1211 | 
 14. INTELLECTUAL PROPERTY RIGHTS | 
 
 
 
 
 
 | 1212 | 
  | 
 
 
 
 
 
 | 1213 | 
    The IETF has been notified of intellectual property rights claimed in | 
 
 
 
 
 
 | 1214 | 
    regard to some or all of the specification contained in this | 
 
 
 
 
 
 | 1215 | 
    document.  For more information consult the online list of claimed | 
 
 
 
 
 
 | 1216 | 
    rights, at <http://www.ietf.org/ipr.html>. | 
 
 
 
 
 
 | 1217 | 
  | 
 
 
 
 
 
 | 1218 | 
    The IETF takes no position regarding the validity or scope of any | 
 
 
 
 
 
 | 1219 | 
    intellectual property or other rights that might be claimed to | 
 
 
 
 
 
 | 1220 | 
    pertain to the implementation or use of the technology described in | 
 
 
 
 
 
 | 1221 | 
    this document or the extent to which any license under such rights | 
 
 
 
 
 
 | 1222 | 
    might or might not be available; neither does it represent that it | 
 
 
 
 
 
 | 1223 | 
    has made any effort to identify any such rights.  Information on the | 
 
 
 
 
 
 | 1224 | 
    IETF's procedures with respect to rights in standards-track and | 
 
 
 
 
 
 | 1225 | 
    standards-related documentation can be found in BCP-11.  Copies of | 
 
 
 
 
 
 | 1226 | 
    claims of rights made available for publication and any assurances of | 
 
 
 
 
 
 | 1227 | 
    licenses to be made available, or the result of an attempt made to | 
 
 
 
 
 
 | 1228 | 
    obtain a general license or permission for the use of such | 
 
 
 
 
 
 | 1229 | 
    proprietary rights by implementors or users of this specification can | 
 
 
 
 
 
 | 1230 | 
    be obtained from the IETF Secretariat. | 
 
 
 
 
 
 | 1231 | 
  | 
 
 
 
 
 
 | 1232 | 
  | 
 
 
 
 
 
 | 1233 | 
  | 
 
 
 
 
 
 | 1234 | 
 15. IANA CONSIDERATIONS | 
 
 
 
 
 
 | 1235 | 
  | 
 
 
 
 
 
 | 1236 | 
    The Internet Assigned Numbers Authority (IANA) administers the number | 
 
 
 
 
 
 | 1237 | 
    space for Secondary Compressor ID values.  Values and their meaning | 
 
 
 
 
 
 | 1238 | 
    must be documented in an RFC or other peer-reviewed, permanent, and | 
 
 
 
 
 
 | 1239 | 
    readily available reference, in sufficient detail so that | 
 
 
 
 
 
 | 1240 | 
    interoperability between independent implementations is possible. | 
 
 
 
 
 
 | 1241 | 
    Subject to these constraints, name assignments are First Come, First | 
 
 
 
 
 
 | 1242 | 
    Served - see RFC2434 [13].  Legal ID values are in the range 1..255. | 
 
 
 
 
 
 | 1243 | 
  | 
 
 
 
 
 
 | 1244 | 
    This document does not define any values in this number space. | 
 
 
 
 
 
 | 1245 | 
  | 
 
 
 
 
 
 | 1246 | 
  | 
 
 
 
 
 
 | 1247 | 
 16. REFERENCES | 
 
 
 
 
 
 | 1248 | 
  | 
 
 
 
 
 
 | 1249 | 
     [1] D.G. Korn and K.P. Vo, Vdelta: Differencing and Compression, | 
 
 
 
 
 
 | 1250 | 
         Practical Reusable Unix Software, Editor B. Krishnamurthy, | 
 
 
 
 
 
 | 1251 | 
         John Wiley & Sons, Inc., 1995. | 
 
 
 
 
 
 | 1252 | 
  | 
 
 
 
 
 
 | 1253 | 
     [2] J. Ziv and A. Lempel, A Universal Algorithm for Sequential Data | 
 
 
 
 
 
 | 1254 | 
         Compression, IEEE Trans. on Information Theory, 23(3):337-343, 1977. | 
 
 
 
 
 
 | 1255 | 
  | 
 
 
 
 
 
 | 1256 | 
     [3] W. Tichy, The String-to-String Correction Problem with Block Moves, | 
 
 
 
 
 
 | 1257 | 
         ACM Transactions on Computer Systems, 2(4):309-321, November 1984. | 
 
 
 
 
 
 | 1258 | 
  | 
 
 
 
 
 
 | 1259 | 
     [4] E.M. McCreight, A Space-Economical Suffix Tree Construction | 
 
 
 
 
 
 | 1260 | 
         Algorithm, Journal of the ACM, 23:262-272, 1976. | 
 
 
 
 
 
 | 1261 | 
  | 
 
 
 
 
 
 | 1262 | 
     [5] J.J. Hunt, K.P. Vo, W. Tichy, An Empirical Study of Delta Algorithms, | 
 
 
 
 
 
 | 1263 | 
         IEEE Software Configuration and Maintenance Workshop, 1996. | 
 
 
 
 
 
 | 1264 | 
  | 
 
 
 
 
 
 | 1265 | 
     [6] J.J. Hunt, K.P. Vo, W. Tichy, Delta Algorithms: An Empirical Analysis, | 
 
 
 
 
 
 | 1266 | 
         ACM Trans. on Software Engineering and Methodology, 7:192-214, 1998. | 
 
 
 
 
 
 | 1267 | 
  | 
 
 
 
 
 
 | 1268 | 
     [7] D.G. Korn, K.P. Vo, Sfio: A buffered I/O Library, | 
 
 
 
 
 
 | 1269 | 
         Proc. of the Summer '91 Usenix Conference, 1991. | 
 
 
 
 
 
 | 1270 | 
  | 
 
 
 
 
 
 | 1271 | 
     [8] D. W. Jones, Application of Splay Trees to Data Compression, | 
 
 
 
 
 
 | 1272 | 
         CACM, 31(8):996:1007. | 
 
 
 
 
 
 | 1273 | 
  | 
 
 
 
 
 
 | 1274 | 
     [9] M. Nelson, J. Gailly, The Data Compression Book, ISBN 1-55851-434-1, | 
 
 
 
 
 
 | 1275 | 
         M&T Books, New York, NY, 1995. | 
 
 
 
 
 
 | 1276 | 
  | 
 
 
 
 
 
 | 1277 | 
    [10] J.C. Mogul, F. Douglis, A. Feldmann, and B. Krishnamurthy, | 
 
 
 
 
 
 | 1278 | 
         Potential benefits of delta encoding and data compression for HTTP, | 
 
 
 
 
 
 | 1279 | 
         SIGCOMM '97, Cannes, France, 1997. | 
 
 
 
 
 
 | 1280 | 
  | 
 
 
 
 
 
 | 1281 | 
    [11] J.C. Mogul, B. Krishnamurthy, F. Douglis, A. Feldmann, | 
 
 
 
 
 
 | 1282 | 
         Y. Goland, and A. Van Hoff, Delta Encoding in HTTP, | 
 
 
 
 
 
 | 1283 | 
         IETF, draft-mogul-http-delta-10, 2001. | 
 
 
 
 
 
 | 1284 | 
  | 
 
 
 
 
 
 | 1285 | 
    [12] S. Bradner, Key words for use in RFCs to Indicate Requirement Levels, | 
 
 
 
 
 
 | 1286 | 
         RFC 2119, March 1997. | 
 
 
 
 
 
 | 1287 | 
  | 
 
 
 
 
 
 | 1288 | 
    [13] T. Narten, H. Alvestrand, Guidelines for Writing an IANA | 
 
 
 
 
 
 | 1289 | 
         Considerations Section in RFCs, RFC2434, October 1998. | 
 
 
 
 
 
 | 1290 | 
  | 
 
 
 
 
 
 | 1291 | 
  | 
 
 
 
 
 
 | 1292 | 
  | 
 
 
 
 
 
 | 1293 | 
 17. AUTHOR'S ADDRESS | 
 
 
 
 
 
 | 1294 | 
  | 
 
 
 
 
 
 | 1295 | 
     Kiem-Phong Vo (main contact) | 
 
 
 
 
 
 | 1296 | 
     AT&T Labs, Room D223 | 
 
 
 
 
 
 | 1297 | 
     180 Park Avenue | 
 
 
 
 
 
 | 1298 | 
     Florham Park, NJ 07932 | 
 
 
 
 
 
 | 1299 | 
     Email: kpv@research.att.com | 
 
 
 
 
 
 | 1300 | 
     Phone: 1 973 360 8630 | 
 
 
 
 
 
 | 1301 | 
  | 
 
 
 
 
 
 | 1302 | 
     David G. Korn | 
 
 
 
 
 
 | 1303 | 
     AT&T Labs, Room D237 | 
 
 
 
 
 
 | 1304 | 
     180 Park Avenue | 
 
 
 
 
 
 | 1305 | 
     Florham Park, NJ 07932 | 
 
 
 
 
 
 | 1306 | 
     Email: dgk@research.att.com | 
 
 
 
 
 
 | 1307 | 
     Phone: 1 973 360 8602 | 
 
 
 
 
 
 | 1308 | 
  | 
 
 
 
 
 
 | 1309 | 
     Jeffrey C. Mogul | 
 
 
 
 
 
 | 1310 | 
     Western Research Laboratory | 
 
 
 
 
 
 | 1311 | 
     Compaq Computer Corporation | 
 
 
 
 
 
 | 1312 | 
     250 University Avenue | 
 
 
 
 
 
 | 1313 | 
     Palo Alto, California, 94305, U.S.A. | 
 
 
 
 
 
 | 1314 | 
     Email: JeffMogul@acm.org | 
 
 
 
 
 
 | 1315 | 
     Phone: 1 650 617 3304 (email preferred) | 
 
 
 
 
 
 | 1316 | 
  | 
 
 
 
 
 
 | 1317 | 
     Joshua P. MacDonald | 
 
 
 
 
 
 | 1318 | 
     Computer Science Division | 
 
 
 
 
 
 | 1319 | 
     University of California, Berkeley | 
 
 
 
 
 
 | 1320 | 
     345 Soda Hall | 
 
 
 
 
 
 | 1321 | 
     Berkeley, CA 94720 | 
 
 
 
 
 
 | 1322 | 
     Email: jmacd@cs.berkeley.edu |