| 1 |
|
| 2 |
|
| 3 |
|
| 4 |
|
| 5 |
|
| 6 |
|
| 7 |
Network Working Group P. Deutsch |
| 8 |
Request for Comments: 1951 Aladdin Enterprises |
| 9 |
Category: Informational May 1996 |
| 10 |
|
| 11 |
|
| 12 |
DEFLATE Compressed Data Format Specification version 1.3 |
| 13 |
|
| 14 |
Status of This Memo |
| 15 |
|
| 16 |
This memo provides information for the Internet community. This memo |
| 17 |
does not specify an Internet standard of any kind. Distribution of |
| 18 |
this memo is unlimited. |
| 19 |
|
| 20 |
IESG Note: |
| 21 |
|
| 22 |
The IESG takes no position on the validity of any Intellectual |
| 23 |
Property Rights statements contained in this document. |
| 24 |
|
| 25 |
Notices |
| 26 |
|
| 27 |
Copyright (c) 1996 L. Peter Deutsch |
| 28 |
|
| 29 |
Permission is granted to copy and distribute this document for any |
| 30 |
purpose and without charge, including translations into other |
| 31 |
languages and incorporation into compilations, provided that the |
| 32 |
copyright notice and this notice are preserved, and that any |
| 33 |
substantive changes or deletions from the original are clearly |
| 34 |
marked. |
| 35 |
|
| 36 |
A pointer to the latest version of this and related documentation in |
| 37 |
HTML format can be found at the URL |
| 38 |
<ftp://ftp.uu.net/graphics/png/documents/zlib/zdoc-index.html>. |
| 39 |
|
| 40 |
Abstract |
| 41 |
|
| 42 |
This specification defines a lossless compressed data format that |
| 43 |
compresses data using a combination of the LZ77 algorithm and Huffman |
| 44 |
coding, with efficiency comparable to the best currently available |
| 45 |
general-purpose compression methods. The data can be produced or |
| 46 |
consumed, even for an arbitrarily long sequentially presented input |
| 47 |
data stream, using only an a priori bounded amount of intermediate |
| 48 |
storage. The format can be implemented readily in a manner not |
| 49 |
covered by patents. |
| 50 |
|
| 51 |
|
| 52 |
|
| 53 |
|
| 54 |
|
| 55 |
|
| 56 |
|
| 57 |
|
| 58 |
Deutsch Informational [Page 1] |
| 59 |
|
| 60 |
RFC 1951 DEFLATE Compressed Data Format Specification May 1996 |
| 61 |
|
| 62 |
|
| 63 |
Table of Contents |
| 64 |
|
| 65 |
1. Introduction ................................................... 2 |
| 66 |
1.1. Purpose ................................................... 2 |
| 67 |
1.2. Intended audience ......................................... 3 |
| 68 |
1.3. Scope ..................................................... 3 |
| 69 |
1.4. Compliance ................................................ 3 |
| 70 |
1.5. Definitions of terms and conventions used ................ 3 |
| 71 |
1.6. Changes from previous versions ............................ 4 |
| 72 |
2. Compressed representation overview ............................. 4 |
| 73 |
3. Detailed specification ......................................... 5 |
| 74 |
3.1. Overall conventions ....................................... 5 |
| 75 |
3.1.1. Packing into bytes .................................. 5 |
| 76 |
3.2. Compressed block format ................................... 6 |
| 77 |
3.2.1. Synopsis of prefix and Huffman coding ............... 6 |
| 78 |
3.2.2. Use of Huffman coding in the "deflate" format ....... 7 |
| 79 |
3.2.3. Details of block format ............................. 9 |
| 80 |
3.2.4. Non-compressed blocks (BTYPE=00) ................... 11 |
| 81 |
3.2.5. Compressed blocks (length and distance codes) ...... 11 |
| 82 |
3.2.6. Compression with fixed Huffman codes (BTYPE=01) .... 12 |
| 83 |
3.2.7. Compression with dynamic Huffman codes (BTYPE=10) .. 13 |
| 84 |
3.3. Compliance ............................................... 14 |
| 85 |
4. Compression algorithm details ................................. 14 |
| 86 |
5. References .................................................... 16 |
| 87 |
6. Security Considerations ....................................... 16 |
| 88 |
7. Source code ................................................... 16 |
| 89 |
8. Acknowledgements .............................................. 16 |
| 90 |
9. Author's Address .............................................. 17 |
| 91 |
|
| 92 |
1. Introduction |
| 93 |
|
| 94 |
1.1. Purpose |
| 95 |
|
| 96 |
The purpose of this specification is to define a lossless |
| 97 |
compressed data format that: |
| 98 |
* Is independent of CPU type, operating system, file system, |
| 99 |
and character set, and hence can be used for interchange; |
| 100 |
* Can be produced or consumed, even for an arbitrarily long |
| 101 |
sequentially presented input data stream, using only an a |
| 102 |
priori bounded amount of intermediate storage, and hence |
| 103 |
can be used in data communications or similar structures |
| 104 |
such as Unix filters; |
| 105 |
* Compresses data with efficiency comparable to the best |
| 106 |
currently available general-purpose compression methods, |
| 107 |
and in particular considerably better than the "compress" |
| 108 |
program; |
| 109 |
* Can be implemented readily in a manner not covered by |
| 110 |
patents, and hence can be practiced freely; |
| 111 |
|
| 112 |
|
| 113 |
|
| 114 |
Deutsch Informational [Page 2] |
| 115 |
|
| 116 |
RFC 1951 DEFLATE Compressed Data Format Specification May 1996 |
| 117 |
|
| 118 |
|
| 119 |
* Is compatible with the file format produced by the current |
| 120 |
widely used gzip utility, in that conforming decompressors |
| 121 |
will be able to read data produced by the existing gzip |
| 122 |
compressor. |
| 123 |
|
| 124 |
The data format defined by this specification does not attempt to: |
| 125 |
|
| 126 |
* Allow random access to compressed data; |
| 127 |
* Compress specialized data (e.g., raster graphics) as well |
| 128 |
as the best currently available specialized algorithms. |
| 129 |
|
| 130 |
A simple counting argument shows that no lossless compression |
| 131 |
algorithm can compress every possible input data set. For the |
| 132 |
format defined here, the worst case expansion is 5 bytes per 32K- |
| 133 |
byte block, i.e., a size increase of 0.015% for large data sets. |
| 134 |
English text usually compresses by a factor of 2.5 to 3; |
| 135 |
executable files usually compress somewhat less; graphical data |
| 136 |
such as raster images may compress much more. |
| 137 |
|
| 138 |
1.2. Intended audience |
| 139 |
|
| 140 |
This specification is intended for use by implementors of software |
| 141 |
to compress data into "deflate" format and/or decompress data from |
| 142 |
"deflate" format. |
| 143 |
|
| 144 |
The text of the specification assumes a basic background in |
| 145 |
programming at the level of bits and other primitive data |
| 146 |
representations. Familiarity with the technique of Huffman coding |
| 147 |
is helpful but not required. |
| 148 |
|
| 149 |
1.3. Scope |
| 150 |
|
| 151 |
The specification specifies a method for representing a sequence |
| 152 |
of bytes as a (usually shorter) sequence of bits, and a method for |
| 153 |
packing the latter bit sequence into bytes. |
| 154 |
|
| 155 |
1.4. Compliance |
| 156 |
|
| 157 |
Unless otherwise indicated below, a compliant decompressor must be |
| 158 |
able to accept and decompress any data set that conforms to all |
| 159 |
the specifications presented here; a compliant compressor must |
| 160 |
produce data sets that conform to all the specifications presented |
| 161 |
here. |
| 162 |
|
| 163 |
1.5. Definitions of terms and conventions used |
| 164 |
|
| 165 |
Byte: 8 bits stored or transmitted as a unit (same as an octet). |
| 166 |
For this specification, a byte is exactly 8 bits, even on machines |
| 167 |
|
| 168 |
|
| 169 |
|
| 170 |
Deutsch Informational [Page 3] |
| 171 |
|
| 172 |
RFC 1951 DEFLATE Compressed Data Format Specification May 1996 |
| 173 |
|
| 174 |
|
| 175 |
which store a character on a number of bits different from eight. |
| 176 |
See below, for the numbering of bits within a byte. |
| 177 |
|
| 178 |
String: a sequence of arbitrary bytes. |
| 179 |
|
| 180 |
1.6. Changes from previous versions |
| 181 |
|
| 182 |
There have been no technical changes to the deflate format since |
| 183 |
version 1.1 of this specification. In version 1.2, some |
| 184 |
terminology was changed. Version 1.3 is a conversion of the |
| 185 |
specification to RFC style. |
| 186 |
|
| 187 |
2. Compressed representation overview |
| 188 |
|
| 189 |
A compressed data set consists of a series of blocks, corresponding |
| 190 |
to successive blocks of input data. The block sizes are arbitrary, |
| 191 |
except that non-compressible blocks are limited to 65,535 bytes. |
| 192 |
|
| 193 |
Each block is compressed using a combination of the LZ77 algorithm |
| 194 |
and Huffman coding. The Huffman trees for each block are independent |
| 195 |
of those for previous or subsequent blocks; the LZ77 algorithm may |
| 196 |
use a reference to a duplicated string occurring in a previous block, |
| 197 |
up to 32K input bytes before. |
| 198 |
|
| 199 |
Each block consists of two parts: a pair of Huffman code trees that |
| 200 |
describe the representation of the compressed data part, and a |
| 201 |
compressed data part. (The Huffman trees themselves are compressed |
| 202 |
using Huffman encoding.) The compressed data consists of a series of |
| 203 |
elements of two types: literal bytes (of strings that have not been |
| 204 |
detected as duplicated within the previous 32K input bytes), and |
| 205 |
pointers to duplicated strings, where a pointer is represented as a |
| 206 |
pair <length, backward distance>. The representation used in the |
| 207 |
"deflate" format limits distances to 32K bytes and lengths to 258 |
| 208 |
bytes, but does not limit the size of a block, except for |
| 209 |
uncompressible blocks, which are limited as noted above. |
| 210 |
|
| 211 |
Each type of value (literals, distances, and lengths) in the |
| 212 |
compressed data is represented using a Huffman code, using one code |
| 213 |
tree for literals and lengths and a separate code tree for distances. |
| 214 |
The code trees for each block appear in a compact form just before |
| 215 |
the compressed data for that block. |
| 216 |
|
| 217 |
|
| 218 |
|
| 219 |
|
| 220 |
|
| 221 |
|
| 222 |
|
| 223 |
|
| 224 |
|
| 225 |
|
| 226 |
Deutsch Informational [Page 4] |
| 227 |
|
| 228 |
RFC 1951 DEFLATE Compressed Data Format Specification May 1996 |
| 229 |
|
| 230 |
|
| 231 |
3. Detailed specification |
| 232 |
|
| 233 |
3.1. Overall conventions In the diagrams below, a box like this: |
| 234 |
|
| 235 |
+---+ |
| 236 |
| | <-- the vertical bars might be missing |
| 237 |
+---+ |
| 238 |
|
| 239 |
represents one byte; a box like this: |
| 240 |
|
| 241 |
+==============+ |
| 242 |
| | |
| 243 |
+==============+ |
| 244 |
|
| 245 |
represents a variable number of bytes. |
| 246 |
|
| 247 |
Bytes stored within a computer do not have a "bit order", since |
| 248 |
they are always treated as a unit. However, a byte considered as |
| 249 |
an integer between 0 and 255 does have a most- and least- |
| 250 |
significant bit, and since we write numbers with the most- |
| 251 |
significant digit on the left, we also write bytes with the most- |
| 252 |
significant bit on the left. In the diagrams below, we number the |
| 253 |
bits of a byte so that bit 0 is the least-significant bit, i.e., |
| 254 |
the bits are numbered: |
| 255 |
|
| 256 |
+--------+ |
| 257 |
|76543210| |
| 258 |
+--------+ |
| 259 |
|
| 260 |
Within a computer, a number may occupy multiple bytes. All |
| 261 |
multi-byte numbers in the format described here are stored with |
| 262 |
the least-significant byte first (at the lower memory address). |
| 263 |
For example, the decimal number 520 is stored as: |
| 264 |
|
| 265 |
0 1 |
| 266 |
+--------+--------+ |
| 267 |
|00001000|00000010| |
| 268 |
+--------+--------+ |
| 269 |
^ ^ |
| 270 |
| | |
| 271 |
| + more significant byte = 2 x 256 |
| 272 |
+ less significant byte = 8 |
| 273 |
|
| 274 |
3.1.1. Packing into bytes |
| 275 |
|
| 276 |
This document does not address the issue of the order in which |
| 277 |
bits of a byte are transmitted on a bit-sequential medium, |
| 278 |
since the final data format described here is byte- rather than |
| 279 |
|
| 280 |
|
| 281 |
|
| 282 |
Deutsch Informational [Page 5] |
| 283 |
|
| 284 |
RFC 1951 DEFLATE Compressed Data Format Specification May 1996 |
| 285 |
|
| 286 |
|
| 287 |
bit-oriented. However, we describe the compressed block format |
| 288 |
in below, as a sequence of data elements of various bit |
| 289 |
lengths, not a sequence of bytes. We must therefore specify |
| 290 |
how to pack these data elements into bytes to form the final |
| 291 |
compressed byte sequence: |
| 292 |
|
| 293 |
* Data elements are packed into bytes in order of |
| 294 |
increasing bit number within the byte, i.e., starting |
| 295 |
with the least-significant bit of the byte. |
| 296 |
* Data elements other than Huffman codes are packed |
| 297 |
starting with the least-significant bit of the data |
| 298 |
element. |
| 299 |
* Huffman codes are packed starting with the most- |
| 300 |
significant bit of the code. |
| 301 |
|
| 302 |
In other words, if one were to print out the compressed data as |
| 303 |
a sequence of bytes, starting with the first byte at the |
| 304 |
*right* margin and proceeding to the *left*, with the most- |
| 305 |
significant bit of each byte on the left as usual, one would be |
| 306 |
able to parse the result from right to left, with fixed-width |
| 307 |
elements in the correct MSB-to-LSB order and Huffman codes in |
| 308 |
bit-reversed order (i.e., with the first bit of the code in the |
| 309 |
relative LSB position). |
| 310 |
|
| 311 |
3.2. Compressed block format |
| 312 |
|
| 313 |
3.2.1. Synopsis of prefix and Huffman coding |
| 314 |
|
| 315 |
Prefix coding represents symbols from an a priori known |
| 316 |
alphabet by bit sequences (codes), one code for each symbol, in |
| 317 |
a manner such that different symbols may be represented by bit |
| 318 |
sequences of different lengths, but a parser can always parse |
| 319 |
an encoded string unambiguously symbol-by-symbol. |
| 320 |
|
| 321 |
We define a prefix code in terms of a binary tree in which the |
| 322 |
two edges descending from each non-leaf node are labeled 0 and |
| 323 |
1 and in which the leaf nodes correspond one-for-one with (are |
| 324 |
labeled with) the symbols of the alphabet; then the code for a |
| 325 |
symbol is the sequence of 0's and 1's on the edges leading from |
| 326 |
the root to the leaf labeled with that symbol. For example: |
| 327 |
|
| 328 |
|
| 329 |
|
| 330 |
|
| 331 |
|
| 332 |
|
| 333 |
|
| 334 |
|
| 335 |
|
| 336 |
|
| 337 |
|
| 338 |
Deutsch Informational [Page 6] |
| 339 |
|
| 340 |
RFC 1951 DEFLATE Compressed Data Format Specification May 1996 |
| 341 |
|
| 342 |
|
| 343 |
/\ Symbol Code |
| 344 |
0 1 ------ ---- |
| 345 |
/ \ A 00 |
| 346 |
/\ B B 1 |
| 347 |
0 1 C 011 |
| 348 |
/ \ D 010 |
| 349 |
A /\ |
| 350 |
0 1 |
| 351 |
/ \ |
| 352 |
D C |
| 353 |
|
| 354 |
A parser can decode the next symbol from an encoded input |
| 355 |
stream by walking down the tree from the root, at each step |
| 356 |
choosing the edge corresponding to the next input bit. |
| 357 |
|
| 358 |
Given an alphabet with known symbol frequencies, the Huffman |
| 359 |
algorithm allows the construction of an optimal prefix code |
| 360 |
(one which represents strings with those symbol frequencies |
| 361 |
using the fewest bits of any possible prefix codes for that |
| 362 |
alphabet). Such a code is called a Huffman code. (See |
| 363 |
reference [1] in Chapter 5, references for additional |
| 364 |
information on Huffman codes.) |
| 365 |
|
| 366 |
Note that in the "deflate" format, the Huffman codes for the |
| 367 |
various alphabets must not exceed certain maximum code lengths. |
| 368 |
This constraint complicates the algorithm for computing code |
| 369 |
lengths from symbol frequencies. Again, see Chapter 5, |
| 370 |
references for details. |
| 371 |
|
| 372 |
3.2.2. Use of Huffman coding in the "deflate" format |
| 373 |
|
| 374 |
The Huffman codes used for each alphabet in the "deflate" |
| 375 |
format have two additional rules: |
| 376 |
|
| 377 |
* All codes of a given bit length have lexicographically |
| 378 |
consecutive values, in the same order as the symbols |
| 379 |
they represent; |
| 380 |
|
| 381 |
* Shorter codes lexicographically precede longer codes. |
| 382 |
|
| 383 |
|
| 384 |
|
| 385 |
|
| 386 |
|
| 387 |
|
| 388 |
|
| 389 |
|
| 390 |
|
| 391 |
|
| 392 |
|
| 393 |
|
| 394 |
Deutsch Informational [Page 7] |
| 395 |
|
| 396 |
RFC 1951 DEFLATE Compressed Data Format Specification May 1996 |
| 397 |
|
| 398 |
|
| 399 |
We could recode the example above to follow this rule as |
| 400 |
follows, assuming that the order of the alphabet is ABCD: |
| 401 |
|
| 402 |
Symbol Code |
| 403 |
------ ---- |
| 404 |
A 10 |
| 405 |
B 0 |
| 406 |
C 110 |
| 407 |
D 111 |
| 408 |
|
| 409 |
I.e., 0 precedes 10 which precedes 11x, and 110 and 111 are |
| 410 |
lexicographically consecutive. |
| 411 |
|
| 412 |
Given this rule, we can define the Huffman code for an alphabet |
| 413 |
just by giving the bit lengths of the codes for each symbol of |
| 414 |
the alphabet in order; this is sufficient to determine the |
| 415 |
actual codes. In our example, the code is completely defined |
| 416 |
by the sequence of bit lengths (2, 1, 3, 3). The following |
| 417 |
algorithm generates the codes as integers, intended to be read |
| 418 |
from most- to least-significant bit. The code lengths are |
| 419 |
initially in tree[I].Len; the codes are produced in |
| 420 |
tree[I].Code. |
| 421 |
|
| 422 |
1) Count the number of codes for each code length. Let |
| 423 |
bl_count[N] be the number of codes of length N, N >= 1. |
| 424 |
|
| 425 |
2) Find the numerical value of the smallest code for each |
| 426 |
code length: |
| 427 |
|
| 428 |
code = 0; |
| 429 |
bl_count[0] = 0; |
| 430 |
for (bits = 1; bits <= MAX_BITS; bits++) { |
| 431 |
code = (code + bl_count[bits-1]) << 1; |
| 432 |
next_code[bits] = code; |
| 433 |
} |
| 434 |
|
| 435 |
3) Assign numerical values to all codes, using consecutive |
| 436 |
values for all codes of the same length with the base |
| 437 |
values determined at step 2. Codes that are never used |
| 438 |
(which have a bit length of zero) must not be assigned a |
| 439 |
value. |
| 440 |
|
| 441 |
for (n = 0; n <= max_code; n++) { |
| 442 |
len = tree[n].Len; |
| 443 |
if (len != 0) { |
| 444 |
tree[n].Code = next_code[len]; |
| 445 |
next_code[len]++; |
| 446 |
} |
| 447 |
|
| 448 |
|
| 449 |
|
| 450 |
Deutsch Informational [Page 8] |
| 451 |
|
| 452 |
RFC 1951 DEFLATE Compressed Data Format Specification May 1996 |
| 453 |
|
| 454 |
|
| 455 |
} |
| 456 |
|
| 457 |
Example: |
| 458 |
|
| 459 |
Consider the alphabet ABCDEFGH, with bit lengths (3, 3, 3, 3, |
| 460 |
3, 2, 4, 4). After step 1, we have: |
| 461 |
|
| 462 |
N bl_count[N] |
| 463 |
- ----------- |
| 464 |
2 1 |
| 465 |
3 5 |
| 466 |
4 2 |
| 467 |
|
| 468 |
Step 2 computes the following next_code values: |
| 469 |
|
| 470 |
N next_code[N] |
| 471 |
- ------------ |
| 472 |
1 0 |
| 473 |
2 0 |
| 474 |
3 2 |
| 475 |
4 14 |
| 476 |
|
| 477 |
Step 3 produces the following code values: |
| 478 |
|
| 479 |
Symbol Length Code |
| 480 |
------ ------ ---- |
| 481 |
A 3 010 |
| 482 |
B 3 011 |
| 483 |
C 3 100 |
| 484 |
D 3 101 |
| 485 |
E 3 110 |
| 486 |
F 2 00 |
| 487 |
G 4 1110 |
| 488 |
H 4 1111 |
| 489 |
|
| 490 |
3.2.3. Details of block format |
| 491 |
|
| 492 |
Each block of compressed data begins with 3 header bits |
| 493 |
containing the following data: |
| 494 |
|
| 495 |
first bit BFINAL |
| 496 |
next 2 bits BTYPE |
| 497 |
|
| 498 |
Note that the header bits do not necessarily begin on a byte |
| 499 |
boundary, since a block does not necessarily occupy an integral |
| 500 |
number of bytes. |
| 501 |
|
| 502 |
|
| 503 |
|
| 504 |
|
| 505 |
|
| 506 |
Deutsch Informational [Page 9] |
| 507 |
|
| 508 |
RFC 1951 DEFLATE Compressed Data Format Specification May 1996 |
| 509 |
|
| 510 |
|
| 511 |
BFINAL is set if and only if this is the last block of the data |
| 512 |
set. |
| 513 |
|
| 514 |
BTYPE specifies how the data are compressed, as follows: |
| 515 |
|
| 516 |
00 - no compression |
| 517 |
01 - compressed with fixed Huffman codes |
| 518 |
10 - compressed with dynamic Huffman codes |
| 519 |
11 - reserved (error) |
| 520 |
|
| 521 |
The only difference between the two compressed cases is how the |
| 522 |
Huffman codes for the literal/length and distance alphabets are |
| 523 |
defined. |
| 524 |
|
| 525 |
In all cases, the decoding algorithm for the actual data is as |
| 526 |
follows: |
| 527 |
|
| 528 |
do |
| 529 |
read block header from input stream. |
| 530 |
if stored with no compression |
| 531 |
skip any remaining bits in current partially |
| 532 |
processed byte |
| 533 |
read LEN and NLEN (see next section) |
| 534 |
copy LEN bytes of data to output |
| 535 |
otherwise |
| 536 |
if compressed with dynamic Huffman codes |
| 537 |
read representation of code trees (see |
| 538 |
subsection below) |
| 539 |
loop (until end of block code recognized) |
| 540 |
decode literal/length value from input stream |
| 541 |
if value < 256 |
| 542 |
copy value (literal byte) to output stream |
| 543 |
otherwise |
| 544 |
if value = end of block (256) |
| 545 |
break from loop |
| 546 |
otherwise (value = 257..285) |
| 547 |
decode distance from input stream |
| 548 |
|
| 549 |
move backwards distance bytes in the output |
| 550 |
stream, and copy length bytes from this |
| 551 |
position to the output stream. |
| 552 |
end loop |
| 553 |
while not last block |
| 554 |
|
| 555 |
Note that a duplicated string reference may refer to a string |
| 556 |
in a previous block; i.e., the backward distance may cross one |
| 557 |
or more block boundaries. However a distance cannot refer past |
| 558 |
the beginning of the output stream. (An application using a |
| 559 |
|
| 560 |
|
| 561 |
|
| 562 |
Deutsch Informational [Page 10] |
| 563 |
|
| 564 |
RFC 1951 DEFLATE Compressed Data Format Specification May 1996 |
| 565 |
|
| 566 |
|
| 567 |
preset dictionary might discard part of the output stream; a |
| 568 |
distance can refer to that part of the output stream anyway) |
| 569 |
Note also that the referenced string may overlap the current |
| 570 |
position; for example, if the last 2 bytes decoded have values |
| 571 |
X and Y, a string reference with <length = 5, distance = 2> |
| 572 |
adds X,Y,X,Y,X to the output stream. |
| 573 |
|
| 574 |
We now specify each compression method in turn. |
| 575 |
|
| 576 |
3.2.4. Non-compressed blocks (BTYPE=00) |
| 577 |
|
| 578 |
Any bits of input up to the next byte boundary are ignored. |
| 579 |
The rest of the block consists of the following information: |
| 580 |
|
| 581 |
0 1 2 3 4... |
| 582 |
+---+---+---+---+================================+ |
| 583 |
| LEN | NLEN |... LEN bytes of literal data...| |
| 584 |
+---+---+---+---+================================+ |
| 585 |
|
| 586 |
LEN is the number of data bytes in the block. NLEN is the |
| 587 |
one's complement of LEN. |
| 588 |
|
| 589 |
3.2.5. Compressed blocks (length and distance codes) |
| 590 |
|
| 591 |
As noted above, encoded data blocks in the "deflate" format |
| 592 |
consist of sequences of symbols drawn from three conceptually |
| 593 |
distinct alphabets: either literal bytes, from the alphabet of |
| 594 |
byte values (0..255), or <length, backward distance> pairs, |
| 595 |
where the length is drawn from (3..258) and the distance is |
| 596 |
drawn from (1..32,768). In fact, the literal and length |
| 597 |
alphabets are merged into a single alphabet (0..285), where |
| 598 |
values 0..255 represent literal bytes, the value 256 indicates |
| 599 |
end-of-block, and values 257..285 represent length codes |
| 600 |
(possibly in conjunction with extra bits following the symbol |
| 601 |
code) as follows: |
| 602 |
|
| 603 |
|
| 604 |
|
| 605 |
|
| 606 |
|
| 607 |
|
| 608 |
|
| 609 |
|
| 610 |
|
| 611 |
|
| 612 |
|
| 613 |
|
| 614 |
|
| 615 |
|
| 616 |
|
| 617 |
|
| 618 |
Deutsch Informational [Page 11] |
| 619 |
|
| 620 |
RFC 1951 DEFLATE Compressed Data Format Specification May 1996 |
| 621 |
|
| 622 |
|
| 623 |
Extra Extra Extra |
| 624 |
Code Bits Length(s) Code Bits Lengths Code Bits Length(s) |
| 625 |
---- ---- ------ ---- ---- ------- ---- ---- ------- |
| 626 |
257 0 3 267 1 15,16 277 4 67-82 |
| 627 |
258 0 4 268 1 17,18 278 4 83-98 |
| 628 |
259 0 5 269 2 19-22 279 4 99-114 |
| 629 |
260 0 6 270 2 23-26 280 4 115-130 |
| 630 |
261 0 7 271 2 27-30 281 5 131-162 |
| 631 |
262 0 8 272 2 31-34 282 5 163-194 |
| 632 |
263 0 9 273 3 35-42 283 5 195-226 |
| 633 |
264 0 10 274 3 43-50 284 5 227-257 |
| 634 |
265 1 11,12 275 3 51-58 285 0 258 |
| 635 |
266 1 13,14 276 3 59-66 |
| 636 |
|
| 637 |
The extra bits should be interpreted as a machine integer |
| 638 |
stored with the most-significant bit first, e.g., bits 1110 |
| 639 |
represent the value 14. |
| 640 |
|
| 641 |
Extra Extra Extra |
| 642 |
Code Bits Dist Code Bits Dist Code Bits Distance |
| 643 |
---- ---- ---- ---- ---- ------ ---- ---- -------- |
| 644 |
0 0 1 10 4 33-48 20 9 1025-1536 |
| 645 |
1 0 2 11 4 49-64 21 9 1537-2048 |
| 646 |
2 0 3 12 5 65-96 22 10 2049-3072 |
| 647 |
3 0 4 13 5 97-128 23 10 3073-4096 |
| 648 |
4 1 5,6 14 6 129-192 24 11 4097-6144 |
| 649 |
5 1 7,8 15 6 193-256 25 11 6145-8192 |
| 650 |
6 2 9-12 16 7 257-384 26 12 8193-12288 |
| 651 |
7 2 13-16 17 7 385-512 27 12 12289-16384 |
| 652 |
8 3 17-24 18 8 513-768 28 13 16385-24576 |
| 653 |
9 3 25-32 19 8 769-1024 29 13 24577-32768 |
| 654 |
|
| 655 |
3.2.6. Compression with fixed Huffman codes (BTYPE=01) |
| 656 |
|
| 657 |
The Huffman codes for the two alphabets are fixed, and are not |
| 658 |
represented explicitly in the data. The Huffman code lengths |
| 659 |
for the literal/length alphabet are: |
| 660 |
|
| 661 |
Lit Value Bits Codes |
| 662 |
--------- ---- ----- |
| 663 |
0 - 143 8 00110000 through |
| 664 |
10111111 |
| 665 |
144 - 255 9 110010000 through |
| 666 |
111111111 |
| 667 |
256 - 279 7 0000000 through |
| 668 |
0010111 |
| 669 |
280 - 287 8 11000000 through |
| 670 |
11000111 |
| 671 |
|
| 672 |
|
| 673 |
|
| 674 |
Deutsch Informational [Page 12] |
| 675 |
|
| 676 |
RFC 1951 DEFLATE Compressed Data Format Specification May 1996 |
| 677 |
|
| 678 |
|
| 679 |
The code lengths are sufficient to generate the actual codes, |
| 680 |
as described above; we show the codes in the table for added |
| 681 |
clarity. Literal/length values 286-287 will never actually |
| 682 |
occur in the compressed data, but participate in the code |
| 683 |
construction. |
| 684 |
|
| 685 |
Distance codes 0-31 are represented by (fixed-length) 5-bit |
| 686 |
codes, with possible additional bits as shown in the table |
| 687 |
shown in Paragraph 3.2.5, above. Note that distance codes 30- |
| 688 |
31 will never actually occur in the compressed data. |
| 689 |
|
| 690 |
3.2.7. Compression with dynamic Huffman codes (BTYPE=10) |
| 691 |
|
| 692 |
The Huffman codes for the two alphabets appear in the block |
| 693 |
immediately after the header bits and before the actual |
| 694 |
compressed data, first the literal/length code and then the |
| 695 |
distance code. Each code is defined by a sequence of code |
| 696 |
lengths, as discussed in Paragraph 3.2.2, above. For even |
| 697 |
greater compactness, the code length sequences themselves are |
| 698 |
compressed using a Huffman code. The alphabet for code lengths |
| 699 |
is as follows: |
| 700 |
|
| 701 |
0 - 15: Represent code lengths of 0 - 15 |
| 702 |
16: Copy the previous code length 3 - 6 times. |
| 703 |
The next 2 bits indicate repeat length |
| 704 |
(0 = 3, ... , 3 = 6) |
| 705 |
Example: Codes 8, 16 (+2 bits 11), |
| 706 |
16 (+2 bits 10) will expand to |
| 707 |
12 code lengths of 8 (1 + 6 + 5) |
| 708 |
17: Repeat a code length of 0 for 3 - 10 times. |
| 709 |
(3 bits of length) |
| 710 |
18: Repeat a code length of 0 for 11 - 138 times |
| 711 |
(7 bits of length) |
| 712 |
|
| 713 |
A code length of 0 indicates that the corresponding symbol in |
| 714 |
the literal/length or distance alphabet will not occur in the |
| 715 |
block, and should not participate in the Huffman code |
| 716 |
construction algorithm given earlier. If only one distance |
| 717 |
code is used, it is encoded using one bit, not zero bits; in |
| 718 |
this case there is a single code length of one, with one unused |
| 719 |
code. One distance code of zero bits means that there are no |
| 720 |
distance codes used at all (the data is all literals). |
| 721 |
|
| 722 |
We can now define the format of the block: |
| 723 |
|
| 724 |
5 Bits: HLIT, # of Literal/Length codes - 257 (257 - 286) |
| 725 |
5 Bits: HDIST, # of Distance codes - 1 (1 - 32) |
| 726 |
4 Bits: HCLEN, # of Code Length codes - 4 (4 - 19) |
| 727 |
|
| 728 |
|
| 729 |
|
| 730 |
Deutsch Informational [Page 13] |
| 731 |
|
| 732 |
RFC 1951 DEFLATE Compressed Data Format Specification May 1996 |
| 733 |
|
| 734 |
|
| 735 |
(HCLEN + 4) x 3 bits: code lengths for the code length |
| 736 |
alphabet given just above, in the order: 16, 17, 18, |
| 737 |
0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15 |
| 738 |
|
| 739 |
These code lengths are interpreted as 3-bit integers |
| 740 |
(0-7); as above, a code length of 0 means the |
| 741 |
corresponding symbol (literal/length or distance code |
| 742 |
length) is not used. |
| 743 |
|
| 744 |
HLIT + 257 code lengths for the literal/length alphabet, |
| 745 |
encoded using the code length Huffman code |
| 746 |
|
| 747 |
HDIST + 1 code lengths for the distance alphabet, |
| 748 |
encoded using the code length Huffman code |
| 749 |
|
| 750 |
The actual compressed data of the block, |
| 751 |
encoded using the literal/length and distance Huffman |
| 752 |
codes |
| 753 |
|
| 754 |
The literal/length symbol 256 (end of data), |
| 755 |
encoded using the literal/length Huffman code |
| 756 |
|
| 757 |
The code length repeat codes can cross from HLIT + 257 to the |
| 758 |
HDIST + 1 code lengths. In other words, all code lengths form |
| 759 |
a single sequence of HLIT + HDIST + 258 values. |
| 760 |
|
| 761 |
3.3. Compliance |
| 762 |
|
| 763 |
A compressor may limit further the ranges of values specified in |
| 764 |
the previous section and still be compliant; for example, it may |
| 765 |
limit the range of backward pointers to some value smaller than |
| 766 |
32K. Similarly, a compressor may limit the size of blocks so that |
| 767 |
a compressible block fits in memory. |
| 768 |
|
| 769 |
A compliant decompressor must accept the full range of possible |
| 770 |
values defined in the previous section, and must accept blocks of |
| 771 |
arbitrary size. |
| 772 |
|
| 773 |
4. Compression algorithm details |
| 774 |
|
| 775 |
While it is the intent of this document to define the "deflate" |
| 776 |
compressed data format without reference to any particular |
| 777 |
compression algorithm, the format is related to the compressed |
| 778 |
formats produced by LZ77 (Lempel-Ziv 1977, see reference [2] below); |
| 779 |
since many variations of LZ77 are patented, it is strongly |
| 780 |
recommended that the implementor of a compressor follow the general |
| 781 |
algorithm presented here, which is known not to be patented per se. |
| 782 |
The material in this section is not part of the definition of the |
| 783 |
|
| 784 |
|
| 785 |
|
| 786 |
Deutsch Informational [Page 14] |
| 787 |
|
| 788 |
RFC 1951 DEFLATE Compressed Data Format Specification May 1996 |
| 789 |
|
| 790 |
|
| 791 |
specification per se, and a compressor need not follow it in order to |
| 792 |
be compliant. |
| 793 |
|
| 794 |
The compressor terminates a block when it determines that starting a |
| 795 |
new block with fresh trees would be useful, or when the block size |
| 796 |
fills up the compressor's block buffer. |
| 797 |
|
| 798 |
The compressor uses a chained hash table to find duplicated strings, |
| 799 |
using a hash function that operates on 3-byte sequences. At any |
| 800 |
given point during compression, let XYZ be the next 3 input bytes to |
| 801 |
be examined (not necessarily all different, of course). First, the |
| 802 |
compressor examines the hash chain for XYZ. If the chain is empty, |
| 803 |
the compressor simply writes out X as a literal byte and advances one |
| 804 |
byte in the input. If the hash chain is not empty, indicating that |
| 805 |
the sequence XYZ (or, if we are unlucky, some other 3 bytes with the |
| 806 |
same hash function value) has occurred recently, the compressor |
| 807 |
compares all strings on the XYZ hash chain with the actual input data |
| 808 |
sequence starting at the current point, and selects the longest |
| 809 |
match. |
| 810 |
|
| 811 |
The compressor searches the hash chains starting with the most recent |
| 812 |
strings, to favor small distances and thus take advantage of the |
| 813 |
Huffman encoding. The hash chains are singly linked. There are no |
| 814 |
deletions from the hash chains; the algorithm simply discards matches |
| 815 |
that are too old. To avoid a worst-case situation, very long hash |
| 816 |
chains are arbitrarily truncated at a certain length, determined by a |
| 817 |
run-time parameter. |
| 818 |
|
| 819 |
To improve overall compression, the compressor optionally defers the |
| 820 |
selection of matches ("lazy matching"): after a match of length N has |
| 821 |
been found, the compressor searches for a longer match starting at |
| 822 |
the next input byte. If it finds a longer match, it truncates the |
| 823 |
previous match to a length of one (thus producing a single literal |
| 824 |
byte) and then emits the longer match. Otherwise, it emits the |
| 825 |
original match, and, as described above, advances N bytes before |
| 826 |
continuing. |
| 827 |
|
| 828 |
Run-time parameters also control this "lazy match" procedure. If |
| 829 |
compression ratio is most important, the compressor attempts a |
| 830 |
complete second search regardless of the length of the first match. |
| 831 |
In the normal case, if the current match is "long enough", the |
| 832 |
compressor reduces the search for a longer match, thus speeding up |
| 833 |
the process. If speed is most important, the compressor inserts new |
| 834 |
strings in the hash table only when no match was found, or when the |
| 835 |
match is not "too long". This degrades the compression ratio but |
| 836 |
saves time since there are both fewer insertions and fewer searches. |
| 837 |
|
| 838 |
|
| 839 |
|
| 840 |
|
| 841 |
|
| 842 |
Deutsch Informational [Page 15] |
| 843 |
|
| 844 |
RFC 1951 DEFLATE Compressed Data Format Specification May 1996 |
| 845 |
|
| 846 |
|
| 847 |
5. References |
| 848 |
|
| 849 |
[1] Huffman, D. A., "A Method for the Construction of Minimum |
| 850 |
Redundancy Codes", Proceedings of the Institute of Radio |
| 851 |
Engineers, September 1952, Volume 40, Number 9, pp. 1098-1101. |
| 852 |
|
| 853 |
[2] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data |
| 854 |
Compression", IEEE Transactions on Information Theory, Vol. 23, |
| 855 |
No. 3, pp. 337-343. |
| 856 |
|
| 857 |
[3] Gailly, J.-L., and Adler, M., ZLIB documentation and sources, |
| 858 |
available in ftp://ftp.uu.net/pub/archiving/zip/doc/ |
| 859 |
|
| 860 |
[4] Gailly, J.-L., and Adler, M., GZIP documentation and sources, |
| 861 |
available as gzip-*.tar in ftp://prep.ai.mit.edu/pub/gnu/ |
| 862 |
|
| 863 |
[5] Schwartz, E. S., and Kallick, B. "Generating a canonical prefix |
| 864 |
encoding." Comm. ACM, 7,3 (Mar. 1964), pp. 166-169. |
| 865 |
|
| 866 |
[6] Hirschberg and Lelewer, "Efficient decoding of prefix codes," |
| 867 |
Comm. ACM, 33,4, April 1990, pp. 449-459. |
| 868 |
|
| 869 |
6. Security Considerations |
| 870 |
|
| 871 |
Any data compression method involves the reduction of redundancy in |
| 872 |
the data. Consequently, any corruption of the data is likely to have |
| 873 |
severe effects and be difficult to correct. Uncompressed text, on |
| 874 |
the other hand, will probably still be readable despite the presence |
| 875 |
of some corrupted bytes. |
| 876 |
|
| 877 |
It is recommended that systems using this data format provide some |
| 878 |
means of validating the integrity of the compressed data. See |
| 879 |
reference [3], for example. |
| 880 |
|
| 881 |
7. Source code |
| 882 |
|
| 883 |
Source code for a C language implementation of a "deflate" compliant |
| 884 |
compressor and decompressor is available within the zlib package at |
| 885 |
ftp://ftp.uu.net/pub/archiving/zip/zlib/. |
| 886 |
|
| 887 |
8. Acknowledgements |
| 888 |
|
| 889 |
Trademarks cited in this document are the property of their |
| 890 |
respective owners. |
| 891 |
|
| 892 |
Phil Katz designed the deflate format. Jean-Loup Gailly and Mark |
| 893 |
Adler wrote the related software described in this specification. |
| 894 |
Glenn Randers-Pehrson converted this document to RFC and HTML format. |
| 895 |
|
| 896 |
|
| 897 |
|
| 898 |
Deutsch Informational [Page 16] |
| 899 |
|
| 900 |
RFC 1951 DEFLATE Compressed Data Format Specification May 1996 |
| 901 |
|
| 902 |
|
| 903 |
9. Author's Address |
| 904 |
|
| 905 |
L. Peter Deutsch |
| 906 |
Aladdin Enterprises |
| 907 |
203 Santa Margarita Ave. |
| 908 |
Menlo Park, CA 94025 |
| 909 |
|
| 910 |
Phone: (415) 322-0103 (AM only) |
| 911 |
FAX: (415) 322-1734 |
| 912 |
EMail: <ghost@aladdin.com> |
| 913 |
|
| 914 |
Questions about the technical content of this specification can be |
| 915 |
sent by email to: |
| 916 |
|
| 917 |
Jean-Loup Gailly <gzip@prep.ai.mit.edu> and |
| 918 |
Mark Adler <madler@alumni.caltech.edu> |
| 919 |
|
| 920 |
Editorial comments on this specification can be sent by email to: |
| 921 |
|
| 922 |
L. Peter Deutsch <ghost@aladdin.com> and |
| 923 |
Glenn Randers-Pehrson <randeg@alumni.rpi.edu> |
| 924 |
|
| 925 |
|
| 926 |
|
| 927 |
|
| 928 |
|
| 929 |
|
| 930 |
|
| 931 |
|
| 932 |
|
| 933 |
|
| 934 |
|
| 935 |
|
| 936 |
|
| 937 |
|
| 938 |
|
| 939 |
|
| 940 |
|
| 941 |
|
| 942 |
|
| 943 |
|
| 944 |
|
| 945 |
|
| 946 |
|
| 947 |
|
| 948 |
|
| 949 |
|
| 950 |
|
| 951 |
|
| 952 |
|
| 953 |
|
| 954 |
Deutsch Informational [Page 17] |
| 955 |
|