I am learning the following spec by analyzing a sample of a file
compressed with gzip.
Network Working Group P. Deutsch
Request for Comments: 1951 Aladdin Enterprises
Category: Informational May 1996
DEFLATE Compressed Data Format Specification version 1.3
Full hex data of compressed file and original file are both given at
bottom of this post.
I think the beginning of the compressed block is:
6d90 5d6b 8330 1486 afcd
Binary:
0110 1101 1001 0000 0101 1101 0110 1011
1000 0011 0011 0000 0001 0100 1000 0110
1010 1111 1100 1101
Is the above analysis correct? If that's correct, then I have some
guesses of the HLIT HDIST HCLEN and CL codes data.
The spec defines:
(HCLEN + 4) x 3 bits: code lengths for the code length
alphabet given just above, in the order: 16, 17, 18,
0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15
I think it might be:
HLIT: 10110 (# LL codes - 257 = 22) 279 LL codes
HDIST: 00001 (# D codes -1 = 1) 2 D codes
HCLEN: 0011 (# CL codes -4 = 3) 7 CL codes
3 x 7 = 21 bits of CL codes should be:
011 101 011 010 110 110 000 (3 5 3 2 6 6 0) or
110 101 110 010 011 011 000 (6 5 6 2 3 3 0).
I don't know which is correct.
Then the literal/length table ll is created as:
ll = {2, 0, 0, 0, 0, 0, 0, 6, 6, 0, 0, 0, 0, 0, 0, 0, 3, 5, 3, ...
(not defined) ... } or
ll = {2, 0, 0, 0, 0, 0, 0, 3, 3, 0, 0, 0, 0, 0, 0, 0, 6, 5, 6, ...
(not defined) ... }.
I don't know which is correct.
279 code lengths for LL alhabet.
2 code lengths for distance alphabet.
The actual compressed data of the block, encoded using the literal/
length and distance Huffman codes.
The literal/length symbol 256 (end of data), encoded using the literal/
length Huffman code.
It might also be:
HLIT: 01101 (# LL codes - 257 = 13) 270 LL codes
HDIST: 10000 (# D codes -1 = 16) 17 D codes
HCLEN: 1100 (# CL codes -4 = 12) 16 CL codes
3 x 16 = 48 bits of CL codes should be:
011 101 011 010 110 110 000 010
000 110 000 101 000 011 000 111
( 3 5 3 2 6 6 0 2 0 6 0 5 0 3 0 3 ) or
110 101 110 010 011 011 000 010
000 011 000 101 000 110 000 110
( 6 5 6 2 3 3 0 2 0 3 0 5 0 6 0 6 ).
I don't know which is correct.
Then the literal/length table ll is created as:
ll = {2, 0, 3, 3, 5, 6, 2, 6, 6, 0, 0, 0, 0, 0, 0, 0, 3, 5, 3, ...
(not defined) ... } or
ll = {2, 0, 6, 6, 5, 3, 2, 3, 3, 0, 0, 0, 0, 0, 0, 0, 6, 5, 6, ...
(not defined) ... }.
I don't know which is correct.
270 code lengths for LL alhabet.
17 code lengths for distance alphabet.
The actual compressed data of the block, encoded using the literal/
length and distance Huffman codes.
The literal/length symbol 256 (end of data), encoded using the literal/
length Huffman code.
Which analysis is correct? Would you please give some idea?
Thanks.
Compressed file:
0000000: 1f8b 0808 35fd a64a 0003 4c56 4c00 6d90 ....5..J..LVL.m.
0000010: 5d6b 8330 1486 afcd af38 735e cc0d c6f0 ]k.0.....8s^....
0000020: b2ec 266a 58c3 1294 7c94 ca18 a514 0b16 ..&jX...|.......
0000030: fc40 6d19 88ff 7d89 9375 6ebb 4ace 739e .@m...}..un.J.s.
0000040: f372 92db 1be8 2f47 54ed cbbc 6bf6 871c .r..../GT...k...
0000050: 8ab2 a9db dec2 d5ea 1ea1 e964 384b b482 ...........d8K..
0000060: 50f3 3400 2faf 2e77 86ec a6d2 5f2a 2956 P.4./..w...._*)V
0000070: ebab 612b 7fd9 1794 6391 fd50 0cf0 ff44 ..a+....c..P...D
0000080: 04cb 8ce0 df90 6099 6257 d910 1162 4539 ......`.bW...bE9
0000090: 0c68 7665 2615 e1f0 124b 4a97 2c98 612c .hve&....KJ.,.a,
00000a0: 2210 446a a624 c458 e110 4b02 2edb b0c7 ".Dj.$.X..K.....
00000b0: fadc bb80 6534 4b52 f369 7741 d244 a82f ....e4KR.iwA.D./
00000c0: a53b 97ee d4e4 784b b9e6 df49 9831 3422 .;....xK...I.14"
00000d0: 74ac 5b18 babc 8702 9e46 18bc 029e af5b t.[......F.....[
00000e0: 33ca a9f2 0d2e aa43 0b85 b920 c7ba 2778 3......C... ..'x
00000f0: cb3f 9a16 8cfd f0fb bbdf 9123 3423 d19a .?.........#4#..
0000100: 44af b04d c4ce 4866 cc31 ef32 1211 76c8 D..M..Hf.1.2..v.
0000110: 60f0 4ec8 19ed 069f e68c 2d89 e101 0000 `.N.......-.....
Original file:
0000000: 2321 2074 7666 0a6e 616d 6573 7061 6365 #! tvf.namespace
0000010: 2069 6d70 6f72 7420 7476 663a 3a2a 0a0a import tvf::*..
0000020: 7476 663a 3a4c 4159 4f55 5420 4255 4d50 tvf::LAYOUT BUMP
0000030: 3220 2465 6e76 284c 4159 5f42 554d 5032 2 $env(LAY_BUMP2
0000040: 290a 0a74 7666 3a3a 4c41 594f 5554 2050 )..tvf::LAYOUT P
0000050: 4154 4820 2465 6e76 284c 4159 5f50 4154 ATH $env(LAY_PAT
0000060: 4829 0a74 7666 3a3a 4c41 594f 5554 2050 H).tvf::LAYOUT P
0000070: 5249 4d41 5259 2024 656e 7628 4c41 595f RIMARY $env(LAY_
0000080: 5052 494d 290a 7476 663a 3a4c 4159 4f55 PRIM).tvf::LAYOU
0000090: 5420 5041 5448 3220 2465 6e76 284c 4159 T PATH2 $env(LAY
00000a0: 5f50 4154 4832 290a 7476 663a 3a4c 4159 _PATH2).tvf::LAY
00000b0: 4f55 5420 5052 494d 4152 5932 2024 656e OUT PRIMARY2 $en
00000c0: 7628 4c41 595f 5052 494d 3229 0a0a 5645 v(LAY_PRIM2)..VE
00000d0: 5242 4154 494d 207b 0a4c 4159 4f55 5420 RBATIM {.LAYOUT
00000e0: 5359 5354 454d 2047 4453 4949 0a4c 4159 SYSTEM GDSII.LAY
00000f0: 4f55 5420 5359 5354 454d 3220 4744 5349 OUT SYSTEM2 GDSI
0000100: 490a 4452 4320 5245 5355 4c54 5320 4441 I.DRC RESULTS DA
0000110: 5441 4241 5345 2022 4c56 4c2e 6f75 7422 TABASE "LVL.out"
0000120: 2041 5343 4949 0a44 5243 2053 554d 4d41 ASCII.DRC SUMMA
0000130: 5259 2052 4550 4f52 5420 224c 564c 2e73 RY REPORT "LVL.s
0000140: 756d 220a 4452 4320 4d41 5849 4d55 4d20 um".DRC MAXIMUM
0000150: 5245 5355 4c54 5320 414c 4c0a 7d0a 0a66 RESULTS ALL.}..f
0000160: 6f72 207b 7365 7420 6920 307d 207b 2469 or {set i 0} {$i
0000170: 203c 2024 656e 7628 4c41 595f 4c49 4d49 < $env(LAY_LIMI
0000180: 5429 7d20 7b69 6e63 7220 697d 207b 0a09 T)} {incr i} {..
0000190: 7365 7420 6a20 5b65 7870 7220 2469 202b set j [expr $i +
00001a0: 2024 656e 7628 4c41 595f 4255 4d50 3229 $env(LAY_BUMP2)
00001b0: 5d0a 0952 554c 4543 4845 434b 2058 4f52 ]..RULECHECK XOR
00001c0: 5f24 6920 7b0a 0909 4f55 544c 4159 4552 _$i {...OUTLAYER
00001d0: 2024 6920 584f 5220 246a 0a09 7d0a 7d0a $i XOR $j..}.}.
00001e0: 0a .
|
|
0
|
|
|
|
Reply
|
chen_zhitao (70)
|
9/11/2009 3:13:08 PM |
|
In article <46610802-97f7-411c-811b-817109e875c5@x6g2000prc.googlegroups.com>,
Kuhl <chen_zhitao@yahoo.com> wrote:
>
>Full hex data of compressed file and original file are both given at
>bottom of this post.
>
>I think the beginning of the compressed block is:
>6d90 5d6b 8330 1486 afcd
>Binary:
>0110 1101 1001 0000 0101 1101 0110 1011
>1000 0011 0011 0000 0001 0100 1000 0110
>1010 1111 1100 1101
Correct so far.
>
>Is the above analysis correct? If that's correct, then I have some
>guesses of the HLIT HDIST HCLEN and CL codes data.
You should mention BFINAL and BTYPE which come first. BFINAL is the first bit
in the stream, found in the least-significant position of the first byte (the
byte with value 6d = 01101101) so BFINAL = 0x6d & 1 = 1. BTYPE is in the next 2
bits. To extract the 2-bit field, you have to take the 2 bits next up from
the least-significant bit just consumed (in the 0x6d & 6 position). Those 2
bits are "10", so BTYPE = 2.
Don't think because you had to read from right-to-left to locate the 2 bits
that you should reverse them. The designer of this format was only half
insane.
>
>The spec defines:
> (HCLEN + 4) x 3 bits: code lengths for the code length
> alphabet given just above, in the order: 16, 17, 18,
> 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15
>
>I think it might be:
>
>HLIT: 10110 (# LL codes - 257 = 22) 279 LL codes
Wrong. The 5 bits you need are the upper bits of the first byte, since the
low 3 bits have already been used as BFINAL and BTYPE. And don't reverse
them. HLIT should be 01101, which was your second guess so I'll skip ahead...
>It might also be:
>
>HLIT: 01101 (# LL codes - 257 = 13) 270 LL codes
>HDIST: 10000 (# D codes -1 = 16) 17 D codes
>HCLEN: 1100 (# CL codes -4 = 12) 16 CL codes
Those are all correct.
>
>3 x 16 = 48 bits of CL codes should be:
>011 101 011 010 110 110 000 010
>000 110 000 101 000 011 000 111
>( 3 5 3 2 6 6 0 2 0 6 0 5 0 3 0 3 ) or
Those are backwards.
>110 101 110 010 011 011 000 010
>000 011 000 101 000 110 000 110
>( 6 5 6 2 3 3 0 2 0 3 0 5 0 6 0 6 ).
>I don't know which is correct.
The second version is correct.
HLIT=01101 BTYPE=10 BFINAL=1 HDIST=10000 HCLEN=1100
CL=110 101 110 010 011 011 000 010
000 011 000 101 000 110 000 110
>
>Then the literal/length table ll is created as:
>ll = {2, 0, 3, 3, 5, 6, 2, 6, 6, 0, 0, 0, 0, 0, 0, 0, 3, 5, 3, ...
>(not defined) ... } or
>ll = {2, 0, 6, 6, 5, 3, 2, 3, 3, 0, 0, 0, 0, 0, 0, 0, 6, 5, 6, ...
>(not defined) ... }.
>I don't know which is correct.
The second one looks like a reasonable brief description of the code length
Huffman code. It may help to write it out in a longer form:
Value "0" is encoded with a code of length 2
Value "1" will not occur (code length 0)
Value "2" is encoded with a code of length 6
Value "3" is encoded with a code of length 6
Value "4" is encoded with a code of length 5
Value "5" is encoded with a code of length 3
Value "6" is encoded with a code of length 2
Value "7" is encoded with a code of length 3
Value "8" is encoded with a code of length 3
Value "9" will not occur (code length 0)
Value "10" will not occur (code length 0)
Value "11" will not occur (code length 0)
Value "12" will not occur (code length 0)
Value "13" will not occur (code length 0)
Value "14" will not occur (code length 0)
Value "15" will not occur (code length 0)
Value "16" is encoded with a code of length 6
Value "17" is encoded with a code of length 5
Value "18" is encoded with a code of length 6
Next you have to actually assign the codes, given those lengths. You need
them to decode the next segment. Do you know how to do that?
--
Alan Curry
|
|
0
|
|
|
|
Reply
|
pacman5 (288)
|
9/13/2009 5:11:33 AM
|
|
Hi, Alan:
Thank you for your very clear answer. I will try to continue the next
step. If I come across with any issue, then I will start a new post to
ask the question. Many Thanks.
|
|
0
|
|
|
|
Reply
|
chen_zhitao (70)
|
9/14/2009 3:04:14 AM
|
|
Oops, there's still something that I need help in next step.
In my understanding, the 270 code lengths for LL alphabet and 17 code
lengths for D alphabet should start from the af (except the LSB of af,
which was already in the CL codes) at address 016, and end before the
23 at address 0fb. Like below. Is this correct?
0000010: 5d6b 8330 1486 afcd af38 735e cc0d c6f0 ]k.0.....8s^....
... ...
00000f0: cb3f 9a16 8cfd f0fb bbdf 9123 3423 d19a .?.........#4#..
Then I try to interpret the data in this part based on two ideas.
One idea is the following statement in your answer above, like this:
Value "0" is encoded with a code of length 2
... ...
Value "18" is encoded with a code of length 6
The other idea is the following description in the spec:
0 - 15: Represent code lengths of 0 - 15
16: Copy the previous code length 3 - 6 times.
The next 2 bits indicate repeat length
(0 = 3, ... , 3 = 6)
17: Repeat a code length of 0 for 3 - 10 times.
(3 bits of length)
18: Repeat a code length of 0 for 11 - 138 times
(7 bits of length)
Perhaps my understanding to your answer is incorrect, so that I got
the following issue:
Value "0" is encoded with a code of length 2
I think this means let's use 00 (length=2) to represent 0.
Value "6" is encoded with a code of length 2.
I think this means let's use two bits to represent 6. But two bits
cannot represent 6.
So there must be something wrong in my understanding, and I need a
further explain. Would you please help?
Thanks.
|
|
0
|
|
|
|
Reply
|
chen_zhitao (70)
|
9/14/2009 7:47:17 AM
|
|
In article <03b771a8-6e0e-4f19-8f5b-480f2226abfc@a39g2000pre.googlegroups.com>,
Kuhl <chen_zhitao@yahoo.com> wrote:
>Oops, there's still something that I need help in next step.
>
>In my understanding, the 270 code lengths for LL alphabet and 17 code
>lengths for D alphabet should start from the af (except the LSB of af,
>which was already in the CL codes) at address 016, and end before the
>23 at address 0fb. Like below. Is this correct?
The starting point is correct. We're at byte 0x16 of the file, its value is
0xaf, and the low bit has been used. But the rest of what you say is
suspicious. How do you know where the LL code lengths are going to end? This
next part of the file, the LL code lengths, is encoded using the code length
Huffman code. You know that there are 270 symbols encoded here, but you can't
know how many bits that's going to take, because a Huffman code uses
different amounts of bits for the symbols!
>
>One idea is the following statement in your answer above, like this:
> Value "0" is encoded with a code of length 2
> ... ...
> Value "18" is encoded with a code of length 6
You will need to use that information to build the Huffman decoding tree.
>
>The other idea is the following description in the spec:
> 0 - 15: Represent code lengths of 0 - 15
> 16: Copy the previous code length 3 - 6 times.
> The next 2 bits indicate repeat length
> (0 = 3, ... , 3 = 6)
> 17: Repeat a code length of 0 for 3 - 10 times.
> (3 bits of length)
> 18: Repeat a code length of 0 for 11 - 138 times
> (7 bits of length)
Those instructions are also important, because they tell you what to do with
the decoded symbols coming out of the Huffman tree.
>
>Perhaps my understanding to your answer is incorrect, so that I got
>the following issue:
> Value "0" is encoded with a code of length 2
> I think this means let's use 00 (length=2) to represent 0.
> Value "6" is encoded with a code of length 2.
> I think this means let's use two bits to represent 6. But two bits
>cannot represent 6.
If you think two bits cannot represent 6, you've got a long way to go
understanding Huffman codes. Go back and re-read sections 3.2.1 and 3.2.2 of
RFC1951. Also on the subject of recommended reading, when I was learning this
stuff, the original paper written by Huffman himself was helpful. It's from
1952, so the terminology is different from what we use now, but it's really
good for explaining why the algorithm works. Find a link to that, and other
educational material, at http://en.wikipedia.org/wiki/Huffman_coding
Take your time...
Back already? Here's how to build the Huffman decoder you need to take the
next step. It's sort of hard to keep track of everything because we're
looking at a list of lengths of codes the represent lengths of other codes...
It's easy to get lost. But you've apparently chosen to jump right in learning
the deflate format without studying Huffman in isolation, so this is what you
get for being in a hurry.
I'm going to prefix one set of them with "CLS". Those are the "code length
symbols". They are the output of the first Huffman decoder, and they are used
to build the other 2 Huffman decoders.
Here's the specification of the first Huffman decoder, which should look a
lot like the table in my previous article.
CLS0 is represented by 2 bits
CLS2 is represented by 6 bits
CLS3 is represented by 6 bits
CLS4 is represented by 5 bits
CLS5 is represented by 3 bits
CLS6 is represented by 2 bits
CLS7 is represented by 3 bits
CLS8 is represented by 3 bits
CLS16 is represented by 6 bits
CLS17 is represented by 5 bits
CLS18 is represented by 6 bits
We need to assign codes (bit sequences) matching those lengths.
Note: at this stage, the symbols "CLS0", "CLS18", etc. are just abstract
symbols. They could be called "A", "B", "C" instead and the job would be the
same. We must take the list of lengths {2 6 6 5 3 2 3 3 6 5 6} and transform
it into a list of bit sequences. The algorithm in section 3.2.2 of RFC1951
can be used, or we can do it by hand using the principles:
1. Short codes come first
2. Codes of equal length are numerically consecutive
Doing it by hand is more enlightening. To help in following the "short codes
come first" rule, let's put the list of lengths into increasing order.
{2 6 6 5 3 2 3 3 6 5 6} becomes {2 2 3 3 3 5 5 6 6 6 6}. We'll restore the
original order and match them up with symbols later.
The first code is always a bit sequence composed entirely of zeros. How long
is it? Look at the list of lengths {2 2 3 3 3 5 5 6 6 6 6} and pick the
shortest one: the first code, the shortest code, has length 2. So the bit
sequence for that code is: 00
We need another length=2 code, so we take that 00 and increment it. The next
code is: 01
Now we need a length 3 code. What bit sequence shall we use? Remember,
Huffman is a prefix-free code. We must choose a code that does not have any
preceding codes as a prefix. We can't use 000 or 001, because we've already
decided that 00 is one of the length=2 codes. Likewise 010 and 011 are ruled
out. So we take the lowest available 3-bit value, which is: 100
Take a look at that selection process from another angle. We needed a
sequence of 3 bits, where the first 2 bits were "new". The quick way to do it
(which you'll see in all serious implementations, including the sample code
in the RFC) is to increment the last code of the previous length, and then
append as many zeros as are needed. In this case, we had 00 and 01, so we can
increment the 01 which yields 10, then add an extra zero on the end to get
100, our first length=3 code.
Back to the length list. From {2 2 3 3 3 5 5 6 6 6 6} we've done {2 2 3}, now
we need a couple more 3's. Take the 100 code and increment it to get 101,
that's the next code. Increment it again to get 110, that's the next code. So
far we've got: 00 01 100 101 110 and now the next length in the list is 5.
Jumping from length=3 to length=5 can be handled the same way as the jump
from length=2 to length=3. Take the last code (110), increment it (now it's
111), and append zeros until it has the proper length (11100). That's the
next code.
Hopefully you get the pattern by now. The complete list is:
Length Code
------ ----
2 00
2 01 (= 00 + 1)
3 100 (= 01 + 1, with 0 appended)
3 101 (= 100 + 1)
3 110 (= 101 + 1)
5 11100 (= 110 + 1, with 00 appended)
5 11101 (= 11100 + 1)
6 111100 (= 11101 + 1, with 0 appended)
6 111101 (= 111100 + 1)
6 111110 (= 111101 + 1)
6 111111 (= 111110 + 1)
If you do the math correctly, the last code in the list will always be a
string of 1's. No bits are wasted!
Now take that list of codes and match them up with the "CLS" symbols. Take
the length=2 codes, and write them next to the symbols that require 2-bit
representations. The first one was CLS0, so it gets the 00 code. The second
one was CLS6, so it gets the 01 code. Do the same with the other lengths. At
length 3, CLS5 gets the first code (100), CLS7 gets 101, and CLS8 gets 110.
CLS0 is represented by 2 bits: 00
CLS2 is represented by 6 bits: 111100
CLS3 is represented by 6 bits: 111101
CLS4 is represented by 5 bits: 11100
CLS5 is represented by 3 bits: 100
CLS6 is represented by 2 bits: 01
CLS7 is represented by 3 bits: 101
CLS8 is represented by 3 bits: 110
CLS16 is represented by 6 bits: 111110
CLS17 is represented by 5 bits: 11101
CLS18 is represented by 6 bits: 111111
And now you are ready to read the next piece of the file. You're going to be
building a list of 270 code lengths, using the Huffman table above. When
reading a Huffman code, you don't know ahead of time how many bits are in it
(could be 2, 3, 5, or 6 in this case) so you have to process 1 bit at a time.
When we left off, we were at byte offset 0x16, the next bytes are:
af cd
or in binary:
10101111 11001101
but one bit has already been used, so really we have this input:
1010111- 11001101
The next bit to be read is a 1. What do we do with it? Check the Huffman
table. (In a serious implementation there would be a tree structure; never
mind that for now.) Is "1" a valid code? No. Get another bit. It's another 1.
Now we have "11". That's not a valid code either. Get another bit. It's
another 1. Now we have "111". Still not a valid code. The next bit is a 0,
making "1110", not a valid code. The next bit is a 1, now we have "11101",
and that's one of our codes! It represents CLS17. We have successfully
decoded the first symbol.
Hopefully, working through that, you get a feeling for how a prefix-free code
works. Going a bit at a time, you always know whether you have read a
complete code or not, because no code is a prefix of another code. If we had
"111010" in our list, we wouldn't know whether to interpret "11101" as CLS17,
or grab another bit.
So we got CLS17. What do you do with it? I'll let you try to figure it out.
I've written enough for now.
--
Alan Curry
|
|
0
|
|
|
|
Reply
|
pacman5 (288)
|
9/14/2009 11:41:55 AM
|
|
Thank you so much. :-)
I need some time to read and understand your article. It's a very good
explain.
Thanks and regards.
|
|
0
|
|
|
|
Reply
|
chen_zhitao (70)
|
9/15/2009 1:16:55 PM
|
|
|
5 Replies
82 Views
(page loaded in 0.062 seconds)
|