Analysing actual compressed data of a block with dynamic Huffman coding

  • Follow


I am learning the following spec by analyzing a sample of a file
compressed with gzip.

Network Working Group        P. Deutsch
Request for Comments: 1951    Aladdin Enterprises
Category: Informational        May 1996
DEFLATE Compressed Data Format Specification version 1.3

Full hex data of compressed file and original file are both given at
bottom of this post. I give the part that is already decoded first,
then describe the part that I  am trying to figure out, but have
questions.

After Huffman decoding of creating the code length symbol (CLS) table,
it's found that:
CLS0 is represented by 2 bits: 00
CLS2 is represented by 6 bits: 111100
CLS3 is represented by 6 bits: 111101
CLS4 is represented by 5 bits: 11100
CLS5 is represented by 3 bits: 100
CLS6 is represented by 2 bits: 01
CLS7 is represented by 3 bits: 101
CLS8 is represented by 3 bits: 110
CLS16 is represented by 6 bits: 111110
CLS17 is represented by 5 bits: 11101
CLS18 is represented by 6 bits: 111111

Characters existing in original file (57):

09 0a 20 21 22 23 24 28 29 2a 2b 2e 30 32 3a 3c 41 42 43 44
45 47 48 49 4b 4c 4d 4f 50 52 53 54 55 56 58 59 5b 5d 5f 61
63 65 66 69 6a 6d 6e 6f 70 72 73 74 75 76 78 7b 7d

Literal/Length table (1006 elements, [0-1005]):

[0-255]:
0000000006500000 0000000000000000
4868600086880080 8060000000708000
0566650875085506 6055557076080807
0707067006700666 6066677080060600
0000000000000000 0000000000000000
0000000000000000 0000000000000000
0000000000000000 0000000000000000
0000000000000000 0000000000000000

[256-511]:
8556860878688800 0005542424304550
0000000000000700 0000000000000000
0000000000000000 0000000000000000
0000000000000000 0000000000000000
0000000000000000 0000000000000000
0000000000006066 6667308600000075
8000000000000000 0000000000000000
0000000000000680 6606608866688837

[512-767]:
0000000876826007 5786080686705667
6006007652750000 3036685000000650
5800007607000000 0000000000000000
0000000000000000 0000000000000000
0000000000000000 0000000000000000
0000000605706766 7636650887000000
0000000000000000 0000000000000000
0000000000000000 0000000000000000

[768-1005]:
0000000000000000 0000000000000000
0000000000063006 6046706080000000
0000000000000000 0000000000000000
0000000000000000 0000000000000000
0000000000000000 0000065000000000
0000000000000000 0000000000000000
0000000000000000 0000000000000000
00006060600670

Distance table (40 elements, [0-39]):

8 6 5 6 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 6 0 5 0 6 7 0 8 6

I am not sure whether the Literal/Length table and Distance table are
correct. At least, [0-255] of Literal/Length table seems to be
correct, because the non-0 elements  match exactly with the characters
existing in original file. You guys are experts. If there's something
wrong apparently in these two tables, please point out.

Next part of the compressed file starts at address 088 hex, which is
the byte 62. But questions come now.

1. In this case, it happens that the Distance table ends at an entire
byte, so there's no question that the actual compressed data starts at
next byte. But if the distance table ends inside a byte, then where
does the actual compressed data start?

2. The first byte of the original file is 23. So in my understanding,
the first element in this actual compressed data should also be 23, or
23 stored with different number of bits. But the data here is like 62
57, which does not look like 23 in any way. So How is the actual
compressed data decoded?

Thanks.

Compressed file:

0000000: 1f8b 0808 35fd a64a 0003 4c56 4c00 6d90  ....5..J..LVL.m.
0000010: 5d6b 8330 1486 afcd af38 735e cc0d c6f0  ]k.0.....8s^....
0000020: b2ec 266a 58c3 1294 7c94 ca18 a514 0b16  ..&jX...|.......
0000030: fc40 6d19 88ff 7d89 9375 6ebb 4ace 739e  .@m...}..un.J.s.
0000040: f372 92db 1be8 2f47 54ed cbbc 6bf6 871c  .r..../GT...k...
0000050: 8ab2 a9db dec2 d5ea 1ea1 e964 384b b482  ...........d8K..
0000060: 50f3 3400 2faf 2e77 86ec a6d2 5f2a 2956  P.4./..w...._*)V
0000070: ebab 612b 7fd9 1794 6391 fd50 0cf0 ff44  ..a+....c..P...D
0000080: 04cb 8ce0 df90 6099 6257 d910 1162 4539  ......`.bW...bE9
0000090: 0c68 7665 2615 e1f0 124b 4a97 2c98 612c  .hve&....KJ.,.a,
00000a0: 2210 446a a624 c458 e110 4b02 2edb b0c7  ".Dj.$.X..K.....
00000b0: fadc bb80 6534 4b52 f369 7741 d244 a82f  ....e4KR.iwA.D./
00000c0: a53b 97ee d4e4 784b b9e6 df49 9831 3422  .;....xK...I.14"
00000d0: 74ac 5b18 babc 8702 9e46 18bc 029e af5b  t.[......F.....[
00000e0: 33ca a9f2 0d2e aa43 0b85 b920 c7ba 2778  3......C... ..'x
00000f0: cb3f 9a16 8cfd f0fb bbdf 9123 3423 d19a  .?.........#4#..
0000100: 44af b04d c4ce 4866 cc31 ef32 1211 76c8  D..M..Hf.1.2..v.
0000110: 60f0 4ec8 19ed 069f e68c 2d89 e101 0000  `.N.......-.....

Original file:

0000000: 2321 2074 7666 0a6e 616d 6573 7061 6365  #! tvf.namespace
0000010: 2069 6d70 6f72 7420 7476 663a 3a2a 0a0a   import tvf::*..
0000020: 7476 663a 3a4c 4159 4f55 5420 4255 4d50  tvf::LAYOUT BUMP
0000030: 3220 2465 6e76 284c 4159 5f42 554d 5032  2 $env(LAY_BUMP2
0000040: 290a 0a74 7666 3a3a 4c41 594f 5554 2050  )..tvf::LAYOUT P
0000050: 4154 4820 2465 6e76 284c 4159 5f50 4154  ATH $env(LAY_PAT
0000060: 4829 0a74 7666 3a3a 4c41 594f 5554 2050  H).tvf::LAYOUT P
0000070: 5249 4d41 5259 2024 656e 7628 4c41 595f  RIMARY $env(LAY_
0000080: 5052 494d 290a 7476 663a 3a4c 4159 4f55  PRIM).tvf::LAYOU
0000090: 5420 5041 5448 3220 2465 6e76 284c 4159  T PATH2 $env(LAY
00000a0: 5f50 4154 4832 290a 7476 663a 3a4c 4159  _PATH2).tvf::LAY
00000b0: 4f55 5420 5052 494d 4152 5932 2024 656e  OUT PRIMARY2 $en
00000c0: 7628 4c41 595f 5052 494d 3229 0a0a 5645  v(LAY_PRIM2)..VE
00000d0: 5242 4154 494d 207b 0a4c 4159 4f55 5420  RBATIM {.LAYOUT
00000e0: 5359 5354 454d 2047 4453 4949 0a4c 4159  SYSTEM GDSII.LAY
00000f0: 4f55 5420 5359 5354 454d 3220 4744 5349  OUT SYSTEM2 GDSI
0000100: 490a 4452 4320 5245 5355 4c54 5320 4441  I.DRC RESULTS DA
0000110: 5441 4241 5345 2022 4c56 4c2e 6f75 7422  TABASE "LVL.out"
0000120: 2041 5343 4949 0a44 5243 2053 554d 4d41   ASCII.DRC SUMMA
0000130: 5259 2052 4550 4f52 5420 224c 564c 2e73  RY REPORT "LVL.s
0000140: 756d 220a 4452 4320 4d41 5849 4d55 4d20  um".DRC MAXIMUM
0000150: 5245 5355 4c54 5320 414c 4c0a 7d0a 0a66  RESULTS ALL.}..f
0000160: 6f72 207b 7365 7420 6920 307d 207b 2469  or {set i 0} {$i
0000170: 203c 2024 656e 7628 4c41 595f 4c49 4d49   < $env(LAY_LIMI
0000180: 5429 7d20 7b69 6e63 7220 697d 207b 0a09  T)} {incr i} {..
0000190: 7365 7420 6a20 5b65 7870 7220 2469 202b  set j [expr $i +
00001a0: 2024 656e 7628 4c41 595f 4255 4d50 3229   $env(LAY_BUMP2)
00001b0: 5d0a 0952 554c 4543 4845 434b 2058 4f52  ]..RULECHECK XOR
00001c0: 5f24 6920 7b0a 0909 4f55 544c 4159 4552  _$i {...OUTLAYER
00001d0: 2024 6920 584f 5220 246a 0a09 7d0a 7d0a   $i XOR $j..}.}.
00001e0: 0a                                       .
0
Reply chen_zhitao (70) 9/23/2009 11:59:29 AM

In article <4cae53ba-0dfc-430a-8371-77316c1359fd@z4g2000prh.googlegroups.com>,
Kuhl  <chen_zhitao@yahoo.com> wrote:
>
>Literal/Length table (1006 elements, [0-1005]):

Where did you get the idea that there are 1006 elements in this table? It was
previously established (in the other thread) that the number of lit/len codes
in this file is 270. You got the first 270 elements decoded correctly. After
that you should have moved on to the next step.

>
>[0-255]:
>0000000006500000 0000000000000000
>4868600086880080 8060000000708000
>0566650875085506 6055557076080807
>0707067006700666 6066677080060600
>0000000000000000 0000000000000000
>0000000000000000 0000000000000000
>0000000000000000 0000000000000000
>0000000000000000 0000000000000000
>
>[256-511]:
>85568608786888

Stop there. That's 270. Now you've got a table of 270 code lengths for the
lit/len table. Move on to the next step.

-- 
Alan Curry
0
Reply pacman5 (288) 9/24/2009 12:30:27 AM


Hi, thanks.

I got a misunderstanding about the 270 of literal / length data
before. I though it mean that I should read 270 numbers from the file,
while each is encoded with Huffman coding. But according to the ZLIB
spec, some of these numbers (16, 17, 18) are extracted to be multiple
elements. So the 270 numbers became 1006 elements.

Since the correct understanding is that there should be 270 elements
in this LL table, I think the next 17 elements should be the Distance
table. So I will try to match the following part with the actual
compressed data.

Thanks and regards.
0
Reply chen_zhitao (70) 9/24/2009 1:10:42 PM

Literal/Length table (270 elements, [0-269]):
[0-255]: (Ignored here.)
[256-269]: 85568608786888

Distance table (17 elements, [0-16]): 00000 55424 24304 55

Now we are at address 042 (content is 92 hex), while address 043 is db
hex. Bit 0-7 of address 042 is already used by distance table. Only
the MSB (1) is remaining.

If we continue to get bits one by one, it will be 111011011 ... It
does not look like 23 hex at all. Even if we follow the static Huffman
coding like below, it's still a number between 144 (90 hex) and 255
(ff hex). It's by no means 23 hex.
144 - 255  9  110010000 through 111111111

What's the correct way to decode this?

Thanks.
0
Reply chen_zhitao (70) 9/25/2009 1:19:00 AM

In article <0e96ef74-84c4-4c14-84d6-a16dcc844125@p10g2000prm.googlegroups.com>,
Kuhl  <chen_zhitao@yahoo.com> wrote:
>Literal/Length table (270 elements, [0-269]):
>[0-255]: (Ignored here.)
>[256-269]: 85568608786888
>
>Distance table (17 elements, [0-16]): 00000 55424 24304 55
>
>Now we are at address 042 (content is 92 hex), while address 043 is db
>hex. Bit 0-7 of address 042 is already used by distance table. Only
>the MSB (1) is remaining.

I think you meant bit 0-6 are already used. 0-7 would be all of them.
Otherwise, you're doing good so far.

>
>If we continue to get bits one by one, it will be 111011011 ... It

Those are the correct bits, so you're in the right place. But you aren't
ready to do anything with those bits yet.

>does not look like 23 hex at all. Even if we follow the static Huffman
>coding like below, it's still a number between 144 (90 hex) and 255
>(ff hex). It's by no means 23 hex.
>144 - 255  9  110010000 through 111111111

You're way off. The static Huffman codes are not used in a dynamic Huffman
code block (BTYPE=10).

>
>What's the correct way to decode this?

You just finished decoding the 270-element array of lit/len code lengths, and
the 17-element array of distance code lengths. Use them!

You have to build the list of code-to-symbol mappings for both the lit/len
alphabet and the distance alphabet. Remember how we had the list of lengths
for the Code Length code:

  6 5 6 2 3 3 0 2 0 3 0 5 0 6 0 6

and turned it into a table like this:

  CLS0 is represented by 00
  CLS2 is represented by 111100
  CLS3 is represented by 111101
  CLS4 is represented by 11100
  CLS5 is represented by 100
  CLS6 is represented by 01
  CLS7 is represented by 101
  CLS8 is represented by 110
  CLS16 is represented by 111110
  CLS17 is represented by 11101
  CLS18 is represented by 111111

Well now take this list of lengths (270 of them):
  0 0 0 0 0 0 0 0 0 6 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 4 8 6 8 6 0 0 0 8 6 8 8 0 0 8 0 8 0 6 0 0 0 0 0 0 0 7 0
  8 0 0 0 0 5 6 6 6 5 0 8 7 5 0 8 5 5 0 6 6 0 5 5 5 5 7 0 7 6
  0 8 0 8 0 7 0 7 0 7 0 6 7 0 0 6 7 0 0 6 6 6 6 0 6 6 6 7 7 0
  8 0 0 6 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 5 5 6 8 6 0 8 7 8 6 8 8 8
and follow the same procedure. That'll create the decoding table for lit/len
symbols.

Then take this list of lengths (17 of them):
  0 0 0 0 0 5 5 4 2 4 2 4 3 0 4 5 5
and do it again, to create the decoding table for distance symbols.

Then you'll be ready to start reading the data bits.

-- 
Alan Curry
0
Reply pacman5 (288) 9/25/2009 4:07:24 AM

I am reading the bits from the real compressed data. Let's analyze the
initial part.

-- Original file --
#! tvf
namespace import tvf::*

tvf::LAYOUT BUMP2 $env(LAY_BUMP2)

.... ... (Ignored here.)

-- End of original file --

-- Hex display of the original file --
0000000: 2321 2074 7666 0a6e 616d 6573 7061 6365  #! tvf.namespace
0000010: 2069 6d70 6f72 7420 7476 663a 3a2a 0a0a   import tvf::*..
0000020: 7476 663a 3a4c 4159 4f55 5420 4255 4d50  tvf::LAYOUT BUMP
0000030: 3220 2465 6e76 284c 4159 5f42 554d 5032  2 $env(LAY_BUMP2
0000040: 290a 0a74 7666 3a3a 4c41 594f 5554 2050  )..tvf::LAYOUT P

.... ... (Ignored here.)

-- End of hex display of original file --


Distance Table
No.	L	code
0	0
1	0
2	0
3	0
4	0
5	5	11100
6	5	11101
7	4	1010
8	2	00
9	4	1011
10	2	01
11	4	1100
12	3	100
13	0
14	4	1101
15	5	11110
16	5	11111


In the compressed file, we arrived at the following segment:
0000050: 8ab2 a9db dec2 d5ea 1ea1 e964 384b b482

I already matched the following characters successfully:
#! tvf
namespace import

The last character already matched (the t in import) is MSB of address
053 (1 in byte db) and lower 5 bits of address 054 (01111 in byte de),
which is 101111, that represents 74 hex (t) in the Huffman coding of
this table.

The next number I read from compressed data is 01101, which represents
258 in this Huffman table. I found in the spec:
258, Extra bits 0, Length 4. And following this should be the backward
distance.

In my understanding, the length 4 means a 4-byte-string in the already
decoded data should be repeated here. I think it should be " tvf", and
backward distance means the relative location that this string starts.

But I cannot read the backward distance correctly. I think it should
be how many characters need to be counted back from current location.
I counted it manually, and got 21. But the distance table has only 17
items [0-16]. What's more, the search window size of gunzip is said to
be 32768, and the spec also mentioned that the backward distance
should be 1-32768. My sample is a small file. For a large file, I
don't know how the distance is represented either.

What is the mistake in my analysis above? And what's the correct way
to do it?

Thanks.
0
Reply chen_zhitao (70) 9/26/2009 1:14:41 AM

In article <5da14549-256a-43a2-ba51-c5881021d34d@y10g2000prg.googlegroups.com>,
Kuhl  <chen_zhitao@yahoo.com> wrote:
[snip]
>
>The next number I read from compressed data is 01101, which represents
>258 in this Huffman table. I found in the spec:
>258, Extra bits 0, Length 4. And following this should be the backward
>distance.
>
>In my understanding, the length 4 means a 4-byte-string in the already
>decoded data should be repeated here. I think it should be " tvf", and
>backward distance means the relative location that this string starts.

Everything above here is correct. You have a lit/len code 258 which means
the next 4 bytes will be a repeat of a string that appeared previously.

>
>But I cannot read the backward distance correctly. I think it should
>be how many characters need to be counted back from current location.
>I counted it manually, and got 21. But the distance table has only 17
>items [0-16]. What's more, the search window size of gunzip is said to
>be 32768, and the spec also mentioned that the backward distance
>should be 1-32768. My sample is a small file. For a large file, I
>don't know how the distance is represented either.

Next you need to use the distance Huffman code to retrieve a distance symbol
from the input. The distance symbol is a value from 0-16, but the distance
symbol is not the distance. The following table (condensed from what you
posted) is correct, but would be clearer if the left column said "DS5",
"DS6", and so on... to make it clear that these are Distance Symbols, not
distances.

>Distance Table
>No.	L	code
>5	5	11100
>6	5	11101
>7	4	1010
>8	2	00
>9	4	1011
>10	2	01
>11	4	1100
>12	3	100
>14	4	1101
>15	5	11110
>16	5	11111

Take the next few bits in the input, match them up to one of the codes in the
above table, and you get a distance symbol. Then you use this table (copied
directly from RFC1951):

                  Extra           Extra               Extra
             Code Bits Dist  Code Bits   Dist     Code Bits Distance
             ---- ---- ----  ---- ----  ------    ---- ---- --------
               0   0    1     10   4     33-48    20    9   1025-1536
               1   0    2     11   4     49-64    21    9   1537-2048
               2   0    3     12   5     65-96    22   10   2049-3072
               3   0    4     13   5     97-128   23   10   3073-4096
               4   1   5,6    14   6    129-192   24   11   4097-6144
               5   1   7,8    15   6    193-256   25   11   6145-8192
               6   2   9-12   16   7    257-384   26   12  8193-12288
               7   2  13-16   17   7    385-512   27   12 12289-16384
               8   3  17-24   18   8    513-768   28   13 16385-24576
               9   3  25-32   19   8   769-1024   29   13 24577-32768

to turn it into a distance. Distances up to 32768 are represented using codes
0-29, thanks to the extra bits that can follow a distance code. In your
sample file the distance codes only go up to 16, but that's enough for
distances up to 384.

-- 
Alan Curry
0
Reply pacman5 (288) 9/26/2009 3:49:50 AM

The entire compressed data block is already matched with the original
file now. But there are several new questions.

Reference 1950 mentioned that the 4 bytes following the real
compressed data is the checksum value of the uncompressed data
computed according to Adler-32 algorithm. In this sample case, the
compressed data happens to end at byte boundary. There are exectly
last 8 bytes in the compressed file remaining now. They are address
118-11F, while the real data is e68c 2d89 e101 0000.

1. If the real compressed data does not end at byte boundary, then
where does the checksum value start?

2. I think that e68c 2d89 is the checksum value, is it correct? The
description in reference 1950 is not detailed enough. I have no access
to the journals either. Is there any document available online that
can be accessed easily?

3. What's the data following the checksum, such like e101 0000 in this
sample? And if the checksum does not start at byte boundary, then
where does this part start?

Thanks.
0
Reply chen_zhitao (70) 9/27/2009 2:34:10 AM

In article <a8f45f7c-b2b7-4ec2-9688-0ffb447a3e0b@o9g2000prg.googlegroups.com>,
Kuhl  <chen_zhitao@yahoo.com> wrote:
>The entire compressed data block is already matched with the original
>file now. But there are several new questions.
>
>Reference 1950 mentioned that the 4 bytes following the real
>compressed data is the checksum value of the uncompressed data
>computed according to Adler-32 algorithm. In this sample case, the

Wrong document. You have a gzip file, not zlib. See RFC1952. The last 8 bytes
are a gzip trailer.

>compressed data happens to end at byte boundary. There are exectly
>last 8 bytes in the compressed file remaining now. They are address
>118-11F, while the real data is e68c 2d89 e101 0000.
>
>1. If the real compressed data does not end at byte boundary, then
>where does the checksum value start?

RFC1952 says the gzip format is byte-oriented, so the trailer must start at
the next bytes boundary.

>
>2. I think that e68c 2d89 is the checksum value, is it correct? The
>description in reference 1950 is not detailed enough. I have no access
>to the journals either. Is there any document available online that
>can be accessed easily?

I guess you'll have trouble understanding gzip's CRC32 checksum then. It's
more complicated than zlib's adler32. But there's sample code, so just study
that.

>
>3. What's the data following the checksum, such like e101 0000 in this
>sample? And if the checksum does not start at byte boundary, then
>where does this part start?

little-endian 0x000001e1 == 481, your original file's size. ISIZE in RFC1952

-- 
Alan Curry
0
Reply pacman5 (288) 9/27/2009 4:52:12 AM

OK, I see the C code of CRC32 in document 1952 now. I will study it.
Since this is another topic, if I come across with any trouble in
understanding CRC32, I will start a new thread on this board. Thanks
for your great help. :-)
0
Reply chen_zhitao (70) 9/27/2009 5:13:26 AM

9 Replies
51 Views

(page loaded in 0.125 seconds)

Similiar Articles:




7/26/2012 5:35:24 PM


Reply: