Analysis of dataset in the large text compression benchmark

  • Follow


One problem in the large text compression benchmark is that the entropy
of the data set is not known.  Shannon and others estimated the entropy
of English text by having human subjects guess successive characters in
text reduced to a 27 character alphabet (a-z and space).  The problem
with the data set (enwik9) is that there is a lot of "junk" like XML
markup, tables, links, formatting codes, etc.  This doesn't matter as
long as the extra data is not hard to model; the best natural language
model will still win.  However, it would still be nice to know where
the finish line is.

I wrote a program to filter out the junk and generate readable English
text using a 27 character alphabet (as readable as you can get without
punctuation).  Essentially I removed any text that would not be visible
on the Wikipedia website.  For example, I removed hypertext links but
retained the anchor text.  I also removed non-English text, tables and
markup, but retained image captions.  When this is reduced to only
letters, digits, and nonconsecutive spaces, the resulting text is about
70% of the original size.  This is expanded to 74% after spelling out
digits (a common technique in speech models).

Then I compared compression ratios for 100 MB of clean text with enwik8
(also 100 MB) using 25 different compressors.  Popular compressors like
zip, gzip and bzip2 compress clean text about 10% smaller than the
original text.  High end compressors like paq8h, durilca, slim, and
ppmonstr compress about 1% to 5% smaller.  (Did not test WinRK).
Details are here:
http://cs.fit.edu/~mmahoney/compression/text.html#data

Now we have a rough estimate of the entropy of enwik9.  Shannon
estimated 0.6 to 1.3 bpc for 27 character English.  Cover and King
estimated an upper bound of 1.3 bpc.  It is likely that the entropy of
enwik9 is at most 5% more.  I plan to test on the "cleaned" enwik9 to
see if the results hold for larger files.

The uncertainty in Shannon's estimate is still a problem because the
top compressors are now in the range of uncertainty.  paq8h compresses
enwik9 to 1.18 bpc.  The uncertainty is due to the fact that different
probability distributions can lead to the same observed sequence of
guesses in the Shannon game (guessing successive characters).  Cover
and King tried to rectify this problem using a gambling game to assign
probabilities, but this method is tedious and they were only able to
measure a single sentence.  Years ago I did simulations of the Shannon
game wth known models which suggest the true entropy is in the middle
of the Shannon bounds, around 0.9 to 1.1 bpc, but this is speculative.

-- Matt Mahoney

0
Reply matmahoney (29) 6/10/2006 8:48:29 PM

I did a similar program to remove punctuation and xml-tags. I split the
enwik9 file in two files: one with punctuation (and xml/xhtml tags) and
another one with the real text (keeping enough information to
reconstruct the initial file). I got an overall improvement of 5% in
the compression ratio (around 10mb). The file, "text.txt", with the
real text was a big list with words with only a space between two
words.

I also wrote a program that guesses words (~21% of the words on enwik9,
and ~17% on enwik8) in order to reduce the size of "text.txt" file.
Unfortunately, the compression ratio (of bzip and chile) was not
affected in a positive way (both programs use BWT). Anyway, if anyone
wants to test my idea I can give him the necessary files (3 source
files written in C++, very simple to use it within his program just by
including a header).

I want to do something like T9 on mobiles phone: replacing letters with
digits, but I didn't figure a clever way to group letters into digits
in order to have a very high success rate of guessing the words from
the first shot.

Some other problems with the test:

1. A human is well "trained" at the age of 21 so he processes much less
than one gigabyte of language. A child who just learned to read/write
can also do very good on Shannon's tests.

2. If you give someone a very big book written in an unknown language
he/she won't be able to learn that language and understand book's
meaning (even if the book written using a known alphabet).

3. Do not remove punctuation. A sentence ending in "?" is different
from a sentence ending in "." (eg. the distribution of the first word
in the sentence is different, or the order of words may be different).

4. Humans do have memory! Shannon didn't included humans' memory in his
tests, but their capability to guess letters in different contexts
(humans do not train online). Your benchmark includes also the
compressor size. I suggest you to estimate the humans' memory and allow
compressors to have other files (like dictionaries) up to that limit.

Matt Mahoney wrote:
> One problem in the large text compression benchmark is that the entropy
> of the data set is not known.  Shannon and others estimated the entropy
> of English text by having human subjects guess successive characters in
> text reduced to a 27 character alphabet (a-z and space).  The problem
> with the data set (enwik9) is that there is a lot of "junk" like XML
> markup, tables, links, formatting codes, etc.  This doesn't matter as
> long as the extra data is not hard to model; the best natural language
> model will still win.  However, it would still be nice to know where
> the finish line is.
>
> I wrote a program to filter out the junk and generate readable English
> text using a 27 character alphabet (as readable as you can get without
> punctuation).  Essentially I removed any text that would not be visible
> on the Wikipedia website.  For example, I removed hypertext links but
> retained the anchor text.  I also removed non-English text, tables and
> markup, but retained image captions.  When this is reduced to only
> letters, digits, and nonconsecutive spaces, the resulting text is about
> 70% of the original size.  This is expanded to 74% after spelling out
> digits (a common technique in speech models).
>
> Then I compared compression ratios for 100 MB of clean text with enwik8
> (also 100 MB) using 25 different compressors.  Popular compressors like
> zip, gzip and bzip2 compress clean text about 10% smaller than the
> original text.  High end compressors like paq8h, durilca, slim, and
> ppmonstr compress about 1% to 5% smaller.  (Did not test WinRK).
> Details are here:
> http://cs.fit.edu/~mmahoney/compression/text.html#data
>
> Now we have a rough estimate of the entropy of enwik9.  Shannon
> estimated 0.6 to 1.3 bpc for 27 character English.  Cover and King
> estimated an upper bound of 1.3 bpc.  It is likely that the entropy of
> enwik9 is at most 5% more.  I plan to test on the "cleaned" enwik9 to
> see if the results hold for larger files.
>
> The uncertainty in Shannon's estimate is still a problem because the
> top compressors are now in the range of uncertainty.  paq8h compresses
> enwik9 to 1.18 bpc.  The uncertainty is due to the fact that different
> probability distributions can lead to the same observed sequence of
> guesses in the Shannon game (guessing successive characters).  Cover
> and King tried to rectify this problem using a gambling game to assign
> probabilities, but this method is tedious and they were only able to
> measure a single sentence.  Years ago I did simulations of the Shannon
> game wth known models which suggest the true entropy is in the middle
> of the Shannon bounds, around 0.9 to 1.1 bpc, but this is speculative.
> 
> -- Matt Mahoney

0
Reply Alexandru 6/11/2006 3:35:21 PM


Alexandru wrote:
> I did a similar program to remove punctuation and xml-tags. I split the
> enwik9 file in two files: one with punctuation (and xml/xhtml tags) and
> another one with the real text (keeping enough information to
> reconstruct the initial file). I got an overall improvement of 5% in
> the compression ratio (around 10mb). The file, "text.txt", with the
> real text was a big list with words with only a space between two
> words.
>
> I also wrote a program that guesses words (~21% of the words on enwik9,
> and ~17% on enwik8) in order to reduce the size of "text.txt" file.
> Unfortunately, the compression ratio (of bzip and chile) was not
> affected in a positive way (both programs use BWT). Anyway, if anyone
> wants to test my idea I can give him the necessary files (3 source
> files written in C++, very simple to use it within his program just by
> including a header).

If you post the programs, I will add them to the benchmark.
Preprocessors (like xml-wrt) combined with existing compressors are
useful.

> Some other problems with the test:
>
> 1. A human is well "trained" at the age of 21 so he processes much less
> than one gigabyte of language. A child who just learned to read/write
> can also do very good on Shannon's tests.

True, 1 GB was just a rough guess.  Has anyone measured how well
children do in Shannon's tests compared to adults?

> 2. If you give someone a very big book written in an unknown language
> he/she won't be able to learn that language and understand book's
> meaning (even if the book written using a known alphabet).

Also true, because people can't learn a language without grounding the
meanings of words to their other senses.  People only learn what is
relevant.  However, I think that language has properties that allow
machines to do this.  Words with similar meanings or similar
grammatical roles will appear in similar context.  A machine can learn
those groupings by analyzing the statistics of the text.  A human
cannot because he or she would be overwhelmed by attempting to read a
foreign text.  It takes people many years to learn such associations.
Machines can learn them in minutes.

> 3. Do not remove punctuation. A sentence ending in "?" is different
> from a sentence ending in "." (eg. the distribution of the first word
> in the sentence is different, or the order of words may be different).

I agree this would be better.  I only removed punctuation because
Shannon did.  It would be more accurate to measure the entropy using
more realistic text.  I am considering ideas to improve Shannon's tests
to measure the entropy of enwik9, but I am unsure how to do it.  One
idea I have is to have a human-assisted computer language model.  The
computer would make its best guesses about the next word or character
and present these to the user.  The user could either accept the
machine's guesses as is, or substitute his own guess either from the
list or by typing it in.

I would like to do this in a way that minimizes the uncertainty in
Shannon's method, which is due to different probability distributions
leading to the same series of observed guesses by the human.  I would
also like to make the test easy enough that a large number of samples
could be measured.  (The tests by Cover and King were very tedious.  It
took several people several hours to measure one sentence).

I suspect that the distribution of entropy in enwik9 and in text in
general is nonuniform and possibly fractal in nature.  This will
complicate how I pick samples to measure.  I will need to study how
compression varies across the file.

> 4. Humans do have memory! Shannon didn't included humans' memory in his
> tests, but their capability to guess letters in different contexts
> (humans do not train online). Your benchmark includes also the
> compressor size. I suggest you to estimate the humans' memory and allow
> compressors to have other files (like dictionaries) up to that limit.

The purpose of including the size of the decompressor is to prevent
information from being moved from the compressed file to the
decompressor in the form of dictionaries or prebuilt models.  Otherwise
it is trivial to compress any file in most benchmarks to 1 byte.  Of
course, humans can predict text using vast knowledge learned over a
lifetime in addition to the data itself.  However, humans have limited
memory.  What I want to know is just how much knowledge that is, in
bits.  How much memory is needed to understand language, not including
the knowledge needed to ground the meanings of words to other senses?
How much knowledge is needed to build a language model that can match
human performance, as measured by prediction or recognition?

> Matt Mahoney wrote:
> > One problem in the large text compression benchmark is that the entropy
> > of the data set is not known.  Shannon and others estimated the entropy
> > of English text by having human subjects guess successive characters in
> > text reduced to a 27 character alphabet (a-z and space).  The problem
> > with the data set (enwik9) is that there is a lot of "junk" like XML
> > markup, tables, links, formatting codes, etc.  This doesn't matter as
> > long as the extra data is not hard to model; the best natural language
> > model will still win.  However, it would still be nice to know where
> > the finish line is.
> >
> > I wrote a program to filter out the junk and generate readable English
> > text using a 27 character alphabet (as readable as you can get without
> > punctuation).  Essentially I removed any text that would not be visible
> > on the Wikipedia website.  For example, I removed hypertext links but
> > retained the anchor text.  I also removed non-English text, tables and
> > markup, but retained image captions.  When this is reduced to only
> > letters, digits, and nonconsecutive spaces, the resulting text is about
> > 70% of the original size.  This is expanded to 74% after spelling out
> > digits (a common technique in speech models).
> >
> > Then I compared compression ratios for 100 MB of clean text with enwik8
> > (also 100 MB) using 25 different compressors.  Popular compressors like
> > zip, gzip and bzip2 compress clean text about 10% smaller than the
> > original text.  High end compressors like paq8h, durilca, slim, and
> > ppmonstr compress about 1% to 5% smaller.  (Did not test WinRK).
> > Details are here:
> > http://cs.fit.edu/~mmahoney/compression/text.html#data
> >
> > Now we have a rough estimate of the entropy of enwik9.  Shannon
> > estimated 0.6 to 1.3 bpc for 27 character English.  Cover and King
> > estimated an upper bound of 1.3 bpc.  It is likely that the entropy of
> > enwik9 is at most 5% more.  I plan to test on the "cleaned" enwik9 to
> > see if the results hold for larger files.
> >
> > The uncertainty in Shannon's estimate is still a problem because the
> > top compressors are now in the range of uncertainty.  paq8h compresses
> > enwik9 to 1.18 bpc.  The uncertainty is due to the fact that different
> > probability distributions can lead to the same observed sequence of
> > guesses in the Shannon game (guessing successive characters).  Cover
> > and King tried to rectify this problem using a gambling game to assign
> > probabilities, but this method is tedious and they were only able to
> > measure a single sentence.  Years ago I did simulations of the Shannon
> > game wth known models which suggest the true entropy is in the middle
> > of the Shannon bounds, around 0.9 to 1.1 bpc, but this is speculative.
> >
> > -- Matt Mahoney

I have an update on this study.  I tested the filtered enwik9 (715 MB)
on 17 compressors.  It turns out that compressing the 100 MB text8 file
underestimates the compression ratio on the full filtered data set by
about 2% to 4%.  The compression ratio is in fact about the same on
both the raw and filtered text, within 3% in most cases.  In fact, on
the filtered text, paq8h compression is about 3% worse than on the raw
text, and would take 2nd place behind xml-wrt|ppmonstr when the
decompressor size is taken into account.

-- Matt Mahoney

0
Reply Matt 6/11/2006 9:39:01 PM

Alexandru wrote:
) I also wrote a program that guesses words (~21% of the words on enwik9,
) and ~17% on enwik8) in order to reduce the size of "text.txt" file.
) Unfortunately, the compression ratio (of bzip and chile) was not
) affected in a positive way (both programs use BWT).

BWT implicitly 'guesses' words already, so it's no surprise
that it doesn't improve the compression ratio.


SaSW, Willem
-- 
Disclaimer: I am in no way responsible for any of the statements
            made in the above text. For all I know I might be
            drugged or something..
            No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
0
Reply Willem 6/11/2006 9:59:06 PM

Matt Mahoney wrote:
> I have an update on this study.  I tested the filtered enwik9 (715 MB)
> on 17 compressors.  It turns out that compressing the 100 MB text8 file
> underestimates the compression ratio on the full filtered data set by
> about 2% to 4%.  The compression ratio is in fact about the same on
> both the raw and filtered text, within 3% in most cases.  In fact, on
> the filtered text, paq8h compression is about 3% worse than on the raw
> text, and would take 2nd place behind xml-wrt|ppmonstr when the
> decompressor size is taken into account.

Another update.  I graphed the incremental compression of enwik9 with
ppmd (modifying the source code to print the x,y coordinates I needed).
 The graph is here:
http://cs.fit.edu/~mmahoney/compression/text.html#data (scroll down a
bit).  It turns out the entropy is not uniform as might be expected.
There is a large dip in the middle where the incremental compression
ratio drops from 0.22 to about 0.08 for about 150 MB.  It seems like
there are several thousand articles on U.S. towns that may have been
automatically generated from a table of census data, so the articles
have a regular structure.  This probably explains why most compressors
did better on enwik9 than enwik8, even those without much memory.

-- Matt Mahoney

0
Reply Matt 6/12/2006 3:38:20 AM

This sort of thing is actually relatively common in real text.  Sports
stories are much more predictable than other genres and you find times
when lots of new stuff happens which increases the entropy temporarily
(think 9/11 for an example).

Matt Mahoney wrote:
> Matt Mahoney wrote:
....
> There is a large dip in the middle where the incremental compression
> ratio drops from 0.22 to about 0.08 for about 150 MB.  It seems like
> there are several thousand articles on U.S. towns that may have been
> automatically generated from a table of census data, so the articles
> have a regular structure.  This probably explains why most compressors
> did better on enwik9 than enwik8, even those without much memory.
> 
> -- Matt Mahoney

0
Reply Ted 6/12/2006 9:13:08 PM

I have a suggestion for the data set:

Assign each word in the text an unique 24-bit number and then transform
the text into a list of integers (a binary file). This way you'll force
people to think about grammar and word relations and not to try to
predict individual letters (or bits). Since the file will be much
smaller (I expect half of the current size, ie ~400MB) you may use the
whole wikipedia as a test to compress.

0
Reply Alexandru 6/14/2006 7:17:23 PM

Alexandru wrote:
> I have a suggestion for the data set:
>
> Assign each word in the text an unique 24-bit number and then transform
> the text into a list of integers (a binary file). This way you'll force
> people to think about grammar and word relations and not to try to
> predict individual letters (or bits). Since the file will be much
> smaller (I expect half of the current size, ie ~400MB) you may use the
> whole wikipedia as a test to compress.

I am currently doing some experiments along these lines, writing my own
text preprocessor similar to xml-wrt.  In experiments with xml-wrt,
replacing the most common words with 1-2 byte tokens improves
compression with almost all compressors.  However it is better not to
replace every word, just the most common ones, because you have to
compress the dictionary too.  With xml-wrt the optimal threshold for
adding to the dictionary is about 180 (-f option) for enwik8 and 800
for enwik9.  In both cases the dictionary size is several thousand
words, so 16 bits is enough.

I'm also experimenting with my own version of a text preprocessor that
can be integrated into a context mixing compressor so I can also use
semantic and syntactic modeling.  Currently, paq8h (the top compressor)
is not improved by xml-wrt because it already is preprocessed with a
static Engish dictionary and xml-wrt messes up its own internal word
recognition models.

What I plan to do, hopefully by late summer, is abstract out the
lexical model by introducing symbols to override default capitalization
and spacing rules, add symbols corresponding to morphemes (root words,
suffixes, prefixes, punctuation), and spell out rare words in letters
and common n-grams, and split off nontext and structured data into a
separate stream.  (The spelling of novel words can sometimes reveal
information about their meanings).  After this preprocessing I will
cluster words (including those spelled out) by their immediate and
longer range context to learn their syntactic and semantic roles,
respectively, and use those classifications as context in a
context-mixing compressor.  I may experiment with a "soft" clustering
using online LSA (but for both semantics and syntax) implemented with a
neural network.

-- Matt Mahoney

0
Reply Matt 6/14/2006 9:06:30 PM

Matt Mahoney wrote:
> Alexandru wrote:
> > I have a suggestion for the data set:

> > Assign each word in the text an unique 24-bit number and then transform
> > the text into a list of integers (a binary file). This way you'll force
> > people to think about grammar and word relations and not to try to
> > predict individual letters (or bits). Since the file will be much
> > smaller (I expect half of the current size, ie ~400MB) you may use the
> > whole wikipedia as a test to compress.

> I am currently doing some experiments along these lines, writing my own
> text preprocessor similar to xml-wrt.  In experiments with xml-wrt,
> replacing the most common words with 1-2 byte tokens improves
> compression with almost all compressors.  However it is better not to
> replace every word, just the most common ones, because you have to
> compress the dictionary too.  With xml-wrt the optimal threshold for
> adding to the dictionary is about 180 (-f option) for enwik8 and 800
> for enwik9.  In both cases the dictionary size is several thousand
> words, so 16 bits is enough.

My idea was to make the dictionary publicly available (with the option
to allow decompresors to use it or not) and to have an easy-to-read
file. I don't think anybody will try to mess with xml-wrt file format
or codes generated by a Huffman tree. You need a very simple format
whose main utility is not compression (although is compresses somehow),
but speed (eg. comparing two ints is faster than comparing two words)
and eficiency (eg. no memory needed to remember words, no parsing, and
it is *perfect hash function*).

> I'm also experimenting with my own version of a text preprocessor that
> can be integrated into a context mixing compressor so I can also use
> semantic and syntactic modeling.  Currently, paq8h (the top compressor)
> is not improved by xml-wrt because it already is preprocessed with a
> static Engish dictionary and xml-wrt messes up its own internal word
> recognition models.

I've made my self a primitive text preprocessor. It parses the text and
it tries to guess each word using the text already seen. It outputs the
word only if the guess was wrong. The input is a big list of words with
only one space between two consecutive tokens. The output is a list of
words with one or more spaces between two consecutive items. If there
is no word between two consecutive spaces then the preprocessor guessed
the word.

eg.: (_ stands for space) "blah_mumu_bubu" -> "blah__bubu" (the
preprocessor guessed the word mumu).

You can download the preprocessor from this location:
http://mosoi.lumina.ro/downloads/skip.tar.bz2

Because these days I'm having my final exams I cannot continue my work
on this program (at least not for the next month), but if you have any
idea (or question), please drop a line. At the moment the context is
only the last word (and sometimes, no words). The next step will be to
make variable length context (but it already consumes too much memory
~250mb).

> -- Matt Mahoney

-- Alexandru Mosoi

0
Reply Alexandru 6/14/2006 10:21:26 PM

Willem wrote:
> Alexandru wrote:
> ) I also wrote a program that guesses words (~21% of the words on enwik9,
> ) and ~17% on enwik8) in order to reduce the size of "text.txt" file.
> ) Unfortunately, the compression ratio (of bzip and chile) was not
> ) affected in a positive way (both programs use BWT).
>
> BWT implicitly 'guesses' words already, so it's no surprise
> that it doesn't improve the compression ratio.

How is that? BWT doesn't change zero-order entropy. My program does
remove words.

>
>
> SaSW, Willem
> --
> Disclaimer: I am in no way responsible for any of the statements
>             made in the above text. For all I know I might be
>             drugged or something..
>             No I'm not paranoid. You all think I'm paranoid, don't you !
> #EOT

0
Reply Alexandru 6/14/2006 11:22:25 PM

Alexandru wrote:
> Matt Mahoney wrote:
> > Alexandru wrote:
> > > I have a suggestion for the data set:
>
> > > Assign each word in the text an unique 24-bit number and then transform
> > > the text into a list of integers (a binary file). This way you'll force
> > > people to think about grammar and word relations and not to try to
> > > predict individual letters (or bits). Since the file will be much
> > > smaller (I expect half of the current size, ie ~400MB) you may use the
> > > whole wikipedia as a test to compress.
>
> > I am currently doing some experiments along these lines, writing my own
> > text preprocessor similar to xml-wrt.  In experiments with xml-wrt,
> > replacing the most common words with 1-2 byte tokens improves
> > compression with almost all compressors.  However it is better not to
> > replace every word, just the most common ones, because you have to
> > compress the dictionary too.  With xml-wrt the optimal threshold for
> > adding to the dictionary is about 180 (-f option) for enwik8 and 800
> > for enwik9.  In both cases the dictionary size is several thousand
> > words, so 16 bits is enough.
>
> My idea was to make the dictionary publicly available (with the option
> to allow decompresors to use it or not) and to have an easy-to-read
> file. I don't think anybody will try to mess with xml-wrt file format
> or codes generated by a Huffman tree. You need a very simple format
> whose main utility is not compression (although is compresses somehow),
> but speed (eg. comparing two ints is faster than comparing two words)
> and eficiency (eg. no memory needed to remember words, no parsing, and
> it is *perfect hash function*).
>
> > I'm also experimenting with my own version of a text preprocessor that
> > can be integrated into a context mixing compressor so I can also use
> > semantic and syntactic modeling.  Currently, paq8h (the top compressor)
> > is not improved by xml-wrt because it already is preprocessed with a
> > static Engish dictionary and xml-wrt messes up its own internal word
> > recognition models.
>
> I've made my self a primitive text preprocessor. It parses the text and
> it tries to guess each word using the text already seen. It outputs the
> word only if the guess was wrong. The input is a big list of words with
> only one space between two consecutive tokens. The output is a list of
> words with one or more spaces between two consecutive items. If there
> is no word between two consecutive spaces then the preprocessor guessed
> the word.
>
> eg.: (_ stands for space) "blah_mumu_bubu" -> "blah__bubu" (the
> preprocessor guessed the word mumu).
>
> You can download the preprocessor from this location:
> http://mosoi.lumina.ro/downloads/skip.tar.bz2

The program moved here:
http://mosoi.lumina.ro/downloads/Chile/large/skip.tar.bz2

I did some programs that implement my idea (construct a dictionary and
transform the list of words into a list of integers). The dictionary
size (as a bzip2 archive): ~450k for text8.txt and ~1.5mb (not very
big) for text9.txt.

The results are (bzip2):
text8.txt               100,000,000 bytes
text8.txt.bz2         26,395,389 bytes

dict8.txt                1,187,596 bytes
outp8.txt               51,015,621 bytes      -> ~1/2 of original file
(as I predicted)

dict8.txt.bz2          458,295 bytes           -> dictionary file
outp8.txt.bz2         23,382,764 bytes      -> the binary file with
integers
total                      23,841,059               -> total size

An improvement of 2,554,330 bytes!!

For gzip2
text8.txt.gz            33,048,243
outp8.txt.gz           27,768,161
dict8.txt.gz            554,332
total                      28,322,493

Again an improvement of 4,725,756 bytes!!!

with xml-wrt (-f180):
text8.txt.xwrt         37,082,250
text8.txt.xwrt.bz2   23,869,114 bytes (a little bit worse than
compressing the list of integers)

Moreover, my opinion is that by assigning shorter codes to more
frequent words you'll increase the entropy of the file and,
consequently, compression performance will be affected.

My programs can be downloaded from:
http://mosoi.lumina.ro/downloads/Chile/large/.

0
Reply Alexandru 6/15/2006 5:26:08 PM

Alexandru wrote:
> Alexandru wrote:
> > Matt Mahoney wrote:
> > > Alexandru wrote:
> > > > I have a suggestion for the data set:
> >
> > > > Assign each word in the text an unique 24-bit number and then transform
> > > > the text into a list of integers (a binary file). This way you'll force
> > > > people to think about grammar and word relations and not to try to
> > > > predict individual letters (or bits). Since the file will be much
> > > > smaller (I expect half of the current size, ie ~400MB) you may use the
> > > > whole wikipedia as a test to compress.
> >
> > > I am currently doing some experiments along these lines, writing my own
> > > text preprocessor similar to xml-wrt.  In experiments with xml-wrt,
> > > replacing the most common words with 1-2 byte tokens improves
> > > compression with almost all compressors.  However it is better not to
> > > replace every word, just the most common ones, because you have to
> > > compress the dictionary too.  With xml-wrt the optimal threshold for
> > > adding to the dictionary is about 180 (-f option) for enwik8 and 800
> > > for enwik9.  In both cases the dictionary size is several thousand
> > > words, so 16 bits is enough.
> >
> > My idea was to make the dictionary publicly available (with the option
> > to allow decompresors to use it or not) and to have an easy-to-read
> > file. I don't think anybody will try to mess with xml-wrt file format
> > or codes generated by a Huffman tree. You need a very simple format
> > whose main utility is not compression (although is compresses somehow),
> > but speed (eg. comparing two ints is faster than comparing two words)
> > and eficiency (eg. no memory needed to remember words, no parsing, and
> > it is *perfect hash function*).
> >
> > > I'm also experimenting with my own version of a text preprocessor that
> > > can be integrated into a context mixing compressor so I can also use
> > > semantic and syntactic modeling.  Currently, paq8h (the top compressor)
> > > is not improved by xml-wrt because it already is preprocessed with a
> > > static Engish dictionary and xml-wrt messes up its own internal word
> > > recognition models.
> >
> > I've made my self a primitive text preprocessor. It parses the text and
> > it tries to guess each word using the text already seen. It outputs the
> > word only if the guess was wrong. The input is a big list of words with
> > only one space between two consecutive tokens. The output is a list of
> > words with one or more spaces between two consecutive items. If there
> > is no word between two consecutive spaces then the preprocessor guessed
> > the word.
> >
> > eg.: (_ stands for space) "blah_mumu_bubu" -> "blah__bubu" (the
> > preprocessor guessed the word mumu).
> >
> > You can download the preprocessor from this location:
> > http://mosoi.lumina.ro/downloads/skip.tar.bz2
>
> The program moved here:
> http://mosoi.lumina.ro/downloads/Chile/large/skip.tar.bz2
>
> I did some programs that implement my idea (construct a dictionary and
> transform the list of words into a list of integers). The dictionary
> size (as a bzip2 archive): ~450k for text8.txt and ~1.5mb (not very
> big) for text9.txt.
>
> The results are (bzip2):
> text8.txt               100,000,000 bytes
> text8.txt.bz2         26,395,389 bytes
>
> dict8.txt                1,187,596 bytes
> outp8.txt               51,015,621 bytes      -> ~1/2 of original file
> (as I predicted)
>
> dict8.txt.bz2          458,295 bytes           -> dictionary file
> outp8.txt.bz2         23,382,764 bytes      -> the binary file with
> integers
> total                      23,841,059               -> total size
>
> An improvement of 2,554,330 bytes!!
>
> For gzip2
> text8.txt.gz            33,048,243
> outp8.txt.gz           27,768,161
> dict8.txt.gz            554,332
> total                      28,322,493
>
> Again an improvement of 4,725,756 bytes!!!
>
> with xml-wrt (-f180):
> text8.txt.xwrt         37,082,250
> text8.txt.xwrt.bz2   23,869,114 bytes (a little bit worse than
> compressing the list of integers)
>
> Moreover, my opinion is that by assigning shorter codes to more
> frequent words you'll increase the entropy of the file and,
> consequently, compression performance will be affected.
>
> My programs can be downloaded from:
> http://mosoi.lumina.ro/downloads/Chile/large/.

Does this work on enwik8/9 also?  If so, send me some results and I
will post them.

Your "skip" program is like word-level LZP compression.

-- Matt Mahoney

0
Reply Matt 6/15/2006 8:04:23 PM

Matt Mahoney wrote:
> Alexandru wrote:
> > Alexandru wrote:
> > > Matt Mahoney wrote:
> > > > Alexandru wrote:
> > > > > I have a suggestion for the data set:
> > >
> > > > > Assign each word in the text an unique 24-bit number and then transform
> > > > > the text into a list of integers (a binary file). This way you'll force
> > > > > people to think about grammar and word relations and not to try to
> > > > > predict individual letters (or bits). Since the file will be much
> > > > > smaller (I expect half of the current size, ie ~400MB) you may use the
> > > > > whole wikipedia as a test to compress.
> > >
> > > > I am currently doing some experiments along these lines, writing my own
> > > > text preprocessor similar to xml-wrt.  In experiments with xml-wrt,
> > > > replacing the most common words with 1-2 byte tokens improves
> > > > compression with almost all compressors.  However it is better not to
> > > > replace every word, just the most common ones, because you have to
> > > > compress the dictionary too.  With xml-wrt the optimal threshold for
> > > > adding to the dictionary is about 180 (-f option) for enwik8 and 800
> > > > for enwik9.  In both cases the dictionary size is several thousand
> > > > words, so 16 bits is enough.
> > >
> > > My idea was to make the dictionary publicly available (with the option
> > > to allow decompresors to use it or not) and to have an easy-to-read
> > > file. I don't think anybody will try to mess with xml-wrt file format
> > > or codes generated by a Huffman tree. You need a very simple format
> > > whose main utility is not compression (although is compresses somehow),
> > > but speed (eg. comparing two ints is faster than comparing two words)
> > > and eficiency (eg. no memory needed to remember words, no parsing, and
> > > it is *perfect hash function*).
> > >
> > > > I'm also experimenting with my own version of a text preprocessor that
> > > > can be integrated into a context mixing compressor so I can also use
> > > > semantic and syntactic modeling.  Currently, paq8h (the top compressor)
> > > > is not improved by xml-wrt because it already is preprocessed with a
> > > > static Engish dictionary and xml-wrt messes up its own internal word
> > > > recognition models.
> > >
> > > I've made my self a primitive text preprocessor. It parses the text and
> > > it tries to guess each word using the text already seen. It outputs the
> > > word only if the guess was wrong. The input is a big list of words with
> > > only one space between two consecutive tokens. The output is a list of
> > > words with one or more spaces between two consecutive items. If there
> > > is no word between two consecutive spaces then the preprocessor guessed
> > > the word.
> > >
> > > eg.: (_ stands for space) "blah_mumu_bubu" -> "blah__bubu" (the
> > > preprocessor guessed the word mumu).
> > >
> > > You can download the preprocessor from this location:
> > > http://mosoi.lumina.ro/downloads/skip.tar.bz2
> >
> > The program moved here:
> > http://mosoi.lumina.ro/downloads/Chile/large/skip.tar.bz2
> >
> > I did some programs that implement my idea (construct a dictionary and
> > transform the list of words into a list of integers). The dictionary
> > size (as a bzip2 archive): ~450k for text8.txt and ~1.5mb (not very
> > big) for text9.txt.
> >
> > The results are (bzip2):
> > text8.txt               100,000,000 bytes
> > text8.txt.bz2         26,395,389 bytes
> >
> > dict8.txt                1,187,596 bytes
> > outp8.txt               51,015,621 bytes      -> ~1/2 of original file
> > (as I predicted)
> >
> > dict8.txt.bz2          458,295 bytes           -> dictionary file
> > outp8.txt.bz2         23,382,764 bytes      -> the binary file with
> > integers
> > total                      23,841,059               -> total size
> >
> > An improvement of 2,554,330 bytes!!
> >
> > For gzip2
> > text8.txt.gz            33,048,243
> > outp8.txt.gz           27,768,161
> > dict8.txt.gz            554,332
> > total                      28,322,493
> >
> > Again an improvement of 4,725,756 bytes!!!
> >
> > with xml-wrt (-f180):
> > text8.txt.xwrt         37,082,250
> > text8.txt.xwrt.bz2   23,869,114 bytes (a little bit worse than
> > compressing the list of integers)
> >
> > Moreover, my opinion is that by assigning shorter codes to more
> > frequent words you'll increase the entropy of the file and,
> > consequently, compression performance will be affected.
> >
> > My programs can be downloaded from:
> > http://mosoi.lumina.ro/downloads/Chile/large/.

> Does this work on enwik8/9 also?  If so, send me some results and I
> will post them.

No it doesn't work on enwik8/9, only for text8/9.

Still, I can split enwik8/9 into two streams:
* first stream contains enough information to decode punctuation and
the location of words
* second stream contains only the words.

I can apply my work, including "skip", on the second stream. I did a
program to split the files into the streams, but i didn't do the
reverse program (so the results cannot be verified). If I will have
some notable results (like 15% improvement in compression ratio) I will
submit any result.

At the moment I want to concentrate only on compressing the stream of
words (text8/9 files). I assume that compressing human language (and
not xml/html tags) is your goal, too (correct me if I'm wrong).

The results I posted earlier shows that using a dictionary with all the
words is a very good idea. Not only that it boosts compression of other
programs, but also the transformation of words into integers (like a
perfect hash function) allows better text preprocessing by decreasesing
memory usage and increasing computation speed. Moreover "the hash
function" is also bijective, so integers can be converted easily into
words and vice-versa.

> Your "skip" program is like word-level LZP compression.

Oh, thanks for taking a look over it. I didn't know (until 1h ago)
about LZP algorithm (I didn't pay too much attention to any of LZ*
algorithms). The difference between LZP and "skip" is that LZP uses a
higher level to predict; however, I'm planning to implement a variable
length context for skip, too. It will be a tremendious work since I
will have to rewrite parts of the program to make it work with integers
(now that I decided to use a dictionary).

--
Alexandru Mosoi

0
Reply Alexandru 6/15/2006 9:35:38 PM


You can probably grab a few percent more by compressing your dictionary
using prefix compression (standard trick).

Using arithmetic coding to minimize the size of your code words will
definitely help as well.

There are interesting methods that try to find an optimal lexicon for
this sort of compression.  See Carl de Marcken's thesis about
segmentation of English.

If you use an entropic method for deriving your lexicon, it should
automagically include common multi-word phrases.  Matt's observation
that your method is similar to word-level LVP is correct and can be
extended if you use phrases to say that you are doing something like
word level LZ.

0
Reply Ted 6/15/2006 11:54:13 PM

Alexandru wrote:
> Matt Mahoney wrote:
> > Alexandru wrote:
> > > Alexandru wrote:
> > > > Matt Mahoney wrote:
> > > > > Alexandru wrote:
> > > > > > I have a suggestion for the data set:
> > > >
> > > > > > Assign each word in the text an unique 24-bit number and then transform
> > > > > > the text into a list of integers (a binary file). This way you'll force
> > > > > > people to think about grammar and word relations and not to try to
> > > > > > predict individual letters (or bits). Since the file will be much
> > > > > > smaller (I expect half of the current size, ie ~400MB) you may use the
> > > > > > whole wikipedia as a test to compress.
> > > >
> > > > > I am currently doing some experiments along these lines, writing my own
> > > > > text preprocessor similar to xml-wrt.  In experiments with xml-wrt,
> > > > > replacing the most common words with 1-2 byte tokens improves
> > > > > compression with almost all compressors.  However it is better not to
> > > > > replace every word, just the most common ones, because you have to
> > > > > compress the dictionary too.  With xml-wrt the optimal threshold for
> > > > > adding to the dictionary is about 180 (-f option) for enwik8 and 800
> > > > > for enwik9.  In both cases the dictionary size is several thousand
> > > > > words, so 16 bits is enough.
> > > >
> > > > My idea was to make the dictionary publicly available (with the option
> > > > to allow decompresors to use it or not) and to have an easy-to-read
> > > > file. I don't think anybody will try to mess with xml-wrt file format
> > > > or codes generated by a Huffman tree. You need a very simple format
> > > > whose main utility is not compression (although is compresses somehow),
> > > > but speed (eg. comparing two ints is faster than comparing two words)
> > > > and eficiency (eg. no memory needed to remember words, no parsing, and
> > > > it is *perfect hash function*).
> > > >
> > > > > I'm also experimenting with my own version of a text preprocessor that
> > > > > can be integrated into a context mixing compressor so I can also use
> > > > > semantic and syntactic modeling.  Currently, paq8h (the top compressor)
> > > > > is not improved by xml-wrt because it already is preprocessed with a
> > > > > static Engish dictionary and xml-wrt messes up its own internal word
> > > > > recognition models.
> > > >
> > > > I've made my self a primitive text preprocessor. It parses the text and
> > > > it tries to guess each word using the text already seen. It outputs the
> > > > word only if the guess was wrong. The input is a big list of words with
> > > > only one space between two consecutive tokens. The output is a list of
> > > > words with one or more spaces between two consecutive items. If there
> > > > is no word between two consecutive spaces then the preprocessor guessed
> > > > the word.
> > > >
> > > > eg.: (_ stands for space) "blah_mumu_bubu" -> "blah__bubu" (the
> > > > preprocessor guessed the word mumu).
> > > >
> > > > You can download the preprocessor from this location:
> > > > http://mosoi.lumina.ro/downloads/skip.tar.bz2
> > >
> > > The program moved here:
> > > http://mosoi.lumina.ro/downloads/Chile/large/skip.tar.bz2
> > >
> > > I did some programs that implement my idea (construct a dictionary and
> > > transform the list of words into a list of integers). The dictionary
> > > size (as a bzip2 archive): ~450k for text8.txt and ~1.5mb (not very
> > > big) for text9.txt.
> > >
> > > The results are (bzip2):
> > > text8.txt               100,000,000 bytes
> > > text8.txt.bz2         26,395,389 bytes
> > >
> > > dict8.txt                1,187,596 bytes
> > > outp8.txt               51,015,621 bytes      -> ~1/2 of original file
> > > (as I predicted)
> > >
> > > dict8.txt.bz2          458,295 bytes           -> dictionary file
> > > outp8.txt.bz2         23,382,764 bytes      -> the binary file with
> > > integers
> > > total                      23,841,059               -> total size
> > >
> > > An improvement of 2,554,330 bytes!!
> > >
> > > For gzip2
> > > text8.txt.gz            33,048,243
> > > outp8.txt.gz           27,768,161
> > > dict8.txt.gz            554,332
> > > total                      28,322,493
> > >
> > > Again an improvement of 4,725,756 bytes!!!
> > >
> > > with xml-wrt (-f180):
> > > text8.txt.xwrt         37,082,250
> > > text8.txt.xwrt.bz2   23,869,114 bytes (a little bit worse than
> > > compressing the list of integers)
> > >
> > > Moreover, my opinion is that by assigning shorter codes to more
> > > frequent words you'll increase the entropy of the file and,
> > > consequently, compression performance will be affected.
> > >
> > > My programs can be downloaded from:
> > > http://mosoi.lumina.ro/downloads/Chile/large/.
>
> > Does this work on enwik8/9 also?  If so, send me some results and I
> > will post them.
>
> No it doesn't work on enwik8/9, only for text8/9.
>
> Still, I can split enwik8/9 into two streams:
> * first stream contains enough information to decode punctuation and
> the location of words
> * second stream contains only the words.
>
> I can apply my work, including "skip", on the second stream. I did a
> program to split the files into the streams, but i didn't do the
> reverse program (so the results cannot be verified). If I will have
> some notable results (like 15% improvement in compression ratio) I will
> submit any result.

You could also have codes for punctuation characters, and a special
symbol meaning to capitalize the next letter.  Also, you could have
rules about when to insert a space before a word (like after another
word or "," but not after a linefeed or "(" ) and a special symbol to
override the rules.

> At the moment I want to concentrate only on compressing the stream of
> words (text8/9 files). I assume that compressing human language (and
> not xml/html tags) is your goal, too (correct me if I'm wrong).
>
> The results I posted earlier shows that using a dictionary with all the
> words is a very good idea. Not only that it boosts compression of other
> programs, but also the transformation of words into integers (like a
> perfect hash function) allows better text preprocessing by decreasesing
> memory usage and increasing computation speed. Moreover "the hash
> function" is also bijective, so integers can be converted easily into
> words and vice-versa.

Yes, but part of language modeling is dealing with the junk at the
lexical level like structure, formatting and misspelled words.
Wikipedia is high quality with few misspellings, but more commonly you
can improve compression of misspelled words using ngrams rather than
creating new dictionary entries.  The filtered text is good for testing
ideas, though.

> > Your "skip" program is like word-level LZP compression.
>
> Oh, thanks for taking a look over it. I didn't know (until 1h ago)
> about LZP algorithm (I didn't pay too much attention to any of LZ*
> algorithms). The difference between LZP and "skip" is that LZP uses a
> higher level to predict; however, I'm planning to implement a variable
> length context for skip, too. It will be a tremendious work since I
> will have to rewrite parts of the program to make it work with integers
> (now that I decided to use a dictionary).

It will be interesting to see what effect this has on language
modeling.  I guess if you take all the predictable parts out of text,
it will still be somewhat comprehensible but no longer grammatical.  It
might make compression worse for an n-gram model, but I wonder if it
would help a more sophisticated model like LSA.

> 
> --
> Alexandru Mosoi

-- Matt Mahoney

0
Reply Matt 6/16/2006 1:45:29 AM

> > At the moment I want to concentrate only on compressing the stream of
> > words (text8/9 files). I assume that compressing human language (and
> > not xml/html tags) is your goal, too (correct me if I'm wrong).
> >
> > The results I posted earlier shows that using a dictionary with all the
> > words is a very good idea. Not only that it boosts compression of other
> > programs, but also the transformation of words into integers (like a
> > perfect hash function) allows better text preprocessing by decreasesing
> > memory usage and increasing computation speed. Moreover "the hash
> > function" is also bijective, so integers can be converted easily into
> > words and vice-versa.
>
> Yes, but part of language modeling is dealing with the junk at the
> lexical level like structure, formatting and misspelled words.
> Wikipedia is high quality with few misspellings, but more commonly you
> can improve compression of misspelled words using ngrams rather than
> creating new dictionary entries.  The filtered text is good for testing
> ideas, though.

I made a distribution of new words over text9.txt (fil9). Check the
image:
http://mosoi.lumina.ro/downloads/Chile/large/words.jpg.

You may observe that the number of words added to the dictionary is
almost constant (there is a very low decrease over time). This happens
because wikipedia contains articles from many subjects, each subject
with it's own specific vocabulary (this explains also the huge number
of words in the dictionary).

Alexandru Mosoi

0
Reply Alexandru 6/16/2006 10:01:18 AM

Alexandru wrote:
> > > At the moment I want to concentrate only on compressing the stream of
> > > words (text8/9 files). I assume that compressing human language (and
> > > not xml/html tags) is your goal, too (correct me if I'm wrong).
> > >
> > > The results I posted earlier shows that using a dictionary with all the
> > > words is a very good idea. Not only that it boosts compression of other
> > > programs, but also the transformation of words into integers (like a
> > > perfect hash function) allows better text preprocessing by decreasesing
> > > memory usage and increasing computation speed. Moreover "the hash
> > > function" is also bijective, so integers can be converted easily into
> > > words and vice-versa.
> >
> > Yes, but part of language modeling is dealing with the junk at the
> > lexical level like structure, formatting and misspelled words.
> > Wikipedia is high quality with few misspellings, but more commonly you
> > can improve compression of misspelled words using ngrams rather than
> > creating new dictionary entries.  The filtered text is good for testing
> > ideas, though.
>
> I made a distribution of new words over text9.txt (fil9). Check the
> image:
> http://mosoi.lumina.ro/downloads/Chile/large/words.jpg.
>
> You may observe that the number of words added to the dictionary is
> almost constant (there is a very low decrease over time). This happens
> because wikipedia contains articles from many subjects, each subject
> with it's own specific vocabulary (this explains also the huge number
> of words in the dictionary).
>
> Alexandru Mosoi

Very interesting.  This distribution is consistent with Zipf's law.
Zipf's law says the n'th most frequent word has a frequency
proportional to 1/n.  At every point in the learning process, about
half of the words in the dictionary should occur only once.  The rate
at which novel words are encountered is nearly constant (after an
initial spurt).  This implies that natural languge is a nonstationary
source over large time scales.

If Zipf's law holds, then about 40,000 words have a frequency of 20 or
higher (since 800,000 have a frequency of 1).  This is significant
because a child has to read a word about 20 times to "learn" it, so
that it becomes part of the vocabulary.  The typical adult vocabulary
is 30,000 to 50,000 words.

The relatively flat section in the middle looks like it corresponds to
the region of low compression ratio in enwik9 observed with ppmd.

This is exactly the kind of research I was hoping this benchmark would
encourage.

-- Matt Mahoney

0
Reply Matt 6/16/2006 3:46:53 PM

Matt Mahoney wrote:
> Alexandru wrote:
> > > > At the moment I want to concentrate only on compressing the stream of
> > > > words (text8/9 files). I assume that compressing human language (and
> > > > not xml/html tags) is your goal, too (correct me if I'm wrong).
> > > >
> > > > The results I posted earlier shows that using a dictionary with all the
> > > > words is a very good idea. Not only that it boosts compression of other
> > > > programs, but also the transformation of words into integers (like a
> > > > perfect hash function) allows better text preprocessing by decreasesing
> > > > memory usage and increasing computation speed. Moreover "the hash
> > > > function" is also bijective, so integers can be converted easily into
> > > > words and vice-versa.
> > >
> > > Yes, but part of language modeling is dealing with the junk at the
> > > lexical level like structure, formatting and misspelled words.
> > > Wikipedia is high quality with few misspellings, but more commonly you
> > > can improve compression of misspelled words using ngrams rather than
> > > creating new dictionary entries.  The filtered text is good for testing
> > > ideas, though.
> >
> > I made a distribution of new words over text9.txt (fil9). Check the
> > image:
> > http://mosoi.lumina.ro/downloads/Chile/large/words.jpg.
> >
> > You may observe that the number of words added to the dictionary is
> > almost constant (there is a very low decrease over time). This happens
> > because wikipedia contains articles from many subjects, each subject
> > with it's own specific vocabulary (this explains also the huge number
> > of words in the dictionary).
> >
> > Alexandru Mosoi
>
> Very interesting.  This distribution is consistent with Zipf's law.
> Zipf's law says the n'th most frequent word has a frequency
> proportional to 1/n.  At every point in the learning process, about
> half of the words in the dictionary should occur only once.  The rate
> at which novel words are encountered is nearly constant (after an
> initial spurt).  This implies that natural languge is a nonstationary
> source over large time scales.
>
> If Zipf's law holds, then about 40,000 words have a frequency of 20 or
> higher (since 800,000 have a frequency of 1).  This is significant
> because a child has to read a word about 20 times to "learn" it, so
> that it becomes part of the vocabulary.  The typical adult vocabulary
> is 30,000 to 50,000 words.
>
> The relatively flat section in the middle looks like it corresponds to
> the region of low compression ratio in enwik9 observed with ppmd.
>
> This is exactly the kind of research I was hoping this benchmark would
> encourage.
>
> -- Matt Mahoney

I did an analysis of the unfiltered and filtered text (enwik9 and fil9)
to test if Zipf's law holds.  In both cases, I counted any sequence of
a-z as a word, ignoring case.  This is cruder than ZIpf's analysis.
Zipf analyzed text by hand (this was in 1935) and considered
morphology, so suffixes like -s, -ed, -ing etc. would be considered
separate words from their roots.  It turns out with my crude parsing
that the fit to Zipf's law is rather poor, with or without filtering.
Filtering just makes it a little worse.

Here are the words in each file ranked by frequency.  Zipf's law says
rank x frequency = constant (dictionary size).  Here is for enwik9.
The Zipf constant is 7-14 million for the top 10,000 words but only
about 1.2 million around the bottom.  (In case of words tied by
frequency, the rank is shown for the last word in the list).

1171106 types, 141176630 tokens
   Freq    Rank
7803966       1 the
4864784       2 of
3064252       3 and
2634604       4 in
2366965       5 a
2147564       6 to
1875373       7 quot
1638933       8 is
1423733       9 id
1379670      10 s
1153722      11 lt
1152134      12 gt
 994178      13 for
 919854      14 amp
 861329      15 as
 855885      16 are
 817812      17 was
 740085      18 by
 726845      19 with
 664439      20 from
....
 143452     100 most
....
  14368    1000 songs
....
    956   10002 ale arrows axioms cal charlemagne expenditures
paradigm...
....
     26  100665 aaaa aabybro aak aalsmeer abdolhossein abdoulaye
aberfoyle...
....
      5  277712 aaaba aabye aaca aacap aachtopf aafjes aafss aakal
aamer...
      4  326414 aaaaaaa aaaaahh aaaai aaaargh aaad aaasouth aabab
aabbff...
      3  403550 aaaab aaaar aaalac aaargh aabaa aababb aabadie aabrekk
aacc...
      2  577832 aaaaaaaa aaaaaaaaaah aaaabbbbccccddddeeeeffffgggg
aaaay...
      1 1171106 aaaaaaaaa aaaaaaaaaaa aaaaaaaaaaaa...


Here is for fil9.  Zipf's constant is about 7-20 million for the top
10,000 words, then drops to less than 1 million.

833184 types, 124301826 tokens
   Freq    Rank
7446708       1 the
4453926       2 of
3776770       3 one
3085174       4 zero
2916968       5 and
2480552       6 in
2339802       7 two
2241744       8 a
2063649       9 nine
2028129      10 to
1594091      11 is
1489530      12 eight
1483148      13 three
1459622      14 four
1456363      15 five
1283963      16 six
1123114      17 seven
 938010      18 for
 840437      19 are
 829101      20 as
....
 109212     100 line
....
  11177    1000 route
....
    752   10007 angela bulls costly fare finch hulk hypothetical
inputs...
....
     18  101387 abbate abderus abdomens abides abigor abp abridge...
....
      5  218316 aaaaaa aabba aabf aachtopf aafss aaiyangar aakal
aakirkeby...
      4  253871 aaaaahh aaaba aabab aachener aadt aafjes aafk aafla
aagaard...
      3  309475 aaaaaaa aaaab aaaar aaaargh aaad aabaa aabc aabye
aadolf...
      2  425051 aaaaaaaa aaaabbbbccccddddeeeeffffgggg aaaay aaab
aaabb...
      1  833184 aaaaaaaaa aaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaaaaarg...


-- Matt Mahoney

0
Reply Matt 6/16/2006 8:45:24 PM

Ted Dunning wrote:
> You can probably grab a few percent more by compressing your dictionary
> using prefix compression (standard trick).

Here are some compression benchmarks for a sorted list of words.
http://maximumcompression.com/data/dict.php?historic=true

In writing PAQ8 I saw a big jump in compression when I added an
indirect context model.  At the time I was writing it, I was not even
trying to tune it for sorted words, although it helped on almost every
test I tried.  Still, I was rather surprised by the effect.  The other
models that help are a text column model (using the column position and
character above as context), and an extra level of model indirection at
the bit level.  In other words, if the bit sequence 0001 is observed in
4 successive occurrences of some context, then the next bit prediction
depends on the 0/1 count in this indirect context based on what happens
after the sequence 0001 is observed in other contexts.  In a dictionary
the next bit will usually be 1 because the words are in ascending
order.  (01 is more common than 10).  This is an example of indirect
context modeling at the bit level, which is present in all versions of
PAQ7 and PAQ8.

PAQ6 and PAQAR use fixed tables, so are more likely to predict 0.
However there is also some improvement in PAQ7-8 due to using a neural
network to mix predictions instead of adaptive weighted linear mixing.
The main difference is that instead of a weighted average of what the
various models predict the next bit to be, the probabilities are first
transformed into the logistic domain, p -> ln(p/(1-p)) before the
weighed averaging, then transformed back: p -> 1/(1+e^-p)) prior to
arithmetic coding.

In PAQ8F and higher, I also added indirect models at the byte level.
This means that the sequence "AB...AC...AB...AC...AB...A?" uses the
history "BCBCB" (the sequence that occurred in context A) to predict
the next byte.  Note that B is always followed by C, so C is predicted
in the indirect contexts "B" and "CB".

Neither XML-WRT nor the external dictionary in PAQ8H are sorted, so I
think there might be some room for improvement here.

-- Matt Mahoney

0
Reply Matt 6/16/2006 9:28:16 PM

Ted Dunning wrote:
> You can probably grab a few percent more by compressing your dictionary
> using prefix compression (standard trick).

Already did this, but the trick doesn't work with paq7 (it works with
bzip and gzip). Paq7 does better if I leave the sorted list of words
untouched. I tried different tricks of which none improved compression
with paq7. I think paq8h does better, but I cannot run it on my
computer.

I kindly ask Matt to publish a Linux binary executable (with static
libraries) of paq8h so I can run tests on it. Currently I'm doing all
my tests with bzip2 because it's fast and has no model for English text
(which I'm working on), but I plan to try also paq from time to time.

0
Reply Alexandru 6/16/2006 11:11:27 PM

This introduction also occurs due to tokens such as numbers.  For
instance, in a novel, you expect to see at least one new token on each
of the later pages (the page numbers, at least).

Whether that sort of vocabulary growth is interesting is a debatable
issue.

As you get to very, very large corpora, the rate of new tokens which
are actually words in the sense used by most linguists decreases to
very nearly zero.  This is because Zipf's law doesn't precisely hold
for very large or very low frequency words, both are found in lesser
abundance than predicted.

Alexandru wrote:
> > > At the moment I want to concentrate only on compressing the stream of
> > > words (text8/9 files). I assume that compressing human language (and
> > > not xml/html tags) is your goal, too (correct me if I'm wrong).
> > >
> > > The results I posted earlier shows that using a dictionary with all the
> > > words is a very good idea. Not only that it boosts compression of other
> > > programs, but also the transformation of words into integers (like a
> > > perfect hash function) allows better text preprocessing by decreasesing
> > > memory usage and increasing computation speed. Moreover "the hash
> > > function" is also bijective, so integers can be converted easily into
> > > words and vice-versa.
> >
> > Yes, but part of language modeling is dealing with the junk at the
> > lexical level like structure, formatting and misspelled words.
> > Wikipedia is high quality with few misspellings, but more commonly you
> > can improve compression of misspelled words using ngrams rather than
> > creating new dictionary entries.  The filtered text is good for testing
> > ideas, though.
>
> I made a distribution of new words over text9.txt (fil9). Check the
> image:
> http://mosoi.lumina.ro/downloads/Chile/large/words.jpg.
>
> You may observe that the number of words added to the dictionary is
> almost constant (there is a very low decrease over time). This happens
> because wikipedia contains articles from many subjects, each subject
> with it's own specific vocabulary (this explains also the huge number
> of words in the dictionary).
> 
> Alexandru Mosoi

0
Reply Ted 6/16/2006 11:57:39 PM

Matt Mahoney wrote:
> Matt Mahoney wrote:
>
> I did an analysis of the unfiltered and filtered text (enwik9 and fil9)
> to test if Zipf's law holds.  ... It turns out with my crude parsing
> that the fit to Zipf's law is rather poor, with or without filtering.
> Filtering just makes it a little worse.
   ...
> The Zipf constant is 7-14 million for the top 10,000 words but only
> about 1.2 million around the bottom.  (In case of words tied by
> frequency, the rank is shown for the last word in the list).

These results are typical.  If you plot the results on a log-log graph,
you see a goodly stretch that is roughly linear with droop at each end.
 The drop that you see is not all that large.

Working with lemmatized text doesn't change the results all that much.
You still get droop at each end.

0
Reply Ted 6/17/2006 12:15:11 AM

Matt Mahoney wrote:

> ... The rate
> at which novel words are encountered is nearly constant (after an
> initial spurt).  This implies that natural languge is a nonstationary
> source over large time scales.

Au contraire.

A hyperbolic distribution with a very large number of unobserved
symbols does not imply non-stationarity.

In fact, word frequencies that vary by topic also doesn't imply
non-stationarity.  It just implies that the probability of the next
word can't be modeled by a stationary multi-nomial.  A stationary
hidden variable approach or a history based model might do just fine.

0
Reply Ted 6/17/2006 12:19:22 AM

Alexandru wrote:
> Ted Dunning wrote:
> > You can probably grab a few percent more by compressing your dictionary
> > using prefix compression (standard trick).
>
> Already did this, but the trick doesn't work with paq7 (it works with
> bzip and gzip). Paq7 does better if I leave the sorted list of words
> untouched. I tried different tricks of which none improved compression
> with paq7. I think paq8h does better, but I cannot run it on my
> computer.
>
> I kindly ask Matt to publish a Linux binary executable (with static
> libraries) of paq8h so I can run tests on it. Currently I'm doing all
> my tests with bzip2 because it's fast and has no model for English text
> (which I'm working on), but I plan to try also paq from time to time.

Unfortunately PAQ8H (and PAQ8G) have some Windows specific code in the
text preprocessor written by Przemyslaw Skibinski.  I think he plans to
replace the static dictionaries with xml-wrt in a future version of PAQ
which should compile under Linux like xml-wrt currently does.
Meanwhile you can use PAQ8F which does no dictionary preprocessing and
compiles under Linux.  However it lacks Alexander Ratushnyak's
improvement to the mixer so the compression is not quite as good.

-- Matt Mahoney

0
Reply Matt 6/17/2006 12:31:21 AM

Ted Dunning wrote:
> Matt Mahoney wrote:
>
> > ... The rate
> > at which novel words are encountered is nearly constant (after an
> > initial spurt).  This implies that natural languge is a nonstationary
> > source over large time scales.
>
> Au contraire.
>
> A hyperbolic distribution with a very large number of unobserved
> symbols does not imply non-stationarity.
>
> In fact, word frequencies that vary by topic also doesn't imply
> non-stationarity.  It just implies that the probability of the next
> word can't be modeled by a stationary multi-nomial.  A stationary
> hidden variable approach or a history based model might do just fine.

Yes, I guess you're right, you need more to show the nonstationarity of
text.

But even huge corpora like libraries, the web or usenet can be
organized hierarchically.  A depth-first traversal of this hierarchy
will show nonuniform statistics over the largest time scales.  You can
artificially make a corpus stationary by chopping it into small pieces
and shuffling them, but this does not remove the underlying
organization.  These pieces can be clustered to recover the hierarchy.

Even if you consider every word ever spoken or written, then this
source is nonstationary because language evolves over time.

-- Matt Mahoney

0
Reply Matt 6/17/2006 1:55:40 PM

Alexandru wrote:
> Alexandru wrote:
> > Matt Mahoney wrote:
> > > Alexandru wrote:
> > > > I have a suggestion for the data set:
> >
> > > > Assign each word in the text an unique 24-bit number and then transform
> > > > the text into a list of integers (a binary file). This way you'll force
> > > > people to think about grammar and word relations and not to try to
> > > > predict individual letters (or bits). Since the file will be much
> > > > smaller (I expect half of the current size, ie ~400MB) you may use the
> > > > whole wikipedia as a test to compress.
> >
> > > I am currently doing some experiments along these lines, writing my own
> > > text preprocessor similar to xml-wrt.  In experiments with xml-wrt,
> > > replacing the most common words with 1-2 byte tokens improves
> > > compression with almost all compressors.  However it is better not to
> > > replace every word, just the most common ones, because you have to
> > > compress the dictionary too.  With xml-wrt the optimal threshold for
> > > adding to the dictionary is about 180 (-f option) for enwik8 and 800
> > > for enwik9.  In both cases the dictionary size is several thousand
> > > words, so 16 bits is enough.
> >
> > My idea was to make the dictionary publicly available (with the option
> > to allow decompresors to use it or not) and to have an easy-to-read
> > file. I don't think anybody will try to mess with xml-wrt file format
> > or codes generated by a Huffman tree. You need a very simple format
> > whose main utility is not compression (although is compresses somehow),
> > but speed (eg. comparing two ints is faster than comparing two words)
> > and eficiency (eg. no memory needed to remember words, no parsing, and
> > it is *perfect hash function*).
> >
> > > I'm also experimenting with my own version of a text preprocessor that
> > > can be integrated into a context mixing compressor so I can also use
> > > semantic and syntactic modeling.  Currently, paq8h (the top compressor)
> > > is not improved by xml-wrt because it already is preprocessed with a
> > > static Engish dictionary and xml-wrt messes up its own internal word
> > > recognition models.
> >
> > I've made my self a primitive text preprocessor. It parses the text and
> > it tries to guess each word using the text already seen. It outputs the
> > word only if the guess was wrong. The input is a big list of words with
> > only one space between two consecutive tokens. The output is a list of
> > words with one or more spaces between two consecutive items. If there
> > is no word between two consecutive spaces then the preprocessor guessed
> > the word.
> >
> > eg.: (_ stands for space) "blah_mumu_bubu" -> "blah__bubu" (the
> > preprocessor guessed the word mumu).
> >
> > You can download the preprocessor from this location:
> > http://mosoi.lumina.ro/downloads/skip.tar.bz2
>
> The program moved here:
> http://mosoi.lumina.ro/downloads/Chile/large/skip.tar.bz2
>
> I did some programs that implement my idea (construct a dictionary and
> transform the list of words into a list of integers). The dictionary
> size (as a bzip2 archive): ~450k for text8.txt and ~1.5mb (not very
> big) for text9.txt.
>
> The results are (bzip2):
> text8.txt               100,000,000 bytes
> text8.txt.bz2         26,395,389 bytes
>
> dict8.txt                1,187,596 bytes
> outp8.txt               51,015,621 bytes      -> ~1/2 of original file
> (as I predicted)
>
> dict8.txt.bz2          458,295 bytes           -> dictionary file
> outp8.txt.bz2         23,382,764 bytes      -> the binary file with
> integers
> total                      23,841,059               -> total size
>
> An improvement of 2,554,330 bytes!!
>
> For gzip2
> text8.txt.gz            33,048,243
> outp8.txt.gz           27,768,161
> dict8.txt.gz            554,332
> total                      28,322,493
>
> Again an improvement of 4,725,756 bytes!!!
>
> with xml-wrt (-f180):
> text8.txt.xwrt         37,082,250
> text8.txt.xwrt.bz2   23,869,114 bytes (a little bit worse than
> compressing the list of integers)
>
> Moreover, my opinion is that by assigning shorter codes to more
> frequent words you'll increase the entropy of the file and,
> consequently, compression performance will be affected.
>
> My programs can be downloaded from:
> http://mosoi.lumina.ro/downloads/Chile/large/.

I checked also paq7 and paq8f. On paq7 there is no improvement, but for
paq8f:

text.txt.paq8f         18,554,164

dict.txt.paq8f         359,198
outp.txt.paq8f        17,940,311
total                     18,320,744   (the improvement is small,
though)

* Note: The words in the dictionary is not compressed with prefix
compression.

0
Reply Alexandru 6/17/2006 3:27:06 PM

Matt Mahoney wrote:
> I did an analysis of the unfiltered and filtered text (enwik9 and fil9)
> to test if Zipf's law holds.  In both cases, I counted any sequence of
> a-z as a word, ignoring case.  This is cruder than ZIpf's analysis.
> Zipf analyzed text by hand (this was in 1935) and considered
> morphology, so suffixes like -s, -ed, -ing etc. would be considered
> separate words from their roots.  It turns out with my crude parsing
> that the fit to Zipf's law is rather poor, with or without filtering.
> Filtering just makes it a little worse.
>
> Here are the words in each file ranked by frequency.  Zipf's law says
> rank x frequency = constant (dictionary size).  Here is for enwik9.
> The Zipf constant is 7-14 million for the top 10,000 words but only
> about 1.2 million around the bottom.  (In case of words tied by
> frequency, the rank is shown for the last word in the list).
>
> 1171106 types, 141176630 tokens
>    Freq    Rank
> 7803966       1 the
> 4864784       2 of
> 3064252       3 and
> 2634604       4 in
> 2366965       5 a
> 2147564       6 to
> 1875373       7 quot
> 1638933       8 is
> 1423733       9 id
> 1379670      10 s
> 1153722      11 lt
> 1152134      12 gt
>  994178      13 for
>  919854      14 amp
>  861329      15 as
>  855885      16 are
>  817812      17 was
>  740085      18 by
>  726845      19 with
>  664439      20 from
> ...
>  143452     100 most
> ...
>   14368    1000 songs
> ...
>     956   10002 ale arrows axioms cal charlemagne expenditures
> ...
>      26  100665 aaaa aabybro aak aalsmeer abdolhossein abdoulaye
> ...
>       5  277712 aaaba aabye aaca aacap aachtopf aafjes aafss aakal
>       4  326414 aaaaaaa aaaaahh aaaai aaaargh aaad aaasouth aabab
>       3  403550 aaaab aaaar aaalac aaargh aabaa aababb aabadie aabrekk
>       2  577832 aaaaaaaa aaaaaaaaaah aaaabbbbccccddddeeeeffffgggg
>       1 1171106 aaaaaaaaa aaaaaaaaaaa aaaaaaaaaaaa...

Here is the distribution of enwik9, counting words only once per page
(article).  In enwik9, the string "<p" (start of the <page> tag) is
used to delimit articles.  This 2 character string occurs nowhere else.
 The distribution is again Zipf-like, but the droop at the ends is more
pronounced.  The product of rank x frequency peaks at around 8 million.

1171106 types, 52789779 tokens, 243426 pages
   Freq    Rank
 243427       2 page xml
 243426      12 contributor id preserve revision space t text
timestamp...
 205420      13 username
 190590      14 comment
 175772      15 of
 169458      16 the
 165157      17 to
 156329      18 in
 155269      19 and
 153842      20 a
 146587      21 category
 143877      22 s
 141985      23 is
 131662      24 for
 129792      25 as
 126046      26 from
 125279      27 with
 121777      28 it
 117671      29 an
 115512      30 by
....
  85080      50 no
....
  51218     100 some
....
  36246     200 being
....
  14389     500 george
....
   7574    1000 running
....
   4014    2001 edge seem
....
   1421    5001 introduce targets
....
    552   10009 covenant daytime dining drift exciting grip guarded
issuing...
....
    196   20061 acquaintances afc appleton barclay beers carriages
cayman...
....
     45   50213 abdur abrogation abruzzo abugida acci accueil
acidosis...
....
     15  100727 aacsb aalen aao aarons abacha abaddon abanet abcl
abdali...
....
     10  130964 aaai aadd aands aarseth aarti aaup abascal abaya
abbad...
....
      5  213602 aaah aabba aach aachener aacm aadolf aafa aalberg
aalge...
      4  253832 aaaaaaa aaasouth aabye aacap aachtopf aadl aafjes
aafrika...
      3  322435 aaaai aaad aaargh aabab aabc aabrekk aaca aacc
aachencath...
      2  475773 aaaaaaaaaah aaaab aaab aaaba aaade aaalac aaarrr
aabaa...
      1 1171106 aaaaaaaa aaaaaaaaa aaaaaaaaaaa aaaaaaaaaaaa...

-- Matt Mahoney

0
Reply Matt 6/17/2006 3:49:04 PM

Alexandru wrote:
> Ted Dunning wrote:
> > You can probably grab a few percent more by compressing your dictionary
> > using prefix compression (standard trick).
>
> Already did this, but the trick doesn't work with paq7 (it works with
> bzip and gzip). Paq7 does better if I leave the sorted list of words
> untouched.

Score one for Matt's cleverness in defining the rules in such a way
that the compressor designers can focus on important issues.

0
Reply Ted 6/17/2006 9:50:06 PM

27 Replies
232 Views

(page loaded in 0.336 seconds)

Similiar Articles:








7/27/2012 8:31:54 AM


Reply: