|
|
Re: Ian's Compression Algorithm
Hi Ian,
This cross-posted from CreatingAI@googlegroups.com to comp.compression,
and Hutter Prize@googlegroups.com. It seems relevant.
Ian Parker wrote:
>
> Either natural language is compressible or it is not. You now seem to
> be saying it is compressible but not by very much - am I right?
You can think of it as an attempt to explain why text can't be
compressed as much as we expect if you like. In particular why we can't
compress it by as much as human estimates for the entropy of text
suggest we should (e.g. Matt Mahoney on Shannon.)
Something like... that the sum of human knowledge needed to compress a
given text that much, will always be bigger than the text itself.
We can understand this if we accept that knowledge is coded in the
different ways of ordering text (or the different ways of ordering any
sufficiently complex/random set.)
It will always take more information to code all the different ways of
ordering a random set than it takes to code the set itself. (It is only
a corollary that you can't use information about the different possible
orderings of a text to compress text, because your knowledge of all the
possible orders of a text will always be bigger than the text itself.)
It is important to note that while language is statistically random
(permitting the same tokens to specify more orders), language is not
indeterminate (c.f. Adrian Barnett's rather amusing "bitmap" Theory Of
Everything (http://www.abarnett.demon.co.uk/theories.html#BITMAP.) With
language the different orders are precisely determined by the set
itself. It is just many are equally likely.
Which IMHO is the insight we have been looking for to model AI (NLP
etc.)
-Rob
|
|
0
|
|
|
|
Reply
|
groups754 (1)
|
1/15/2007 10:14:13 AM |
|
I have no objections to cross posting under the new moderation scheme.
Hutter is in fact completely unmoderated.
It may well be that natural language is less compressible than Matt and
Co imagine. However it certainly is compressible. Obviously without a
definte benchmark you cannot tell how compressible.
I would like to make a number of remarks.
1) Rrimavera, Ressorte, Mamanthal. We know that English can be
translated into Spanish using a human translator. It would be hard for
me to concede therefore that techniques do not exist to eliminate at
least 2/3 of words using context. Primavera etc. implies log(2)3 bits
per word (at least).
2) This is a much more mathematical point. We know that the present
prize-winner has compressed to 17MB. If we allocate a number to a word
there must be a technique (we will ignore Kolmagorov for the moment)
which will compress our sequence to at most 17MB. We can at this stage
view punctuation marks as words. Algorithms like 7-zip take raw text
and compress using fragments as does the current prize-winner. Now if a
sequence is random to the extent that the probability of a word
occurring is simply the mean frequency of that word there cannot be any
better method than factorials, the Gosper approximation giving the
minimum compressed size. Now if V is our vector of words VP is going to
represent a fragment representation. We can get to fragments by a
simple matrix multiplication. VP = 17MB compressed. A matrix X can (and
has been) found which will give us VPX. The probability decomposition
of VPX MUST be 17MB. Of course multiplying one matrix by another simply
produces another matrix. Why not take the bull by the horns, apply LSA
and produce the 17MB minus matrix? The answer, of course, is
Kolmogorov. If I were not concerned with Hutter, but were simply
concerned with GALE and translating languages I would not care a damn
about Kolmogorov but would simply demand a larger training set.
This raises yet another point. We can get an initial semantic
assessment by simply associating fragments to words. This tends to be
what is done in assessing relationships between different languages,
cerradura does indeed have a very similar fragment structure to serrure
(French) arana, arraigne - Papers please! If we were to take the
prize-winning program and construct a matrix P linking it to a word
list. This could be a springboard to further compression.
- Ian Parker
|
|
0
|
|
|
|
Reply
|
ianparker2
|
1/16/2007 12:16:47 PM
|
|
For anyone who was following this thread in the Hutter Prize group,
here is my last reply to James Bowery. It seems to be taking a long
time to appear there.
jabowery@gmail.com wrote:
> Rob Freeman wrote:
> > So you feel justified in assuming grammar to make compression possible,
> > on the grounds that if compression is possible you will have a grammar?
> >
> > Surely your argument is circular.
>
> No I was merely using grammar as an example of a knowledge
> representation language and then you said it isn't powerful enough to
> compress natural language and then I said that there is an existence
> proof that it is: Sequitur with its hierarchical grammar output.
You need the existence of the compression you are seeking for an
existence proof. All this proves is that current compression uses a
grammar, not that further compression is possible using knowledge.
> > As a case in point, what are the equivalence classes which compress my
> > example string: "AX...DX...DB...AZ...YZ...YC"?
> >
> > In particular how do you capture the information that "in some ways A
> > is close to D but not Y, but in other ways A is close to Y but not D"?
>
> Sorry, but you're not making sense to me. Find some better way of
> defining "information" aka: algorithmic information theory, and we can
> continue this discussion here. Otherwise, keep it in comp.compression.
I was not defining information. You can use any definition for
information you like. What I am asking you to do is use your definition
of information to capture the relationships "in some ways A is close to
D but not Y, but in other ways A is close to Y but not D" in my example
string.
My point relates only to the use of AI for compression, so it is of
limited interest to comp.compression. I only posted in comp.compression
because Matt was making loud claims there that compression based on AI
was the solution to problems like speech recognition.
That is wrong, but it is so close to being right it is tantalizing.
Really, you have 99% of the solution. It is just you have it kind-of
inside-out. Because the problems matter to me I would rather not see
you, or anyone else wasting their time getting the solution only
_almost_ right.
It is clear I am not making sense to you. But I don't know what
information you lack. What do you find puzzling about my example? What
do you know about distributional analysis?
-Rob
|
|
0
|
|
|
|
Reply
|
Rob
|
1/19/2007 1:26:21 AM
|
|
|
2 Replies
85 Views
(page loaded in 0.069 seconds)
|
|
|
|
|
|
|
|
|