I would like for someone who is more knowledgeable in the area of
compression to see the flaw in my thinking.
My thought process goes something like this:
- the more tightly a file is compressed, the more the compressed file
begins to look like random data.
- thus if we expect random data as input, then calculating the
difference between the expected random data and the actual input data
will output a difference map which should be able to be compressed.
To test this out i created a simple program which modelled the "ideal"
random data stream by deriving the probability of the next bit in a
sequence being identical to the previous as P=0.5^(n+1) where n is the
run length of immediate previous identical bits.
I then used a PRNG to generate a number between 0 and 1 and if this
number is less than P then the guessed next bit in the sequence is
identical to the previous, if not it is the inverse of the previous
bit.
If the guessed bit is identical to the actual next bit then we write 0
to an output bitstream else we write 1.
repeat until end of input file.
To recreate the original sequence we perform almost the same steps,
however we now read those 'output' bits to tell us to use the guessed
bit or the inverse of the guessed bit.
The PRNG needs to have the same seed for both compressor and
decompressor.
But the output bitstream which is the difference map from the 'random'
sequence is does compressable. I expected there would be more 0's than
1's which would have made compression possible.
|
|
0
|
|
|
|
Reply
|
moogie
|
3/13/2008 11:44:41 AM |
|
moogie schrieb:
> - the more tightly a file is compressed, the more the compressed file
> begins to look like random data.
Here is your problem. "Looks like" is not a well-defined term. In fact,
the output is not at all "random" (for whatever this word means for you).
It is a valid input for a decompressor, which will return a usable file
as output. See my thread above to clean up your thoughts.
> - thus if we expect random data as input, then calculating the
> difference between the expected random data and the actual input data
> will output a difference map which should be able to be compressed.
What is "expect random data as input"? I actually wouldn't expect
"random data" (again, for whatever this means) to be a useful input for
a compressor, because I wouldn't expect it to be compressed.
> To test this out i created a simple program which modelled the "ideal"
> random data stream by deriving the probability of the next bit in a
> sequence being identical to the previous as P=0.5^(n+1) where n is the
> run length of immediate previous identical bits.
Why is that a good model? If your input stream is i.i.d. uniformly
distributed zero-order Markov (rand() does something that fits approximately
to this description), then the probability of a bit being zero is 0.5
(constant, because it is i.i.d) and independent from the previous bits
(because it is order zero markov) and 0.5 (because it is uniformly distributed).
Rather, what do you "expect" as input to make this compressible?
> I then used a PRNG to generate a number between 0 and 1 and if this
> number is less than P then the guessed next bit in the sequence is
> identical to the previous, if not it is the inverse of the previous
> bit.
>
> If the guessed bit is identical to the actual next bit then we write 0
> to an output bitstream else we write 1.
Now you have two independent random sources. If you apply an "==" operation,
which is just NOT XOR, and thus a group operation between the two inputs,
or just the addition modulo 2, the result will be again an i.i.d random source.
What is important is the independence of the two sources. Why should the
PRNG know anything about your first input data? Why should it be related?
> The PRNG needs to have the same seed for both compressor and
> decompressor.
Then the operation is reversible.
> But the output bitstream which is the difference map from the 'random'
> sequence is does compressable. I expected there would be more 0's than
> 1's which would have made compression possible.
Why? If you flip one coin, and "predict" its outcome with a second (independent)
coin-throw, why should it be more probable that both coins land on the
same side???
So long,
Thomas
|
|
0
|
|
|
|
Reply
|
Thomas
|
3/13/2008 12:03:50 PM
|
|
On Mar 13, 5:44 am, moogie <budgetan...@mystarship.com> wrote:
> I would like for someone who is more knowledgeable in the area of
> compression to see the flaw in my thinking.
>
> My thought process goes something like this:
>
> - the more tightly a file is compressed, the more the compressed file
> begins to look like random data.
.....
This is your first mistake. The fact is if you have a good compressor
that
is not adding control information or header information. When you run
compression hopefully the file is shorter thats all compression can
do.
The only reason it might look random is that there are more random
looking sort of files than nonrandom. But any file is possible
including
a file of all zeros or ones.
Many people wrongly assume that if the output is all zeros that the
compressor messed up and that you can compress again. True you can
compress again but then you would have a different compressor and
something else would go to the file of all zeros. Some files would be
longer and a few shorter. Compression is just a reordering of files
so
that hopefully more useful files are ordered first as you fill the
shorter
files inculding files that have random and nonrandom appearing
patterns.
I think people lose it when it comes to compression because when
done right which means lossless bijective compression. Each optimal
compressor is also an optimal decompressor for some criterian that
we be optimal for something. All each of these is, is no more than
a reordering of the infinite file space.
Even my BWTS is nothing more than part of a loop. Take any file
BWTS it enough times any your back to same starting file. Its just
that certain structured files the forward direction seems to make
useful compression better. But there exist files as measured by
some ceitera than UNBWTS would allow for better compression.
Since no compressor is best for all types of files we are chasing
our tails.
Hope this helps
David A. Scott
--
My Crypto code
http://bijective.dogma.net/crypto/scott19u.zip
http://www.jim.com/jamesd/Kong/scott19u.zip old version
My Compression code http://bijective.dogma.net/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link"
|
|
0
|
|
|
|
Reply
|
biject
|
3/13/2008 2:05:44 PM
|
|
The flaw is that you think you are funny.
Marco
|
|
0
|
|
|
|
Reply
|
Marco
|
3/13/2008 4:10:27 PM
|
|
Thanks for your responses!
On Mar 13, 10:03 pm, Thomas Richter <t...@math.tu-berlin.de> wrote:
> moogie schrieb:
>
> > - the more tightly a file is compressed, the more the compressed file
> > begins to look like random data.
>
> Here is your problem. "Looks like" is not a well-defined term. In fact,
> the output is not at all "random" (for whatever this word means for you).
> It is a valid input for a decompressor, which will return a usable file
> as output. See my thread above to clean up your thoughts.
>
> > - thus if we expect random data as input, then calculating the
> > difference between the expected random data and the actual input data
> > will output a difference map which should be able to be compressed.
>
> What is "expect random data as input"? I actually wouldn't expect
> "random data" (again, for whatever this means) to be a useful input for
> a compressor, because I wouldn't expect it to be compressed.
>
> > To test this out i created a simple program which modelled the "ideal"
> > random data stream by deriving the probability of the next bit in a
> > sequence being identical to the previous as P=0.5^(n+1) where n is the
> > run length of immediate previous identical bits.
>
> Why is that a good model? If your input stream is i.i.d. uniformly
> distributed zero-order Markov (rand() does something that fits approximately
> to this description), then the probability of a bit being zero is 0.5
> (constant, because it is i.i.d) and independent from the previous bits
> (because it is order zero markov) and 0.5 (because it is uniformly distributed).
>
> Rather, what do you "expect" as input to make this compressible?
Sorry, my wording was a little vague on what i meant by random looking
data... in this circumstance i will define the random data as the set
of sequences of the same length which are not compressible by a given
compressor.
I chose a LZMA compressed archive as input for this test. When
plotting the probability of lengths (derived from frequency counts) of
identical bit runs I observed that the curve produced closely fit the
curve of 0.5^(n+1) which is why i used it as the model.
> Why? If you flip one coin, and "predict" its outcome with a second (independent)
> coin-throw, why should it be more probable that both coins land on the
> same side???
I do expect identical sequential coin flips (represented by the input
stream) will become less probable given the definition of the random
looking data i explained above.
All i expect the PRNG to do is to provide a list of uniformly
distributed values between 0..1 This list could be hard coded.
Perhaps some code of the simple test will help clear up what i am
doing:
int runLength=0;
boolean runBitType=false;
boolean actualBitType=false;
boolean guessedBitType=false;
int outZeroCount=0;
int inZeroCount=0;
long size=inFile.length()*8;
Random rand=new Random(0);
double probability=0;
for (int i=0;i<size;i++)
{
probability=Math.pow(0.5,runLength+1);
// read in the next bit from the stream
actualBitType=bis.read();
// generate the guessed next bit
guessedBitType=rand.nextDouble()<probability;
// if they are the same then output 0 to the out stream
if (actualBitType == guessedBitType)
{
bos.write(false);
}
// else they are different and therefore output 1 to the out stream
else
{
bos.write(true);
}
// if the next bit is the same as the current run bit then
increment the current run length
if (actualBitType==runBitType)
{
runLength++;
}
// else reset the run length
else
{
runLength=0;
runBitType=!runBitType;
}
}
bos.close();
bis.close();
|
|
0
|
|
|
|
Reply
|
moogie
|
3/13/2008 9:19:03 PM
|
|
On Mar 14, 12:05 am, biject <biject.b...@gmail.com> wrote:
> Many people wrongly assume that if the output is all zeros that the
> compressor messed up and that you can compress again. True you can
> compress again but then you would have a different compressor and
> something else would go to the file of all zeros. Some files would be
> longer and a few shorter.
> so
> that hopefully more useful files are ordered first as you fill the
> shorter
> files inculding files that have random and nonrandom appearing
> patterns.
This is exactly what i am thinking. by using a compressor (B) which
expects the compressed output from another compressor(A) to need
shorter code lengths to represent them than the non-compressed output
from compressor (A).
>
> I think people lose it when it comes to compression because when
> done right which means lossless bijective compression. Each optimal
> compressor is also an optimal decompressor for some criterian that
> we be optimal for something. All each of these is, is no more than
> a reordering of the infinite file space.
That does fit with my understanding of how compression works... an
algorithm which orders the set of possible combinations of a sequence
of a certain size. Compression is achieved by that algorithm ordering
the sequences such that the more common sequences in the input stream
we are attempting to compress occur closer to the beginning of the
ordered list of combinations.
|
|
0
|
|
|
|
Reply
|
moogie
|
3/13/2008 9:29:54 PM
|
|
moogie schrieb:
>>> - the more tightly a file is compressed, the more the compressed file
>>> begins to look like random data.
>> Here is your problem. "Looks like" is not a well-defined term. In fact,
>> the output is not at all "random" (for whatever this word means for you).
>> It is a valid input for a decompressor, which will return a usable file
>> as output. See my thread above to clean up your thoughts.
>>
>>> - thus if we expect random data as input, then calculating the
>>> difference between the expected random data and the actual input data
>>> will output a difference map which should be able to be compressed.
>> What is "expect random data as input"? I actually wouldn't expect
>> "random data" (again, for whatever this means) to be a useful input for
>> a compressor, because I wouldn't expect it to be compressed.
>>
>>> To test this out i created a simple program which modelled the "ideal"
>>> random data stream by deriving the probability of the next bit in a
>>> sequence being identical to the previous as P=0.5^(n+1) where n is the
>>> run length of immediate previous identical bits.
Ok, so P is the *conditioned* probability, i.e. the probability of
finding bit X if the last n bits are also given. That is, your P is
P(x_k | x_{k-1} ... x_{k-n}). Which would be, for i.i.d. data, simply
0.5, because the probability *does not* depend on the data before.
>> Why is that a good model? If your input stream is i.i.d. uniformly
>> distributed zero-order Markov (rand() does something that fits approximately
>> to this description), then the probability of a bit being zero is 0.5
>> (constant, because it is i.i.d) and independent from the previous bits
>> (because it is order zero markov) and 0.5 (because it is uniformly distributed).
>>
>> Rather, what do you "expect" as input to make this compressible?
>
> Sorry, my wording was a little vague on what i meant by random looking
> data... in this circumstance i will define the random data as the set
> of sequences of the same length which are not compressible by a given
> compressor.
>
> I chose a LZMA compressed archive as input for this test. When
> plotting the probability of lengths (derived from frequency counts) of
> identical bit runs I observed that the curve produced closely fit the
> curve of 0.5^(n+1) which is why i used it as the model.
And this is a *different* probability, namely the probability of finding
a run of n identical bits. Again, if you pick an i.i.d. sequence,
uniformly distributed, the probability of a run of n identical bits (or
actually, the probability of any other n-bit sequence) is (0.5)^n, where
the total run length is n. If you count the run as a run of n bits,
followed by one identical bit (which might be, according to what I read
above), then this is in fact the probability I just computed.
However, the *conditioned* probability is not the probability of the
run. It is the probability to extend the run by one bit, which is 1/2
(for uniform i.i.d. sources).
>> Why? If you flip one coin, and "predict" its outcome with a second (independent)
>> coin-throw, why should it be more probable that both coins land on the
>> same side???
>
> I do expect identical sequential coin flips (represented by the input
> stream) will become less probable given the definition of the random
> looking data i explained above.
What do you mean by that? Do you mean that the probability of finding a
run decreases with the run-length? This is certainly true, and you have
the likelyness of a run. In fact, your observed likeliness of the output
of the compressor matches that of a uniform i.i.d. source.
> All i expect the PRNG to do is to provide a list of uniformly
> distributed values between 0..1 This list could be hard coded.
>
> Perhaps some code of the simple test will help clear up what i am
> doing:
/* snip */
See above. The probability of extending a run of size n *is not*
identical to the probability of *finding* a run of size n. The former is
0.5, the latter 0.5^n. (again, i.i.d models understood).
In effect, what you're doing is to use one sequence, your compressed
stream, which can be well-described by a i.i.d. memoryless random source
in the absence of any additional knowledge, and EXOR it with another
random source, which is not i.i.d., but has a skewed probability
distribution. What do you expect to find?
Could it be that somehow you believe that the observed probability of
runs is "somewhat special" for compressed sequences? Actually, it's not.
It looks like you misunderstood something about conditional vs. absolute
probabilities.
So long,
Thomas
|
|
0
|
|
|
|
Reply
|
Thomas
|
3/14/2008 12:45:56 PM
|
|
Hello,
On Mar 13, 12:44 pm, moogie <budgetan...@mystarship.com> wrote:
> - thus if we expect random data as input, then calculating the
> difference between the expected random data and the actual input data
> will output a difference map which should be able to be compressed.
Well, the problem is that the expected random data for any bit,
independently from its context, 50/50. Also the input will have
probability 50/50 for any position, so any reversible operation you do
between these two streams will still obtain a random data stream.
> To test this out i created a simple program which modelled the "ideal"
> random data stream by deriving the probability of the next bit in a
> sequence being identical to the previous as P=0.5^(n+1) where n is the
> run length of immediate previous identical bits.
I agree with Thomas Richter on this: to model an ideal random data
stream you should impose P=0.5 in every case. If the context gives us
a probability about the next bit, then the data is not random.
Generating sequences like this you make long runs less probable than
they should be, and in fact in the stream you are creating with the
code below you take advantage of this fact by making an estimate of
the next bit with this information. Also using an arithmetic encoder
that uses this P for the estimation of the probability would compress
the sequence.
Imposing P=0.5 you will obtain that the probability of a run of ones
or zeros of length n is, in fact, 0.5^(n+1). Do not worry if in the
pseudo-random sequence you obtain long runs of zeros or ones, this
will necessary happen for big sequences. Even if the sequence may seem
compressible 'in that point', it will be globally uncompressible: any
compressor that compresses that sequence will expand the data
somewhere else.
Hope this helps,
Stefano
|
|
0
|
|
|
|
Reply
|
Stefano
|
3/14/2008 5:36:05 PM
|
|
Thanks for your in depth answers.
I cant say i fully understand yet but I will digest your answers and
read up on the specific terms i am unfamiliar with.
Cheers
|
|
0
|
|
|
|
Reply
|
moogie
|
3/14/2008 7:48:50 PM
|
|
Thomas Richter wrote:
>>>> To test this out i created a simple program which modelled the "ideal"
>>>> random data stream by deriving the probability of the next bit in a
>>>> sequence being identical to the previous as P=0.5^(n+1) where n is the
>>>> run length of immediate previous identical bits.
Correction to my respond above. Of course the probability of having a
run of n bits is 0.5^{n+1} for an i.i.d memoryless uniform binary
source. First, 0.5 for every (equal) symbol, and another 0.5 for finding
exactly the opposite symbol at last.
Sorry, my fault!
So long,
Thomas
|
|
0
|
|
|
|
Reply
|
Thomas
|
3/15/2008 7:46:51 PM
|
|
|
9 Replies
190 Views
(page loaded in 0.11 seconds)
|