Re: A Random Data Compression method -draft #4

  • Permalink
  • submit to reddit
  • Email
  • Follow

Another interesting idea of move to front is instead of encoding value's 
like zero's.... and then later passing huffman over it...

How about directly outputting codes...

Instead of a value table and position table and such... there would be an 
encoding table for the move to front...

the front positions would have the shortest codes... and the later positions 
longer codes...

Decoding happens in opposite... longer codes are looked up... the value 
outputted... and then moved to the front to get a shorter code next time...

Added benefit of this new idea is that outputted codes immediatly get 
shorter codes on the next try...

while with static huffman and perhaps even dynamic huffman it would still 
depend on the ultimate frequencies but not so with move to front...

Hmm kinda interesting... this could also be combined with bubble to front... 
for a quicker implementation but would probably be less compression 

Only problem remaining is what prefixes/codes to give the the element of the 
move to front ? that's where dynamic huffman uses a tree to adept to 

and that's why huffman works nicely normally...

With move to front the prefixes could be a simple list for starters:

always ends with a one.


Other idea:

reserve position 0 as a special pattern/compression pattern.

move to front position zero is outputted as 0.
move to front position x is outputted as 1 + value.

This way re-occuring patterns always get the short 0 code.

Other variations could be:

position 0 is outputted as 1
position 1 is outputted as 01
position 2 is outputted as 001
position 3 is outputted as 0001
position 4 is outputted as 00001
position 5 is outputted as 000001
position 6 is outputted as 0000001
position 7 and higher is outputted as 00000001 + value.

This Skybuck's Universal Code can be used for the prefixes.

Savings are achieved when positions 0 to 6 are hit.

Bits are doubled for 7 and higher.

For position 7 and higher cases... perhaps a huffman code could be used... 
but then again that might be used for position 1 to 6 as well... and 
scrapping the idea. and only using position 0.

Let's see original idea:

Random input:

1  2  3  4  5  6  7  8  9  10 11 12
11 10 10 01 01 10 11 01 01 01 01 01

2 bit
[0, 101, 110, 111]

First input:

First output:

Table changed:
[100,101,110, 0]

Remarkable discovery: move to front is not necessary to move table 
forward... why would you do that ? it apprently assumes values close to each 
other but for random probablt doesn't make sense... so I am just gonna swap 
;) :) like he says ;)
Now I am just gonna modify it... add a bit in front of it and zero the other 
one ;)

Second input:

Second output:

Table changed:

[100,101,0, 111]

Third input:

Third output:

Table stay same.

Fourth input:

Fouth output:

Table changed:
[100, 0, 110, 111]

Well it's easy to see where this goes:

Output5: 0
Output6: 110
Output7: 111
Output8: 101
Output9: 0
Output10: 0
Output11: 0
Output12: 0

Final result:

Compared to input:

No saving so far ;)

Perhaps table can be encoded more efficient:

0 is reserved for least occurence.
10 is for second
110 is for third
111 is for fourth...

using this new table... it should be possible to get some kind of saving.

Reading 0 means efficient code.
Reading 1 means read next, if next is zero stop.
Reading 1 means read next, if next is one continue.
Reading 1 means read next, if next is one continue, if next is zero stop.
Reading 1 means read next, if next is one continue, if next is one stop.

So decoding this without conflict should be possible.

Not gonna do a full example again... I've seen enough ;)

Maybe later I gotta go eat ;)


Reply Skybuck 1/10/2011 7:26:07 PM

See related articles to this posting

comp.compression 4580 articles. 14 followers. Post

0 Replies

Similar Articles

[PageSpeed] 36


Similar Artilces:

Re: Compressing data sets (was Re: [SAS-L]) #2 #4
It's not a big deal, but there is a difference between reading a file directly and having another program as an intermediary, and someone someday is going to get into trouble because they didn't take that into account. I'm not sure what you mean by "using SAS/Connect to read non-SAS files". -- Manager, Technical Development Metrics Department, First Health West Sacramento, California USA >>> "Bruce Johnson" <bjohnson@SOLUCIENT.COM> 01/06/2004 9:36 AM >>> But if another program is doing it, within SAS, wh...

Re: Random number generation
Ian, I, too, hope Ed enjoys his well earned vacation and my intent was definitely not to pick on Ed, but discover some additional tidbits like those you shared in your post. And, of course, I used a program to choose my seed. Don't miss the point though. Many on this list, including me, don't know many of the things which you and others may find intuitively obvious. Anything else you can add to my wish list? Art -------- On Wed, 4 Feb 2009 21:50:32 -0500, Ian Whitlock <iw1sas@GMAIL.COM> wrote: >Art, > >There is something funny here. There are exactly 12 4-digit s...

Re: Generating Random Data #4
> -----Original Message----- > From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On > Behalf Of Zai Saki > Sent: Wednesday, April 23, 2008 3:43 PM > To: SAS-L@LISTSERV.UGA.EDU > Subject: Re: Generating Random Data > > Thanks, all methods were helpful. > > However I wasn't able to get an exact 5% with these. Is there a way to > control for that. > > Thanks, > > Yes, there are probably as many ways as people willing to post a solution, but your original request did say "So if we generated 200 respondents we want **approximately** ten 1...

Re: Random data imputation #4
Dan=2C =20 Thank you for the clear explanation=2C now I get it and I have better under= standing about uniform distribution too. =20 Thanks=2C =20 Sophia> Date: Wed=2C 19 Nov 2008 12:58:47 -0800> From: NordlDJ@DSHS.WA.GOV>= Subject: Re: Random data imputation> To: SAS-L@LISTSERV.UGA.EDU> > > -----= Original Message-----> > From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA= ..EDU] On> > Behalf Of Suhong Tong> > Sent: Wednesday=2C November 19=2C 2008= 12:39 PM> > To: SAS-L@LISTSERV.UGA.EDU> > Subject: Re: Random data imputat= ion> >>...

Best possible compression method for random data?
What is the best known compression method for random data? I mean any random data, example in hex: it would be between 00-FF Size let's say 1 mega > ??? Posted Via Premium Usenet Newsgroup Services ---------------------------------------------------------- ** SPEED ** RETENTION ** COMPLETION ** ANONYMITY ** ---------------------------------------------------------- Serge wrote: ) What is the best known compression method for random data? ) I mean any random data, example in hex: it would be between 00-FF ) Size let's s...

Re: Secure method of FTPing data #4 682027
Indeed, I've actually used PGP in the past to encrypt data using external calls (via the X command) to PGP's command line utility with the following macro: /*Encrypt the CSV File Created in Previous Step*/ %macro encrypt; x "pgp -tea +force c:\HF\SASExtracts\Extract&todaydate..txt <&recipient>"; run; %mend; %encrypt; %let; /*The public key for PGP encryption*/ %encrypt; Then, SAS automatically FTP's the encrypted file. It worked well and met NIST standards for encryption and transmission of data at the time. Mark J. Lamias SAIC...

Re: How to pick records randomly from a data sets #4
I don't know much about STAT, but why don't you try something like: data test; set sashelp.class sashelp.class sashelp.class sashelp.class sashelp.class ; if mod(_n_,round(ranuni(0)*30+1))=0; run; ?? If you want less output use 40 instead of 30 after RANUNI(0). Don't know, if that does what you need... Gerhard On Fri, 23 Mar 2007 06:12:56 -0400, Arild S <sko@KLP.NO> wrote: >On Fri, 23 Mar 2007 07:10:33 +0100, Rune Runnest?ne@FASTLANE.NO> wrote: > >>The point of choosing only 2 data points per year i simplicity. The ...

Re: Secure method of FTPing data #4 1560272
Thanks for the tip! Rgds. On Tue, 16 Jan 2007 05:22:07 -0800, MikeInWV <mikeslover@GMAIL.COM> wrote: >If possible, look into using PGP software to encrypt the file before >sending it to an FTP server. You can obtain it from > > >One note---the recipient would need to have PGP and provide you with >their public key. I have recently gone through this process and it >works well. > > >ben.powell@CLA.CO.UK wrote: >> On Mon, 15 Jan 2007 20:50:50 -0800, daniel_coppin@BRADYCORP.COM wrote: >> >> >I am using the following ...

Determine compression method from compressed data
Hi; Maybe one of you recognizes some characteristic bits in the stream described below. My goal is to find a way to compress a different stream so that the same given decoder will be able to decompress it, but I haven't yet been able to determine the compression method nor its parameters. There is no header (no gzip,arj,...). Data is stored in contiguous chunks; each prefixed with one big-endian 16-bit word specifying the expected length (in bytes) of the chunk when uncompressed, and a second word specifying the number of bytes following in the compressed chunk. The uncompressed length...

Re: How to randomly select a certain data from a large dataset #4
X-OriginalArrivalTime: 06 Nov 2009 03:45:55.0563 (UTC) FILETIME=[A50AC3B0:01CA5E93] X-Scanned-By: Digested by UGA Mail Gateway on <> MIME-Version: 1.0 --_92f3e422-05d6-4a43-9388-3f7d08fb68d3_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hi=2C =20 The output of the following code is approximately 2% sample. =20 data sample=3B set yourdata=3B if ranuni(2)<=3D0.02=3B /*approximately 2%=2C e.g: if ranuni(2)<=3D0.1 fo= r 10%*/ run=3B =20 Hope ...

Re: How to randomly select records from a large data set? #4
Melody, Remove the semicolon after ...small on the first line in: proc surveyselect data=melody.melody_2 out=melody.melody2_small; sampsize=8500; run; as proc surveyselect data=melody.melody_2 out=melody.melody2_small sampsize=8500; run; and try. Muthia Kachirayan On Jan 22, 2008 10:14 AM, Melodyp <> wrote: > On Jan 21, 3:47 pm, Melodyp <> wrote: > > Hello! I have a question on how to select about 8500 records from a > > nearly one million records dataset. > > > > The 8500 records must be randomly sel...

Re: random numbers not random? #4
-> If you want to roll your own, you can take two different RNGs, call both of -> them and multiply the results, use MOD as necessary to reduce the magnitude -> to what is desired, for each random number. :-) Hmmm... If both numbers are integers, then their product will have a zero in the rightmost position if *either* (or both) of the initial numbers have a zero there. So the probability of the product having a zero there is nearly 20%, instead of 10% which would be the case in truly random numbers. Likewise, the probability would be 75% that the product wou...

Re: Random contents using data (newbie question) #4
kpalak@WP.PL wrote: > >Hello for the first time! > >How do I create a table using "data" instruction and populate it with >random records? I need to create some sales report containing random >records with amount of items sold, salesperson ID, item ID, date of >transaction etc. > >Thank you in advance for any hints. > > >-- >Regards, > >Iwanow In addition to the great advice you ahve already seen, let me add another point. If you have some starting set of variables in a SAS data set, you can sample from that to get working test data sets...

Re: Method to distinguish different periods in time series data #4
I think that some good graphing can say a whole lot, and does not get into the potentially problematic area of doing statistical inference without thinking hard about it - always a dangerous area. Graphs can be helpful in getting your ideas together about next steps, and if done ethically and precisely they can be great descriptive tools to support your other analysis. Now you mentioned that you have several variables. Do you have a single target variable and the others potentially affect this one, or are all of the variables of equal or unknown importance? Sam On 10/15/07, kansaskannan@g...

Re: COMPRESS function working differently in Data Step and Macro #4
Guido, The suggestion you provided using %BQUOTE got rid of the WARNING messages, but did have the desired result in naming the HTML files. The COMPRESS function only removed spaces, not punctuation - and the 2nd and 3rd parameters on the COMPRESS function became part of the HTML file name. Ex. nh_St.Mary's,'','p'.html nh_AnneArundel,'','p'.html Any additional suggestions are appreciated. Thanks. Jack Clark Research Analyst Center for Health Program Development and Management University of Maryland, Baltimore County -----Origina...

Re: Randomly Select Observations on Infile Step (Large Data Sets) #4
Wensui: Your proposal does not insure that every case has equal likelihood of being included in the random sample. In particular the cases near the end-of-file will not have the same probability of being chosed as the cases new the beginning. I believe this is a good case for hash table use. Process the entire raw data file, using the hash table to hold all the variables for the 50 cases with the highest value of the random number. Obviously the hash table will start out with the first 50 cases from the raw file, but as new cases with higher random numbers are identified, the entry with th...

Re: Los Angeles Times: Low-Tech Methods Used in Data Theft #4
Clark wrote: > Consumers have been able to challenge adverse entries in their reports > for years. Reporting companies are required to investigate and remove > said item if it can't be substantiated. Furthermore, the consumer is > required to be told when a credit report was used to as a basis of an > adverse decision and is entitled to request a copy of that report, > even if they have their free annual report allowance used already. From what I read in the newspapers, that protection doesn't work very well in practice. In other words, the consumer ends...

Re: Compressing data sets (was Re: [SAS-L])
On Mon, 5 Jan 2004 18:14:02 -0700, Jack Hamilton <JackHamilton@FIRSTHEALTH.COM> wrote: >OK, "you or a program on your behalf will have to decompress them before >use". > > > >-- > >Manager, Technical Development >Metrics Department, First Health >West Sacramento, California USA > >>>> "Richard Graham" <> 01/05/2004 5:04 PM >>>> >Actually you can compress(WINZIP, PKZIP) SAS data sets and use them in >the >compressed format. There is software...

Re: Data step question for longitudinal data #4
>>> "Gerstle, John" <yzg9@CDC.GOV> >>> wrote <<<< Peter, What's the analysis question that needs to be addressed by the data? From your description (granted it's generalized), it's hard to determine exactly what form the data is needed for analysis. I usually start at the question to determine what form the data needs to be in for the final analysis, then look at the data I have and then connect the dots. :) Sometimes this is straight forward, and sometimes not. >>>> INTERESTING CODE SNIPPED Eventually, I am going to be ...

Re: Interview questions re data management #4
Peter: I'd begin with questions about storage of media, file management, database versions, and data security. Ninety six CD's containing one file each would not overwhelm anyone, but management of corrected versions of data, passwords, statistical reports, and metadata adds up to a daunting inventory problem. Ask the candidate to sketch a plan for managing data receipts and deliveries. I would expect to see evidence that the candidate understands receipt control, data security concerns, and production processes. He or she should know to archive original media immediately after copying...

Re: combining data between different data sets #4
On Tue, 19 Jul 2005 07:35:18 -0700, rss <rslotpole@COMCAST.NET> wrote [in part]: >In English: Ibm, dec and hwp. We own 100 shares of each. They pay a >dividend. You multiple shares * dividend and then sum them to get the >total dollars received. We want to add this value to the cash value >and really don't want to associate it with every record in the dataset. > Ticker is by date and our cashvalues are also by date > >Open two different data sets; >By date; >If ticker (from first data set) = '*$$$' then > Endshare (from first data set) =...

Re: Random Numbers again : WAS RE: genetic inheritence in a #4
Subject: Randomness when switching seeds. #iw-value=1 I ran your code. the step took 61 seconds. (Thankful that I did not need 100,000 numbers.) I also ran data q ; seed = 0 ; do obs = 1 to 1000 ; call ranuni (seed,r) ; if r > .01 then seed + 1 ; r2 = ranuni ( 0 ) ; output ; end ; run ; This took .01 seconds (1/600 the time) and produced an approved random sequence R2 as well as one, which like yours kept switching random sequence streams. Finally, I made a format grouping by tenths and did freqs. Your sequence produced r ...

Re: Converting surveymonkey data into sas data #4
Dennis , Well that would depend on whether the character value is '3' '4' '5' or '3-5'. If it is discrete then yes: '0' = 1 '1' , '2' = 2 '3' , '4' , '5' = 3 Would be how I would put in a format. Otherwise I would have to go with what I wrote. Toby Dunn From: "Dennis G. Fisher" <dfisher@CSULB.EDU> Reply-To: "Dennis G. Fisher" <dfisher@CSULB.EDU> To: SAS-L@LISTSERV.UGA.EDU Subject: Re: Converting surveymonkey data into sas data Date: Mon, 3 Apr 2006 09:46:02 -0700 Why wo...

Re: make a data set from an empty data? #4
OKay Kevin , So I posted the wrong code its been one heck of a day. I cant remember when I wrote so much code. My fingers are worn down to the nub and my mind what little I have is threatening a strike. try this: proc sql; create table tmplt (a char(5), b char(10), c num); data xx; If _N_ = 0 then set tmplt; a='abc'; b='xyz'; d=a||b; output; run; proc contents data = xx ; run ; proc print data = xx ; run ; Toby Dunn From: Kevin Roland Viel <kviel@EMORY.EDU> Reply-To: Kevin Roland Viel <kviel@EMORY.EDU> To: SAS-L@LISTSERV.UGA.EDU Subject: Re: make a...

Re: Help on data look up between two data set #4
Hi Mark, Thank you for the elaboration. My first impression about your code is: wow,= the code can be written in this way! I am pretty new to SQL ( 3 months user now) and never know 'between' can = be used in SQL, and didn't know the comparison expression can yield 1/ 0 an= d generate a new variable at the same time. This is very helpful and I alr= eady see how to apply this technique to my other data. Have a great weekend. Sophia > Subject: RE: Re: Help on data look up between two data set > Date: Fri, 2 Nov 2007 10:45:23 -0700 > From: ...