Random access to content of archived files without extraction - how? compression scheme? algorithm? source code? tutorial? software?

  • Follow


"Random access to content of archived files without extraction - how? 
compression scheme? algorithm? source code? tutorial? software?"

I tried already to Google e.g.

"Which compression software allows random access to compressed files 
without extracting them from archive first?"

and other keywords. I have also checked out the
   http://datacompression.info/
site, but

still failed to find a compression scheme which allows me to tell the 
extractor:
   get from file X.dat in archive Y.ext a 100 byte large section
   beginning at offset 0xFFEEDDCC

where file X.dat is e.g. a 10 GByte large file (with huge sections 
filled with same bytes) compressed into the archive Y.ext .

Is there any compression scheme supporting extraction of sections from 
files in archive without extracting the entire file first (transparent 
random access to content of archived file)?

Any hints are appreciated.

Claudio Grondi

0
Reply Claudio 6/29/2006 4:08:41 PM

Claudio Grondi wrote:
> Is there any compression scheme supporting extraction of sections from
> files in archive without extracting the entire file first (transparent
> random access to content of archived file)?

Yes, but this is usually a function of the archive format.  For
example, bzip2 handles data in 900K chunks, so if you know the data you
want is in the 200th chunk, you just seek 200*900K and decompress just
that chunk.

I'm not aware of any scheme that allows "transparent random access" to
the uncompressed data in a compressed file.  You could certainly write
a wrapper to anything, but the performance would take a massive hit (in
the simplest implementation, the entire file would be decompressed
anyway and stored in temporary cache to be accessed by the wrapper).

What do you wish to apply this behavior toward?

0
Reply Jim 6/29/2006 5:05:01 PM


Jim Leonard wrote:
> Claudio Grondi wrote:
> 
>>Is there any compression scheme supporting extraction of sections from
>>files in archive without extracting the entire file first (transparent
>>random access to content of archived file)?
> 
> 
> Yes, but this is usually a function of the archive format.  For
> example, bzip2 handles data in 900K chunks, so if you know the data you
> want is in the 200th chunk, you just seek 200*900K and decompress just
> that chunk.
If I understand it right, to know in which chunk the data is, it will be 
necessary to know about the exact distribution of compression ratio over 
the entire uncompressed file. Am I right here to learn from this, that 
bzip2 format is not suitable for what I want to achieve without the need 
of writing additional code putting marks into the original data (to make 
search for the appropriate compressed chunk possible) or adopting the 
bzip2 compression scheme by storing along each 900K chunk the offsets of 
start and end of the expanded section in the uncompressed file?

> I'm not aware of any scheme that allows "transparent random access" to
> the uncompressed data in a compressed file.  You could certainly write
> a wrapper to anything, but the performance would take a massive hit (in
> the simplest implementation, the entire file would be decompressed
> anyway and stored in temporary cache to be accessed by the wrapper).
> 
> What do you wish to apply this behavior toward?
> 
Storing e.g. a huge amount of highly redundant experimental data for 
later analysis can take hundreds of gigabytes, so it makes sense to 
compress them down to some tenths of gigabytes. Having many of such 
files on a hard disk there is no more free space available to extract 
any content prior to processing it.

What comes to my mind in this context is, that it is maybe possible to 
expand the entire file up to the requested section by letting the stream 
of uncompressed bytes provided by the archiver API go to nowhere while 
counting the passing bytes and catch only the bytes of the section which 
is to retrieve (that is probably the performance bottleneck you speak 
about as it is necessary to expand the entire content prior to the 
requested section from the archive anyway).

If there are no compression schemes capable of what I want to achieve, 
which free archiver with available source code capable of compressing 
files with sizes up to 400 GBytes (current standard hard drive size) 
provide an API I can use to let only count the streamed expanded bytes 
up to the section I need without keeping them on disk or in memory?
If I understand it right bzip2 is one of such compression schemes, 
right? Any other?

Claudio Grondi
0
Reply Claudio 6/29/2006 7:06:48 PM

> still failed to find a compression scheme which allows me to tell the
> extractor:
>    get from file X.dat in archive Y.ext a 100 byte large section
>    beginning at offset 0xFFEEDDCC
> where file X.dat is e.g. a 10 GByte large file (with huge sections
> filled with same bytes) compressed into the archive Y.ext .
(maybe a silly question) Does the offset known by the user is referred
to uncompressed data or to compressed data?
In the first case, that seem to me the more probable from the user's
point of view, you can simply write a very simple compression scheme
from the compression primitive that works at best on your data,
compressing a fixed size of input data to a size you store in a
variable that you write to the output file along with compressed data.
The compressed data may be structured in blocks like this:
|- compressed size in B (an uint) -|- compressed data block -|-etc..
When you need to recover data at a certain point (referred to the point
the data occupies in the original uncompressed stream) you simply have
to read the first field (the size of the block, of a fixed size that is
the size of the type of variable you use, in this example 4 byte) and
then read (and discard) the n bytes of the second field of the block
(the compressed data), then you have the first field of the next block
and so on.
In this way you can simply parse the first field data (and simply and
quickly discard the second field data) until you reach the block were
the desired offset (of uncompressed data) begins, then start
uncompressing only the blocks containing the data you need.
That happens because if the compression scheme doesn't keep a state
from one block to the another (doesn't learn from previous data), you
don't need to uncompress all previous block to uncompress a random
block in the sequence, you have simply to "walk" trough the compressed
strem using a scheme like this one and the expand only the bolck(s)
were your data is, saving a lot of computing time.
And the scheme doesn't pose a limit to output file size (but the
underlying filesystem will do!).
You could do it even more efficiently building an index of the blocks
ath the beginning of the compressed file (a zone where are saved all
the sizes that in the previus scheme was called the first field), that
will save you some time when searching for the block with the data you
desire (basically because you have few data to read from disk, that is
a seriuos bottleneck, maybe if the file is small enough to be stored in
memory the gain would not be so noticeable), but this is more tricky to
program.
Obviously that approach poses some drawbacks, since the compressor will
work only on single blocks without learning from previous data, so the
compression level will be reduced.
(ps, you will obviously also need a mechanism to control the data
against casual or malicious corruption, depending from your scenery,
but it's OT)

0
Reply giorgio 6/30/2006 7:29:05 AM

Claudio Grondi wrote:
> 
> "Random access to content of archived files without extraction - how? 
> compression scheme? algorithm? source code? tutorial? software?"
> 
> Any hints are appreciated.

> Storing e.g. a huge amount of highly redundant experimental data for 
> later analysis can take hundreds of gigabytes. Having many of such 
> files in compressed form on a hard disk, there is no free space 
> available to extract any content prior to processing it. 

First thanks to Jim and Giorgio for their replies.

Can I conclude from the two replies I have got (status: 2006-06-30 10:51 
GMT+1), that my (above mentioned) requirement for a compression scheme 
(transparent random access to file content of files in archive) could be 
with relatively small effort built into some (or even all?) of existing 
ones, but there is no archiving/compression software directly supporting it?

Is there something wrong with my approach?

Is there any reason for this situation I am (most probably) not aware of 
of?

Is it, that lack of storage space for expanding compressed files is not 
worth to be considered when developing compression software, especially 
because costs of directly accessible (i.e. not such for backup purposes) 
storage can be neglected from the point of view of commercial 
requirements? Should I just purchase some more large hard disks instead 
of trying to find appropriate compression software supporting what I 
want to achieve?

If it is as supposed above, why should I then use compression software 
and if at all, then what else as for backup and data trasmission over 
slow channels purposes for?

Claudio Grondi
0
Reply Claudio 6/30/2006 9:22:06 AM

> Is there something wrong with my approach?
I don't know if it's feasible in your case and if it will fit your
needs, so I'll not say you are using the wrong approach, but I would
not save the data as a single stream and then try to recover specifical
data from the offset position in it.
I'll rather insert a routine in the program that generate the data that
save the data of each day (or week, or hour, or when a given size is
reached...) as a separate file with progressive name.
If it's not possible, I'll use some file splitting tools, like split in
*x, or HJsplit in Windows, or my splitter of File Tools in both, to
split the data file in files of given size.
Then you can archive them (i.e. zip, ace or 7z archive) and you will be
able to extract the single file of the time or of the offset you need,
at the cost of some overhead in archive size due to the need to save
(at least) the name of the input files.
I'll use an archiver that archive and compress in a single pass; it's
inportant in order to don't have to read/write the data multiple times
form disk, like tar+gzip would do, because you will work on huge amount
of data and minimize disk operation is important for performance and
for disk occupation during the operation.
The important thing is NOT to create a solid archive, because in a
solid archive the compression algorithm "learn" from data from the
beginning to the end of the archive, so you will not able to extract a
single output file (however, the solid archive approach makes
definitely sense because allow to train better the comptression
algorithm and reach better compression ratio).

Otherwise you should develope your solution, and for that I'll look at
zcompress and zuncompress in zlib that are a great example in how to
compress buffers in memory to achieve the ability to decompress needed
buffer without the need to run the (de)compression algorithm on the
wole file.
If you implement something similar the scheme hinted in my previous
post, using zcompress and zuncompress, what you need to code would be
basically reduced to read a buffer of size, i.e. "s" of data from your
data stream, pass it to compress function of zcompress, save it to
output with his compressed size (to allow uncompress to work), i.e. n.
For decompressing, read the size variable n, fill a buffer with n
compressed bytes, pass the buffer to uncompress function of uncompress
(along with the expected uncompressed size s), write the s byte of
uncompressed data. Or, to skip a block, read the size variable n, skip
the next n bytes (that will bring you to the offset s of the
uncompressed data stream), read the next size variable... until you
reach a block you want to uncompress.

0
Reply giorgio 6/30/2006 10:46:33 AM

It's easy to construct a coding scheme that allows random acces to
files in O(N^0.5) with O(N^0.5) extra memory. In my language is called
"smenul lui Batog" (don't search google, you'll find no results).

N = is the length of the file (no symbols). Encrypt the file with
static huffman coding.
Lets have M = sqrt(N) (you can choose whatever value you want, but take
in account the overhead). Divide the file in blocks of M symbols and
save a pointer to each start of the block in the compressed file.
Accessing a block is O(1), but searching within the block is O(M). The
overhead is O(N/M) = O(M) because you have to remember the pointers in
the compressed file.

Random access in a file compressed with an adaptive scheme is very
difficult, because each symbol is encoded based on history. Moreover,
if AC is used as an entropy coder random access is almost impossible
because each symbol is coded with a fractionar number of bits.

Claudio Grondi wrote:
> "Random access to content of archived files without extraction - how?
> compression scheme? algorithm? source code? tutorial? software?"
>
> I tried already to Google e.g.
>
> "Which compression software allows random access to compressed files
> without extracting them from archive first?"
>
> and other keywords. I have also checked out the
>    http://datacompression.info/
> site, but
>
> still failed to find a compression scheme which allows me to tell the
> extractor:
>    get from file X.dat in archive Y.ext a 100 byte large section
>    beginning at offset 0xFFEEDDCC
>
> where file X.dat is e.g. a 10 GByte large file (with huge sections
> filled with same bytes) compressed into the archive Y.ext .
>
> Is there any compression scheme supporting extraction of sections from
> files in archive without extracting the entire file first (transparent
> random access to content of archived file)?
> 
> Any hints are appreciated.
> 
> Claudio Grondi

0
Reply Alexandru 6/30/2006 2:59:18 PM

try doctor dobbs journal archive on data compression
various utilities to carry out bzip2 like compression in stages.

you should be able to write a modified pipe script, with|
and make a small script which interleaves a datum compression ID every
1024 bytes or so for easy finding. then indexing into the file every
1KB would give the current file in compression, when a datum ID equal
or higher than the one you want is found the rewind iKB and start
decompression. The end of datum markers can be used to count upto the
required datum. and end of datum marker system could be replaced with a
compressed bits left in datum length indication, but you would also
need to have an extra to datum end count placed with the 1KB current
datum ID

if IDs are un ordered an index at the end of the file should provide an
optimal solution.

cheers, I do no have my KTree class working yet!!

0
Reply jacko 6/30/2006 4:26:26 PM

Claudio Grondi wrote:
> Is there any compression scheme supporting extraction of sections from
> files in archive without extracting the entire file first (transparent
> random access to content of archived file)?

It can certainly be done, but I don't know of any off-the-shelf code to
do it, at least not in the sense you describe.  I wrote some example
code in the zlib distribution showing how it can be done with zlib,
examples/zran.c, that you can look at.  For any scheme, what you do is
build a random access index that provides the information that allows
you to a) figure out about where an uncompressed offset is in the
compressed data, and b) allow you to start decompressing some short
distance before that location to get your data.  A parameter of the
process is how far apart the index points be and still give you the
speed of random access you want.

Such indices can be built while compressing, or after the fact.
Another trade is to either store in the index data the uncompressed
stream history needed for decompressing at that point, or simply throw
away the history at each index point and start compression over.  That
impacts the compression ratio to some degree depending on how frequent
the index points are.  For examples like you give, where you want
random access to 10 GB, resetting a compressor like zlib every 1 MB
will have very little effect on compression, and permit 10,000 index
points into the data without the need to store history data for each
index.

mark

0
Reply Mark 6/30/2006 5:31:15 PM

On Thu, 29 Jun 2006 18:08:41 +0200, Claudio Grondi <claudio.grondi@freenet.de> writes:

> "Random access to content of archived files without extraction - how? 
> compression scheme? algorithm? source code? tutorial? software?"

> Is there any compression scheme supporting extraction of sections from
> files in archive without extracting the entire file first (transparent
> random access to content of archived file)?

If the data is read-only, then squashfs under linux will probably do
it. I've never tried it on a 100gb filesystem, but it would be easy to
test, and it would work transparently.

http://squashfs.sourceforge.net/

Or, you can try to roll your own in the way that Alexandru described.

Scott
0
Reply Scott 6/30/2006 10:43:33 PM

Mark Adler wrote:

> Claudio Grondi wrote:
>> Is there any compression scheme supporting extraction of sections from
>> files in archive without extracting the entire file first (transparent
>> random access to content of archived file)?
> 
> It can certainly be done, but I don't know of any off-the-shelf code to
> do it, at least not in the sense you describe.  I wrote some example
> code in the zlib distribution showing how it can be done with zlib,
> examples/zran.c, that you can look at.  For any scheme, what you do is
> build a random access index that provides the information that allows
> you to a) figure out about where an uncompressed offset is in the
> compressed data, and b) allow you to start decompressing some short
> distance before that location to get your data.  A parameter of the
> process is how far apart the index points be and still give you the
> speed of random access you want.

Is it possible to do what zran does on the compression phase? If so, is
there any example code?

Paul
0
Reply Paul 7/1/2006 6:14:13 PM

Paul Marquess wrote:
> Is it possible to do what zran does on the compression phase? If so, is
> there any example code?

Actually, it would be much easier during compression.  zran was written
to handle the case where the compressed stream is provided with no
consideration for random access, and then needs to be prepared for
random access.

When you're in control of the compression, you would simply use the
Z_FULL_FLUSH option of zlib's deflate() function periodically, and keep
a record of the compression and uncompressed offsets for each flush.
Then you can start decompressing from any of those points.  As I
mentioned, an index spacing of about 1 MB would have almost no effect
on compression, and would provide relatively fast random access for
very large files.

There is no example code, but it would be straightforward to implement.

mark

0
Reply Mark 7/1/2006 11:42:06 PM

Mark Adler wrote:
> Paul Marquess wrote:
> 
>>Is it possible to do what zran does on the compression phase? If so, is
>>there any example code?
> 
> 
> Actually, it would be much easier during compression.  zran was written
> to handle the case where the compressed stream is provided with no
> consideration for random access, and then needs to be prepared for
> random access.
> 
> When you're in control of the compression, you would simply use the
> Z_FULL_FLUSH option of zlib's deflate() function periodically, and keep
> a record of the compression and uncompressed offsets for each flush.
> Then you can start decompressing from any of those points.  As I
> mentioned, an index spacing of about 1 MB would have almost no effect
> on compression, and would provide relatively fast random access for
> very large files.
> 
> There is no example code, but it would be straightforward to implement.
> 
> mark
> 
Warning, idea below not yet tested, but I think it should work.
The only problems I see with the approach below is, that the number of 
files in an archive can be somehow limited and the random access to 
files in archive can be much slower compared to the approach described 
above.
What are the limits of the known archiver like 7zip, zip, bzip2, etc. in 
terms of maximal number of files which could be stored in an archive?
Which archiver support random access to the archived files?

Having considered all of the currently known to me options, it appears, 
that the simplest approach almost without need for any programming is 
(it will be probably what I will use in case it will work for me as 
expected):

1. split a large file into chunks by some file splitting application 
storing the files with the chunks into a directory. The files have now 
names numbered by the splitting application.

2. compress the directory of the chunk files by a compression program 
supporting  random access to its files as it is the case for zip based 
compression, right?
By the way: which other archivers support random access to their files?

3. determine which file name(s) is(are) to use to extract the required 
chunk(s) of the slice from the original file.

4. extract all necessary files, put them together using the splitting 
application and seek to the requested slice

The advantage of this approach is, that all the steps above can be done 
easily even by hand or in a very simple script.

Who goes for the limits and uses the software API to access file content 
in an archive is expected to be able to solve his problems with random 
access to archived content himself.
Is that maybe the reason, why the archiving software does not cover this 
special case of use of compression?

Claudio Grondi
0
Reply Claudio 7/2/2006 8:57:56 AM

If you want random write acces too then squashfs or something similar
may or may not be of use, as im not sure if it has such a feature. a
new file system based arround the virual file system structure may be
of more benifit.

so you need to insert some bits or a new file/object then?

here goes ...
sector allocation is stored in special inode sectors which also pint to
directory sectors which contain the filename information.

the inode density limits the number of files/directories on the disk.
and is a trade of between wasted space and wasted space.

you'll need some kind of hash bucket system to locate the file/object
(decompressed) fast by name, and maybe some granularity of the file
itself. i suppose disk sectors compressed are the best object type.

sector method (including null for uncompressed), sector re-id to get
the real sector of the compressed virtual sector. now if two virtual
sectors are allocated per real sector then starting at the end or
begining of sector gives two know ends. Any space left between the two
parts of a sector if any is stored if a free complexity list showing
free bits in sector. migration of sector halves between cached sectors
gives resulting locality and any possible tighter fitting.

this should give 2:1 random full access. with vfs atop the virtual
sector system!!


nesting of the virtual sectoring system can be achived by having nested
compression with the inter part sector space bing treated as
compression maximal engineered algorythmic data.

good hunting :)

0
Reply jacko 7/2/2006 8:56:28 PM

if i had the will and time i'd do sector compression via a sector
burrows wheeler transform, followed by move to front followed by 256
symbol arithmetic coding. This keeps byte alignment, or is 16 bit word
alignment via 64K word arithmetic coding better?

i have afeeling that byte alignment loses 4 bits on the compression
tail, and gains 3 on the sector  compressed length. The compressed
length does not have to be stored, as decompression is to fixed size
from known start location.

so its down to a speed or few extra bits trade off.

cheers

0
Reply jacko 7/2/2006 9:46:10 PM

Come to think of it better compression would be achieved if the sectors
were fuctionally mapped, and no sector indirection information had to
be stored.

I'm sure the "occupied" or sectors with too small a capacity for
certain data could be considered relatively bad or virtually allocated.

This would present the grand trade off between speed and size of 2
virtual sector algorithm indicators per real sector used for virtual
sector storage. Highly compressable data could be held in nested
virtual sectors with "real" sector numbers exceeding mapping to real
sectors. ( this at 4 virtual sector algorithm indicators per real
sector, stored in virtual sectors.)

this is an easy functional mapping

the virtual sector algorithm indicators should be stored hash table
fashion ? or is that too much given away?

cheers

0
Reply jacko 7/2/2006 10:46:56 PM

Let me jump in here ;-)

Claudio Grondi schrieb:

> Can I conclude from the two replies I have got (status: 2006-06-30 10:51 
> GMT+1), that my (above mentioned) requirement for a compression scheme 
> (transparent random access to file content of files in archive) could be 
> with relatively small effort built into some (or even all?) of existing 
> ones, but there is no archiving/compression software directly supporting 
> it?

Into most compressors, IMO.
Every decompressor will use an internal buffer, for a more or less large 
part of an archive. A filter can be installed between this buffer and 
the output file, which can be instructed which part(s) of the entire 
stream really should be stored.

> Is there something wrong with my approach?

Nothing, so far, your requirements fit perfectly into my own archiver 
project :-)

> 
> Is there any reason for this situation I am (most probably) not aware of 
> of?

The implementation may require modifications to the decompressor source 
code, unless the decompressor output can be piped through that filter. 
For better performance modifications are required, too, so that the 
decompressor can skip over blocks that are discarded anyhow.


> Is it, that lack of storage space for expanding compressed files is not 
> worth to be considered when developing compression software, especially 
> because costs of directly accessible (i.e. not such for backup purposes) 
> storage can be neglected from the point of view of commercial 
> requirements? Should I just purchase some more large hard disks instead 
> of trying to find appropriate compression software supporting what I 
> want to achieve?

IMO you are better off with randomly accessible uncompressed data on a 
phyiscal disc (or RAID...). I'd also suggest to split your data into 
files (of fixed size), for several reasons:

- files can be backed up, deleted and restored randomly.
- multi-file archives (zip...) already allow for random access to parts 
of the entire information.
- adding an new file is easier, faster and more flexible than adding 
data to an existing archive.

You also may use DVDs to hold your data, instead of hard discs, provided 
that access time is acceptable, or that you can hold an "working set" of 
your data on a HD.

DoDi
0
Reply Hans 7/9/2006 3:59:31 AM

16 Replies
276 Views

(page loaded in 0.298 seconds)

Similiar Articles:








7/25/2012 2:20:40 AM


Reply: