ot: storage file formats

  • Follow


there is a common need to store data in files, and file formats are often 
ad-hoc and much effort is duplicated, but then again, there is not that much 
better. lack of common general purpose formats, endless need to recreate 
formats. lots of duplication.

traditionally there has been RIFF and IFF as storage formats, though they 
have well known limits (my biggest complaints have been fourcc only tags and 
serial-only access).

interesting was EBML:
http://www.matroska.org/technical/specs/index.html
which is imo in some ways both better and worse than RIFF (it gives up 
fourcc tags in favor of a less-expressive system).


recently I designed a format I called XLIFF:
http://bgb-sys.sourceforge.net/XLIFF_0_1_1_Draft_Spec.html
as a kind of binary xml like format.

I wrote a lib for it, and gained slight size reduction when compared with 
textual xml. it could offer some advantages over text xml, eg: storage of 
binary data, possibly later using schemas or such to reduce space 
requirements, ability to skip over sub-branches, ...

sadly, it is serial (along with the prior formats). non-serial access leads 
to other possibilities, and the ability to have read/write access allows 
more.
I don't know of any really common read/write formats though.


an idea that came up with one format was to use something like my memory 
manager, eg: the file is divided up into "chunks" and "cells" (and also 
"blocks"). it is intending to allow random read/write access.

this format is in much earlier phases, and is likely to represent a 
riff-like tree structure (since, eg, for something xml-like it would have a 
bit of extra space overhead). unlike riff it will allow either fourcc tags 
or string tags. as for string tags, the management of string tables and the 
hash mechanism are still being thought about.


eg: the file consists of a number of chunks (lower level). the file is 
extended with new chunks as new space is needed. similarly, all points in 
the file also have cell and block indices, and thus things need to be 
properly aligned (this is ok, since chunks are typically powers of 2 at or 
greater than the block size).

chunks use fourcc's to identify themselves, allowing for new types of chunks 
to be added later. similarly, the file header also sticks within the chunk 
layout. the eof is marked by a null chunk.

within cell data chunks, objects are allocated as runs of 16 byte cells, 
using a bitmap for the management of cells. this is intended for smaller 
objects (though, the 16 byte chunky effect is signifigant).

larger or variable size objects are allocated in "block chunks", which are 
basically arrays of 1KB blocks and a kind of fragment of a fat-table like 
structure.

any object is referred to by its file cell index, which points to the object 
header. for the case of blocks, it may be possibly to use the lower 6 bits, 
eg, for flags (for now they are all 0, given the addresses are still cell 
based).

some thoughts keep popping up as to how it might be possible to 
garbage-collect the format, but I am unsure. it is possible the file sizes 
may get large and garbage collection could be slow. ideas persist though, 
but I am imagining a gc would have to keep some context (eg: kind of an 
"chunk A references chunk B", "chunk B references C", "C references A and B" 
table, along with probably needing to name the referenced objects...).
retaining context could allow determining when objects are no longer used 
(if there are no incomming references, and no references within the chunk, 
and no references otherwise, then it is safe to free an object).

ok, so the garbage collector could use an lru cache or such and occasionally 
garbage collect chunks. also possible would be to just step sequentially.
this may be unnecessary however, maybe everything could be manual.


I could have used a b-tree, but b-trees would be more complicated here and I 
wasn't sure they could really offer better performance for this kind of use. 
a b-tree, however, could have probably saved space.


or such...



0
Reply cr88192 (24) 11/30/2004 2:30:14 AM

> recently I designed a format I called XLIFF:
> http://bgb-sys.sourceforge.net/XLIFF_0_1_1_Draft_Spec.html
> as a kind of binary xml like format.
>
> I wrote a lib for it, and gained slight size reduction when compared with 
> textual xml. it could offer some advantages over text xml, eg: storage of 
> binary data, possibly later using schemas or such to reduce space 
> requirements, ability to skip over sub-branches, ...
>
> sadly, it is serial (along with the prior formats). non-serial access 
> leads to other possibilities, and the ability to have read/write access 
> allows more.

But it is good as a serial format, it is relatively compact, and should be 
fast to read.
Writing could be simple or complex, DTD's or Schemas are used to create 
dictionaries, or dictionaries could be created from the actual XML. 
Producing optimally compacted data requires either schemas or a two pass 
afair.

> I don't know of any really common read/write formats though.

This is a challenge, random access, one off write is a less of a challenge 
than random access read/write, but not as useful although maybe easier.

> an idea that came up with one format was to use something like my memory 
> manager, eg: the file is divided up into "chunks" and "cells" (and also 
> "blocks"). it is intending to allow random read/write access.

Some form of container or chuncking is required.

> this format is in much earlier phases, and is likely to represent a 
> riff-like tree structure (since, eg, for something xml-like it would have 
> a bit of extra space overhead). unlike riff it will allow either fourcc 
> tags or string tags. as for string tags, the management of string tables 
> and the hash mechanism are still being thought about.
>
> eg: the file consists of a number of chunks (lower level). the file is 
> extended with new chunks as new space is needed. similarly, all points in 
> the file also have cell and block indices, and thus things need to be 
> properly aligned (this is ok, since chunks are typically powers of 2 at or 
> greater than the block size).

Chunks should have a size and grain defined in the header, these could be 
changed for compacted files.

> chunks use fourcc's to identify themselves, allowing for new types of 
> chunks to be added later. similarly, the file header also sticks within 
> the chunk layout. the eof is marked by a null chunk.
>
> within cell data chunks, objects are allocated as runs of 16 byte cells, 
> using a bitmap for the management of cells. this is intended for smaller 
> objects (though, the 16 byte chunky effect is signifigant).

I do not like the use of bitmaps, would prefer unused chunks to have chained 
pointers for both sequential location (maybe even bidirectional) and size. 
As it is in unused space it is not an overhead.

> larger or variable size objects are allocated in "block chunks", which are 
> basically arrays of 1KB blocks and a kind of fragment of a fat-table like 
> structure.

Your talking about two levels of allocation, course and fine grain ?

> any object is referred to by its file cell index, which points to the 
> object header. for the case of blocks, it may be possibly to use the lower 
> 6 bits, eg, for flags (for now they are all 0, given the addresses are 
> still cell based).

Absolute file offset addresses maybe better and faster requiring no 
calculations.

> some thoughts keep popping up as to how it might be possible to 
> garbage-collect the format, but I am unsure. it is possible the file sizes 
> may get large and garbage collection could be slow. ideas persist though, 
> but I am imagining a gc would have to keep some context (eg: kind of an 
> "chunk A references chunk B", "chunk B references C", "C references A and 
> B" table, along with probably needing to name the referenced objects...).
> retaining context could allow determining when objects are no longer used 
> (if there are no incomming references, and no references within the chunk, 
> and no references otherwise, then it is safe to free an object).
>
> ok, so the garbage collector could use an lru cache or such and 
> occasionally garbage collect chunks. also possible would be to just step 
> sequentially.
> this may be unnecessary however, maybe everything could be manual.

GC's not my strong point, but reference counting seems a bit of an overhead, 
some form of mark and sweep maybe better.

> I could have used a b-tree, but b-trees would be more complicated here and 
> I wasn't sure they could really offer better performance for this kind of 
> use. a b-tree, however, could have probably saved space.

b-tree's do not really fit the bill ?

> or such...

'or such....' ?

Aaron


0
Reply Aaron 12/4/2004 4:53:44 PM


"Aaron Gray" <angray.OFFSPAMOFF@beeb.net> wrote in message 
news:41b1ebe9$0$9360$ed2619ec@ptn-nntp-reader02.plus.net...
>> recently I designed a format I called XLIFF:
>> http://bgb-sys.sourceforge.net/XLIFF_0_1_1_Draft_Spec.html
>> as a kind of binary xml like format.
>>
>> I wrote a lib for it, and gained slight size reduction when compared with 
>> textual xml. it could offer some advantages over text xml, eg: storage of 
>> binary data, possibly later using schemas or such to reduce space 
>> requirements, ability to skip over sub-branches, ...
>>
>> sadly, it is serial (along with the prior formats). non-serial access 
>> leads to other possibilities, and the ability to have read/write access 
>> allows more.
>
> But it is good as a serial format, it is relatively compact, and should be 
> fast to read.
> Writing could be simple or complex, DTD's or Schemas are used to create 
> dictionaries, or dictionaries could be created from the actual XML. 
> Producing optimally compacted data requires either schemas or a two pass 
> afair.
>
yes.

>> I don't know of any really common read/write formats though.
>
> This is a challenge, random access, one off write is a less of a challenge 
> than random access read/write, but not as useful although maybe easier.
>
agreed.

>> an idea that came up with one format was to use something like my memory 
>> manager, eg: the file is divided up into "chunks" and "cells" (and also 
>> "blocks"). it is intending to allow random read/write access.
>
> Some form of container or chuncking is required.
>
yes, and that was done.
chunks have a variable power of 2 size greater than the size of a block.
a block has a power of 2 size greater than that of a cell.
cells also have a power of 2 size.

>> this format is in much earlier phases, and is likely to represent a 
>> riff-like tree structure (since, eg, for something xml-like it would have 
>> a bit of extra space overhead). unlike riff it will allow either fourcc 
>> tags or string tags. as for string tags, the management of string tables 
>> and the hash mechanism are still being thought about.
>>
>> eg: the file consists of a number of chunks (lower level). the file is 
>> extended with new chunks as new space is needed. similarly, all points in 
>> the file also have cell and block indices, and thus things need to be 
>> properly aligned (this is ok, since chunks are typically powers of 2 at 
>> or greater than the block size).
>
> Chunks should have a size and grain defined in the header, these could be 
> changed for compacted files.
>
well, for various resons the relations have to hold.
actually, the exact chunk size doesn't need to be specified as each chunk 
includes its own size, and there is a common header structure for all the 
chunks.
the file header is also a chunk, albeit a smaller one (still 1 block for 
alignment reasons).

>> chunks use fourcc's to identify themselves, allowing for new types of 
>> chunks to be added later. similarly, the file header also sticks within 
>> the chunk layout. the eof is marked by a null chunk.
>>
>> within cell data chunks, objects are allocated as runs of 16 byte cells, 
>> using a bitmap for the management of cells. this is intended for smaller 
>> objects (though, the 16 byte chunky effect is signifigant).
>
> I do not like the use of bitmaps, would prefer unused chunks to have 
> chained pointers for both sequential location (maybe even bidirectional) 
> and size. As it is in unused space it is not an overhead.
>
err, at 16 bytes, assuming 4 bytes/link, this is notable overhead.
for runs, it depends on the average size of objects and fragmentation.

my experience has been good with bitmaps, and they are commonly used in many 
filesystems as typically allocating blocks with bitmaps is faster, eg, than 
using a fat like structure (excluding an explicit free list or similar).

>> larger or variable size objects are allocated in "block chunks", which 
>> are basically arrays of 1KB blocks and a kind of fragment of a fat-table 
>> like structure.
>
> Your talking about two levels of allocation, course and fine grain ?
>
yes.

might need to do something here though, things started getting pretty ugly 
(mostly dealing with the issue of possibly fragmented data between blocks 
and cells).

I could simplify, eg, blocks can only be linked with blocks (except maybe in 
the case of end-packing). cell based objects could be made fixed size.

this would both eliminate complexity and reduce size overhead for objects.
this would, however, have an effect on semantics.

>> any object is referred to by its file cell index, which points to the 
>> object header. for the case of blocks, it may be possibly to use the 
>> lower 6 bits, eg, for flags (for now they are all 0, given the addresses 
>> are still cell based).
>
> Absolute file offset addresses maybe better and faster requiring no 
> calculations.
>
shifts are not that expensive.
anyways, the cost of seeking to part of a file and reading is far greater 
than the cost of a shift.

also, since there is shifting involved, the format will be able to go a bit 
above 4GB without needing to be modified, and I can boost the cell and block 
sizes after that to push it further (like with fat-16).

>> some thoughts keep popping up as to how it might be possible to 
>> garbage-collect the format, but I am unsure. it is possible the file 
>> sizes may get large and garbage collection could be slow. ideas persist 
>> though, but I am imagining a gc would have to keep some context (eg: kind 
>> of an "chunk A references chunk B", "chunk B references C", "C references 
>> A and B" table, along with probably needing to name the referenced 
>> objects...).
>> retaining context could allow determining when objects are no longer used 
>> (if there are no incomming references, and no references within the 
>> chunk, and no references otherwise, then it is safe to free an object).
>>
>> ok, so the garbage collector could use an lru cache or such and 
>> occasionally garbage collect chunks. also possible would be to just step 
>> sequentially.
>> this may be unnecessary however, maybe everything could be manual.
>
> GC's not my strong point, but reference counting seems a bit of an 
> overhead, some form of mark and sweep maybe better.
>
well, a single mark/sweep pass on possible hundreds of MB or GB of file data 
seems unreasonable. better to split the gc into a number of small operations 
so that it can be done gradually...

>> I could have used a b-tree, but b-trees would be more complicated here 
>> and I wasn't sure they could really offer better performance for this 
>> kind of use. a b-tree, however, could have probably saved space.
>
> b-tree's do not really fit the bill ?
>
not really.

b-trees would be better for searches of some sort (eg: look up object by 
name), or if I expect the object to move around a lot, or have sequential 
order, or whatever.

since I have none of there requirements, b-trees are unnecessary.
similarly, they also involve a lookup time, wheras with a raw index (that is 
also a file offset) I don't need to look things up.

however, sometimes moving objects would be nice, and would seem to demand a 
kind of index table. an index table, however, would have a cost of 4 
bytes/object, and would almost invalidate the current file structure.

maybe just variable-sized objects could be allocated indices in the table. 
annoyingly, the cost of each index would need to be at least 1 cell given 
the current way the format works. oh well...

or such.



0
Reply cr88192 12/5/2004 6:26:36 AM

I get the general idea but donot see the fine grain to any clarity as you do 
not really discuss it in any detail. Have you got a document like the XLIFF 
doc together ? I am still interested in reviewing your XLIFF doc and helping 
you clarify things in places and help tidy it up the overall format of the 
document for presentation purposes. I do have some time and interest in this 
area so will give feedback :)

Aaron


0
Reply Aaron 12/7/2004 12:16:20 AM

"Aaron Gray" <angray.OFFSPAMOFF@beeb.net> wrote in message 
news:41b4f694$0$9347$ed2619ec@ptn-nntp-reader02.plus.net...
>I get the general idea but donot see the fine grain to any clarity as you 
>do not really discuss it in any detail. Have you got a document like the 
>XLIFF doc together ? I am still interested in reviewing your XLIFF doc and 
>helping you clarify things in places and help tidy it up the overall format 
>of the document for presentation purposes. I do have some time and interest 
>in this area so will give feedback :)
>
the xliff doc has had a few minor changes since I had posted it, but has 
otherwise remained fairly stable.


as for the cell based format, I am in the process of redesigning it to not 
be such a mess (and simplifying and generalizing it). I am also redesigning 
the workings of indices, such that they no longer correspond to physical 
file offsets, but are more a kind of logical offset (and can vary wrt the 
amount of space they represent).
I figured this made sense, given:
I need to lookup the chunk for most accesses anyways;
for blocks it does not make sense to access them on 16 byte boundaries;
it allows better use of index space/denser packing;
....

I am also dropping the ability to build variable sized cell-packed objects, 
dropping the block size to 512 bytes, ...

also possible might be to drop the cell size to 8 bytes and the bitmap to 
4bits/cell (changes allowed cutting the requirements, and possible is 
merging the mark and reference count). this is likely not that necessary 
though, as space is not that expensive (otherwise I would probably have to 
do similar with memory...).


this is the first part of the spec. I cut it down as the rest was mostly 
stuff about the stuff built on top of this. the new spec is still in much 
earlier phases.
----
CPSF: Cell Packed Storage Format

Goals:
A fairly simple random access image format;
Will be as simple as is reasonable at the cost of some size;
Split into 2 distinct layers, one for basic storage, the other data and 
semantics.

Unless otherwise noted, all numbers are implicitly in little-endian 
ordering.

Rough Layout
A file will consist on a number of chunks. Each will be divided into cells, 
and each chunk will begin on a cell boundary. A chunk with a magic of 0 will 
flag the end of the file, and is required after the last valid chunk.

The default cell size will be 16 bytes, and the default data chunk size 1MB.

Within the size of the data chunk will be included the header and bitmaps.
Cell indices will be linear within a file, and will thus include any 
headers, bitmaps, or other data. Cells outside chunks, however, will not be 
covered by bitmaps.

FOURCC values will be like RIFF style FOURCC values, except that if 
<0x20000000 they are indices into the strings table.

GeneralChunkHead {
u32 magic; //chunk magic
u32 size; //chunk size, includes header
//any data for the chunk follows
}

At the start of the file will be the header. This will be a variable size.

FileHeadChunk {
u32 magic; //'CPSF'
u32 size; //size of header data, padded up to a power of 2 boundary>=1024
u16 flags; //header flags, 1&=dirty.
u8 version; //0
u8 cell_sizes; //4 bits: log2 cell size, 4 bits: log2 bitmap cell bits.
u32 file_magic; //file magic number
u32 root_magic; //root magic number
u32 root_idx; //root index
u32 strs_idx; //index of strings tables
}

Following the header will be data chunks:
A data chunk consists of a header, a certain amount of cell data, and a 
bitmap.
The cells comprising the header and bitmap should be marked as reserved.
The size of data chunks is to be a power of 2.

DataChunkHead {
u32 magic; //'CHK '
u32 size; //chunk size, includes header
byte cells_data[size-(size/CELLSZ)-8];
byte cells_bitmap[size/CELLSZ];
}

The data will be built out of objects, each of which may consist of 1 or 
more runs.
At the start of an object will be a header, and each run will also have a 
header.

RunHead {
u32 next_idx; //index for next run
}

ObjHead {
u32 next_idx; //index for next run in object
u32 size; //size of object data
u32 chain; //chained object (eg: next "fork")
u32 magic; //type magic for object
}

The bitmap will be placed at the end of the chunk, and have 8 bits/cell (or 
1/16 of the data).

The low 4 bits will be the type:
0=free cell;
1=start of object;
2=start of run;
3=data in run;
4=reserved cell;
5-15=reserved values.

The next 2 bits will be a ref count: 0/1/2, 3=many.
The next 2 bits will be a mark: white/black/grey/lock.

Larger Alloctions

For larger allocations blocks are to be allocated and bound together to form 
objects.
Chunk Magic 'BLKS'.

The chunk contents are an array of 1024 byte blocks.
By default, these chunks are 1MB.
The last 4 blocks are used as fat tables for any blocks found in the chunk.

Each entry is 32 bits and is the index of the next block (however, they are 
cell indices).
Special Values:
0=free block;
1=eof;
2=reserved;
3..63=reserved values.

The first block is reserved.
The first 2 enties (for the first 2 blocks), are unused (and taken up by the 
chunk header).

Objects defined in this manner have a header:
LargeObjectHead {
u32 flags; //low 6: 0..62=use count, 63=many, next 2: white/black/grey/lock.
u32 size; //size of data
u32 chain; //next fork
u32 magic; //magic for data
byte resv[48];
}



0
Reply cr88192 12/7/2004 1:51:00 AM

Thus rendering it completely useless for anything that I would be even 
remotely interested in doing.

Oh well.

cr88192 wrote:
> Unless otherwise noted, all numbers are implicitly in little-endian 
> ordering.
0
Reply Donald 12/7/2004 1:49:00 PM

"Donald F. McLean" <dmclean@stsci.edu> wrote in message 
news:cp4cb4$rfn$1@tomm.stsci.edu...
> Thus rendering it completely useless for anything that I would be even 
> remotely interested in doing.
>
> Oh well.
>
why does it make much difference?

one endianess is about as good as the other, and le is the one used by x86, 
which is the dominant cpu architecture anymore.

in my experience "native ordering" is more of a hassle than just one or the 
other, because instead of just needing a big or little endian reading 
function. one has to do a check and optionally swap the values, which is imo 
worse.

> cr88192 wrote:
>> Unless otherwise noted, all numbers are implicitly in little-endian 
>> ordering. 



0
Reply cr88192 12/7/2004 2:37:03 PM

I realize that endianness is an arbitrary choice and a great 
Green-Purple debate topic.

However, I do all of my programming in Java which uses BE on all 
machines, even x86.

cr88192 wrote:
> "Donald F. McLean" <dmclean@stsci.edu> wrote in message 
> news:cp4cb4$rfn$1@tomm.stsci.edu...
> 
>>Thus rendering it completely useless for anything that I would be even 
>>remotely interested in doing.
>>
>>Oh well.
> 
> why does it make much difference?
> 
> one endianess is about as good as the other, and le is the one used by x86, 
> which is the dominant cpu architecture anymore.
> 
> in my experience "native ordering" is more of a hassle than just one or the 
> other, because instead of just needing a big or little endian reading 
> function. one has to do a check and optionally swap the values, which is imo 
> worse.
> 
>>cr88192 wrote:
>>
>>>Unless otherwise noted, all numbers are implicitly in little-endian 
>>>ordering. 
0
Reply Donald 12/8/2004 2:08:07 PM

"Donald F. McLean" <dmclean@stsci.edu> wrote in message 
news:cp71qv$8bk$1@tomm.stsci.edu...
>I realize that endianness is an arbitrary choice and a great Green-Purple 
>debate topic.
>
> However, I do all of my programming in Java which uses BE on all machines, 
> even x86.
>
I largely use c, and I don't think it warrents much. dunno enough about java 
to know, eg, whether having le numbers makes that much difference (I would 
think at worst maybe a method would be needed somewhere to read them, dunno 
really though).

lots of common file formats are le, and I am on x86, so figured I would use 
le...

> cr88192 wrote:
>> "Donald F. McLean" <dmclean@stsci.edu> wrote in message 
>> news:cp4cb4$rfn$1@tomm.stsci.edu...
>>
>>>Thus rendering it completely useless for anything that I would be even 
>>>remotely interested in doing.
>>>
>>>Oh well.
>>
>> why does it make much difference?
>>
>> one endianess is about as good as the other, and le is the one used by 
>> x86, which is the dominant cpu architecture anymore.
>>
>> in my experience "native ordering" is more of a hassle than just one or 
>> the other, because instead of just needing a big or little endian reading 
>> function. one has to do a check and optionally swap the values, which is 
>> imo worse.
>>
>>>cr88192 wrote:
>>>
>>>>Unless otherwise noted, all numbers are implicitly in little-endian 
>>>>ordering. 



0
Reply cr88192 12/8/2004 3:10:29 PM

8 Replies
20 Views

(page loaded in 0.376 seconds)

10/22/2012 5:49:54 AM


Reply: