offset-based hash table for ASCII data

  • Follow


I'm looking for an offset-based data structure to hold character data. 
Background: I'm working with an app whose server is written in Java 
while the client is in C. The server needs to package up a bunch of data 
and ship it to the client where it will be traversed, read-only, many 
many times. There's actually a working implementation which sends the 
data in an XML format, but it's too slow and profiling reveals that it 
spends almost all its time in XML parsing, so I need to replace the XML 
with something more natural to the purpose.

The client starts thousands of child processes which each start by 
parsing the XML document in order to place its data in hash tables, 
after which it is accessed via the hash table only. So the XML is very 
much an intermediate format, and the obvious win would be to have the 
Java server package the data in a form which is directly usable by the 
clients. Presumably the data would arrive from the server and be stored 
in a file: each client would then map the file into its address space 
and treat it as a hash table (or similar data structure) right away.

The traditional issue with transportable data structures is that since 
the client can't reliably control what address the data is mapped to, 
all addressing must be made relative to the starting point. Does anyone 
know of an implementation of such a format which can be generated in 
Java and consumed in C code?

TIA,
RM
0
Reply rexm (52) 4/15/2008 12:15:06 PM

On Apr 15, 8:15 am, Rex Mottram <r...@not.here> wrote:
> I'm looking for an offset-based data structure to hold character data.
> Background: I'm working with an app whose server is written in Java
> while the client is in C. The server needs to package up a bunch of data
> and ship it to the client where it will be traversed, read-only, many
> many times. There's actually a working implementation which sends the
> data in an XML format, but it's too slow and profiling reveals that it
> spends almost all its time in XML parsing, so I need to replace the XML
> with something more natural to the purpose.
>
> The client starts thousands of child processes which each start by
> parsing the XML document in order to place its data in hash tables,
> after which it is accessed via the hash table only. So the XML is very
> much an intermediate format, and the obvious win would be to have the
> Java server package the data in a form which is directly usable by the
> clients. Presumably the data would arrive from the server and be stored
> in a file: each client would then map the file into its address space
> and treat it as a hash table (or similar data structure) right away.
>
> The traditional issue with transportable data structures is that since
> the client can't reliably control what address the data is mapped to,
> all addressing must be made relative to the starting point. Does anyone
> know of an implementation of such a format which can be generated in
> Java and consumed in C code?
>
> TIA,
> RM

Can the client parse the XML and share it with the child processes
(thus avoiding having each child parse the XML)?

Also, if the XML parsing is only done once by each child process, why
is the parsing performance so important?
0
Reply shakahshakah (188) 4/15/2008 12:23:54 PM


shakahshakah@gmail.com wrote:
> Can the client parse the XML and share it with the child processes
> (thus avoiding having each child parse the XML)?

Yes, that's also an option but it doesn't change the underlying 
complication much, while leaving in an extra stage which would be better 
avoided.

The fundamental need is for an offset-based data structure since the 
child processes can't be guaranteed of being able to map the file at a 
constant location. This would still be a problem between the client and 
its children. So given that someone needs to place the data in an 
offset-based table, it would be far preferable to do the ADT work in 
Java and send it straight to the children. Yes, the server could send 
XML and the client could rework it to a native format and pass it on to 
the children but there's just more moving parts to break that way, not 
to mention more code written in C to core dump. In general I, along with 
I think most people in the situation, prefer to push as much coding as 
possible over to the Java side where you have garbage collection, 
exceptions, and a really nice debugger. Not to mention that the server 
process typically runs on a "server" (bigger, faster) machine.

> Also, if the XML parsing is only done once by each child process, why
> is the parsing performance so important?

The short and somewhat obnoxious answer would be that "why" doesn't 
matter - profiling has revealed that XML is the major cost and profiling 
doesn't lie. The more substantive response is that many of these child 
processes will end up doing either very little or nothing at all; 
basically they start up, parse the XML data, and compare current reality 
against what the server says it should be. In the majority of cases 
current reality is fine, in which case the process can exit immediately. 
Thus the constant factor of XML processing can become the dominant part 
of child performance when multiplied by thousands of them.

RM
0
Reply rexm (52) 4/15/2008 1:59:26 PM

On Apr 15, 9:59 am, Rex Mottram <r...@not.here> wrote:
> shakahsha...@gmail.com wrote:
> > Can the client parse the XML and share it with the child processes
> > (thus avoiding having each child parse the XML)?
>
> Yes, that's also an option but it doesn't change the underlying
> complication much, while leaving in an extra stage which would be better
> avoided.
>
> The fundamental need is for an offset-based data structure since the
> child processes can't be guaranteed of being able to map the file at a
> constant location. This would still be a problem between the client and
> its children. So given that someone needs to place the data in an
> offset-based table, it would be far preferable to do the ADT work in
> Java and send it straight to the children. Yes, the server could send
> XML and the client could rework it to a native format and pass it on to
> the children but there's just more moving parts to break that way, not
> to mention more code written in C to core dump. In general I, along with
> I think most people in the situation, prefer to push as much coding as
> possible over to the Java side where you have garbage collection,
> exceptions, and a really nice debugger. Not to mention that the server
> process typically runs on a "server" (bigger, faster) machine.
>
> > Also, if the XML parsing is only done once by each child process, why
> > is the parsing performance so important?
>
> The short and somewhat obnoxious answer would be that "why" doesn't
> matter - profiling has revealed that XML is the major cost and profiling
> doesn't lie. The more substantive response is that many of these child
> processes will end up doing either very little or nothing at all;
> basically they start up, parse the XML data, and compare current reality
> against what the server says it should be. In the majority of cases
> current reality is fine, in which case the process can exit immediately.
> Thus the constant factor of XML processing can become the dominant part
> of child performance when multiplied by thousands of them.
>
> RM

Fair enough.

I guess I read your OP as describing a client launching many long-
lived (and busy) children, where the initial XML parse wouldn't matter
so much.
0
Reply shakahshakah (188) 4/15/2008 2:03:40 PM

Rex Mottram <rexm@not.here> writes:

> I'm working with an app whose server is written in
> Java while the client is in C.
<snip>
> The traditional issue with transportable data structures is that since
> the client can't reliably control what address the data is mapped to,
> all addressing must be made relative to the starting point. Does
> anyone know of an implementation of such a format which can be
> generated in Java and consumed in C code?

This is done a lot by Remote Procedure Call systems (though your
format may be more complex than they can handle) but I would look
using some form of RPC, or at least the library that marshals the
data.  Since Sun used RPC a lot for its networking, I'd image there is
good Java support/bindings (but that is just a wild guess).

If you fancy standards, there is always ASN.1.  Old, and it used to
thought of as rather "heavy" but that was before everyone started
using XML for everything.

-- 
Ben.
0
Reply ben.usenet (6515) 4/15/2008 3:07:56 PM

May be a binary XML format would be the simplest way to improve the 
performance. For example there is wbxml4j for java and libbxml for C. I 
never used them, but it seems worth a try.
If your XML structure is not too complex you could also implement your 
own simple binary format. So that the clients just reads a kind of 
nested structs from shared memory that use offsets to point to childs an 
siblings.
0
Reply til (2) 4/15/2008 3:43:16 PM

Rex Mottram wrote:
> I'm working with an app whose server is written in Java 
> while the client is in C. The server needs to package up a bunch of data 
> and ship it to the client where it will be traversed, read-only, many 
> many times. There's actually a working implementation which sends the 
> data in an XML format, but it's too slow and profiling reveals that it 
> spends almost all its time in XML parsing, so I need to replace the XML 
> with something more natural to the purpose.

Have you considered JSON?

-- 
RGB
0
Reply RedGrittyBrick (364) 4/15/2008 3:53:20 PM

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
> Rex Mottram <rexm@not.here> writes:
>> I'm working with an app whose server is written in
>> Java while the client is in C.
> <snip>
>> The traditional issue with transportable data structures is that since
>> the client can't reliably control what address the data is mapped to,
>> all addressing must be made relative to the starting point. Does
>> anyone know of an implementation of such a format which can be
>> generated in Java and consumed in C code?
>
> This is done a lot by Remote Procedure Call systems (though your
> format may be more complex than they can handle) but I would look
> using some form of RPC, or at least the library that marshals the
> data.

The easy way to solve such a problem is

	a) get over the completely insane idea that designing data
	exchange formats would be something network programmers should
        not care about.

        b) defined a (presumably simple) binary message format
        containing the information you want to communicate.

Regarding b, if you want to make your life easy, avoid numbers not
composed of an integral number of octets and try to preserve 'natural
alignment' of the members, ie let four-octet quantities (32-bit
integers) start on a four-octet boundary relative to the start of the
message.

For a structure which absolutely must contain a lot of pointers,
introducing a 'fixup' step could be sensible: Store the offsets of all
contained pointers in some array, traverse this array and add the
start address of the buffer the message is contained in to the value
stored at the location start + <current offset> in the buffer.
        
> If you fancy standards, there is always ASN.1.  Old, and it used to
> thought of as rather "heavy" but that was before everyone started
> using XML for everything.

ASN.1 means 'abstract syntax notation one' and it is (mostly) a
standardized high-level language for declaring structured data types.
It is not a definition of an actual 'wire formats' of data. That would
be the purpose of some set of encoding rules for ASN.1, eg BER (basic
encoding rules) or PER (packed encoding rules). BER encoding is
already fairly byzantine and that it transmits and encoded length and
an encoded type for each 'information quantity' is useless for
applications where producer and consumer 'know' how the messages are
supposed to be structured.

According to one of the people responsible for the PER-rules, these
were initially conceived by a couple of frustrated people busy with
getting drunk, with the frustration resulting from the fact that too
many members of some gremium had provided negative, constructive
feedback regarding their previous attempt at defining an 'improved
BER'. PER was designed to address this problem and the person I am
referring to (author of a freely downloadable book on ASN.1)
recommends that people don't even try to understand the
specification.
0
Reply rweikusat (2679) 4/15/2008 4:28:29 PM

Rex Mottram wrote:
> I'm looking for an offset-based data structure to hold character data. 

I'm not sure what you mean by "offset based data structure."  I think I 
missed that chapter when doing my school work, where the official offset 
based data structure that all languages inter-operate with was described.

It sounds like you want just plain old binary data.  This is a huge 
mistake, imo, since your client and server combination is going to be a 
lot less modifiable than just one system by itself.  Better to send the 
XML to the client, parse the whole thing once into some binary format, 
then spawn the child processes.  Getting two systems to work together in 
a binary format is going to be very hard to maintain.

But anyway, look at DataOutputStream for Java.  Send the info to a 
network stream, or you can buffer it to memory by hooking the DataOutput 
Stream to a ByteArrayOutputStream, and then send the buffer at your leisure.

Think very hard about how you will modify that binary data in the future 
when you implement this.

If we knew a little more about what was being parsed here, we might be 
able to help you further.  My advice is "don't" but you seem wedded to 
the idea, so hopefully we can save you from disaster.  It's still really 
unclear to me how sending data in binary is going to present some huge 
savings on the client.  Networks are slow, CPUs are fast.  Servers also 
tend to be heavily loaded.  Parsing once in C on the client seems like 
the obvious answer and much less headache prone.

Anyway, good luck.


> The traditional issue with transportable data structures is that since 
> the client can't reliably control what address the data is mapped to, 
> all addressing must be made relative to the starting point. Does anyone 
> know of an implementation of such a format which can be generated in 
> Java and consumed in C code?

Re-reading this, I'm still totally unclear on what you are actually 
asking here.  You need "addressing?"  Are you saying you don't know what 
indirection is in C?  Ever hear of a look-up table?  How about a hash table?

That may come off as condescending, but that's what your description 
sounds like to me.  Maybe a clearer explanation of how you think the 
data will be parsed/accessed in binary will help.  I think part of the 
problem is you are a bit unclear yourself.
0
Reply markspace (954) 4/15/2008 6:03:22 PM

Mark Space <markspace@sbc.global.net> writes:
> Rex Mottram wrote:
>> I'm looking for an offset-based data structure to hold character
>> data.
>
> I'm not sure what you mean by "offset based data structure."

This fairly obviously means 'replacing pointers by offsets relative to
the start of the data structure' (because pointer values usually
cannot even be meaningfully communicated between different processes
running on the same machine, let alone processes running on different
machines). 

[...]

> It sounds like you want just plain old binary data.

That was indeed what was intended: Use a data format which can be
'readily prepared' on the server for use as-is, because the overhead
of parsing an XML-steganogram of a structured message into a
structured message had been _proven_ to be to expensive. 

> This is a huge mistake, imo, since your client and server
> combination is going to be a lot less modifiable than just one
> system by itself.

At worst, it will need twice the effort for modifying one system.

> Better to send the XML to the client, parse the whole thing once
> into some binary format, then spawn the child processes.  Getting
> two systems to work together in a binary format is going to be very
> hard to maintain.

Why would it? The code itself isn't structurally different: For the
most general case, you have a 'native' representation on system #1,
which is encoded into a transport representation by an encoding
routine, transmitted to system #2 and then decoded into a native
representation by a decoding routine.

[...]

> If we knew a little more about what was being parsed here, we might be
> able to help you further.  My advice is "don't" but you seem wedded to
> the idea, so hopefully we can save you from disaster.  It's still
> really unclear to me how sending data in binary is going to present
> some huge savings on the client.

Did it occur to you that your lack of understanding in this respect
means that you are a less-than-ideal person regarding having a
sensible opinion on the topic?

The original problem was that _measurements_ had proven that the need
to parse an XML-representation was too expensive for this particular
application. Assuming this as true, not parsing the data on the
client, or at least decoding something signficantly less noisy than
XML (eg a binary interchange format or a simpler text format) is an
obvious solution. It will reduce the load on the server, too
(constructing XML is not cheaper than deconstructing XML) and will
even lead to less bandwidht waste during transmission, assuming the
'other message format' is designed with a better SDU/ PDU than XML can
provide (which will be easy[*]).

	[*] No, compression is not the solution: It's a ressource
	intense workaround for the original problem.
0
Reply rweikusat (2679) 4/15/2008 6:30:56 PM

On Apr 15, 2:30 pm, Rainer Weikusat <rweiku...@mssgmbh.com> wrote:
>
> The original problem was that _measurements_ had proven that the need
> to parse an XML-representation was too expensive for this particular
> application. ...

Slight quibble here in that while measurements may have
shown parsing to be too expensive for a parse-in-every-child
implementation, the same might not hold
true in an implementation where parsing is done once on the
client and then shared with the children (particularly
if those are child processes in the UNIX sense). The existence
of "a working implementation which sends the data
in an XML format" (implying existing C parsing code and internal
representation) would only make me look further in that direction.
0
Reply shakahshakah (188) 4/15/2008 6:46:24 PM

Rainer Weikusat wrote:
> Mark Space <markspace@sbc.global.net> writes:
>> Rex Mottram wrote:
>>> I'm looking for an offset-based data structure to hold character
>>> data.
>> I'm not sure what you mean by "offset based data structure."
> 
> This fairly obviously means 'replacing pointers by offsets relative to
> the start of the data structure' (because pointer values usually
> cannot even be meaningfully communicated between different processes
> running on the same machine, let alone processes running on different
> machines). 

Is that all you need? The Java version should be a tad easier, actually. 
  Note the last loop uses nothing but offsets from the buffer to print 
out the data.

If you where hoping for the magic library that does this for you, I 
think it's called "hands on keyboard."


lut_test a 12 longer_string_test C

Total buffer size: 63
Numer of entries: 5
String 0: filename
String 1: a
String 2: 12
String 3: longer_string_test
String 4: C

/*
  * File:   lut_test.c
  *
  * Created on April 15, 2008, 12:43 PM
  */

#include <stdio.h>
#include <stdlib.h>

struct pre_parsed {
   int size;
   int length;
   int indexes[];
};

int main(int argc, char** argv) {

     struct pre_parsed *buffer;
     char ** argv_copy = argv;

     if( argc < 2 )
     {
         fprintf( stderr, "Usage: lut_test outfile string_list\n");
     }
     size_t total_string_size = 0;

     while( *++argv )
     {
         size_t len = strlen( *argv );
         total_string_size += len+1;
     }

     // Test some values in debugger

     size_t struct_size = sizeof (struct pre_parsed);
     size_t int_size = sizeof (int);
     size_t array_size = sizeof (int) * (argc-1);

     // Build the buffer

     buffer = malloc( sizeof (struct pre_parsed) + sizeof (int) *
             (argc-1) + total_string_size );

     (*buffer).length = argc -1;
     (*buffer).size = sizeof (struct pre_parsed) + sizeof (int) *
             (argc-1) + total_string_size;
     int index = 0;
     size_t offset = sizeof (struct pre_parsed)
             + sizeof (int) * (argc-1);
     while( *++argv_copy )
     {
         (*buffer).indexes[index] = offset;
         strcpy( (char*)(buffer + offset), *argv_copy );
         offset += strlen( *argv_copy ) + 1;
         index++;
     }

     // Read back from the buffer

     printf( "Total buffer size: %d\n", (*buffer).size );
     printf( "Numer of entries: %d\n", (*buffer).length );
     int i;
     for( i = 0; i < (*buffer).length; i++ )
     {
         printf( "String %d: %s\n", i,
                 (buffer + (*buffer).indexes[i]) );
     }

     return (EXIT_SUCCESS);
}
0
Reply markspace (954) 4/15/2008 9:09:42 PM

On Tue, 15 Apr 2008 08:15:06 -0400, Rex Mottram <rexm@not.here> wrote:

>I'm looking for an offset-based data structure to hold character data. 
>Background: I'm working with an app whose server is written in Java 
>while the client is in C. The server needs to package up a bunch of data 
>and ship it to the client where it will be traversed, read-only, many 
>many times. 
If it is read-only then why do you need to parse it many times?  Parse
it once at the client end and let all the child processes access the
result.

>There's actually a working implementation which sends the 
>data in an XML format, but it's too slow and profiling reveals that it 
>spends almost all its time in XML parsing, so I need to replace the XML 
>with something more natural to the purpose.
Is that actually CPU use in parsing or in network IO?  If IO then
compression might be faster - probably a long shot.


>
>The client starts thousands of child processes which each start by 
>parsing the XML document in order to place its data in hash tables, 
>after which it is accessed via the hash table only. So the XML is very 
>much an intermediate format, and the obvious win would be to have the 
>Java server package the data in a form which is directly usable by the 
>clients. Presumably the data would arrive from the server and be stored 
>in a file: each client would then map the file into its address space 
>and treat it as a hash table (or similar data structure) right away.
Why are you hashing to memory addresses (or pointers)?  Why not hash
to array indexes which immediately makes the whole structure
relocatable in memory since everything is indexed from the start of
the array.  You would need fixed-length data for this.  Would
fixed-length data be a problem?  A C union could handle otherwise
different types in a single array.

Would comma separated be simpler to parse than XML?  It does not have
as much overhead and is generally less verbose.

>
>The traditional issue with transportable data structures is that since 
>the client can't reliably control what address the data is mapped to, 
>all addressing must be made relative to the starting point. Does anyone 
>know of an implementation of such a format which can be generated in 
>Java and consumed in C code?
Heap, sorted array + binary search, hash to an array address are all
possibilities.

I suspect that the biggest saving would be parse once rather than
parse many.  If you only parse once then you can do a lot of
complicated setup at the client end and the work is amortised over the
many children.

rossum

>
>TIA,
>RM

0
Reply rossum48 (643) 4/15/2008 9:11:42 PM

rossum wrote:

> Would comma separated be simpler to parse than XML?  It does not have
> as much overhead and is generally less verbose.

That's actually a pretty good idea.  Much simpler and more portable.
0
Reply markspace (954) 4/15/2008 10:41:43 PM

Mark Space wrote:
> Rex Mottram wrote:
>> I'm looking for an offset-based data structure to hold character data. 
> It sounds like you want just plain old binary data.

Possibly you missed the phrase "ASCII data" cleverly hidden in the 
subject line, and likewise "character data" in the first sentence? :-)

Anyway, in response to you and a few other posters, yes, parsing the XML 
once into some other format at the client end would indeed solve the 
problem mentioned. But although you argue that would be more flexible, I 
think it would result in a Rube Goldberg type of system with two 
different representations needing to be kept in sync. Not very robust 
programming practice IMHO.

Also, again in response to you and a few others, network bandwidth is 
not an issue. I.e. moving the XML document, even if it's large (and it 
often is) is not a significant cost, and for the record it's already 
transported in compressed form. It's the multi-parsing that's a problem.

I could of course write up my own index-based ADT in C, but would much 
prefer not to. There are a number of free ADT packages out there - I'm 
currently using one called Kazlib which is very nice and I have notes on 
6-10 others which also look good. But all these use pointers, with the 
cited problem.

There actually an open source C library which was written to be 
offset-based. I remember seeing it a few years ago. I haven't gone 
digging for it yet because I was hoping to find something I could build 
up from the Java side. Will go see if I can find that next.

RM
0
Reply rexm (52) 4/15/2008 11:49:02 PM

Rex Mottram wrote:

> I could of course write up my own index-based ADT in C, but would much 
> prefer not to. There are a number of free ADT packages out there - I'm 
> currently using one called Kazlib which is very nice and I have notes on 
> 6-10 others which also look good. But all these use pointers, with the 
> cited problem.


Ah ha.  And specifically you want to serialize and deserialize your 
ADTs.  Between C and Java.  Nothing like picking the implementation 
first then going looking for solutions. ;)

What you're asking for is called pointer swizzling, btw, and the general 
case seems to be very hard, which may be why you aren't find any libraries.

And no I don't know any Java packages for that.  Java programmers seem 
to prefer general solutions that always work (XML) to light weight 
solutions that may impose some restrictions.  Sorry about that.

If you find anything interesting, let us know.

You might try some custom de-/serialization on the Java side, so you can 
control the format, and produce something easier to parse in C.  Just a 
thought.
0
Reply markspace (954) 4/16/2008 1:38:54 AM

Rex Mottram wrote:
> I'm looking for an offset-based data structure to hold character data. 
> Background: I'm working with an app whose server is written in Java 
> while the client is in C. The server needs to package up a bunch of data 
> and ship it to the client where it will be traversed, read-only, many 
> many times. There's actually a working implementation which sends the 
> data in an XML format, but it's too slow and profiling reveals that it 
> spends almost all its time in XML parsing, so I need to replace the XML 
> with something more natural to the purpose.

Let me see if I can understand the problem statement.  It sounds like
you do not need a hash table specifically, but what you actually need
is a way to map string values to other string values.  As far as I know,
you do not say you need to be able to iterate over the keys, although
it's a common requirement, so you might need it.

If it is true that that is all you actually need, removing the artificial
constraint of a hash might open up some implementation possibilities.

For example, you could send a sorted list of pairs of C strings concatenated
together, by which I mean 8-bit bytes terminated by null bytes.  The client
side would be very easy to implement:

   * load the entire image (i.e. message) into a contiguous area of memory
   * make a pass through it to count pairs of strings (which is equivalent
     to counting the zero bytes in the block of memory, then dividing by two)
   * allocate an array of pairs of pointers (of type "const char *")
   * make another pass and write the memory address of every string into
     the array of pointers
   * do key lookup via binary search

Note that this approach does not require encoding any integers or offsets
at all into the message.  It is O(N) to parse it, but that is no worse
asymptotically than the time required to receive it over the network.

Another alternative would be to build a trie on the server side, where
states in the trie's state machine (I think of a trie as a finite state
machine that recognizes strings) are represented by byte offsets from the
beginning of the image.  The client could then do lookups on the structure
directly (except that to move to the next state, it would have to add an
offset, but that is no big deal).

   - Logan

0
Reply lshaw-usenet (926) 4/16/2008 1:45:35 AM

Mark Space wrote:

> You might try some custom de-/serialization on the Java side, so you can 
> control the format, and produce something easier to parse in C.  Just a 
> thought.

Oh, and the hash table itself in a Java hash map isn't serializable, 
iirc.  So if you're expecting to send the hash table over directly, 
rather than just the objects it contains, you'll have to roll your own 
implementation or do something similar.
0
Reply markspace (954) 4/16/2008 1:50:18 AM

Mark Space wrote:
> Mark Space wrote:
> 
>> You might try some custom de-/serialization on the Java side, so you 
>> can control the format, and produce something easier to parse in C.  
>> Just a thought.
> 
> Oh, and the hash table itself in a Java hash map isn't serializable, iirc.  

You don't.

public class HashMap<K,V>
extends AbstractMap<K,V>
implements Map<K,V>, Cloneable, Serializable

> So if you're expecting to send the hash table over directly, 
> rather than just the objects it contains, you'll have to roll your own 
> implementation or do something similar.

Or just use the HashMap built-in implementation of Serializable.

Every listed subclass of AbstractMap except WeakHashMap is Serializable.

-- 
Lew
0
Reply lew (2143) 4/16/2008 4:30:09 AM

Lew <lew@lewscanon.com> writes:

> Mark Space wrote:

>> Oh, and the hash table itself in a Java hash map isn't serializable,
>> iirc.
>
> You don't.
>
> public class HashMap<K,V>
> extends AbstractMap<K,V>
> implements Map<K,V>, Cloneable, Serializable

I believe the point was that was that serializing a HashMap will not
serialize the internal table used, the data structure, but will
instead just serialize the data it contains.
The deserialization will deserialize the data and rebuild a data 
structure.

>> So if you're expecting to send the hash table over directly, rather
>> than just the objects it contains, you'll have to roll your own
>> implementation or do something similar.
>
> Or just use the HashMap built-in implementation of Serializable.

Which does exactly what Mark Space wrote - serialize the objects
it contains, not the table.

/L
-- 
Lasse Reichstein Nielsen  -  lrn@hotpop.com
 DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
  'Faith without judgement merely degrades the spirit divine.'
0
Reply lrn (235) 4/16/2008 4:55:34 AM

Lasse Reichstein Nielsen wrote:
> Lew <lew@lewscanon.com> writes:
> 
>> Mark Space wrote:
> 
>>> Oh, and the hash table itself in a Java hash map isn't serializable,
>>> iirc.
>> You don't.
>>
>> public class HashMap<K,V>
>> extends AbstractMap<K,V>
>> implements Map<K,V>, Cloneable, Serializable
> 
> I believe the point was that was that serializing a HashMap will not
> serialize the internal table used, the data structure, but will
> instead just serialize the data it contains.
> The deserialization will deserialize the data and rebuild a data 
> structure.
> 
>>> So if you're expecting to send the hash table over directly, rather
>>> than just the objects it contains, you'll have to roll your own
>>> implementation or do something similar.
>> Or just use the HashMap built-in implementation of Serializable.
> 
> Which does exactly what Mark Space wrote - serialize the objects
> it contains, not the table.

Actually, HashMap does serialize the internal table, by "serializ[ing] the 
objects it contains".  I do not see the distinction you are trying to make.

Furthermore, you don't "have to roll your own" serialization, HashMap does it 
just fine.  Even if you did, you'd probably wind up doing pretty much exactly 
the same thing HashMap already does.  So even if you say that the "internal 
table" is somehow not serialized, even though it is, you still don't have the 
same conclusion.

-- 
Lew
0
Reply lew (2143) 4/16/2008 5:34:20 AM

Rex Mottram <rexm@not.here> writes:
> Mark Space wrote:
>> Rex Mottram wrote:
>>> I'm looking for an offset-based data structure to hold character
>>> data.
>> It sounds like you want just plain old binary data.
>
> Possibly you missed the phrase "ASCII data" cleverly hidden in the
> subject line, and likewise "character data" in the first sentence?
> :-)

Possibly, you missed the 'hash table' cleverly hidden in the subject
line. This means 'plain old binary data', ie a table of pointers.

> Anyway, in response to you and a few other posters, yes, parsing the
> XML once into some other format at the client end would indeed solve
> the problem mentioned. But although you argue that would be more
> flexible, I think it would result in a Rube Goldberg type of system
> with two different representations needing to be kept in sync.

If it would, then this would be the type of system you already have,
because you already have one representation on the server, a transport
representation and a representation on the client. That you
(presently) construct identical client representations x times in
parallell and manage to overwhelm the 'client machine' with this
nonsense doesn't change the structure of the application, just its
"simple-mindedness". Parsing the XML-steganogram once and re-using it
x - 1 times would be by far the simplest solution from a coding
standpoint (as someone else already wrote).

> Also, again in response to you and a few others, network bandwidth is
> not an issue. I.e. moving the XML document, even if it's large (and it
> often is) is not a significant cost,

.... because, as you need to understand, this is MY PRIVATE GIG-E LAN
which is been solely provided to transport MY XML (and it has excess
bandwidth for that) ...

It is not possible to assess the significance of the 'cost' of a
particular communication isolated. Networks are usually used by
multiple (and often, many) computers for multiple (often many)
applications in parallell.

> and for the record it's already transported in compressed form.

If network usage is not of concern, why do you bother to compress the
XML? Not doing so would be another easy way to make the processing on
both client and server much simpler. 

> It's the multi-parsing that's a problem.

Well, then don't. 
0
Reply rweikusat (2679) 4/16/2008 11:05:26 AM

Rainer Weikusat wrote:
> If it would, then this would be the type of system you already have,
> because you already have one representation on the server, a transport
> representation and a representation on the client. That you
> (presently) construct identical client representations x times in
> parallell and manage to overwhelm the 'client machine' with this
> nonsense doesn't change the structure of the application, just its
> "simple-mindedness".

You may have me (the OP) mixed up with someone else. I write not to 
defend the current implementation but to improve it.

> Parsing the XML-steganogram once and re-using it
> x - 1 times would be by far the simplest solution from a coding
> standpoint (as someone else already wrote).

Yes, with the single exception of the alternative, which is to deliver 
it in pre-parsed form and thus remove one more intermediate representation.

N < N + 1. QED.

>> Also, again in response to you and a few others, network bandwidth is
>> not an issue. I.e. moving the XML document, even if it's large (and it
>> often is) is not a significant cost,
> 
> ... because, as you need to understand, this is MY PRIVATE GIG-E LAN
> which is been solely provided to transport MY XML (and it has excess
> bandwidth for that) ...

You are swinging (and missing) especially wildly today. The 
insignificance of transport is primarily due to the fact that it's a 
fairly rare event which occurs just once per client/server transaction, 
while the client fires children, as noted, potentially thousands of 
times per transaction.

The largest XML file I've measured to date is around 5 megabytes. 
Because XML is so wordy it compresses quite well, and thus the gzipped 
version which travels over the wire is around 150K. This may happen 
about once every half hour. I don't know what your network looks like 
but this is well within the noise factor for mine.

More than that, the raison d'etre of this application is to keep the 
clients from having to run unnecessary jobs. I.e. it's an optimizer. So 
whatever network traffic it generates will be paid back by orders of 
magnitude in other traffic which doesn't happen later.

> If network usage is not of concern, why do you bother to compress the
> XML? Not doing so would be another easy way to make the processing on
> both client and server much simpler. 

Arg. I do it because it's trivially easy and the right thing to do. On 
the server side it's as easy as constructing a "new GZIPOutputStream" 
object. The client side uses libcurl which has built-in knowledge of 
compression so it's as easy as adding an Accept-Encoding header.

It really couldn't be simpler than that. The XML parser doesn't have to 
worry about compression because it's uncompressed by then. See, it's 
compressed, sent over the wire, uncompressed, then parsed. All with two 
extra lines of code, one in each language.

>> It's the multi-parsing that's a problem.
> 
> Well, then don't. 

Arg. If you didn't spend so much energy looking for someone to bully, 
you might have more fun. I know we would.

RM
0
Reply rexm (52) 4/16/2008 2:10:16 PM

Rex Mottram <rexm@not.here> writes:
> Rainer Weikusat wrote:
>> If it would, then this would be the type of system you already have,
>> because you already have one representation on the server, a transport
>> representation and a representation on the client. That you
>> (presently) construct identical client representations x times in
>> parallell and manage to overwhelm the 'client machine' with this
>> nonsense doesn't change the structure of the application, just its
>> "simple-mindedness".
>
> You may have me (the OP) mixed up with someone else.

No, I was being sarcastic. 

>> Parsing the XML-steganogram once and re-using it
>> x - 1 times would be by far the simplest solution from a coding
>> standpoint (as someone else already wrote).
>
> Yes, with the single exception of the alternative, which is to deliver
> it in pre-parsed form and thus remove one more intermediate
> representation.

Assuming a UNIX(*)-like 'application structure', the necessary changes
would roughly amount to forking after the XML has been parsed instead of
before it has parsed. 

>>> Also, again in response to you and a few others, network bandwidth is
>>> not an issue. I.e. moving the XML document, even if it's large (and it
>>> often is) is not a significant cost,
>> ... because, as you need to understand, this is MY PRIVATE GIG-E LAN
>> which is been solely provided to transport MY XML (and it has excess
>> bandwidth for that) ...
>
> You are swinging (and missing) especially wildly today.
>  The insignificance of transport is primarily due to the fact that it's a
> fairly rare event which occurs just once per client/server
> transaction,

Less creative editing could help to aid your understanding greatly:

,----
| It is not possible to assess the significance of the 'cost' of a
| particular communication isolated. Networks are usually used by
| multiple (and often, many) computers for multiple (often many)
| applications in parallell.
`----

See. This time, I even explained the sarcasm. Note that this was a
general remark regarding the (common) misconception of the 'isolated,
insignificant event', which really isn't an 'isolated event' but a
contributing factor.

[...]

>> If network usage is not of concern, why do you bother to compress the
>> XML? Not doing so would be another easy way to make the processing on
>> both client and server much simpler.
>
> Arg. I do it because it's trivially easy and the right thing to do.

Something which is 'trivially easy' for you to code is not necessarily
'trivially easy' for a computer to execute. In this particular
situation, it is certain that the process is computionally expensive.
You (again) stated that 'network usage' is of no concern to you, while
you (again) stated that 'CPU usage on the client' is of concern to
you. Yet you try to save 'network bandwidth' at the expense of
'processing time' by using compression.

This doesn't exactly make sense together.

[...]

>>> It's the multi-parsing that's a problem.
>> Well, then don't.
>
> Arg. If you didn't spend so much energy looking for someone to bully,
> you might have more fun. I know we would.

That you know that some or all of you would have more fun if some or
all of you spent less energy looking for someone to bully means
nothing about me.
0
Reply rweikusat (2679) 4/16/2008 2:41:23 PM

Rainer Weikusat wrote:
> Assuming a UNIX(*)-like 'application structure', the necessary changes
> would roughly amount to forking after the XML has been parsed instead of
> before it has parsed. 

No, the "necessary changes" would include being able to remove an entire 
XML-processing state machine, many associated static variables, remove 
dependence on an external XML parsing library, not to mention (again) 
the extra representation. These are very significant differences. I 
truly think you've switched your POV here, either without realizing it 
or out of pure orneriness.

  > Less creative editing could help to aid your understanding greatly:
> 
> ,----
> | It is not possible to assess the significance of the 'cost' of a
> | particular communication isolated. Networks are usually used by
> | multiple (and often, many) computers for multiple (often many)
> | applications in parallell.
> `----
> 
> See. This time, I even explained the sarcasm. Note that this was a
> general remark regarding the (common) misconception of the 'isolated,
> insignificant event', which really isn't an 'isolated event' but a
> contributing factor.

Speaking of creative editing, I gather you chose to remove the parts of 
my response which addressed this. Unlike you I know things about how my 
network is used and have done measurements. Sending the XML compressed 
is, empirically, an insignificant cost. Sending it uncompressed is 
probably insignificant too. The only reason I said anything about the 
transport cost and compressed vs not is that some respondents mistakenly 
thought that was the problem and I wanted to set them straight.

There is no possible way that the cost of transporting this data packet 
is a significant fraction of the overall network burden, or network 
profile, of this application or the environment it operates in. You may 
repeat mantras like the above to your heart's content, but though they 
are certainly true in isolation they are not applicable here. This whole 
network bandwidth discussion is a red herring.

>>> If network usage is not of concern, why do you bother to compress the
>>> XML? Not doing so would be another easy way to make the processing on
>>> both client and server much simpler.
>> Arg. I do it because it's trivially easy and the right thing to do.
> 
> Something which is 'trivially easy' for you to code is not necessarily
> 'trivially easy' for a computer to execute. In this particular
> situation, it is certain that the process is computionally expensive.
> You (again) stated that 'network usage' is of no concern to you, while
> you (again) stated that 'CPU usage on the client' is of concern to
> you. Yet you try to save 'network bandwidth' at the expense of
> 'processing time' by using compression.
> 
> This doesn't exactly make sense together.

For you to accuse me of creative editing and in the *same post* put a 
phrase in quotes, attributed directly to me, which I did not say is 
truly breathtaking. Search for the phrase 'CPU usage on the client' - 
its first appearance is when you "quoted" it.

The point here is that all the one-time costs are swamped by the 
aggregate costs of parsing. This applies equally to the compression, 
transmission, and decompression of the XML, and many other constant 
costs. They are just not measurable. I don't know why you insist on 
talking about them.

RM
0
Reply rexm (52) 4/16/2008 3:40:00 PM

On Wed, 16 Apr 2008 11:40:00 -0400, Rex Mottram <rexm@not.here> wrote:

>The point here is that all the one-time costs are swamped by the 
>aggregate costs of parsing.
Then change the parsing from an aggregate cost to a one-time cost.
Parse once at the receiving end and let all the child processes have
access to the read-only parsed version.  It would not be too complex
to set up version numbering or reference counting if required.

I am not clear as to why you cannot do this, it seems such an obvious
solution.

rossum


>This applies equally to the compression, 
>transmission, and decompression of the XML, and many other constant 
>costs. They are just not measurable. I don't know why you insist on 
>talking about them.

0
Reply rossum48 (643) 4/16/2008 4:01:18 PM

Mark Space wrote:
> If you find anything interesting, let us know.

I apologize for letting other parts of this thread devolve into a 
useless squabble. But for those of you with a substantive interest, let 
me report back that I'm getting moderately interested in/excited about 
the CDB ("constant database") family.

Having realized that what I'm looking at can be thought of as a constant 
  (meaning write once, read many) database, I did some searching and 
found CDB. The original CDB package is C code which is unchanged since 
2000 but still works fine. There's also a "TinyCDB" fork of the project 
which is newer and ported to Windows and still format-compatible with 
the original.

Most interestingly, there are APIs for a number of languages including a 
jar file for Java (this version is called sg-cdb). So it looks like I 
can use the sg-cdb API to write a cdb-format file directly from Java and 
send it to the client(s), which can link with libcdb.a and navigate it 
directly.

Whether I can encode my data into the key-value style of CDB (which is a 
dbm-like model) is an open question but that's my problem. For the sake 
of readers here and subsequent archive searches I thought I'd mention 
that the combination of sg-cdb on the server side and tinycdb on the 
client side may make a lot of sense for certain applications.

Licensing is also simple: TinyCDB is in the public domain and sg-cdb is 
BSD licensed.

RM
0
Reply rexm (52) 4/16/2008 4:01:24 PM

rossum wrote:
> On Wed, 16 Apr 2008 11:40:00 -0400, Rex Mottram <rexm@not.here> wrote:
> 
>> The point here is that all the one-time costs are swamped by the 
>> aggregate costs of parsing.
> Then change the parsing from an aggregate cost to a one-time cost.
> Parse once at the receiving end and let all the child processes have
> access to the read-only parsed version.  It would not be too complex
> to set up version numbering or reference counting if required.
> 
> I am not clear as to why you cannot do this, it seems such an obvious
> solution.

I thought I had explained this in excruciating detail but will try once 
more. There are two problems with this approach, not necessarily 
unsolvable but still valid problems:

1. It means double complexity, as the data has to be extracted from 
POJOs on the server side and converted to XML ("format A"), then parsed 
and placed into an alternate representation ("format B") on the client 
side. The preferable plan would be to have the server place the data 
directly into format B. I am not clear on why this is complex or 
controversial; it seems beyond argument to me. Particularly since I've 
already found one API for doing so (CDB) as mentioned elsewhere in the 
thread.

2. In what format would that "read-only parsed version" be? It cannot be 
a traditional hash table or other ADT since they can't be shared between 
processes, as explained previously.

The whole point is that I'm looking for a format in which the data can 
be shared. It looks like the CDB format might be a candidate, which 
allows me to turn the question around: whyever would I want to write the 
data to XML on the server and convert it to CDB on the client when 
there's an API for writing the CDB format directly on the server? I can 
see no argument at all for using two steps when one would suffice.

RM
0
Reply rexm (52) 4/16/2008 4:18:33 PM

rossum <rossum48@coldmail.com> writes:
> On Wed, 16 Apr 2008 11:40:00 -0400, Rex Mottram <rexm@not.here> wrote:
>
>>The point here is that all the one-time costs are swamped by the 
>>aggregate costs of parsing.
> Then change the parsing from an aggregate cost to a one-time cost.
> Parse once at the receiving end and let all the child processes have
> access to the read-only parsed version.  It would not be too complex
> to set up version numbering or reference counting if required.
>
> I am not clear as to why you cannot do this, it seems such an obvious
> solution.

The short answer is that the guy is a complete moron.
0
Reply rweikusat (2679) 4/16/2008 4:33:52 PM

Lasse Reichstein Nielsen wrote:

> I believe the point was that was that serializing a HashMap will not
> serialize the internal table used, the data structure, but will
> instead just serialize the data it contains.
> The deserialization will deserialize the data and rebuild a data 
> structure.

That's what I meant, even if I didn't say it correctly.

The OP wants the whole data structure, right down to the pointer and the 
values used for hash codes.  I don't think that's a great idea either, 
but that's what he's asking for.  Apparently, there are a few libraries 
out there that will swizzle some basic data structures.  He's looking 
for a Java version.
0
Reply markspace (954) 4/16/2008 4:45:30 PM

Rex Mottram wrote:
> Mark Space wrote:
>> If you find anything interesting, let us know.

> me report back that I'm getting moderately interested in/excited about 
> the CDB ("constant database") family.

> Most interestingly, there are APIs for a number of languages including a 
> jar file for Java (this version is called sg-cdb). So it looks like I 

Thanks for the report.  I learned some things this thread, which is 
always valuable to me, and this project you found will be an cool tool I 
can keep in my back pocket in case I ever need such a thing.

I hope this project goes well for you.
0
Reply markspace (954) 4/16/2008 5:07:30 PM

On Wed, 16 Apr 2008 12:18:33 -0400, Rex Mottram wrote:


> 2. In what format would that "read-only parsed version" be? It cannot be
> a traditional hash table or other ADT since they can't be shared between
> processes, as explained previously.

No need to parse it.
Just a native OS-flatfile. Since your data + keys is ASCII,
this flatfile is an ASCII file.

> The whole point is that I'm looking for a format in which the data can
> be shared. It looks like the CDB format might be a candidate, which
> allows me to turn the question around: whyever would I want to write the
> data to XML on the server and convert it to CDB on the client when
> there's an API for writing the CDB format directly on the server? I can
> see no argument at all for using two steps when one would suffice.

You can store an array of {offset, hash,(key)len,next} in 
another flatfile.
This second flatfile needs to be binary, but can be created on the target 
platform. The next-field in the struct is not a pointer but an index into 
the array itself. (for the overflow chain). (NB use -1 or >=N for 
sentinel value. zero wont work ;-)

Using mmap on the target-platform will cause the lookup extremely fast
(typically 2 page faults). Allocating two big arrays and reading them in 
at program startup will be faster (or equally fast), but the startup will 
be slower.

Combinining the two files is a possible extension, too.



HTH,
AvK
0
Reply root32 (398) 4/16/2008 6:23:30 PM

Mark Space <markspace@sbc.global.net> writes:
>The OP wants the whole data structure, right down to the
>pointer and the values used for hash codes.

  The experimental GPL-library �ram.jar� contains a serializer 
  (not tested, so it still might contain bugs):

public class Main
{ public static void main( final java.lang.String[] args ) 
  { final java.util.HashMap<java.lang.String,java.lang.Integer> hashMap 
    = new java.util.HashMap<java.lang.String,java.lang.Integer>();
    hashMap.put( "a", 1 );
    hashMap.put( "b", 2 );
    java.lang.System.out.println
    ( new de.dclj.ram.notation.junobjects.Junobjects().dump( hashMap )); }}

< &objectmap 
  object =
  < &java.util.HashMap 
    entrySet =
    < >
    loadFactor =
    < &float 0.75 >
    modCount =
    < &int 2 >
    size =
    < &int 2 >
    table =
    < &[java.util.HashMap.Entry[]] 
      < null >
      < null >
      < null >
      < null >
      < &java.util.HashMap.Entry zz0 >
      < null >
      < null >
      < &java.util.HashMap.Entry zz1 >
      < null >
      < null >
      < null >
      < null >
      < null >
      < null >
      < null >
      < null >>
    threshold =
    < &int 12 >>
  zz0 =
  < &java.util.HashMap.Entry 
    hash =
    < &int 100 >
    key =
    < &java.lang.String zz2 >
    next =
    < >
    value =
    < &java.lang.Integer zz3 >>
  zz1 =
  < &java.util.HashMap.Entry 
    hash =
    < &int 103 >
    key =
    < &java.lang.String zz4 >
    next =
    < >
    value =
    < &java.lang.Integer zz5 >>
  zz2 =
  < &java.lang.String 
    count =
    < &int 1 >
    hash =
    < &int 98 >
    offset =
    < &int 0 >
    value =
    < &[char[]] b >>
  zz3 =
  < &java.lang.Integer 
    value =
    < &int 2 >>
  zz4 =
  < &java.lang.String 
    count =
    < &int 1 >
    hash =
    < &int 97 >
    offset =
    < &int 0 >
    value =
    < &[char[]] a >>
  zz5 =
  < &java.lang.Integer 
    value =
    < &int 1 >>>

                               �Your wish is not granted unless it's a fish,
                               Your wish is not granted unless it's a dish,
                               A fish on a dish, is that what you wish?�
                               -- The Incredible String Band 

0
Reply ram (2827) 4/16/2008 10:46:09 PM

On Wed, 16 Apr 2008 16:41:23 +0200, Rainer Weikusat wrote:

> Rex Mottram <rexm@not.here> writes:
>> Arg. I do it because it's trivially easy and the right thing to do.
> 
> Something which is 'trivially easy' for you to code is not necessarily
> 'trivially easy' for a computer to execute. In this particular
> situation, it is certain that the process is computionally expensive.
> You (again) stated that 'network usage' is of no concern to you, while
> you (again) stated that 'CPU usage on the client' is of concern to you.
> Yet you try to save 'network bandwidth' at the expense of 'processing
> time' by using compression.
> 
> This doesn't exactly make sense together.
> 

Maybe it is faster for the client to decompress the little file, then it 
would take to download the uncompressed version?
0
Reply stonerfish (284) 4/17/2008 4:45:31 AM

jellybean stonerfish <stonerfish@geocities.com> writes:
> On Wed, 16 Apr 2008 16:41:23 +0200, Rainer Weikusat wrote:
>> Rex Mottram <rexm@not.here> writes:
>>> Arg. I do it because it's trivially easy and the right thing to do.
>> 
>> Something which is 'trivially easy' for you to code is not necessarily
>> 'trivially easy' for a computer to execute. In this particular
>> situation, it is certain that the process is computionally expensive.
>> You (again) stated that 'network usage' is of no concern to you, while
>> you (again) stated that 'CPU usage on the client' is of concern to you.
>> Yet you try to save 'network bandwidth' at the expense of 'processing
>> time' by using compression.
>> 
>> This doesn't exactly make sense together.
>
> Maybe it is faster for the client to decompress the little file,
> then it would take to download the uncompressed version?

It would be much faster to just not download any file ;-), ie the code
on the client (presumably) doesn't do something because it is 'faster'
than doing something else, but to accomplish some useful purpose. The
same way 'the network' is shared among all applications trying to use
it, the host CPU is shared among all applications running on a
particular host. Download of a large file could go on as background
task, using very little host ressources, more of which would then
be available to 'other active tasks', while downloading a smaller file
faster would have a more 'bursty' host ressource usage pattern and
decompressing a small file to a large file is a ressource-intense
operation, although the absolute time needed by it may be short enough
that measuring its influence could be difficult.

This is all, of course, mostly 'thin air' without much more
information about the actual application. But a statement like 'I do
not care about transmitting much more data than I actually need,
except that I compress these data to make it smaller, but anyway, my
actual problem is that preprocessing of the downloaded data is too
expensive for a particular situation' doesn't make much sense
altogether. To me, it sounds a lot like 'I am trading off processor
time for transmission time and now my programs run to slow'. Duh.

If doing it all 'in the right way' doesn't work, then it isn't the
right way.


0
Reply rweikusat (2679) 4/17/2008 6:16:43 AM

moi <root@invalid.address.org> writes:
> On Wed, 16 Apr 2008 12:18:33 -0400, Rex Mottram wrote:
>> 2. In what format would that "read-only parsed version" be? It cannot be
>> a traditional hash table or other ADT since they can't be shared between
>> processes, as explained previously.
>
> No need to parse it. Just a native OS-flatfile. Since your data +
> keys is ASCII, this flatfile is an ASCII file.

[...]

> You can store an array of {offset, hash,(key)len,next} in 
> another flatfile. This second flatfile needs to be binary, but can
> be created on the target platform.

Which requires the data to be parsed on the target platform to create
the index. It just (presumably) requires less complicated parsing
than parsing the XML file. The most sensible way to do this would
still be to preprocess the transport format once and share the result
among all client instances. The OP has repeatedly went ballistic at
the mere suggestion that this could be a sensible strategy.

Not to mention that a non-XML transport format would still need to be
designed. But this would (likely) lead to reducing the amount of data
transmitted, so, again cause Mr Rex' mind to blow up in anger.


0
Reply rweikusat (2679) 4/17/2008 9:56:03 AM

On Thu, 17 Apr 2008 11:56:03 +0200, Rainer Weikusat wrote:

> moi <root@invalid.address.org> writes:
>> On Wed, 16 Apr 2008 12:18:33 -0400, Rex Mottram wrote:
>>> 2. In what format would that "read-only parsed version" be? It cannot
>>> be a traditional hash table or other ADT since they can't be shared
>>> between processes, as explained previously.
>>
>> No need to parse it. Just a native OS-flatfile. Since your data + keys
>> is ASCII, this flatfile is an ASCII file.
> 
> [...]
> 
>> You can store an array of {offset, hash,(key)len,next} in another
>> flatfile. This second flatfile needs to be binary, but can be created
>> on the target platform.
> 
> Which requires the data to be parsed on the target platform to create
> the index. It just (presumably) requires less complicated parsing than
> parsing the XML file. The most sensible way to do this would still be to
> preprocess the transport format once and share the result among all
> client instances. The OP has repeatedly went ballistic at the mere
> suggestion that this could be a sensible strategy.

I am afraid you are right. As an experiment, I used /usr/dict ~5MB, >500K 
words, and created the hashtable in approx 2 seconds walltime. Once the 
hashtable is present on disk, searching one key takes 2 ms or so (most of 
it program startup, of course) All using mmap.
Creating a binary file for another architecture is not impossible to do 
(even dd has options to swap words). Most hardware is little-endian, 
anyway these days, and most people don't get beyond x86, anyway.

Agreed, my parser is simple (just one sweep, looking for space and CR; an 
XML parse may need more than one pass, and a lot more state. But programs 
like these are governed by I/O cost, and one can easily perform 10 scans 
over each buffer being read in. (once it *is* read in ;-)

> 
> Not to mention that a non-XML transport format would still need to be
> designed. But this would (likely) lead to reducing the amount of data
> transmitted, so, again cause Mr Rex' mind to blow up in anger.

XML is the problem. Sticking to it is more than a problem. It is a 
disease.

AvK
0
Reply root32 (398) 4/18/2008 12:05:20 AM

Rex Mottram <rexm@not.here> writes:

> The traditional issue with transportable data structures is that since
> the client can't reliably control what address the data is mapped to,
> all addressing must be made relative to the starting point. Does
> anyone know of an implementation of such a format which can be
> generated in Java and consumed in C code?

IIUC, what you need is a hash table.  Traditionally, a hash table is
implemented as an array of pointers, where the pointers point to linked
lists of objects that hash to the same value, and that value is the
index into the array.

So the fundamental problem is that pointers on the server contain
addresses that are meaningless on the client.

I've often seen linked lists implemented using array indices instead of
pointers.  For example:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

struct list {
  int next;
  const char *data;
};

static struct list pool[1024];
static int freeidx;

static int pool_alloc(void)
{
  int tmp = freeidx;

  if (freeidx != -1)
    freeidx = pool[freeidx].next;

  return tmp;
}

static void pool_free(int idx)
{
  pool[idx].next = freeidx;
  freeidx = idx;
}

static int hashtable[256];

const char *strings[] = {
  "hello",
  "world",
  "this",
  "is",
  "a",
  "hashtable",
  "example"
};

static unsigned char hashfunc(const char *str)
{
  unsigned int hash = 0;

  for (hash = 0; *str; ++str)
    hash += *str;

  return hash & 0xff;
}

static void hash_insert(const char *str)
{
  int hidx = hashfunc(str);
  int pidx = pool_alloc();

  if (pidx == -1) {
    fputs("out of memory\n", stderr);
    exit(1);
  }

  pool[pidx].data = str;
  pool[pidx].next = hashtable[hidx];
  hashtable[hidx] = pidx;
}

static int hash_find(const char *str)
{
  int hidx = hashfunc(str);
  int iter;

  for (iter = hashtable[hidx]; iter != -1; iter = pool[iter].next)
    if (strcmp(str, pool[iter].data) == 0)
      break;

  return iter != -1;
}

int main(int argc, char *argv[])
{
  int i;

  /* initialize the pool */
  for (i=0; i<sizeof(pool)/sizeof(pool[0]); i++)
    pool[i].next = i+1;
  pool[i-1].next = -1;

  /* initialize the hash table */
  memset(hashtable, -1, sizeof(hashtable));

  for (i=0; i<sizeof(strings)/sizeof(strings[0]); i++)
    hash_insert(strings[i]);

  for (i=1; i<argc; i++)
    printf ("%s%s found in table\n", argv[i],
	    (hash_find(argv[i]) ? "" : " not"));

  return 0;
}


-- 
Charles M. "Chip" Coldwell
"Turn on, log in, tune out"
GPG Key ID: 852E052F
GPG Key Fingerprint: 77E5 2B51 4907 F08A 7E92  DE80 AFA9 9A8F 852E 052F
0
Reply coldwell7 (79) 4/20/2008 12:17:49 PM

On 16.04.2008 18:01, Rex Mottram wrote:
> Mark Space wrote:
>> If you find anything interesting, let us know.
> 
> I apologize for letting other parts of this thread devolve into a 
> useless squabble. But for those of you with a substantive interest, let 
> me report back that I'm getting moderately interested in/excited about 
> the CDB ("constant database") family.
> 
> Having realized that what I'm looking at can be thought of as a constant 
>  (meaning write once, read many) database, I did some searching and 
> found CDB. The original CDB package is C code which is unchanged since 
> 2000 but still works fine. There's also a "TinyCDB" fork of the project 
> which is newer and ported to Windows and still format-compatible with 
> the original.
> 
> Most interestingly, there are APIs for a number of languages including a 
> jar file for Java (this version is called sg-cdb). So it looks like I 
> can use the sg-cdb API to write a cdb-format file directly from Java and 
> send it to the client(s), which can link with libcdb.a and navigate it 
> directly.
> 
> Whether I can encode my data into the key-value style of CDB (which is a 
> dbm-like model) is an open question but that's my problem. For the sake 
> of readers here and subsequent archive searches I thought I'd mention 
> that the combination of sg-cdb on the server side and tinycdb on the 
> client side may make a lot of sense for certain applications.
> 
> Licensing is also simple: TinyCDB is in the public domain and sg-cdb is 
> BSD licensed.

I tried to find a post of yours in the thread where you explain what 
kind of data structure you need but could not find one.  The term 
"offset based hash table for ASCII data" is far too generic for me to 
get a concrete idea.  Can you elaborate a bit or show a sample XML if it 
is not too large?

Kind regards

	robert
0
Reply shortcutter (5766) 4/20/2008 12:27:58 PM

Charles Coldwell wrote:
> Rex Mottram <rexm@not.here> writes:
> 
>> The traditional issue with transportable data structures is that since
>> the client can't reliably control what address the data is mapped to,
>> all addressing must be made relative to the starting point. Does
>> anyone know of an implementation of such a format which can be
>> generated in Java and consumed in C code?
> 
> IIUC, what you need is a hash table.  Traditionally, a hash table is
> implemented as an array of pointers, where the pointers point to linked
> lists of objects that hash to the same value, and that value is the
> index into the array.
> 
> So the fundamental problem is that pointers on the server contain
> addresses that are meaningless on the client.

Almost, but not quite. While it's true that pointers on the server don't 
map to pointers on the client, the problem is worse than that: you 
cannot share pointers between processes *even on the same system*.

In order to share a hash table it would be necessary to mmap() the file 
containing it (or use a shared or malloc-ed memory region but the issues 
would be exactly the same). Unfortunately you cannot reliably ask for 
any of these addresses to be mapped at a preset place, which makes 
perfect sense when you think about it - what if something else is 
already mapped at that location? Thus the documentation sets for mmmap 
and the equivalent Win32 APIs have clear warnings to the effect that 
"you may request mapping at a particular virtual address but you cannot 
count on that request being honored".

So the problem is not just between server and client but between all 
processes whether on server or client (and in my case there are multiple 
client-side processes sharing the data). This is where the "ship the 
data to the client in XML, then parse it into a hash table and share it 
around" proposal made by many people falls down.

Anyway, a good solution has been found which I will describe in another 
sub-thread.

RM
0
Reply rexm (52) 4/20/2008 5:19:26 PM

On Sun, 20 Apr 2008 13:19:26 -0400, Rex Mottram wrote:

> Charles Coldwell wrote:

> In order to share a hash table it would be necessary to mmap() the file
> containing it (or use a shared or malloc-ed memory region but the issues
> would be exactly the same). Unfortunately you cannot reliably ask for
> any of these addresses to be mapped at a preset place, which makes

No. you don't *need* to mmap the array to a fixed address.
Just make sure all it's members are offsets /indexes; not pointers. 
period.
[ see my previous posts (and others) in comp.programmig ]

> perfect sense when you think about it - what if something else is
> already mapped at that location? Thus the documentation sets for mmmap
> and the equivalent Win32 APIs have clear warnings to the effect that
> "you may request mapping at a particular virtual address but you cannot
> count on that request being honored".
> 
> So the problem is not just between server and client but between all
> processes whether on server or client (and in my case there are multiple
> client-side processes sharing the data). This is where the "ship the
> data to the client in XML, then parse it into a hash table and share it
> around" proposal made by many people falls down.

The problem is that you failed to describe your problem.

> Anyway, a good solution has been found which I will describe in another
The solution is trivial. Presenting the problem was much harder.

AvK
0
Reply root32 (398) 4/20/2008 5:34:42 PM

Robert Klemme wrote:
> I tried to find a post of yours in the thread where you explain what 
> kind of data structure you need but could not find one.  The term 
> "offset based hash table for ASCII data" is far too generic for me to 
> get a concrete idea.  Can you elaborate a bit or show a sample XML if it 
> is not too large?

I don't know how many other people will be interested in reading to this 
depth but for the record, here's a (long) synopsis. There are 3 
elemental concepts: Activities, States, and Transactions. Activities are 
operations which read some States and modify others. Every time an 
Activity takes place, all affected states are logged (in a server-side 
database) as a Transaction.

Now, these Activities can be pretty slow/expensive to run. Therefore we 
look for opportunities to skip them. Specifically, if a previous 
Transaction was run on the identical set of input States, we can skip 
the new Activity and instead log a pointer back to the previous 
Transaction. This is a _really_ big win when possible.

Since XML must be parsed sequentially and feeling that a DOM model would 
be too expensive, we went to some lengths to define a schema which could 
accommodate the above logic in a single (SAX)  pass. The result was 
something like the following 3 stanzas (naturally this will have lots of 
angle brackets and I can't promise it will render correctly in HTML 
readers):

<transactions>
   <tx id="1" ... />
   <tx id="2" ... />
	.
	.
	.
</transactions>


<activities>
   <activity id="1" ... />
   <activity id="2" ... />
	.
	.
	.
</activities>


<states>
   <state id="1" ... />
   <state id="2" ... />
	.
	.
	.
</states>


This leaves out many details but the basic structure is fairly clear. 
The client code parses through the Transaction section, storing a 
reference to each transaction in a local hash table known as the TX 
table. Then it walks through the set of activities, doing a 
(necessarily) linear search for the current Activity. Once that is 
matched it sets a "found" flag and chews through the rest of the 
Activity stanza until reaching the States stanza.

Next, each State is compared vs current reality. Whenever a State 
mismatch is encountered, the Transactions associated with the mismatched 
State are deleted from the TX table. If we reach the end of the States 
and the TX table is not empty, what remains are matches. If at any 
intermediate point we find that the TX table has become empty, we can 
abort the parse and just go run the Activity.

What makes this preferable to a DOM design is the opportunity to abort 
the parse as soon as potential matches are exhausted; DOM would have 
required a full parse followed by some memory accesses.

This is the current model and it works quite stably, but unfortunately 
not fast enough. XML parsing proves to be the dominant time cost. Since 
the Transaction stanza is small relative to the other two, which can 
grow quite large, it seems quite likely that it's the linear search 
aspect of the latter two stanzas that's the problem.

Thus the key requirement for the "XML replacement technology" is that it 
be random access. A hash table is the obvious candidate, and that's 
exactly what CDB turned out to be, but a tree-based structure such as a 
red-black tree might have been fine too. Anyway, CDB turned out to be 
everything I was looking for.

RM
0
Reply rexm (52) 4/20/2008 6:36:15 PM

Rex Mottram wrote:
> Having realized that what I'm looking at can be thought of as a constant 
> (meaning write once, read many) database, I did some searching and 
> found CDB. The original CDB package is C code which is unchanged since 
> 2000 but still works fine. There's also a "TinyCDB" fork of the project 
> which is newer and ported to Windows and still format-compatible with 
> the original.
> 
> Most interestingly, there are APIs for a number of languages including a 
> jar file for Java (this version is called sg-cdb). So it looks like I 
> can use the sg-cdb API to write a cdb-format file directly from Java and 
> send it to the client(s), which can link with libcdb.a and navigate it 
> directly.
> 
> Whether I can encode my data into the key-value style of CDB (which is a 
> dbm-like model) is an open question but that's my problem. For the sake 
> of readers here and subsequent archive searches I thought I'd mention 
> that the combination of sg-cdb on the server side and tinycdb on the 
> client side may make a lot of sense for certain applications.
> 
> Licensing is also simple: TinyCDB is in the public domain and sg-cdb is 
> BSD licensed.

Epilogue: CDB has worked out beautifully. In essence it's exactly what 
the subject line specified: a hash table contained within a file which 
can be mapped and used at any memory location because it doesn't rely on 
pointers. The fact that it can be written via a pure Java API (sg-cdb) 
and accessed via a simple C API (TinyCDB) which handles its own file 
mapping logic is a nice bonus. As are the easy license terms.

RM
0
Reply rexm (52) 4/20/2008 6:44:14 PM

moi wrote:
>> Charles Coldwell wrote:
> 
>> In order to share a hash table it would be necessary to mmap() the file
>> containing it (or use a shared or malloc-ed memory region but the issues
>> would be exactly the same). Unfortunately you cannot reliably ask for
>> any of these addresses to be mapped at a preset place, which makes

Charles Coldwell did not write that. I did.

> No. you don't *need* to mmap the array to a fixed address.
> Just make sure all it's members are offsets /indexes; not pointers. 
> period.

Oh. So I should have been looking for an offset-based data structure all 
along? Thanks. Next time I'll remember to specify that in the subject line.

RM
0
Reply rexm (52) 4/20/2008 6:50:10 PM

On 20.04.2008 20:36, Rex Mottram wrote:
> Robert Klemme wrote:
>> I tried to find a post of yours in the thread where you explain what 
>> kind of data structure you need but could not find one.  The term 
>> "offset based hash table for ASCII data" is far too generic for me to 
>> get a concrete idea.  Can you elaborate a bit or show a sample XML if 
>> it is not too large?
> 
> I don't know how many other people will be interested in reading to this 
> depth but for the record, here's a (long) synopsis.

Thanks for elaborating!

> There are 3 
> elemental concepts: Activities, States, and Transactions. Activities are 
> operations which read some States and modify others. Every time an 
> Activity takes place, all affected states are logged (in a server-side 
> database) as a Transaction.

With affected states do you mean prior or post activity (or both)?  Is 
the transaction list you transfer to clients always the complete history?

> Now, these Activities can be pretty slow/expensive to run. Therefore we 
> look for opportunities to skip them. Specifically, if a previous 
> Transaction was run on the identical set of input States, we can skip 
> the new Activity and instead log a pointer back to the previous 
> Transaction. This is a _really_ big win when possible.

So you need a quick match based on (activity, (input state1, ...input 
stateN)), do you?

> Since XML must be parsed sequentially and feeling that a DOM model would 
> be too expensive, we went to some lengths to define a schema which could 
> accommodate the above logic in a single (SAX)  pass. The result was 
> something like the following 3 stanzas (naturally this will have lots of 
> angle brackets and I can't promise it will render correctly in HTML 
> readers):
> 
> <transactions>
>   <tx id="1" ... />
>   <tx id="2" ... />
>     .
>     .
>     .
> </transactions>

You mention that transactions involve state but I do not see it here.

> <activities>
>   <activity id="1" ... />
>   <activity id="2" ... />
>     .
>     .
>     .
> </activities>
> 
> 
> <states>
>   <state id="1" ... />
>   <state id="2" ... />
>     .
>     .
>     .
> </states>
> 
> 
> This leaves out many details but the basic structure is fairly clear. 
> The client code parses through the Transaction section, storing a 
> reference to each transaction in a local hash table known as the TX 
> table. Then it walks through the set of activities, doing a 
> (necessarily) linear search for the current Activity.

How do you determine from the structure above which is the current one? 
  There must be something (I am) missing.

> Once that is 
> matched it sets a "found" flag and chews through the rest of the 
> Activity stanza until reaching the States stanza.
> 
> Next, each State is compared vs current reality.

So you have an additional input, current state(s)?

> Whenever a State 
> mismatch is encountered, the Transactions associated with the mismatched 
> State are deleted from the TX table. If we reach the end of the States 
> and the TX table is not empty, what remains are matches. If at any 
> intermediate point we find that the TX table has become empty, we can 
> abort the parse and just go run the Activity.

In other words: if after looking at all states the TX table is non 
empty, you do not need to run the activity (and save a lot of time).

> What makes this preferable to a DOM design is the opportunity to abort 
> the parse as soon as potential matches are exhausted; DOM would have 
> required a full parse followed by some memory accesses.

I find DOM rarely the best choice.  Even for other types of XML 
processing (that do not involve early exit) SAX is usually a much better 
(faster and memory savvier) choice.

> This is the current model and it works quite stably, but unfortunately 
> not fast enough. XML parsing proves to be the dominant time cost. Since 
> the Transaction stanza is small relative to the other two, which can 
> grow quite large, it seems quite likely that it's the linear search 
> aspect of the latter two stanzas that's the problem.
> 
> Thus the key requirement for the "XML replacement technology" is that it 
> be random access. A hash table is the obvious candidate, and that's 
> exactly what CDB turned out to be, but a tree-based structure such as a 
> red-black tree might have been fine too. Anyway, CDB turned out to be 
> everything I was looking for.

And apparently you solved the encoding issue.  That's good to hear.

(reading CDB)

My recommendations would have gone into a similar direction: define a 
binary format that satisfies your lookup requirements and create read 
write code in Java and read code in C.  If files are small enough you 
can memory map them, otherwise you have to use an approach using 
indexing techniques as CDB does.  But since you found your solution we 
don't need to waste more brain cycles.

Thanks again.  And please let us know the outcome (aka speed 
improvement) or any issues you find along the way.  That way we can all 
learn a bit. :-)

Kind regards

	robert
0
Reply shortcutter (5766) 4/20/2008 10:44:00 PM

Robert Klemme wrote:
> With affected states do you mean prior or post activity (or both)?  Is 
> the transaction list you transfer to clients always the complete history?

We _store_ both precondition states and result states of each 
transaction. But of course when looking for a prior match we need only 
compare preconditions.

> So you need a quick match based on (activity, (input state1, ...input 
> stateN)), do you?

Precisely.

> You mention that transactions involve state but I do not see it here.

I did say I was leaving out many details ... yes, in the XML there are 
attributes mapping each transaction to a list of states. In fact many 
attributes (represented by "...") were left out.

>> This leaves out many details but the basic structure is fairly clear. 
>> The client code parses through the Transaction section, storing a 
>> reference to each transaction in a local hash table known as the TX 
>> table. Then it walks through the set of activities, doing a 
>> (necessarily) linear search for the current Activity.
> 
> How do you determine from the structure above which is the current one? 
>  There must be something (I am) missing.

Yes, I left that out as an implementation detail. The client determines 
its current activity and then chews through the Activities stanza 
looking for a match.

> So you have an additional input, current state(s)?

Both current States and current Activity are discoverable by the client 
on its own, so whether that's considered an input or not is debatable 
(but not worth debating).

> In other words: if after looking at all states the TX table is non 
> empty, you do not need to run the activity (and save a lot of time).

Yes.

> Thanks again.  And please let us know the outcome (aka speed 
> improvement) or any issues you find along the way.  That way we can all 
> learn a bit. :-)

Will do.

RM
0
Reply rexm (52) 4/20/2008 11:04:47 PM

Rex Mottram wrote:

> 
> I don't know how many other people will be interested in reading to this 
> depth but for the record, here's a (long) synopsis. There are 3 

I'm still reading this thread too.  I hope I don't need anything besides 
an XML parser for exchanging data, but if I do I'm glad to know there 
are other solutions out there.

Laziness is a virtue.  Never code your own solution when you can 
download one instead.
0
Reply markspace (954) 4/21/2008 1:17:22 AM

Rex Mottram wrote:
>> Thanks again.  And please let us know the outcome (aka speed 
>> improvement) or any issues you find along the way.  That way we can 
>> all learn a bit. :-)
> 
> Will do.

As promised, the final results:

A. I've had no problems with CDB in either of the versions I'm using; it 
works quite well, is clearly documented, etc. It took a few minutes to 
port TinyCDB to Windows but no big deal, and I've offered the diffs to 
the author.

B. In the particular test case I was profiling, a typical run of the XML 
version averaged around 19-20 minutes. The CDB version is around *2* 
minutes. I consider this a rousing success.

I was a little surprised, FWIW, to find that the CDB file was typically 
no smaller than the XML file. Given how "wordy" XML is I'd have expected 
almost any other format to be smaller, but CDB came in around the same 
size. Of course these details will vary considerably with many factors 
so this is just one data point.

RM
0
Reply rexm (52) 4/25/2008 4:05:05 PM

> From: Rex Mottram <r...@not.here>
> I'm looking for an offset-based data structure to hold character data.

I'm not sure what you mean by "offset-based data structure".
Do you mean that there are links (pointers) from one place to
another, such as a link from a *place* within one block of data to
the *front* of another block of data? For example:
   Block1: imhihmgihiothmctio[ptr]pfoevpogsip
                               |
           /-------------------/
           V
   Block2: erjevpigihgipgipughipru
And instead of [ptr] showing the absolute address of Block2, it
shows the difference between the address of Block2 and the address
of the pointer itself?

I definitely don't know what you mean by "character data" in this
context. Except for pointers linking a place within one block to
the start of another block (or whatever else you might have in mind
that I can't guess), is the rest of the content of each block a
sequence of characters? Are these characters fixed-size, such as
US-ASCII or Latin-1 or UCS-2, or variable size such as UTF-8?

> Background: I'm working with an app whose server is written in
> Java while the client is in C. The server needs to package up a
> bunch of data and ship it to the client where it will be traversed,
> read-only, many many times.

Java is certainly powerful enough to build a data-structure with
genuine pointers between the various pieces, then compile each
atomic block of data (not including any links) into a pure array of
data, then create an abstract layout into a linear sequence which
includes handles to each atomic block and tokens where each link
from one place to another block is needed, then measure the sizes
of all those atomic blocks and use that information to compute
where each block and each link-place would start if laid out per
that linear model, hence compute the offset for each link, and
finally allocate a large block and copy all the atomic blocks and
link-offsets into the large block per the linear model. And then
it's trivial to write that array out to a file or communication
channel verbatim.

Can you please give me a small example of a structure that you
might need to process in this way, perhaps three to five atomic
blocks of data, nor more than ten or so bytes of data within each
atomic block, with two to four (N-1) links connecting them into a
structure, or maybe more than N-1 links if you allow some blocks to
have more than one link to each? We could use that example for
illustration of the linear-model algorithm, and eventually as test
data for any code that would be written.

> There's actually a working implementation which sends the data in
> an XML format, but it's too slow and profiling reveals that it
> spends almost all its time in XML parsing, so I need to replace the
> XML with something more natural to the purpose.

When you say "IT" is too slow, and when you say "IT" spends almost
all its time in XML parsing, does "IT" mean the C code to receive
the XML data and convert to internal data structure?

> The client starts thousands of child processes which each start
> by parsing the XML document in order to place its data in hash
> tables, after which it is accessed via the hash table only.

Are all the child processes on one machine, or distributed over
many hosts all over the InterNet?

I can see how parsing XML more than once for copies of exactly the
same data structure would be a wasteful practice.

> So the XML is very much an intermediate format, and the obvious
> win would be to have the Java server package the data in a form
> which is directly usable by the clients.

You need to tell us what forms of data *are* usable by the clients.
For example, if a C program were given the total size of all the
atomic blocks of data plus all the machine-absolute pointers needed
for all the links from a place within a non-atomic block to the
start of another block, would the C program be able to malloc a
single block of that size, then sub-allocate to build all the
blocks and links between them? Then would the rest of the program
be able to run using that in-RAM layout of data? If so, it seems to
me the following protocol would suffice:
- Transmit an integer telling the total size of the data block needed.
- Transmit a block of bytes of that size, containing all the atomic
   blocks of data in correct relative position, with zero bytes for
   all the places where pointers are needed.
- Transmit an integer telling the total number of links.
- Transmit a block consisting of that many (place,target) pairs,
   each relative to the start of the overall block.
The C client then reads that first integer, and mallocs a block of
that size, then array-reads the block of that size into that block,
then reads the second integer, then computes the size of array for
that many pointers and allocates that size block, then block-reads
a block of that size into that second block, then steps through
that second block fetching each (relPlace,relTarget) pair and
adding each value to the address of the start of the first block to
yield a (absPlace,absTarget) and then executing *absPlace=absTarget,
and finally that second block is freed. So that's 6 lines of code
to do all the input, a program loop of 3 lines of code (or one line
of code with nested expression) to convert and patch all the links,
and finally one line of code to free that second array.

> Presumably the data would arrive from the server and be stored in
> a file: each client would then map the file into its address space

You seem to be implying that all the clients are running on just
one computer, possibly multiple CPUs sharing RAM, whereby it's
possible to map one block of RAM into *all* the CPUs?

> and treat it as a hash table (or similar data structure) right
> away.


Why a hash table??? Why do you need a hash table???
You need to explain the abstract model of your data structure
better so that we can decide whether a simple linked structure (as
I assumed above), or a hash table is most appropriate for your
needs, or possibly multiple independent linked structures with
hash-table entries for each, or possibly multiple linked structures
that share some of their structure with a hash table that points
into various blocks within the overall structure, is most
appropriate.

> The traditional issue with transportable data structures is that
> since the client can't reliably control what address the data is
> mapped to, all addressing must be made relative to the starting
> point.

Agreed, but as soon as you have a hash table you can't even say
what the relative locations are between various parts of the
overall data corpus. If a hash table is used *only* for
key-to-location lookup by the user interface, and all the corpus of
data is pure linked-structure, then all you need is one large block
of binary data (per the model I described earlier) plus a list of
(key,location) pairs which can then be installed in the hash table.
relLocation values would be transmitted, and the client would
convert each to absLocation.

I think you definitely need to create a *complete* toy example of
what kind of data corpus structure you wish to transmit from server
to client and build into disk file and thence into RAM on client,
so that we can see what you are really wanting.

> Does anyone know of an implementation of such a format which can
> be generated in Java and consumed in C code?

Before that can be answered, we need a better idea precisely what
kind of data corpus structure you wish to transmit from server to
client. It would do no good for me to spend one hour writing the
server code and five minutes to write the client code, for the
model I described above, if that wouldn't serve your needs.
0
Reply lisp1.3.CalRobert (59) 4/26/2008 3:18:20 AM

On 25.04.2008 18:05, Rex Mottram wrote:
> Rex Mottram wrote:
>>> Thanks again.  And please let us know the outcome (aka speed 
>>> improvement) or any issues you find along the way.  That way we can 
>>> all learn a bit. :-)
>>
>> Will do.
> 
> As promised, the final results:

Thank you for the summary!

> A. I've had no problems with CDB in either of the versions I'm using; it 
> works quite well, is clearly documented, etc. It took a few minutes to 
> port TinyCDB to Windows but no big deal, and I've offered the diffs to 
> the author.

Thank you, that way the community can benefit as well.

> B. In the particular test case I was profiling, a typical run of the XML 
> version averaged around 19-20 minutes. The CDB version is around *2* 
> minutes. I consider this a rousing success.

Indeed!

> I was a little surprised, FWIW, to find that the CDB file was typically 
> no smaller than the XML file. Given how "wordy" XML is I'd have expected 
> almost any other format to be smaller, but CDB came in around the same 
> size. Of course these details will vary considerably with many factors 
> so this is just one data point.

IIRC CDB uses hashing.  At least in memory hash tables are usually 
larger than the number of entries in there so that might be the case 
here as well.  I am sure a closer look at the documentation of the file 
format will reveal the reason but since you said that bandwidth was not 
an issue I guess it's not worthwhile bothering.  Still a good hint for 
others.

Kind regards

	robert
0
Reply shortcutter (5766) 4/27/2008 10:11:55 AM

49 Replies
55 Views

(page loaded in 0.476 seconds)

Similiar Articles:

















7/26/2012 8:02:36 PM


Reply: