Ping Jim Janney - sizing your snippet store

  • Follow


Jim,

Since i spouted off on cljp about how key-value stores were better than 
RDBMs, i thought i ought to put my money where my mouth is. I'm going to 
put together a simple demo/benchmark for storing key-value data, 
implemented on top of Tokyo Cabinet and JDBC, which will hopefully 
demonstrate that TC is easier to use and faster than a database (i can try 
Derby, H2 and a leading commercial database whose license prohibits the 
posting of performance comparisons, and so which will have to go nameless, 
lest the bearded, megalomaniacal CEO of its manufacturer get upset).

So, it would be good to know what sort of problem i'm actually trying to 
show that KV stores are good for. Jim, you said:

> I need to maintain a data base of small text snippets keyed by arbitrary
> strings, without the overhead of a full SQL relational database.  We
> will have several people putting data into it so it needs to support
> concurrent access over a network.

Could i trouble you to expand a little on sizing? In particular:

- How many entries are there?
- How big are the keys? What sort of things are they?
- How big are the values?
- What's the workload mix? Read vs write vs delete?
- Are writes mostly of new entries, or overwrites of existing ones?
- How skewed is the access towards a few hot entries?
- How many users are using it in parallel?
- What is the request rate like?
- How much RAM can be used? How much disk space?

Answers to any of those, no matter how rough, would be useful.

I'll post to cljp if/when i get this done.

tom

-- 
limited to concepts that are meta, generic, abstract and philosophical --
IEEE Standard Upper Ontology Working Group
0
Reply twic (2083) 4/2/2010 6:27:45 PM

Tom Anderson <twic@urchin.earth.li> writes:

> Jim,
>
> Since i spouted off on cljp about how key-value stores were better
> than RDBMs, i thought i ought to put my money where my mouth is. I'm
> going to put together a simple demo/benchmark for storing key-value
> data, implemented on top of Tokyo Cabinet and JDBC, which will
> hopefully demonstrate that TC is easier to use and faster than a
> database (i can try Derby, H2 and a leading commercial database whose
> license prohibits the posting of performance comparisons, and so which
> will have to go nameless, lest the bearded, megalomaniacal CEO of its
> manufacturer get upset).
>
> So, it would be good to know what sort of problem i'm actually trying
> to show that KV stores are good for. Jim, you said:
>
>> I need to maintain a data base of small text snippets keyed by arbitrary
>> strings, without the overhead of a full SQL relational database.  We
>> will have several people putting data into it so it needs to support
>> concurrent access over a network.
>
> Could i trouble you to expand a little on sizing? In particular:
>
> - How many entries are there?
> - How big are the keys? What sort of things are they?
> - How big are the values?
> - What's the workload mix? Read vs write vs delete?
> - Are writes mostly of new entries, or overwrites of existing ones?
> - How skewed is the access towards a few hot entries?
> - How many users are using it in parallel?
> - What is the request rate like?
> - How much RAM can be used? How much disk space?
>
> Answers to any of those, no matter how rough, would be useful.
>
> I'll post to cljp if/when i get this done.
>
> tom

We want our customers to be able to add their own annotations to
fields on our application's screens, sort of like sticky notes.  (This
is why I was fooling with tooltips earlier).  Which means that every
time the app displays a screen it needs to check for an annotation on
pretty much every field on the screen.  Most of the time there won't
be any, or only very few, but it has to check anyway, and we have some
very complicated screens with nested tabbed panes.  Many fields appear
in more than one screen.  For fields tied directly to a column in a
database table the key is the table name plus the column name.  For
derived fields we assign a more or less arbitrary key; these can't be
allowed to change or the sticky notes get mixed up.

The app is in Java, started through JNLP, communicates with an AS/400
through a LAN (each customer has its own).  Some customers have
branches attached through slow connections.

So:
   Number of entries probably < 1000 (but who knows what people will do?)
   Keys are strings, 20 to 30 characters.
   Values are strings, maybe 50 to 500 characters.
   Frequent reads, occasional writes, very few deletes.
   Maybe 5 to 20 users.
   Every time someone displays a screen, 20 to 100 reads need to happen
   in well under a second.  Most of these will be misses.

As I said in another message, the real problem here is probably
network latency, and switching databases won't help that.  I think
I'll have to abandon the brute force approach and try something more
clever.

-- 
Jim Janney
0
Reply Jim 4/3/2010 3:20:26 AM


On Fri, 2 Apr 2010, Jim Janney wrote:

> So:
>   Number of entries probably < 1000 (but who knows what people will do?)
>   Keys are strings, 20 to 30 characters.
>   Values are strings, maybe 50 to 500 characters.
>   Frequent reads, occasional writes, very few deletes.
>   Maybe 5 to 20 users.
>   Every time someone displays a screen, 20 to 100 reads need to happen
>   in well under a second.  Most of these will be misses.

My gut feeling is that this is not a big ask, and that an RDBMS will be 
able to deal with this without any trouble. The first performance-related 
thing i'd do (possibly even upfront, before i had hard data that 
performance was actually a problem) would be to batch the reads: collect 
together all the keys you're going to need, send them over the wire all at 
once, and have the far end do a query like:

SELECT key, value
FROM annotation
WHERE key IN (?, ?, ?, ?)

Plugging each key into one of the parameters in the IN-set.

There's a question over how many parameters you put in that IN-set, and 
what you do if your number of keys is different. I think i'd construct the 
PreparedStatement on the fly with exactly the right number of parameters, 
so the question is moot. Since there will only be a small number of 
different numbers of parameters (not more than one per form in your app), 
the database's PreparedStatement cache should happily hold all of them, 
and you won't be incurring a lot of preparation overhead.

> As I said in another message, the real problem here is probably network 
> latency, and switching databases won't help that.  I think I'll have to 
> abandon the brute force approach and try something more clever.

Probably. Robert Klemme's suggestion of building a toy version to get some 
data on the performance numbers is an excellent one; in my part of the 
world, we call this a 'spike':

http://www.extremeprogramming.org/rules/spike.html

tom

-- 
It's never too late to change the future.
0
Reply Tom 4/3/2010 2:00:08 PM

Tom Anderson <twic@urchin.earth.li> writes:

> On Fri, 2 Apr 2010, Jim Janney wrote:
>
>> So:
>>   Number of entries probably < 1000 (but who knows what people will do?)
>>   Keys are strings, 20 to 30 characters.
>>   Values are strings, maybe 50 to 500 characters.
>>   Frequent reads, occasional writes, very few deletes.
>>   Maybe 5 to 20 users.
>>   Every time someone displays a screen, 20 to 100 reads need to happen
>>   in well under a second.  Most of these will be misses.
>
> My gut feeling is that this is not a big ask, and that an RDBMS will
> be able to deal with this without any trouble. The first
> performance-related thing i'd do (possibly even upfront, before i had
> hard data that performance was actually a problem) would be to batch
> the reads: collect together all the keys you're going to need, send
> them over the wire all at once, and have the far end do a query like:
>
> SELECT key, value
> FROM annotation
> WHERE key IN (?, ?, ?, ?)
>
> Plugging each key into one of the parameters in the IN-set.
>
> There's a question over how many parameters you put in that IN-set,
> and what you do if your number of keys is different. I think i'd
> construct the PreparedStatement on the fly with exactly the right
> number of parameters, so the question is moot. Since there will only
> be a small number of different numbers of parameters (not more than
> one per form in your app), the database's PreparedStatement cache
> should happily hold all of them, and you won't be incurring a lot of
> preparation overhead.
>
>> As I said in another message, the real problem here is probably
>> network latency, and switching databases won't help that.  I think
>> I'll have to abandon the brute force approach and try something more
>> clever.
>
> Probably. Robert Klemme's suggestion of building a toy version to get
> some data on the performance numbers is an excellent one; in my part
> of the world, we call this a 'spike':
>
> http://www.extremeprogramming.org/rules/spike.html

It occurred to me this morning that if I put a timestamp in each
record, I could 

    SELECT *  FROM annotation WHERE timestamp > ?

to find any changes not in the local cache.  Most of the time that
would return nothing, presumably fairly quickly.  It would miss
deletes, but that's probably acceptable.

-- 
Jim Janney
0
Reply Jim 4/3/2010 2:50:01 PM

On Sat, 3 Apr 2010, Jim Janney wrote:

> Tom Anderson <twic@urchin.earth.li> writes:
>
>> On Fri, 2 Apr 2010, Jim Janney wrote:
>>
>>> So:
>>>   Number of entries probably < 1000 (but who knows what people will do?)
>>>   Keys are strings, 20 to 30 characters.
>>>   Values are strings, maybe 50 to 500 characters.
>>>   Frequent reads, occasional writes, very few deletes.
>>>   Maybe 5 to 20 users.
>>>   Every time someone displays a screen, 20 to 100 reads need to happen
>>>   in well under a second.  Most of these will be misses.
>
> It occurred to me this morning that if I put a timestamp in each
> record, I could
>
>    SELECT *  FROM annotation WHERE timestamp > ?
>
> to find any changes not in the local cache.  Most of the time that would 
> return nothing, presumably fairly quickly.  It would miss deletes, but 
> that's probably acceptable.

Good idea. And you could do deletes by setting the value to NULL and 
leaving the record in the database.

This starts to sound like what you really want is some sort of file 
replication, though.

tom

-- 
Civis Britannicus sum.
0
Reply Tom 4/5/2010 12:05:11 AM

4 Replies
162 Views

(page loaded in 0.135 seconds)


Reply: