f



Mapping with continguous ranges of keys

I have some key:value data where the keys often are found in contiguous
ranges with identical values. For example:

{1: "foo",
 2: "foo",
 3: "foo",
 # same for keys 4 through 99
 100: "foo",
 101: "bar",
 102: "bar",
 103: "foobar",
 104: "bar",
 105: "foo", 
}


So in this case, the keys 1 through 100, plus 105, all have the same value.

I have about a million or two keys, with a few hundred or perhaps a few
thousand distinct values. The size of each contiguous group of keys with
the same value can vary from 1 to perhaps a hundred or so.

Has anyone dealt with data like this and could give a recommendation of the
right data structure to use?

The obvious way is to explicitly list each key, in a regular dict. But I
wonder whether there are alternatives which may be better (faster, more
memory efficient)?


Thanks,


-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

0
Steve
12/15/2016 5:06:25 PM
comp.lang.python 77058 articles. 6 followers. Post Follow

15 Replies
900 Views

Similar Articles

[PageSpeed] 50

On 2016-12-15 12:06 PM, Steve D'Aprano wrote:
> I have about a million or two keys, with a few hundred or perhaps a few
> thousand distinct values. The size of each contiguous group of keys with
> the same value can vary from 1 to perhaps a hundred or so.

There isn't enough info in your post to be sure but if those values are 
constant then you might be able to subclass dict and write a new 
__getitem__ that checks for specific ranges and calls the superclass 
only if not in the known ranges.  For example:

class MyDict(dict):
   def __getitem__(self, key):
     if isinstance(key, int) and key >= 1 and key <= 100:
       return "foo"
     return dict.__getitem__(self, key)

Obviously that middle section can be as complex as you need.

-- 
D'Arcy J.M. Cain
System Administrator, Vex.Net
http://www.Vex.Net/ IM:darcy@Vex.Net
VoIP: sip:darcy@Vex.Net
0
D
12/15/2016 5:24:51 PM
On 12/15/2016 09:06 AM, Steve D'Aprano wrote:
> Has anyone dealt with data like this and could give a recommendation of the
> right data structure to use?
>

I haven't dealt with a data structure exactly like this, but it's 
basically a sparse array. (I'm sure it has a name somewhere in the 
academic literature, but I couldn't find it with a quick search right 
now...)

My solution to what you're asking for would be to have a list of 
key-value pairs, only adding a key to the list if it "changes" the 
value. I.e. your data structure would be this:

l = [
     (1, "foo"),
     (101, "bar"),
     (103, "foobar"),
     (104, "bar"),
     (105, "foo"),
]

Then the only thing you need to do is define the lookup function. I 
would basically just do a binary search on the first values in the 
tuples. I.e. if "n" is your integer, you check if the middle values of 
the list l has it's first element as less than or greater than your 
value. Then you split l in two and do the same thing. Do this until you 
either find your value, or you find a value less than your value with 
the added property that the next value is greater than your value. After 
that you spit out the final second value.

There might be better ways to find the keys, but I think this approach 
is probably your best bet.

Cheers,
Thomas
0
Thomas
12/15/2016 5:27:45 PM
Steve D'Aprano wrote:

> I have some key:value data where the keys often are found in contiguous
> ranges with identical values. For example:
> 
> {1: "foo",
>  2: "foo",
>  3: "foo",
>  # same for keys 4 through 99
>  100: "foo",
>  101: "bar",
>  102: "bar",
>  103: "foobar",
>  104: "bar",
>  105: "foo",
> }
> 
> 
> So in this case, the keys 1 through 100, plus 105, all have the same
> value.
> 
> I have about a million or two keys, with a few hundred or perhaps a few
> thousand distinct values. The size of each contiguous group of keys with
> the same value can vary from 1 to perhaps a hundred or so.
> 
> Has anyone dealt with data like this and could give a recommendation of
> the right data structure to use?
> 
> The obvious way is to explicitly list each key, in a regular dict. But I
> wonder whether there are alternatives which may be better (faster, more
> memory efficient)?

Use a list, either directly if there are no big holes

>>> r = range(10**5)
>>> sys.getsizeof(list(r))/sys.getsizeof(dict(zip(r, r)))
0.14306676635590074


or indirectly:

ranges_list = [
    1,
    101,
    103,
    104,
    105,
    106,
]

index_to_value = {
    1: "foo",
    2: "bar",
    3: "foobar",
    4: "bar",
    5: "foo",
}

def get(key, default="missing"):
    index = bisect.bisect(ranges_list, key)
    return index_to_value.get(index, default)


0
Peter
12/15/2016 5:54:48 PM
On 12/15/2016 12:06 PM, Steve D'Aprano wrote:
> I have some key:value data where the keys often are found in contiguous
> ranges with identical values. For example:
>
> {1: "foo",
>  2: "foo",
>  3: "foo",
>  # same for keys 4 through 99
>  100: "foo",
>  101: "bar",
>  102: "bar",
>  103: "foobar",
>  104: "bar",
>  105: "foo",
> }

For this particular example (untested):

def lookup(n):
     F, B, FB = 'foo', 'bar', 'foobar'  # Ensure value objects reused
     if 1 <= n <= 100:
         return F
     elif 101 <= n <= 105:
         return (B, B, FB, B, F')[n - 101]
     else:
         raise ValueError('n must be in range(1, 106)')

> So in this case, the keys 1 through 100, plus 105, all have the same value.
>
> I have about a million or two keys, with a few hundred or perhaps a few
> thousand distinct values. The size of each contiguous group of keys with
> the same value can vary from 1 to perhaps a hundred or so.

> Has anyone dealt with data like this and could give a recommendation of the
> right data structure to use?
>
> The obvious way is to explicitly list each key, in a regular dict. But I
> wonder whether there are alternatives which may be better (faster, more
> memory efficient)?

If, as in the example, the keys comprise a contiguous sequence of ints, 
the 'obvious' way to me is sequence of values.  A tuple of constants 
gets compiled and will load quickly from a .pyc file.  4 or 8 bytes per 
entry times 1 or 2 million entries is usually tolerable on a gigabyte 
machine.

Or trade time for space with binary search or search in explicit binary 
tree.  Or combine binary search and indexed lookup as I did above.


-- 
Terry Jan Reedy

0
Terry
12/15/2016 8:42:00 PM
On 12/15/2016 12:27 PM, Thomas Nyberg wrote:
> On 12/15/2016 09:06 AM, Steve D'Aprano wrote:
>> Has anyone dealt with data like this and could give a recommendation
>> of the
>> right data structure to use?
>>
>
> I haven't dealt with a data structure exactly like this, but it's
> basically a sparse array.

A sparse array has at least half missing values.  This one has none on 
the defined domain, but contiguous dupicates.

 > (I'm sure it has a name somewhere in the
> academic literature, but I couldn't find it with a quick search right
> now...)

'Sparse array' is the name for sparse arrays.  I cannot think of one for 
blocks of duplicates.

-- 
Terry Jan Reedy

0
Terry
12/15/2016 8:48:22 PM
On 12/15/2016 12:48 PM, Terry Reedy wrote:
> On 12/15/2016 12:27 PM, Thomas Nyberg wrote:
>>
>> I haven't dealt with a data structure exactly like this, but it's
>> basically a sparse array.
>
> A sparse array has at least half missing values.  This one has none on
> the defined domain, but contiguous dupicates.
>

I'm sorry for devolving into semantics, but there certainly isn't a 
single definition of "sparse array" out there. For example, the 
definition in wikipedia (https://en.wikipedia.org/wiki/Sparse_array) 
doesn't agree with you:

"In computer science, a sparse array is an array in which most of the 
elements have the default value (usually 0 or null)."

Personally my usage of sparse arrays in scipy has _always_ had all 
defined values it's just that the default value was 0. I never deal with 
"missing" values.

>> (I'm sure it has a name somewhere in the
>> academic literature, but I couldn't find it with a quick search right
>> now...)
>
> 'Sparse array' is the name for sparse arrays.  I cannot think of one for
> blocks of duplicates.
>

Yeah that's why I said "basically a sparse array". It has the same 
defining characteristics. It is a situation where you have a large 
number of values associated to a small number of values in such a way 
that you can fairly easily describe the association without needing to 
simply write out all pairs. For regular sparse arrays it's easy because 
you associate them to a single value by default and only specify the 
ones that aren't. In this case that doesn't work, but due to the fact 
that you have long runs of repeated values you can use that to encode 
and compress the mapping.

Still not sure what to call it. I like D'Arcy's idea of subclassing a 
dict in combination with Peter's idea of using bisect (I had no idea 
that module existed!) so maybe I'll change my own not so great 
terminology to "basically a sparse map".

Cheers,
Thomas
0
Thomas
12/15/2016 9:30:20 PM
On Fri, 16 Dec 2016 04:06:25 +1100, Steve D'Aprano
<steve+python@pearwood.info> declaimed the following:

>I have some key:value data where the keys often are found in contiguous
>ranges with identical values. For example:
>
>{1: "foo",
> 2: "foo",
> 3: "foo",
> # same for keys 4 through 99
> 100: "foo",
> 101: "bar",
> 102: "bar",
> 103: "foobar",
> 104: "bar",
> 105: "foo", 
>}
>
>
>So in this case, the keys 1 through 100, plus 105, all have the same value.
>
>I have about a million or two keys, with a few hundred or perhaps a few
>thousand distinct values. The size of each contiguous group of keys with
>the same value can vary from 1 to perhaps a hundred or so.
>
>Has anyone dealt with data like this and could give a recommendation of the
>right data structure to use?
>
>The obvious way is to explicitly list each key, in a regular dict. But I
>wonder whether there are alternatives which may be better (faster, more
>memory efficient)?
>

	Invert the structure...

invert = { "foo" : [1, 2, 3, ..., 100, 105],
		"bar" : [101, 102, 104],
		"foobar" : [103] }

Provide helper function to look up...

	for (k, r) in invert.iteritems():
		if needle in r: return k


	Presumption is that the total number of "k" entries is much less than
the total number of list entries, so the time lost in sequential search of
the inverted data is regained by not having to examine as many items -- and
the space saved might be less.

	You might even consider converting it to a list of lists

invert = [ ["foo", [1, 2, 3, ..., 100, 105]],
		["bar", [101, 102, 104]],
		["foobar", [103]] ]

since you can do a reverse sort on the lists using the length of the
internal list as the criteria. That way, the most likely hits when
searching will appear first. The dictionary doesn't allow for that.
-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
    wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/

0
Dennis
12/16/2016 1:41:36 AM
On 12/15/2016 4:30 PM, Thomas Nyberg wrote:
> On 12/15/2016 12:48 PM, Terry Reedy wrote:
>> On 12/15/2016 12:27 PM, Thomas Nyberg wrote:
>>>
>>> I haven't dealt with a data structure exactly like this, but it's
>>> basically a sparse array.
>>
>> A sparse array has at least half missing values.  This one has none on
>> the defined domain, but contiguous dupicates.
>>
>
> I'm sorry for devolving into semantics, but there certainly isn't a
> single definition of "sparse array" out there. For example, the
> definition in wikipedia (https://en.wikipedia.org/wiki/Sparse_array)
> doesn't agree with you:

I think it does ;-).

> "In computer science, a sparse array is an array in which most of the
> elements have the default value (usually 0 or null)."

Let's devolve to a memory-based language like C. An physical array 
consisting of sequential memory locations must have *some* bit pattern 
in every byte and hence in every multibyte block.  If I remember 
correctly, C malloc initialized bytes to all 0 bits, which is an int 
value of 0 also.  If there is no meaningful value, there must be a 
default value that means 'missing'.  0 may mean 'missing'. Null values 
are by definition non-values.  Python uses None as a Null.  If the 
example had been populated largely with Nones, I would have called it 
'sparse'.

 > Personally my usage of sparse arrays in scipy has _always_ had all
 > defined values it's just that the default value was 0. I never deal
 > with "missing" values.

Lucky you.  In statistics and analysis of real, experimental data, 
missing values are often a possibility.  They are always a nuisance, 
sometimes a major one.  When the data are represented by arrays,  rather 
than by some compacted form, some value has to be chosen to represent 
the absence of of data.

-- 
Terry Jan Reedy

0
Terry
12/16/2016 7:57:40 AM
--nextPart2300155.Y19S3WCYV1
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8Bit

On Fri, 16 Dec 2016 04:06 am, Steve D'Aprano wrote:

> I have some key:value data where the keys often are found in contiguous
> ranges with identical values.
[...]


Thank you to everyone who gave suggestions!

I have experimented with two solutions, and hope somebody might be able to
suggest some improvements. Attached is the test code I ran, suggestions for
improving the performance will be appreciated.

I decided on these two data structures:

(1) The naive solution: don't worry about duplicate values, or the fact that
keys come in contiguous groups, and just store each individual key and
value in a dict; this is faster but uses more memory.

(2) The clever solution: use a pair of lists, one holding the starting value
of each group of keys, and the other holding the common values. This saves
a lot of memory, but is slower.

A concrete example might help. Suppose I have 15 keys in five groups:

D = {0: 10,
     1: 20, 2: 20,
     3: 30, 4: 30, 5: 30,
     6: 40, 7: 40, 8: 40, 9: 40,
     10: 50, 11: 50, 12: 50, 13: 50, 14: 50}

(Remember, in reality I could have as many as a million or two keys. This is
just a simple toy example.)

Instead of using a dict, I also set up a pair of lists:

L = [0, 1, 3, 6, 10, 15]  # starting value of each group
V = [10, 20, 30, 40, 50]  # value associated with each group

Note that the first list has one extra item, namely the number one past the
final group.

I can do a key look-up using either of these:

    D[key]

    V[bisect.bisect_right(L, i) - 1]


I tested the memory consumption and speed of these two solutions with
(approximately) one million keys. I found:

- the dict solution uses a lot more memory, about 24 bytes per key, compared
to the pair of lists solution, which is about 0.3 bytes per key;

- but on my computer, dict lookups are about 3-4 times faster.


Any suggestions for improving the speed of the binary search version, or the
memory consumption of the dict?

By the way: the particular pattern of groups in the sample code (groups of
size 1, 2, 3, ... up to 50, then repeating from 1, 2, 3, ... again) is just
demo. In my real data, the sizes of the groups are all over the place, in
an unpredictable pattern.



Thanks in advance.


-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

--nextPart2300155.Y19S3WCYV1
Content-Type: text/x-python; name="dict_vs_binary_search.py"
Content-Transfer-Encoding: 8Bit
Content-Disposition: attachment; filename="dict_vs_binary_search.py"

import bisect
import random
import sys

from timeit import default_timer as timer

def make_keys():
    """Yield (start, end) values suitable for passing to range().

    Distance between start and end increase from 1 to 51, then
    repeat, for a total of 800*25*51 = 1020000 values all together.
    """
    n = 1
    for j in range(800):
        for i in range(50):
            yield (n, n+i+1)
            n += i+1

def populate():
    D = {}
    L = []
    V = []
    value = 1
    last_b = 1
    for a, b in make_keys():
        assert a == last_b
        for i in range(a, b):
            D[i] = value
        L.append(a)
        V.append(value)
        last_b = b
    L.append(last_b)
    return (D, L, V)


class Stopwatch:
    """Context manager for timing long running chunks of code."""

    def __enter__(self):
        self._start = timer()
        return self

    def __exit__(self, *args):
        elapsed = timer() - self._start
        del self._start
        if elapsed < 0.01:
            print("Elapsed time is very small, consider using timeit instead.")
        print('time taken: %f seconds' % elapsed)



D, L, V = populate()
assert len(D) == 800*25*51
assert len(L) == len(V) + 1
print("size of dict", sys.getsizeof(D))
print("size of two lists", sys.getsizeof(L) + sys.getsizeof(V))


# Confirm that values are the same whether using dict lookup
# or binary search.
for i in range(1, 800*25*51 + 1):
    index = bisect.bisect_right(L, i) - 1
    assert D[i] == V[index]


# Simulate a large number of lookups in random order.
nums = list(range(1, 800*25*51 + 1))
random.shuffle(nums)

with Stopwatch():
    for i in nums:
        x = D[i]

with Stopwatch():
    for i in nums:
        x = V[bisect.bisect_right(L, i) - 1]




--nextPart2300155.Y19S3WCYV1--
0
Steve
12/16/2016 2:27:43 PM
On 12/15/2016 11:57 PM, Terry Reedy wrote:
> On 12/15/2016 4:30 PM, Thomas Nyberg wrote:
>> On 12/15/2016 12:48 PM, Terry Reedy wrote:
>>> A sparse array has at least half missing values.  This one has none on
>>> the defined domain, but contiguous dupicates.
>>>
>>
>> I'm sorry for devolving into semantics, but there certainly isn't a
>> single definition of "sparse array" out there. For example, the
>> definition in wikipedia (https://en.wikipedia.org/wiki/Sparse_array)
>> doesn't agree with you:
>
> I think it does ;-).
>
>> "In computer science, a sparse array is an array in which most of the
>> elements have the default value (usually 0 or null)."
>
> ...
 >
>> Personally my usage of sparse arrays in scipy has _always_ had all
>> defined values it's just that the default value was 0. I never deal
>> with "missing" values.
>
> Lucky you.  In statistics and analysis of real, experimental data,
> missing values are often a possibility.  They are always a nuisance,
> sometimes a major one.  When the data are represented by arrays,  rather
> than by some compacted form, some value has to be chosen to represent
> the absence of of data.
>

I thought that the definitions were different because yours said "at 
least half" and the other said "most" in addition to your saying 
"missing values" and the other said "default value". Ignoring the first 
minor point, if by "missing value" you basically mean "default value", 
then yes I agree that the definition is the same. Also I didn't mean 
that I was "lucky" to never have dealt with missing values (strictly 
speaking I have, I just do it rarely), but my point was that I deal with 
sparse matricies/arrays/structures quite often, but I rarely deal with 
"missing values" like nulls/nans/Nones. But that point is moot given 
that you apparently meant the same thing with "missing value" as I did 
with "default value" so I don't think we disagree here.

Taking a step back, the real point of sparse anything is: can you 
represent it in a space/speed efficient manner which is "stable" under 
whatever operations you care about. I.e. if you have a default value of 
0, then you have the property that 0 + x = x and 0 * x = 0 which means 
that sparse matrices with default value 0 are stable under addition and 
multiplication. If you have a default value of nan (or None or null) you 
usually have the convention that nan * x = nan (possibly depends on the 
sign of x) and nan + x = nan which means that a sparse matrix with nans 
as default is also stable under addition and multiplication. If you 
chose a default value of (say) 1 you would run into issues with the 
stability of these operations. It wouldn't mean it's not "sparse", but 
it would mean that the operations you care about might not work as well.

The extension to the thread at and is just that now you have multiple 
default values and the fact that they are not assigned at random, but 
are instead runs of constant values means that you can put a sparse 
structure on this (with a suitable definition of "sparse").

 > Let's devolve to a memory-based language like C. An physical array
 > consisting of sequential memory locations must have *some* bit pattern
 > in every byte and hence in every multibyte block.  If I remember
 > correctly, C malloc initialized bytes to all 0 bits, which is an int
 > value of 0 also.  If there is no meaningful value, there must be a
 > default value that means 'missing'.  0 may mean 'missing'. Null values
 > are by definition non-values.  Python uses None as a Null.  If the
 > example had been populated largely with Nones, I would have called it
 > 'sparse'.

It's not really super pertinent to this discussion, but malloc does not 
guarantee that the values are zeroed. That is guaranteed by calloc though:

	http://man7.org/linux/man-pages/man3/malloc.3.html
	http://man7.org/linux/man-pages/man3/calloc.3p.html

Also the point with a sparse representation is that your default value 
wouldn't exist in memory anywhere and that instead the operations would 
understand that it exists by other factors. For example, a sparse matrix 
with all 0s might* be represented by rows of the form

	i,j,v

where i and j are the row and column indices and v is the value at the 
position. So in this representation we would have as many rows as we 
would non-default values.

*I say might because there are usually more compact forms like this. 
This isn't the internal representation of scipy.sparse.csr_matrix, for 
example.

Regardless, I think I wrote too much to say basically that I don't think 
we're really disagreeing except possibly slightly on perspective.

Cheers,
Thomas
0
Thomas
12/16/2016 6:04:05 PM
On Thursday, 15 December 2016 17:06:39 UTC, Steve D'Aprano  wrote:
> I have some key:value data where the keys often are found in contiguous
> ranges with identical values. For example:
> 
> {1: "foo",
>  2: "foo",
>  3: "foo",
>  # same for keys 4 through 99
>  100: "foo",
>  101: "bar",
>  102: "bar",
>  103: "foobar",
>  104: "bar",
>  105: "foo", 
> }
> ...
> -- 
> Steve

All the answers seem to rely on in-memory solutions.  But isn't the problem a classic data design problem (cf Codd) with two tables.

CREATE TABLE keys   (id    INTEGER NOT NULL PRIMARY KEY,
                     kkey  INTEGER,
                     UNIQUE (kkey) );
                        ## eg id = 999, kkey=101
CREATE TABLE values (id    INTEGER NOT NULL PRIMARY KEY,
                     k_id  INTEGER,
                     value VARCHAR,
                     UNIQUE (k_id, value),
                     FOREIGN KEY (k_id) REFERENCES keys(id));
                        ## eg k_id = 999, value = "bar"

For example, Python/SQLITE can parse the list of key:value pairs.  key is looked up in keys -- and  a keys row is added if the key is new.  The keys id is saved.

k_id--value pair is looked up -- and a row is added if the pair is new.

Some of this can be simplified by relying on SQL to handle non-UNIQUE errors.
This approach may be slower than in-memory processing, but it has almost no database size limits.

0
mbg1708
12/16/2016 6:39:34 PM
On Fri, Dec 16, 2016 at 4:06 AM, Steve D'Aprano
<steve+python@pearwood.info> wrote:
> I have about a million or two keys, with a few hundred or perhaps a few
> thousand distinct values. The size of each contiguous group of keys with
> the same value can vary from 1 to perhaps a hundred or so.

Going right back to the beginning here: I think that "a million or
two" is a perfectly doable figure for a straight-forward list or dict.
You get immediately-available lookups in a straight-forward way, at
the cost of maybe 16MB of memory (if you use the same strings for the
values, the cost is just the pointers).

ChrisA
0
Chris
12/16/2016 9:28:29 PM
On 16/12/2016 14:27, Steve D'Aprano wrote:
> (2) The clever solution: use a pair of lists, one holding the starting value
> of each group of keys, and the other holding the common values. This saves
> a lot of memory, but is slower.
>
> A concrete example might help. Suppose I have 15 keys in five groups:
>
> D = {0: 10,
>       1: 20, 2: 20,
>       3: 30, 4: 30, 5: 30,
>       6: 40, 7: 40, 8: 40, 9: 40,
>       10: 50, 11: 50, 12: 50, 13: 50, 14: 50}
>
> (Remember, in reality I could have as many as a million or two keys. This is
> just a simple toy example.)
>
> Instead of using a dict, I also set up a pair of lists:
>
> L = [0, 1, 3, 6, 10, 15]  # starting value of each group
> V = [10, 20, 30, 40, 50]  # value associated with each group
I misread the problem (skipped the comment about keys 4 - 99) and 
assumed there might be gaps between the contiguous blocks so thought of 
the list structure
[  ( (firstKeyN, lastKeyN), "value" ),
     ...
]
At the cost of more memory keeping first and last keys numbers in tuples 
in the L list would mean there is only one L lookup at the expence of 
two additional tuple lookups. Are small tuple lookups quicker than 1 
large list lookup? If not stick with the simple L and V you show above.

I've never used any form of tree structure, perhaps someone else could 
comment of the performance of balanced trees as compared to simple 
lists. Would the insertion cost be too high in keeping the tree balanced?

As to speeding up access with only L and V lists the binary search must 
be optimal unless specialised knowledge about the distribution of the 
size of the contiguous groups is made use of. But you say there is no 
pattern to the group sizes.

Other thought is to have a smaller pre-index list, search this to find 
the range of L indexes the key is in. If the pre-index list had a 1000 
entries then each entry would cover 1/1000 of the L list which narrows 
the binary search space in L considerably. The cost of something like 
this is keeping the pre-index list up to date when new keys are added 
and extra time to code and test it.
preList struct [ (firstKeyN, lastKeyN), (firstLindex, lastLindex),
                         ...
                       ]

Reminds me of Jackson's first two rules on optimisation, 1 - don't do 
it, 2 - don't do it yet
Thanks for an interesting problem.
> Note that the first list has one extra item, namely the number one past the
> final group.
>
> I can do a key look-up using either of these:
>
>      D[key]
>
>      V[bisect.bisect_right(L, i) - 1]
>
>
> I tested the memory consumption and speed of these two solutions with
> (approximately) one million keys. I found:
>
> - the dict solution uses a lot more memory, about 24 bytes per key, compared
> to the pair of lists solution, which is about 0.3 bytes per key;
>
> - but on my computer, dict lookups are about 3-4 times faster.
>
>
> Any suggestions for improving the speed of the binary search version, or the
> memory consumption of the dict?
>
> By the way: the particular pattern of groups in the sample code (groups of
> size 1, 2, 3, ... up to 50, then repeating from 1, 2, 3, ... again) is just
> demo. In my real data, the sizes of the groups are all over the place, in
> an unpredictable pattern.
>
>
>
> Thanks in advance.
>
>

0
John
12/16/2016 11:15:11 PM
Steve D'Aprano wrote:

> I have experimented with two solutions, and hope somebody might be able to
> suggest some improvements. Attached is the test code I ran, suggestions
> for improving the performance will be appreciated.

If there was an attachment to this text -- that didn't make it to the 
mailing list or the news group.

> I decided on these two data structures:
> 
> (1) The naive solution: don't worry about duplicate values, or the fact
> that keys come in contiguous groups, and just store each individual key
> and value in a dict; this is faster but uses more memory.
> 
> (2) The clever solution: use a pair of lists, one holding the starting
> value of each group of keys, and the other holding the common values. This
> saves a lot of memory, but is slower.
> 
> A concrete example might help. Suppose I have 15 keys in five groups:
> 
> D = {0: 10,
>      1: 20, 2: 20,
>      3: 30, 4: 30, 5: 30,
>      6: 40, 7: 40, 8: 40, 9: 40,
>      10: 50, 11: 50, 12: 50, 13: 50, 14: 50}
> 
> (Remember, in reality I could have as many as a million or two keys. This
> is just a simple toy example.)
> 
> Instead of using a dict, I also set up a pair of lists:
> 
> L = [0, 1, 3, 6, 10, 15]  # starting value of each group
> V = [10, 20, 30, 40, 50]  # value associated with each group
> 
> Note that the first list has one extra item, namely the number one past
> the final group.
> 
> I can do a key look-up using either of these:
> 
>     D[key]
> 
>     V[bisect.bisect_right(L, i) - 1]
> 
> 
> I tested the memory consumption and speed of these two solutions with
> (approximately) one million keys. I found:
> 
> - the dict solution uses a lot more memory, about 24 bytes per key,
> compared to the pair of lists solution, which is about 0.3 bytes per key;
> 
> - but on my computer, dict lookups are about 3-4 times faster.

Only three to four times? You basically get that from a no-op function call:

$ python3 -m timeit -s 'd = {1: 2}; k = 1' 'd[k]'
10000000 loops, best of 3: 0.0935 usec per loop
$ python3 -m timeit -s 'd = {1: 2}; k = 1' 'd[int(k)]'
1000000 loops, best of 3: 0.304 usec per loop

Even adding a dummy value to V to go from

>     V[bisect.bisect_right(L, i) - 1]

to

V[bisect.bisect_right(L, i)]

might be visible in your benchmark.

> Any suggestions for improving the speed of the binary search version, or
> the memory consumption of the dict?

Depending on the usage pattern you might try bisect combined with a LRU 
cache. 

If runs of four or more nearby keys are common you can remember the current 
span:

# untested, watch out for off-by-one errors ;)
start = stop = 0
for key in data:
    if start <= key < stop:
        pass # reuse value
    else:
        index = bisect.bisect_right(L, key)
        start, stop = L[index: index + 2]
        value = V[index - 1]

    # use value

(You might also fill V with (value, start, stop) tuples))

> By the way: the particular pattern of groups in the sample code (groups of
> size 1, 2, 3, ... up to 50, then repeating from 1, 2, 3, ... again) is
> just demo. In my real data, the sizes of the groups are all over the
> place, in an unpredictable pattern.
> 
> 
> 
> Thanks in advance.
> 
> 


0
Peter
12/17/2016 9:31:40 AM
On Sat, 17 Dec 2016 08:31 pm, Peter Otten wrote:

> Steve D'Aprano wrote:
> 
>> I have experimented with two solutions, and hope somebody might be able
>> to suggest some improvements. Attached is the test code I ran,
>> suggestions for improving the performance will be appreciated.
> 
> If there was an attachment to this text -- that didn't make it to the
> mailing list or the news group.

I see it on comp.lang.python but not the mailing list or gmane. That's
annoying because its a *text* attachment and should be allowed. I'll attach
it again, inline this time, at the end of this post.


[...]
>> I tested the memory consumption and speed of these two solutions with
>> (approximately) one million keys. I found:
>> 
>> - the dict solution uses a lot more memory, about 24 bytes per key,
>> compared to the pair of lists solution, which is about 0.3 bytes per key;
>> 
>> - but on my computer, dict lookups are about 3-4 times faster.
> 
> Only three to four times? You basically get that from a no-op function
> call:
[...]
> Even adding a dummy value to V to go from
> 
>>     V[bisect.bisect_right(L, i) - 1]
> 
> to
> 
> V[bisect.bisect_right(L, i)]
> 
> might be visible in your benchmark.

Good thinking! I'll try that.

And... it doesn't appear to make any real difference.


>> Any suggestions for improving the speed of the binary search version, or
>> the memory consumption of the dict?
> 
> Depending on the usage pattern you might try bisect combined with a LRU
> cache.

Adding a cache will probably eat up the memory savings, but I may play
around with that.

And here's the code, this time inline. I've added the dummy value to the
values list V as suggested.

# --- cut ---

import bisect
import random
import sys

from timeit import default_timer as timer

def make_keys():
    """Yield (start, end) values suitable for passing to range().

    Distance between start and end increase from 1 to 51, then
    repeat, for a total of 800*25*51 = 1020000 values all together.
    """
    n = 1
    for j in range(800):
        for i in range(50):
            yield (n, n+i+1)
            n += i+1

def populate():
    D = {}
    L = []
    V = [None]
    value = 1
    last_b = 1
    for a, b in make_keys():
        assert a == last_b
        for i in range(a, b):
            D[i] = value
        L.append(a)
        V.append(value)
        last_b = b
    L.append(last_b)
    return (D, L, V)


class Stopwatch:
    """Context manager for timing long running chunks of code."""

    def __enter__(self):
        self._start = timer()
        return self

    def __exit__(self, *args):
        elapsed = timer() - self._start
        del self._start
        if elapsed < 0.01:
            print("Elapsed time is very small, consider using timeit
instead.")
        print('time taken: %f seconds' % elapsed)



D, L, V = populate()
assert len(D) == 800*25*51
assert len(L) == len(V)
print("size of dict", sys.getsizeof(D))
print("size of two lists", sys.getsizeof(L) + sys.getsizeof(V))


# Confirm that values are the same whether using dict lookup
# or binary search.
for i in range(1, 800*25*51 + 1):
    index = bisect.bisect_right(L, i)
    assert D[i] == V[index]


# Simulate a large number of lookups in random order.
nums = list(range(1, 800*25*51 + 1))
random.shuffle(nums)

with Stopwatch():
    for i in nums:
        x = D[i]

with Stopwatch():
    for i in nums:
        x = V[bisect.bisect_right(L, i)]

# --- cut ---



-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

0
Steve
12/17/2016 11:24:25 AM
Reply: