f



Cache memory and its effect on list searching

Alternatively...why you should definitely use binary searches:

Python 3.5.2+ (default, Aug 30 2016, 19:08:42)=20
[GCC 6.2.0 20160822] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hashlib
>>> import timeit
>>> hashes =3D  [hashlib.md5(bytes(str(i), "utf-8")).hexdigest() for i in r=
ange(10000000)]
>>> sortedhashes =3D sorted(hashes)
>>> timeit.timeit('"x" in hashes', globals=3D{'hashes': hashes}, number=3D1=
0)
1.9478233020054176
>>> timeit.timeit('"x" in hashes', globals=3D{'hashes': sortedhashes}, numb=
er=3D10)
18.001392804995703

I thought this was curious behavior.  I created a list of random-looking st=
rings, then made a sorted copy.  I then found that using "in" to see if a s=
tring exists in the sorted list took over 9 times as long!

At first, I thought since both lists are the same size, and the 'in' test i=
s a linear search, shouldn't they take the same amount of time?  Even if th=
ere was some trickery with branch prediction happening, that would have ben=
efited the sorted list.  Then I remembered how lists work in Python.  The o=
riginal list is going to be mostly contiguous in memory, making the memory =
cache quite effective.  When I create the sorted copy, I'm creating a list =
of references to strings that are all over the place in memory, causing ton=
s of cache misses.

Of course, the best solution was to implement a binary search.  That turned=
 the membership check into a 300-700 microsecond operation.
0
sohcahtoa82
12/17/2016 2:20:00 AM
comp.lang.python 77058 articles. 3 followers. Post Follow

5 Replies
19 Views

Similar Articles

[PageSpeed] 17

On Sat, Dec 17, 2016 at 1:20 PM,  <sohcahtoa82@gmail.com> wrote:
> I thought this was curious behavior.  I created a list of random-looking strings, then made a sorted copy.  I then found that using "in" to see if a string exists in the sorted list took over 9 times as long!
>

My numbers replicate yours (though my computer's faster). But my
conclusion is different:

Python 3.7.0a0 (default:ecd218c41cd4, Dec 16 2016, 03:08:47)
[GCC 6.2.0 20161027] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hashlib
>>> import timeit
>>> hashes =  [hashlib.md5(bytes(str(i), "utf-8")).hexdigest() for i in range(10000000)]
>>> sortedhashes = sorted(hashes)
>>> timeit.timeit('"x" in hashes', globals={'hashes': hashes}, number=10)
0.8167107938788831
>>> timeit.timeit('"x" in hashes', globals={'hashes': sortedhashes}, number=10)
5.029693723190576
>>> timeit.timeit('"x" in hashes', globals={'hashes': hashes}, number=10)
0.855183657258749
>>> timeit.timeit('"x" in hashes', globals={'hashes': sortedhashes}, number=10)
5.0585526106879115
>>> sethashes = set(hashes)
>>> timeit.timeit('"x" in hashes', globals={'hashes': sethashes}, number=10)
6.13601878285408e-06

You want containment checks? Use a set :)

ChrisA
0
Chris
12/17/2016 2:27:01 AM
On Friday, December 16, 2016 at 6:27:24 PM UTC-8, Chris Angelico wrote:
> On Sat, Dec 17, 2016 at 1:20 PM,  <sohcahtoa82@gmail.com> wrote:
> > I thought this was curious behavior.  I created a list of random-looking strings, then made a sorted copy.  I then found that using "in" to see if a string exists in the sorted list took over 9 times as long!
> >
> 
> My numbers replicate yours (though my computer's faster). But my
> conclusion is different:
> 
> Python 3.7.0a0 (default:ecd218c41cd4, Dec 16 2016, 03:08:47)
> [GCC 6.2.0 20161027] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import hashlib
> >>> import timeit
> >>> hashes =  [hashlib.md5(bytes(str(i), "utf-8")).hexdigest() for i in range(10000000)]
> >>> sortedhashes = sorted(hashes)
> >>> timeit.timeit('"x" in hashes', globals={'hashes': hashes}, number=10)
> 0.8167107938788831
> >>> timeit.timeit('"x" in hashes', globals={'hashes': sortedhashes}, number=10)
> 5.029693723190576
> >>> timeit.timeit('"x" in hashes', globals={'hashes': hashes}, number=10)
> 0.855183657258749
> >>> timeit.timeit('"x" in hashes', globals={'hashes': sortedhashes}, number=10)
> 5.0585526106879115
> >>> sethashes = set(hashes)
> >>> timeit.timeit('"x" in hashes', globals={'hashes': sethashes}, number=10)
> 6.13601878285408e-06
> 
> You want containment checks? Use a set :)
> 
> ChrisA

Ah, I forgot about the optimizations of a set.  The only time I ever find myself using set is when I want to remove all the duplicates in a list.  I convert it to a set and then back.

For fun, I ran an experiment:

### BEGIN listsearch.py
import hashlib 
import timeit
import sys

from bisect import bisect_left

def bisect_search(a, x, lo=0, hi=None):   # can't use a to specify default for hi
    # From http://stackoverflow.com/questions/212358/binary-search-bisection-in-python
    hi = hi if hi is not None else len(a) # hi defaults to len(a)
    pos = bisect_left(a,x,lo,hi)          # find insertion position
    return (pos if pos != hi and a[pos] == x else -1) # don't walk off the end

def bin_search(haystack, needle):
    start = 0
    end = len(haystack) - 1
    while True:
        if start > end:
            return False
        middle = (start + end) // 2
        if needle < haystack[middle]:
            end = middle - 1
            continue
        elif needle > haystack[middle]:
            start = middle + 1
            continue
        elif needle == haystack[middle]:
            return True

print('Python version: ', sys.version_info)
print('building hashes...')
hashes = [hashlib.md5(bytes(str(i), "utf-8")).hexdigest() for i in range(10000000)]
print('sorting...')
sortedhashes = sorted(hashes)
print('creating set...')
sethashes = set(hashes)
print('Unsorted list:', timeit.timeit('"x" in hashes', globals={'hashes': hashes}, number=10))
print('Sorted:', timeit.timeit('"x" in hashes', globals={'hashes': sortedhashes}, number=10))
print('set:', timeit.timeit('"x" in hashes', globals={'hashes': sethashes}, number=10))
print('binary search:',  timeit.timeit('binsearch(hashes, "x")', globals={'hashes': sortedhashes, 'binsearch': bin_search}, number=10))
print('binary search with bisect:',  timeit.timeit('binsearch(hashes, "x")', globals={'hashes': sortedhashes, 'binsearch': bisect_search}, number=10))
### END listsearch.py

> python3 listsearch.py
Python version:  sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0)
building hashes...
sorting...
creating set...
Unsorted list: 1.7384763684627569
Sorted: 9.248799958145042
set: 1.4614161294446149e-06
binary search: 0.00010902164328108199
binary search with bisect: 1.7829276782066472e-05

Yup.  set is definitely the way to go!
0
sohcahtoa82
12/17/2016 4:44:12 AM
On Sat, Dec 17, 2016 at 3:44 PM,  <sohcahtoa82@gmail.com> wrote:
>> python3 listsearch.py
> Python version:  sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0)
> building hashes...
> sorting...
> creating set...
> Unsorted list: 1.7384763684627569
> Sorted: 9.248799958145042
> set: 1.4614161294446149e-06
> binary search: 0.00010902164328108199
> binary search with bisect: 1.7829276782066472e-05
>
> Yup.  set is definitely the way to go!

More than that: the lists are searched in linear time, the binary
seach runs in logarithmic time, but the set lookup is constant time.
Doesn't matter how big your set is, you can test for membership with
one hash lookup.

ChrisA
0
Chris
12/17/2016 4:55:01 AM
On Sat, 17 Dec 2016 03:55 pm, Chris Angelico wrote:

> More than that: the lists are searched in linear time, the binary
> seach runs in logarithmic time, but the set lookup is constant time.
> Doesn't matter how big your set is, you can test for membership with
> one hash lookup.

To be pedantic: it is constant time on average.

Best case is one hash and one equality test. Worst case is if all the set
elements have colliding hashes, in which case it degenerates to a linear
search.

There is a class of denial of service attacks where the attacker can specify
the keys in a dict in such a way that lookups collide, and can push new
colliding items into the dictionary (or set) faster than Python can perform
the lookups. That's why Python now has hash randomisation:

http://bugs.python.org/issue13703
https://www.python.org/dev/peps/pep-0456/


But outside of contrived or hostile examples, where elements of the set are
designed to collide, typically you would expect hash collisions to be rare,
and long chains of colliding elements even rarer.  For random elements,
most will require only a single hash and a single equality test, a small
number might require two tests, three would be even rarer, and so forth.

So strictly speaking, and I realise I'm being exceedingly pedantic here, a
*sufficiently large* dict or set MUST have colliding elements. How large
is "sufficiently large"? Its at least 2**31, more than two billion, so
while you are right that *in practice* set/dict lookups require only a
single hash + equality, in principle (and sometimes in practice too)
collisions can be significant.

Nevertheless, I mention this only for completeness. In practice, you almost
never need to care about hash collisions except for the most pathological
cases.



-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

0
Steve
12/17/2016 6:12:03 AM
On Sat, Dec 17, 2016 at 5:12 PM, Steve D'Aprano
<steve+python@pearwood.info> wrote:
> On Sat, 17 Dec 2016 03:55 pm, Chris Angelico wrote:
>
>> More than that: the lists are searched in linear time, the binary
>> seach runs in logarithmic time, but the set lookup is constant time.
>> Doesn't matter how big your set is, you can test for membership with
>> one hash lookup.
>
> To be pedantic: it is constant time on average.

True. And technically, you could say that a linear or binary search
has a best-case of constant time (if you find it right at the
beginning or middle of the list). I was talking about the average.

> But outside of contrived or hostile examples, where elements of the set are
> designed to collide, typically you would expect hash collisions to be rare,
> and long chains of colliding elements even rarer.  For random elements,
> most will require only a single hash and a single equality test, a small
> number might require two tests, three would be even rarer, and so forth.

Yep. Normal case, it's going to be one or two tests - and that doesn't
depend on the length of the list. The degenerate case does, but the
normal case doesn't.

> So strictly speaking, and I realise I'm being exceedingly pedantic here, a
> *sufficiently large* dict or set MUST have colliding elements. How large
> is "sufficiently large"? Its at least 2**31, more than two billion, so
> while you are right that *in practice* set/dict lookups require only a
> single hash + equality, in principle (and sometimes in practice too)
> collisions can be significant.

That assumes the hash has a finite size. If you have enough memory to
store that many unique objects, you can probably afford to use a
hashing algorithm that allows enough room. However, the chances of not
having collisions depend on the capacity. And the chances of having no
collisions at all are pretty low... as a rule of thumb, I estimate on
a 50-50 chance of a collision when capacity is the square of usage. So
if you have a thousand entries but capacity for a million, you have a
50% chance of having at least one collision. (The numbers aren't quite
right, but it's a good rule of thumb.)

> Nevertheless, I mention this only for completeness. In practice, you almost
> never need to care about hash collisions except for the most pathological
> cases.

Indeed. That's why hash tables are so prevalent, despite their
appalling worst-case.

ChrisA
0
Chris
12/17/2016 6:34:03 AM
Reply: