Alternatively...why you should definitely use binary searches: Python 3.5.2+ (default, Aug 30 2016, 19:08:42)=20 [GCC 6.2.0 20160822] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import hashlib >>> import timeit >>> hashes =3D [hashlib.md5(bytes(str(i), "utf-8")).hexdigest() for i in r= ange(10000000)] >>> sortedhashes =3D sorted(hashes) >>> timeit.timeit('"x" in hashes', globals=3D{'hashes': hashes}, number=3D1= 0) 1.9478233020054176 >>> timeit.timeit('"x" in hashes', globals=3D{'hashes': sortedhashes}, numb= er=3D10) 18.001392804995703 I thought this was curious behavior. I created a list of random-looking st= rings, then made a sorted copy. I then found that using "in" to see if a s= tring exists in the sorted list took over 9 times as long! At first, I thought since both lists are the same size, and the 'in' test i= s a linear search, shouldn't they take the same amount of time? Even if th= ere was some trickery with branch prediction happening, that would have ben= efited the sorted list. Then I remembered how lists work in Python. The o= riginal list is going to be mostly contiguous in memory, making the memory = cache quite effective. When I create the sorted copy, I'm creating a list = of references to strings that are all over the place in memory, causing ton= s of cache misses. Of course, the best solution was to implement a binary search. That turned= the membership check into a 300-700 microsecond operation.

0 |

12/17/2016 2:20:00 AM

On Sat, Dec 17, 2016 at 1:20 PM, <sohcahtoa82@gmail.com> wrote: > I thought this was curious behavior. I created a list of random-looking strings, then made a sorted copy. I then found that using "in" to see if a string exists in the sorted list took over 9 times as long! > My numbers replicate yours (though my computer's faster). But my conclusion is different: Python 3.7.0a0 (default:ecd218c41cd4, Dec 16 2016, 03:08:47) [GCC 6.2.0 20161027] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import hashlib >>> import timeit >>> hashes = [hashlib.md5(bytes(str(i), "utf-8")).hexdigest() for i in range(10000000)] >>> sortedhashes = sorted(hashes) >>> timeit.timeit('"x" in hashes', globals={'hashes': hashes}, number=10) 0.8167107938788831 >>> timeit.timeit('"x" in hashes', globals={'hashes': sortedhashes}, number=10) 5.029693723190576 >>> timeit.timeit('"x" in hashes', globals={'hashes': hashes}, number=10) 0.855183657258749 >>> timeit.timeit('"x" in hashes', globals={'hashes': sortedhashes}, number=10) 5.0585526106879115 >>> sethashes = set(hashes) >>> timeit.timeit('"x" in hashes', globals={'hashes': sethashes}, number=10) 6.13601878285408e-06 You want containment checks? Use a set :) ChrisA

0 |

12/17/2016 2:27:01 AM

On Friday, December 16, 2016 at 6:27:24 PM UTC-8, Chris Angelico wrote: > On Sat, Dec 17, 2016 at 1:20 PM, <sohcahtoa82@gmail.com> wrote: > > I thought this was curious behavior. I created a list of random-looking strings, then made a sorted copy. I then found that using "in" to see if a string exists in the sorted list took over 9 times as long! > > > > My numbers replicate yours (though my computer's faster). But my > conclusion is different: > > Python 3.7.0a0 (default:ecd218c41cd4, Dec 16 2016, 03:08:47) > [GCC 6.2.0 20161027] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import hashlib > >>> import timeit > >>> hashes = [hashlib.md5(bytes(str(i), "utf-8")).hexdigest() for i in range(10000000)] > >>> sortedhashes = sorted(hashes) > >>> timeit.timeit('"x" in hashes', globals={'hashes': hashes}, number=10) > 0.8167107938788831 > >>> timeit.timeit('"x" in hashes', globals={'hashes': sortedhashes}, number=10) > 5.029693723190576 > >>> timeit.timeit('"x" in hashes', globals={'hashes': hashes}, number=10) > 0.855183657258749 > >>> timeit.timeit('"x" in hashes', globals={'hashes': sortedhashes}, number=10) > 5.0585526106879115 > >>> sethashes = set(hashes) > >>> timeit.timeit('"x" in hashes', globals={'hashes': sethashes}, number=10) > 6.13601878285408e-06 > > You want containment checks? Use a set :) > > ChrisA Ah, I forgot about the optimizations of a set. The only time I ever find myself using set is when I want to remove all the duplicates in a list. I convert it to a set and then back. For fun, I ran an experiment: ### BEGIN listsearch.py import hashlib import timeit import sys from bisect import bisect_left def bisect_search(a, x, lo=0, hi=None): # can't use a to specify default for hi # From http://stackoverflow.com/questions/212358/binary-search-bisection-in-python hi = hi if hi is not None else len(a) # hi defaults to len(a) pos = bisect_left(a,x,lo,hi) # find insertion position return (pos if pos != hi and a[pos] == x else -1) # don't walk off the end def bin_search(haystack, needle): start = 0 end = len(haystack) - 1 while True: if start > end: return False middle = (start + end) // 2 if needle < haystack[middle]: end = middle - 1 continue elif needle > haystack[middle]: start = middle + 1 continue elif needle == haystack[middle]: return True print('Python version: ', sys.version_info) print('building hashes...') hashes = [hashlib.md5(bytes(str(i), "utf-8")).hexdigest() for i in range(10000000)] print('sorting...') sortedhashes = sorted(hashes) print('creating set...') sethashes = set(hashes) print('Unsorted list:', timeit.timeit('"x" in hashes', globals={'hashes': hashes}, number=10)) print('Sorted:', timeit.timeit('"x" in hashes', globals={'hashes': sortedhashes}, number=10)) print('set:', timeit.timeit('"x" in hashes', globals={'hashes': sethashes}, number=10)) print('binary search:', timeit.timeit('binsearch(hashes, "x")', globals={'hashes': sortedhashes, 'binsearch': bin_search}, number=10)) print('binary search with bisect:', timeit.timeit('binsearch(hashes, "x")', globals={'hashes': sortedhashes, 'binsearch': bisect_search}, number=10)) ### END listsearch.py > python3 listsearch.py Python version: sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0) building hashes... sorting... creating set... Unsorted list: 1.7384763684627569 Sorted: 9.248799958145042 set: 1.4614161294446149e-06 binary search: 0.00010902164328108199 binary search with bisect: 1.7829276782066472e-05 Yup. set is definitely the way to go!

0 |

12/17/2016 4:44:12 AM

On Sat, Dec 17, 2016 at 3:44 PM, <sohcahtoa82@gmail.com> wrote: >> python3 listsearch.py > Python version: sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0) > building hashes... > sorting... > creating set... > Unsorted list: 1.7384763684627569 > Sorted: 9.248799958145042 > set: 1.4614161294446149e-06 > binary search: 0.00010902164328108199 > binary search with bisect: 1.7829276782066472e-05 > > Yup. set is definitely the way to go! More than that: the lists are searched in linear time, the binary seach runs in logarithmic time, but the set lookup is constant time. Doesn't matter how big your set is, you can test for membership with one hash lookup. ChrisA

0 |

12/17/2016 4:55:01 AM

On Sat, 17 Dec 2016 03:55 pm, Chris Angelico wrote: > More than that: the lists are searched in linear time, the binary > seach runs in logarithmic time, but the set lookup is constant time. > Doesn't matter how big your set is, you can test for membership with > one hash lookup. To be pedantic: it is constant time on average. Best case is one hash and one equality test. Worst case is if all the set elements have colliding hashes, in which case it degenerates to a linear search. There is a class of denial of service attacks where the attacker can specify the keys in a dict in such a way that lookups collide, and can push new colliding items into the dictionary (or set) faster than Python can perform the lookups. That's why Python now has hash randomisation: http://bugs.python.org/issue13703 https://www.python.org/dev/peps/pep-0456/ But outside of contrived or hostile examples, where elements of the set are designed to collide, typically you would expect hash collisions to be rare, and long chains of colliding elements even rarer. For random elements, most will require only a single hash and a single equality test, a small number might require two tests, three would be even rarer, and so forth. So strictly speaking, and I realise I'm being exceedingly pedantic here, a *sufficiently large* dict or set MUST have colliding elements. How large is "sufficiently large"? Its at least 2**31, more than two billion, so while you are right that *in practice* set/dict lookups require only a single hash + equality, in principle (and sometimes in practice too) collisions can be significant. Nevertheless, I mention this only for completeness. In practice, you almost never need to care about hash collisions except for the most pathological cases. -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse.

0 |

12/17/2016 6:12:03 AM

On Sat, Dec 17, 2016 at 5:12 PM, Steve D'Aprano <steve+python@pearwood.info> wrote: > On Sat, 17 Dec 2016 03:55 pm, Chris Angelico wrote: > >> More than that: the lists are searched in linear time, the binary >> seach runs in logarithmic time, but the set lookup is constant time. >> Doesn't matter how big your set is, you can test for membership with >> one hash lookup. > > To be pedantic: it is constant time on average. True. And technically, you could say that a linear or binary search has a best-case of constant time (if you find it right at the beginning or middle of the list). I was talking about the average. > But outside of contrived or hostile examples, where elements of the set are > designed to collide, typically you would expect hash collisions to be rare, > and long chains of colliding elements even rarer. For random elements, > most will require only a single hash and a single equality test, a small > number might require two tests, three would be even rarer, and so forth. Yep. Normal case, it's going to be one or two tests - and that doesn't depend on the length of the list. The degenerate case does, but the normal case doesn't. > So strictly speaking, and I realise I'm being exceedingly pedantic here, a > *sufficiently large* dict or set MUST have colliding elements. How large > is "sufficiently large"? Its at least 2**31, more than two billion, so > while you are right that *in practice* set/dict lookups require only a > single hash + equality, in principle (and sometimes in practice too) > collisions can be significant. That assumes the hash has a finite size. If you have enough memory to store that many unique objects, you can probably afford to use a hashing algorithm that allows enough room. However, the chances of not having collisions depend on the capacity. And the chances of having no collisions at all are pretty low... as a rule of thumb, I estimate on a 50-50 chance of a collision when capacity is the square of usage. So if you have a thousand entries but capacity for a million, you have a 50% chance of having at least one collision. (The numbers aren't quite right, but it's a good rule of thumb.) > Nevertheless, I mention this only for completeness. In practice, you almost > never need to care about hash collisions except for the most pathological > cases. Indeed. That's why hash tables are so prevalent, despite their appalling worst-case. ChrisA

0 |

12/17/2016 6:34:03 AM