match if a number is within a series of ranges

  • Follow


Hi All,

We're building a niche search engine (using nutch).  One of the checks 
we are putting in place is whether a site matches a db of IP ranges.  If 
it's in the IP range, we add it to the search engine, otherwise we 
discard it.

Right now we're running this in mysql.  Unfortunately given the volume, 
it's dead slow.  So we're planning on stripping this out and creating a 
seperate module that will return whether an IP address is in one of the 
ranges, with an emphasis on speed.

Does anyone have any suggestions one some general approaches to consider 
to tackle this problem?  I'd imagine there are algorithms or techniques 
to do this, I just don't know what they are.

My initial two thoughts are:
1) keep the list of 'white' IP's as ranges, and find some fast lookup 
method that deals with deciding if a number is within a range.
2) build some sort of tree, given that an IP address consists of 
XXX.XXX.XXX.XXX, i.e. check the first 3 digits, it it matches, on to the 
next three etc.

I'm also wondering if there might not be some elegance to converting the 
IP addresses to binary or hex and working with them that way.....

Thanks for your comments.
0
Reply glenn42 (3) 12/22/2005 9:18:52 AM

Come to my forum to get fast help: http//www.wizardsolutionsusa.com

0
Reply csheppard91 (79) 12/22/2005 4:16:34 PM


"gcook" <glenn@glenn.ca> wrote in message 
news:43aab6a0_1@news.cybersurf.net...
> Hi All,
>
> We're building a niche search engine (using nutch).  One of the checks we 
> are putting in place is whether a site matches a db of IP ranges.  If it's 
> in the IP range, we add it to the search engine, otherwise we discard it.
>
> Right now we're running this in mysql.  Unfortunately given the volume, 
> it's dead slow.  So we're planning on stripping this out and creating a 
> seperate module that will return whether an IP address is in one of the 
> ranges, with an emphasis on speed.
>
> Does anyone have any suggestions one some general approaches to consider 
> to tackle this problem?  I'd imagine there are algorithms or techniques to 
> do this, I just don't know what they are.
>
> My initial two thoughts are:
> 1) keep the list of 'white' IP's as ranges, and find some fast lookup 
> method that deals with deciding if a number is within a range.
> 2) build some sort of tree, given that an IP address consists of 
> XXX.XXX.XXX.XXX, i.e. check the first 3 digits, it it matches, on to the 
> next three etc.
>
> I'm also wondering if there might not be some elegance to converting the 
> IP addresses to binary or hex and working with them that way.....

IP numbers as we know them are actually 4 single byte values and therefore 
can be optimally represented by just four single bytes (range 0..255). That 
would be the first improvement: compare four values instead of a string of 
variable length numbers.
Second: the 4 bytes fit into a single word (depending on your architecture 
and programming platform!), so comparing values for a range boils down to 
comparing just one single value.
Third: either the string of digits or its binary rep (in 4 single bytes or 
as one long word) can be stored and retrieved quite efficiently in a binary 
tree.

Caveats: you may want to include wildcards, to allow f.e. an IP number of 
123.45.67.* -- where the '*' can be any number. The most efficient way to do 
that is to store a bitmask along with the IP, e.g., for the above it is 
255.255.255.0. Another method (though I'm not sure if it is used in IP 
addressing) would be allowed lowest-highest values for any of the four 
digits, e.g., '20-100'. This can't _always_ be represented by a bitmask 
(cases where it can are actually exceptions based on its binary rep).
Final caveat: look to the future! This is based so far on IP numbers 
consisting of 4 bytes each, but a new (and hopefully improved) version is 
being designed now: IP version 6 (IPV6).

[Jongware] 


0
Reply sorry25 (67) 12/22/2005 4:36:48 PM

gcook wrote:
> Hi All,
>
> We're building a niche search engine (using nutch).  One of the checks
> we are putting in place is whether a site matches a db of IP ranges.  If
> it's in the IP range, we add it to the search engine, otherwise we
> discard it.
>
> Right now we're running this in mysql.  Unfortunately given the volume,
> it's dead slow.  So we're planning on stripping this out and creating a
> seperate module that will return whether an IP address is in one of the
> ranges, with an emphasis on speed.
>
> Does anyone have any suggestions one some general approaches to consider
> to tackle this problem?  I'd imagine there are algorithms or techniques
> to do this, I just don't know what they are.
>
> My initial two thoughts are:
> 1) keep the list of 'white' IP's as ranges, and find some fast lookup
> method that deals with deciding if a number is within a range.
> 2) build some sort of tree, given that an IP address consists of
> XXX.XXX.XXX.XXX, i.e. check the first 3 digits, it it matches, on to the
> next three etc.
>
> I'm also wondering if there might not be some elegance to converting the
> IP addresses to binary or hex and working with them that way.....

I might do it like this:
* Find out if it's IP4 or IP6
* If IP4 convert it into 4 bytes/octets, use these as a key into a hash
table
* If it's IP6 then hash on the address as it is (that is as a string).

If you have reasonably big hash tables I'd expect it would work quite
quickly.

Ranges of IPs may pose a problem.  Lets say you want to allow
111.111.111.0 to 111.111.111.255.  If you enter a lot of ranges like
this your hash table could get quite big, though it may be tolerable.
If you envisage it getting very big then hash only the first parts, the
111.111.111, then deal with the small numbers using a normal range
test.

As a further optimization you could store the most recently accessing
IP address (or addresses) in buffer, and check this buffer first before
doing any further work.

If you need to get really fancy you could extend the treatment of
ranges by having several hash tables, say one for xxx.xxx.xxx.0 -
xxx.xxx.xxx.255 and another for 0.xxx.xxx.xx - 255.xxx.xxx.xxx.

You may also be able to use an ADFA (acyclic deterministic finite
automata). There are numerous other possiblities.

0
Reply robert.thorpe (1138) 12/22/2005 6:04:02 PM

gcook wrote:
> Hi All,
>
> We're building a niche search engine (using nutch).  One of the checks
> we are putting in place is whether a site matches a db of IP ranges.  If
> it's in the IP range, we add it to the search engine, otherwise we
> discard it.
>
> Right now we're running this in mysql.  Unfortunately given the volume,
> it's dead slow.  So we're planning on stripping this out and creating a
> seperate module that will return whether an IP address is in one of the
> ranges, with an emphasis on speed.
>
> Does anyone have any suggestions one some general approaches to consider
> to tackle this problem?  I'd imagine there are algorithms or techniques
> to do this, I just don't know what they are.
>
> My initial two thoughts are:
> 1) keep the list of 'white' IP's as ranges, and find some fast lookup
> method that deals with deciding if a number is within a range.
> 2) build some sort of tree, given that an IP address consists of
> XXX.XXX.XXX.XXX, i.e. check the first 3 digits, it it matches, on to the
> next three etc.

How about a binary search tree (or trie) where the elements
are [left, right] intervals and you order the elements by
left? To search for IP address x, find the smallest left
endpoint >= x and check the corresponding interval.

If you need to handle overlapping intervals (or check
whether a range of IP address intersects something in
the db), consider using an interval search tree.

0
Reply googmeister (247) 12/22/2005 7:48:30 PM

"gcook" <glenn@glenn.ca> wrote in message
news:43aab6a0_1@news.cybersurf.net...
> We're building a niche search engine (using nutch).  One of the checks
> we are putting in place is whether a site matches a db of IP ranges.  If
> it's in the IP range, we add it to the search engine, otherwise we
> discard it.
>
> Right now we're running this in mysql.  Unfortunately given the volume,
> it's dead slow.  So we're planning on stripping this out and creating a
> seperate module that will return whether an IP address is in one of the
> ranges, with an emphasis on speed.

How many ranges (order of magnitude)?
How often are updates needed?
What is the desired performance (queries/second)?
What are the memory requirements?

If you only want to support IPv4 and have plenty of RAM, a simple and fast
solution would be to memory map a 512MB file where each bit corresponds to
an address.

Alex


0
Reply Alex 12/22/2005 8:46:53 PM

"gcook" <glenn@glenn.ca> wrote in message 
news:43aab6a0_1@news.cybersurf.net...
> Hi All,
>
> We're building a niche search engine (using nutch).  One of the checks we 
> are putting in place is whether a site matches a db of IP ranges.  If it's 
> in the IP range, we add it to the search engine, otherwise we discard it.
>
> Right now we're running this in mysql.  Unfortunately given the volume, 
> it's dead slow.  So we're planning on stripping this out and creating a 
> seperate module that will return whether an IP address is in one of the 
> ranges, with an emphasis on speed.
>
> Does anyone have any suggestions one some general approaches to consider 
> to tackle this problem?  I'd imagine there are algorithms or techniques to 
> do this, I just don't know what they are.
>
> My initial two thoughts are:
> 1) keep the list of 'white' IP's as ranges, and find some fast lookup 
> method that deals with deciding if a number is within a range.
> 2) build some sort of tree, given that an IP address consists of 
> XXX.XXX.XXX.XXX, i.e. check the first 3 digits, it it matches, on to the 
> next three etc.
>
> I'm also wondering if there might not be some elegance to converting the 
> IP addresses to binary or hex and working with them that way.....

IP addresses /are/ 32-bit numbers. They're conventionally represented as 
four dot-separated decimal bytes, but that's only for human convenience.

If you want a really (really) fast mechanism, you could use a 512 megabyte 
table, with a single bit reserved for each possible IP address. If you can't 
afford that much memory, run-length encode the bitmap.

--
Roger


0
Reply rkww (271) 12/24/2005 12:00:44 AM

Roger Willcocks wrote:
> "gcook" <glenn@glenn.ca> wrote in message
> news:43aab6a0_1@news.cybersurf.net...
> > Hi All,
> IP addresses /are/ 32-bit numbers. They're conventionally represented as
> four dot-separated decimal bytes, but that's only for human convenience.
>
> If you want a really (really) fast mechanism, you could use a 512 megabyte
> table, with a single bit reserved for each possible IP address.

That method may not be the fastest.  If a program hits arbitrary places
in that much memory then it will not use the cache effectively.  These
days it can take many hundreds of cycles to access main memory, if a
hash function can be calculated in less time and references the data
cache, then it will perform better.

It's certainly possible that a more memory friendly method will take a
similar time to using a table.

0
Reply robert.thorpe (1138) 12/24/2005 5:33:14 PM

7 Replies
29 Views

(page loaded in 0.252 seconds)


Reply: