Hi All,
We're building a niche search engine (using nutch). One of the checks
we are putting in place is whether a site matches a db of IP ranges. If
it's in the IP range, we add it to the search engine, otherwise we
discard it.
Right now we're running this in mysql. Unfortunately given the volume,
it's dead slow. So we're planning on stripping this out and creating a
seperate module that will return whether an IP address is in one of the
ranges, with an emphasis on speed.
Does anyone have any suggestions one some general approaches to consider
to tackle this problem? I'd imagine there are algorithms or techniques
to do this, I just don't know what they are.
My initial two thoughts are:
1) keep the list of 'white' IP's as ranges, and find some fast lookup
method that deals with deciding if a number is within a range.
2) build some sort of tree, given that an IP address consists of
XXX.XXX.XXX.XXX, i.e. check the first 3 digits, it it matches, on to the
next three etc.
I'm also wondering if there might not be some elegance to converting the
IP addresses to binary or hex and working with them that way.....
Thanks for your comments.
|
|
0
|
|
|
|
Reply
|
glenn42 (3)
|
12/22/2005 9:18:52 AM |
|
Come to my forum to get fast help: http//www.wizardsolutionsusa.com
|
|
0
|
|
|
|
Reply
|
csheppard91 (79)
|
12/22/2005 4:16:34 PM
|
|
"gcook" <glenn@glenn.ca> wrote in message
news:43aab6a0_1@news.cybersurf.net...
> Hi All,
>
> We're building a niche search engine (using nutch). One of the checks we
> are putting in place is whether a site matches a db of IP ranges. If it's
> in the IP range, we add it to the search engine, otherwise we discard it.
>
> Right now we're running this in mysql. Unfortunately given the volume,
> it's dead slow. So we're planning on stripping this out and creating a
> seperate module that will return whether an IP address is in one of the
> ranges, with an emphasis on speed.
>
> Does anyone have any suggestions one some general approaches to consider
> to tackle this problem? I'd imagine there are algorithms or techniques to
> do this, I just don't know what they are.
>
> My initial two thoughts are:
> 1) keep the list of 'white' IP's as ranges, and find some fast lookup
> method that deals with deciding if a number is within a range.
> 2) build some sort of tree, given that an IP address consists of
> XXX.XXX.XXX.XXX, i.e. check the first 3 digits, it it matches, on to the
> next three etc.
>
> I'm also wondering if there might not be some elegance to converting the
> IP addresses to binary or hex and working with them that way.....
IP numbers as we know them are actually 4 single byte values and therefore
can be optimally represented by just four single bytes (range 0..255). That
would be the first improvement: compare four values instead of a string of
variable length numbers.
Second: the 4 bytes fit into a single word (depending on your architecture
and programming platform!), so comparing values for a range boils down to
comparing just one single value.
Third: either the string of digits or its binary rep (in 4 single bytes or
as one long word) can be stored and retrieved quite efficiently in a binary
tree.
Caveats: you may want to include wildcards, to allow f.e. an IP number of
123.45.67.* -- where the '*' can be any number. The most efficient way to do
that is to store a bitmask along with the IP, e.g., for the above it is
255.255.255.0. Another method (though I'm not sure if it is used in IP
addressing) would be allowed lowest-highest values for any of the four
digits, e.g., '20-100'. This can't _always_ be represented by a bitmask
(cases where it can are actually exceptions based on its binary rep).
Final caveat: look to the future! This is based so far on IP numbers
consisting of 4 bytes each, but a new (and hopefully improved) version is
being designed now: IP version 6 (IPV6).
[Jongware]
|
|
0
|
|
|
|
Reply
|
sorry25 (67)
|
12/22/2005 4:36:48 PM
|
|
gcook wrote:
> Hi All,
>
> We're building a niche search engine (using nutch). One of the checks
> we are putting in place is whether a site matches a db of IP ranges. If
> it's in the IP range, we add it to the search engine, otherwise we
> discard it.
>
> Right now we're running this in mysql. Unfortunately given the volume,
> it's dead slow. So we're planning on stripping this out and creating a
> seperate module that will return whether an IP address is in one of the
> ranges, with an emphasis on speed.
>
> Does anyone have any suggestions one some general approaches to consider
> to tackle this problem? I'd imagine there are algorithms or techniques
> to do this, I just don't know what they are.
>
> My initial two thoughts are:
> 1) keep the list of 'white' IP's as ranges, and find some fast lookup
> method that deals with deciding if a number is within a range.
> 2) build some sort of tree, given that an IP address consists of
> XXX.XXX.XXX.XXX, i.e. check the first 3 digits, it it matches, on to the
> next three etc.
>
> I'm also wondering if there might not be some elegance to converting the
> IP addresses to binary or hex and working with them that way.....
I might do it like this:
* Find out if it's IP4 or IP6
* If IP4 convert it into 4 bytes/octets, use these as a key into a hash
table
* If it's IP6 then hash on the address as it is (that is as a string).
If you have reasonably big hash tables I'd expect it would work quite
quickly.
Ranges of IPs may pose a problem. Lets say you want to allow
111.111.111.0 to 111.111.111.255. If you enter a lot of ranges like
this your hash table could get quite big, though it may be tolerable.
If you envisage it getting very big then hash only the first parts, the
111.111.111, then deal with the small numbers using a normal range
test.
As a further optimization you could store the most recently accessing
IP address (or addresses) in buffer, and check this buffer first before
doing any further work.
If you need to get really fancy you could extend the treatment of
ranges by having several hash tables, say one for xxx.xxx.xxx.0 -
xxx.xxx.xxx.255 and another for 0.xxx.xxx.xx - 255.xxx.xxx.xxx.
You may also be able to use an ADFA (acyclic deterministic finite
automata). There are numerous other possiblities.
|
|
0
|
|
|
|
Reply
|
robert.thorpe (1138)
|
12/22/2005 6:04:02 PM
|
|
gcook wrote:
> Hi All,
>
> We're building a niche search engine (using nutch). One of the checks
> we are putting in place is whether a site matches a db of IP ranges. If
> it's in the IP range, we add it to the search engine, otherwise we
> discard it.
>
> Right now we're running this in mysql. Unfortunately given the volume,
> it's dead slow. So we're planning on stripping this out and creating a
> seperate module that will return whether an IP address is in one of the
> ranges, with an emphasis on speed.
>
> Does anyone have any suggestions one some general approaches to consider
> to tackle this problem? I'd imagine there are algorithms or techniques
> to do this, I just don't know what they are.
>
> My initial two thoughts are:
> 1) keep the list of 'white' IP's as ranges, and find some fast lookup
> method that deals with deciding if a number is within a range.
> 2) build some sort of tree, given that an IP address consists of
> XXX.XXX.XXX.XXX, i.e. check the first 3 digits, it it matches, on to the
> next three etc.
How about a binary search tree (or trie) where the elements
are [left, right] intervals and you order the elements by
left? To search for IP address x, find the smallest left
endpoint >= x and check the corresponding interval.
If you need to handle overlapping intervals (or check
whether a range of IP address intersects something in
the db), consider using an interval search tree.
|
|
0
|
|
|
|
Reply
|
googmeister (247)
|
12/22/2005 7:48:30 PM
|
|
"gcook" <glenn@glenn.ca> wrote in message
news:43aab6a0_1@news.cybersurf.net...
> We're building a niche search engine (using nutch). One of the checks
> we are putting in place is whether a site matches a db of IP ranges. If
> it's in the IP range, we add it to the search engine, otherwise we
> discard it.
>
> Right now we're running this in mysql. Unfortunately given the volume,
> it's dead slow. So we're planning on stripping this out and creating a
> seperate module that will return whether an IP address is in one of the
> ranges, with an emphasis on speed.
How many ranges (order of magnitude)?
How often are updates needed?
What is the desired performance (queries/second)?
What are the memory requirements?
If you only want to support IPv4 and have plenty of RAM, a simple and fast
solution would be to memory map a 512MB file where each bit corresponds to
an address.
Alex
|
|
0
|
|
|
|
Reply
|
Alex
|
12/22/2005 8:46:53 PM
|
|
"gcook" <glenn@glenn.ca> wrote in message
news:43aab6a0_1@news.cybersurf.net...
> Hi All,
>
> We're building a niche search engine (using nutch). One of the checks we
> are putting in place is whether a site matches a db of IP ranges. If it's
> in the IP range, we add it to the search engine, otherwise we discard it.
>
> Right now we're running this in mysql. Unfortunately given the volume,
> it's dead slow. So we're planning on stripping this out and creating a
> seperate module that will return whether an IP address is in one of the
> ranges, with an emphasis on speed.
>
> Does anyone have any suggestions one some general approaches to consider
> to tackle this problem? I'd imagine there are algorithms or techniques to
> do this, I just don't know what they are.
>
> My initial two thoughts are:
> 1) keep the list of 'white' IP's as ranges, and find some fast lookup
> method that deals with deciding if a number is within a range.
> 2) build some sort of tree, given that an IP address consists of
> XXX.XXX.XXX.XXX, i.e. check the first 3 digits, it it matches, on to the
> next three etc.
>
> I'm also wondering if there might not be some elegance to converting the
> IP addresses to binary or hex and working with them that way.....
IP addresses /are/ 32-bit numbers. They're conventionally represented as
four dot-separated decimal bytes, but that's only for human convenience.
If you want a really (really) fast mechanism, you could use a 512 megabyte
table, with a single bit reserved for each possible IP address. If you can't
afford that much memory, run-length encode the bitmap.
--
Roger
|
|
0
|
|
|
|
Reply
|
rkww (271)
|
12/24/2005 12:00:44 AM
|
|
Roger Willcocks wrote:
> "gcook" <glenn@glenn.ca> wrote in message
> news:43aab6a0_1@news.cybersurf.net...
> > Hi All,
> IP addresses /are/ 32-bit numbers. They're conventionally represented as
> four dot-separated decimal bytes, but that's only for human convenience.
>
> If you want a really (really) fast mechanism, you could use a 512 megabyte
> table, with a single bit reserved for each possible IP address.
That method may not be the fastest. If a program hits arbitrary places
in that much memory then it will not use the cache effectively. These
days it can take many hundreds of cycles to access main memory, if a
hash function can be calculated in less time and references the data
cache, then it will perform better.
It's certainly possible that a more memory friendly method will take a
similar time to using a table.
|
|
0
|
|
|
|
Reply
|
robert.thorpe (1138)
|
12/24/2005 5:33:14 PM
|
|
|
7 Replies
29 Views
(page loaded in 0.252 seconds)
|