Efficient algorithm for finding duplicates in 56-bit number

  • Permalink
  • submit to reddit
  • Email
  • Follow


Hello,

I'm searching for an memory-efficient (and hopefully fast)
algorithm for detecting duplicate 56-bit numbers.

The input consists of millions of lines like this:

     0X01234567890abcdef01234567890abcd

Each line is a textual representation of a 56-bit hex number.

The file should be read into memory and stored there in a
memory-efficient way.

Afterwards new numbers are read from somewhere else, and
the program needs to know if the number exists already
or if it is a new, unique one.

What would be a good data-structure for this purpose?

"gzip" shows that there's a lot of redundancy in the text file
(factor >20). However, keeping the compressed data in
memory is no option, because searching it would be too slow.

Heiner
-- 
  ___ _
/ __| |_ _____ _____ _ _     Heiner STEVEN <heiner.steven@nexgo.de>
\__ \  _/ -_) V / -_) ' \    Shell Script Programmers: visit
|___/\__\___|\_/\___|_||_|   http://www.shelldorado.com/
0
Reply heiner.steven2 (56) 1/29/2005 12:32:13 AM

See related articles to this posting


A "hash" data structure, accessed for example on the first n bits of
each number, would store only unique numbers, and, as you have found
from gzip, there is a lot of redundancy. Look up "hashing".

0
Reply spinoza1111 (3247) 1/29/2005 12:50:52 AM

spinoza1111@yahoo.com wrote:
> A "hash" data structure, accessed for example on the first n bits of
> each number, would store only unique numbers, and, as you have found
> from gzip, there is a lot of redundancy. Look up "hashing".

I know how hashing works. But in that case I would have to
store all numbers + a hashing table in memory. Storing the
numbers in a more concise format would be better.

Additionally, hashing would be fast, but the hashing table would take
additional memory.

Heiner
-- 
  ___ _
/ __| |_ _____ _____ _ _     Heiner STEVEN <heiner.steven@nexgo.de>
\__ \  _/ -_) V / -_) ' \    Shell Script Programmers: visit
|___/\__\___|\_/\___|_||_|   http://www.shelldorado.com/
0
Reply heiner.steven2 (56) 1/29/2005 1:05:00 AM

Heiner Steven wrote:
> 
> I'm searching for an memory-efficient (and hopefully fast)
> algorithm for detecting duplicate 56-bit numbers.
> 
> The input consists of millions of lines like this:
> 
>      0X01234567890abcdef01234567890abcd
> 
> Each line is a textual representation of a 56-bit hex number.
> 
> The file should be read into memory and stored there in a
> memory-efficient way.
> 
> Afterwards new numbers are read from somewhere else, and
> the program needs to know if the number exists already
> or if it is a new, unique one.
> 
> What would be a good data-structure for this purpose?

A hashtable.  You just happen to be in luck, because I have
published a free hashtable implementation under GPL.  If you happen
to want it for commercial purposes that can be discussed.  It is
available, with examples and tests, at:

  <http://cbfalconer.home.att.net/download/hashlib.zip>

You can decide for yourself whether or not to convert your input
strings into binary numbers internally.  I would suggest making
your initial tries by storing the actual string and not worrying
about it yet.  Since they are hex the conversions will be trivial.

-- 
"If you want to post a followup via groups.google.com, don't use
 the broken "Reply" link at the bottom of the article.  Click on 
 "show options" at the top of the article, then click on the 
 "Reply" at the bottom of the article headers." - Keith Thompson


0
Reply cbfalconer (19194) 1/29/2005 4:08:47 AM

Heiner Steven wrote:
> I know how hashing works. But in that case I would have to
> store all numbers + a hashing table in memory. Storing the
> numbers in a more concise format would be better.
>
> Additionally, hashing would be fast, but the hashing table would take
> additional memory.
>
> Heiner

Perhaps a trie? The idea is that you create a special sort of tree with
a node-type that has 16 slots and a 16-bit flag item at the top.

struct trie16 {
unsigned short itemPointer;
/* ^ each bit is a boolean value telling us whether or not each below
pointer is a pointer to a string (1) or a pointer to another trie16
structure */

void* _0;
void* _1;
void* _2;
void* _3;
void* _4;
void* _5;
void* _6;
void* _7;
void* _8;
void* _9;
void* _a;
void* _b;
void* _c;
void* _d;
void* _e;
void* _f;
}

You then use each of your strings as a "map" to guide you through the
trie. So if you start with just one empty node, then when you add
0X01234567890abcdef01234567890abcd, you will put a pointer to it in the
_0 slot and set the flag for that item to 1. If we then put in
0x11111111111111111111111111111111, this will go in the _1 slot. Then
we try to add 0x00000000000000000000000000000000, which we will want to
put in _0, but the slot is already filled.
First we compare the strings. If they are exactly the same, then we can
already consider it as stored. Otherwise we create a new node and move
0x01234... into the _1 slot of the second note and 0x00000... into the
_0 slot. The _0 slot of the top node will be pointed to this new node
and the flag set to 0. Note that is the two values were identical for
several characters then we might have had to create three or four more
nodes to extend the trie out to where the difference began.

I haven't done enough playing around with structures like this nor
hashes to know what the relative speed/memory usage rates are like.
Most programmers advocate hashes and I generally have to assume that
they know what they are talking about...but just in case here is one
other thing to look up and try =)

-Chris

0
Reply thesagerat (67) 1/29/2005 7:04:50 AM

And best not to preoptimize, but if you really wanted to get
memory-stingy with your trie, one trick is to drop as many of the
hexadecimal decimals as you already have mapped when you store the
original string. But you would need to make sure that when you have to
move that string to a new, deeper node that you increment the pointer
you saved so that 56 - (node depth) == remaining number of hex values
the pointer points to.

And of course don't save your hex values as 8-bit ascii strings =)
-Chris

0
Reply thesagerat (67) 1/29/2005 7:39:00 AM

Heiner Steven wrote:
> spinoza1111@yahoo.com wrote:
> > A "hash" data structure, accessed for example on the first n bits
of
> > each number, would store only unique numbers, and, as you have
found
> > from gzip, there is a lot of redundancy. Look up "hashing".
>
> I know how hashing works. But in that case I would have to
> store all numbers + a hashing table in memory. Storing the
> numbers in a more concise format would be better.

I apologize for the patronizing tone of my reply.

First of all, you don't have to store an extra hashing table in memory
(cf. Cormen et al. 2000: "Introduction to Algorithms", mit press). Open
hashing, which is just as efficient and far more elegant than links,
simply sets the target entry to the number.

IMO, you need to impose "separation of concerns" on hashing versus "a
concise format". Use hashing to search and store. If conciseness is
needed, select a good way to pack the values. If they are, for example,
56 byte strings, restricted to letters and numbers, as appears, they
are in fact base 36 values...which can be represented by a rather small
array of long, 8 byte integers, far less storage in fact than a 56
character string at each location.

Your library may even have an "extra precision" handler.
>
> Additionally, hashing would be fast, but the hashing table would take
> additional memory.

Nope. 'Twon't. I haven't used links to implement hashing since 1974.
Open addressing is just as fast. For a clearer discussion of the
practicalities, see Ch 8 of my book (Build Your Own .Net Language and
Compiler, Edward Nilges 2004, Apress).

The worst case for both styles of hashing is pretty bad, because in the
worst case, all your values hash to THE SAME LOCATION, and the hash
table is just a random table using linear search. But what this means
is you've selected an improper hash algorithm!


>
> Heiner
> --
>   ___ _
> / __| |_ _____ _____ _ _     Heiner STEVEN <heiner.steven@nexgo.de>
> \__ \  _/ -_) V / -_) ' \    Shell Script Programmers: visit
> |___/\__\___|\_/\___|_||_|   http://www.shelldorado.com/

0
Reply spinoza1111 (3247) 1/29/2005 1:08:32 PM

My recommendation, storage of the strings as a small array of Long
integers, does mean that to "compare" values during your hash search
(or other form of search) you have to compare two small arrays (the
input and the entry in the hash table) in a small loop.

But if you arrange the Long integers in small-endian fashion, then that
loop isn't constant K time, it is K/2 time.

It might be actually better to check "right to left" in the K/2 compare
because if memory serves, many of the values have common front ends.

This also means that your hash function can be efficiently based on the
last digits, perhaps the least significant integer, mod the size of the
hash table.

Also, the size of the hash table should not be considered a constant.
Painfully, while processing a lot of numbers you can even expand the
hash table on the fly by creating a new table and re-hashing each
entry...assuming that your hash function is something mod the size of
the hash table.

The benefit is that your code will almost never snarf out as the size
of the data base increases.

I don't think this expensive reorganization, which is an advanced
feature, can be avoided for the simple reason that IF the hash function
(which converts an entire number to a number in the range of the table)
does not depend on the input number, then there is no reason why
identical numbers would map to the same place...which the entire scheme
sorta depends on.

There are also subtleties, documented in the reference to Cormen, about
your selection of the size of the hash table. Knuth's Sorting and
Searching volume in his series on The Art of Computer Programming
discusses these more clearly, but Cormen has more recent research
results.

In fact, recent research has discovered all sorts of minimal functions
but my experience, based on a combination of pragmatism and laziness,
is to take the end of the input strings as being the "most random" in
practice. The beginning in the case of symbols in a programming
language is "least random" almost universally, especially if the code
uses Hungarian notation.

What you want is a random-appearing distribution of the hash targets,
not strange clusters and blobs, for for each strange cluster, each
blob, there are more and more linear searches. One reason I still use
VB is that it's easy to create a nifty visual demo of clustering.

But the central magic of hashing works as long as the key is input to
the hash function and the relation of the key to possible positions is
one-many. Indeed, the elegance of open hashing is an attractive
combination of precision (hash function applied to key goes to same
place) followed by apparent messiness.

It amuses me to think of an expensive and precise machine tossing
values as the wild sheep defecate yet to recover them precisely. It is
for me an emotional deconstruction of the rhetorical presentation of
digital computers as necessarily rigid.

There's a subtlety in open hashing if you delete entries (doesn't sound
like you do). It is that you must use a special marker for a deleted
entry! This is because an EMPTY entry proves that a key does not exist,
whereas a match may be found in the search (starting from the hash
location, and as needed wrapping, to stop at the original entry) after
the deleted entry.

Of course, today, many hash implementations such as are found in .Net
encapsulate good praxis. The problem is that other efforts encapsulate
Stupid Hash Tricks. It is also that completing a good implementation
after actually reading Cormen (reading Nilges is sadly not enough,
since my book develops a compiler in which we can assume a good hasher)
is itself rewarding in a Promethean sense. In fact, it is a rather
small amount of code.

0
Reply spinoza1111 (3247) 1/29/2005 1:35:03 PM

The hash table is far more easily explained...a strong argument in its
favor. Its downside is that its worst case behavior sucks and is
equivalent to linear search. Even before approaching worst case,
unexpectedly patterned input will degrade the hash table.

Of course, a self-conscious "hash object", being an object whose state
can readily contain performance information, can "notice" when strange
clusters and blobs start to appear, by maintaining the maximum
"reasonable" search length in object state.

This length should be a percent of the space available so that the code
exhibits sensitivity to the nonconstant and arbitrary nature of the max
size of the table.

When a blob appears, the table can be expanded and reorganized. The
trouble is the time it takes to reorg, but this only occurs as needed.

>From Cormen's august standpoint, as the author of the formidable
Introduction to Algorithms, this is ad hoccery...bricolage.

In my experience with early releases of data bases, such bricolage
created strange pauses in the processing of user data.

Therefore, the hash algorithm's applicability needs to be considered
carefully.

0
Reply spinoza1111 (3247) 1/29/2005 1:42:10 PM

Heiner Steven wrote:

> Hello,
> 
> I'm searching for an memory-efficient (and hopefully fast)
> algorithm for detecting duplicate 56-bit numbers.
> 
> The input consists of millions of lines like this:
> 
>      0X01234567890abcdef01234567890abcd
> 
> Each line is a textual representation of a 56-bit hex number.
> 
> The file should be read into memory and stored there in a
> memory-efficient way.
> 
> Afterwards new numbers are read from somewhere else, and
> the program needs to know if the number exists already
> or if it is a new, unique one.
> 
> What would be a good data-structure for this purpose?
> 
> "gzip" shows that there's a lot of redundancy in the text file
> (factor >20). However, keeping the compressed data in
> memory is no option, because searching it would be too slow.
> 
> Heiner

Here's a Perl script to test hashing speed:

############################################
%names = {};

$n = "0X01234567890abcdef01234567890abcd";
for ($x=0; $x<4000000; $x++)
{
        $names{$n} += 1;
}

print "Found $n ", $names{$n}, " times\n";
############################################

Hashing your example string 4 million times 
takes approximately 19 seconds on my 66 MHz
Pentium II.

Hashing in Perl is pretty efficient. I 
don't think you will find any implementations
significantly faster than this.

0
Reply marc_a_poulin (9) 1/29/2005 3:13:05 PM

On Sat, 29 Jan 2005 01:32:13 +0100, Heiner Steven wrote:

> Hello,
> 
> I'm searching for an memory-efficient (and hopefully fast)
> algorithm for detecting duplicate 56-bit numbers.
> 
> The input consists of millions of lines like this:
> 
>      0X01234567890abcdef01234567890abcd
> 
> Each line is a textual representation of a 56-bit hex number.
> 
> The file should be read into memory and stored there in a
> memory-efficient way.
> 
> Afterwards new numbers are read from somewhere else, and
> the program needs to know if the number exists already
> or if it is a new, unique one.
> 
> What would be a good data-structure for this purpose?
> 
> "gzip" shows that there's a lot of redundancy in the text file
> (factor >20). However, keeping the compressed data in
> memory is no option, because searching it would be too slow.

If you are using C++, you can try the std::set container. Something like
this:

#include <algorithm>
#include <iterator>
#include <iostream>
#include <vector>
#include <ctime>
#include <set>

using namespace std;

//! for gcc:
typedef unsigned long long uint64;
//! for VC:
//typedef unsigned __int64 uint64;

struct generator
{
  mutable vector<uint64>& m_subset;

  generator(vector<uint64>& v) : m_subset(v) { }

  //! called by generate_n
  uint64 operator()() const
  {
    uint64 num = uint64(rand()) << 32 | rand();
    if( !(rand()%100) )
      m_subset.push_back(num);
    return num;
  }
};

//! this is a function object that acts as a counter
struct counter
{
  //! keep a reference to the list of unique numbers
  const set<uint64>& m_v;
  static int numbers;
  counter(const set<uint64>& v) : m_v(v) { }

  //! called by for_each
  //! counts the numbers that exist in m_v
  void operator()(uint64 number) const
  {
    if( m_v.find(number)!=m_v.end() )
      numbers ++;
  }
};

int counter::numbers = 0;

int main()
{
  //! seed random number generator
  srand(time(0));

  //! this container holds unique numbers
  //! the number is both key and data
  set<uint64> unique;
  
  //! put one million numbers into container
  //! keep 1% of generated numbers here:
  vector<uint64> subset;
  generator g(subset);
  generate_n(inserter(unique, unique.begin()), 1000000, g);

  //! now time the lookup of the saved subset
  time_t t = time(0);
  counter c(unique);
  cout << "Size of subset: " << subset.size() << endl;
  for_each(subset.begin(), subset.end(), c);
  cout << "Lookup took " << time(0)-t << "s" << endl;
  cout << "Found " << counter::numbers << " numbers" << endl;

  return 0;
}

On my computer (3GHz P4) the output is this:
Size of subset: 9952
Lookup took 0s
Found 9952 numbers

-- 
Daniel

0
Reply someone2 (821) 1/29/2005 4:41:57 PM

Heiner Steven wrote:
> Hello,
> 
> I'm searching for an memory-efficient (and hopefully fast)
> algorithm for detecting duplicate 56-bit numbers.
> 
> The input consists of millions of lines like this:
> 
>     0X01234567890abcdef01234567890abcd
> 
> Each line is a textual representation of a 56-bit hex number.
> 
> The file should be read into memory and stored there in a
> memory-efficient way.

Read the numbers in and convert to 7 or 8 byte values, whichever
floats your boat.  Put them in an array and sort them.  Find them
using binary search.

Matt Gregory
0
Reply email69 (74) 1/29/2005 4:49:19 PM

| Matt Gregory wrote:
|> Heiner Steven wrote:
|> I'm searching for an memory-efficient (and hopefully fast)
|> algorithm for detecting duplicate 56-bit numbers.
|> The input consists of millions of lines like this:
|>     0X01234567890abcdef01234567890abcd
|> Each line is a textual representation of a 56-bit hex number.
|> The file should be read into memory and stored there in a
|> memory-efficient way.

| Read the numbers in and convert to 7 or 8 byte values, whichever
| floats your boat.  Put them in an array and sort them.  Find them
| using binary search.

As an aside, I'd be interested to see how long  (wall clock and
processor time)  it would take to sort millions of those values
(and I assume that 2 million is quite less than what Heiner had
in mind when he said millions).

If the OP is looking for duplicates, all one has to do is just
sort the file and see if any of the records match the previous
one.  No need to do a binary search.   This boils the problem
down to having some utility do the work sorting the file, and
creating a new file if the old one is to be kept intact.
Presumably, let the utility run overnight or at some other
"slack" time. __________________________________________Gerard S. 


0
Reply gerard46 (68) 1/29/2005 5:01:13 PM

gerard46 wrote:
) As an aside, I'd be interested to see how long  (wall clock and
) processor time)  it would take to sort millions of those values
) (and I assume that 2 million is quite less than what Heiner had
) in mind when he said millions).
)
) If the OP is looking for duplicates, all one has to do is just
) sort the file and see if any of the records match the previous
) one.  No need to do a binary search.   This boils the problem
) down to having some utility do the work sorting the file, and
) creating a new file if the old one is to be kept intact.
) Presumably, let the utility run overnight or at some other
) "slack" time. __________________________________________Gerard S. 

I think you're overestimating the amount of time needed to sort
mollions of numbers.  Basically, if it fits in memory, sorting time
is in the order of minutes, or maybe even seconds.

If it doesn't, you're best off with a sort that has very good locality
of reference, or even tailor-make one that sorts in chunks (where each
 chunk fits in memory) and then merges the chunks.
You can merge the chunks in one go, if you read those in smaller chunks
so it fits in memory again, and use maybe a heap structure of pointers
to chunks to always output the smallest value.

I seem to remember there's a sorting challenge where you have to sort
like billions or trillions of numbers on big hardware.


SaSW, Willem
-- 
Disclaimer: I am in no way responsible for any of the statements
            made in the above text. For all I know I might be
            drugged or something..
            No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
0
Reply willem (1478) 1/29/2005 5:09:36 PM

| Willem wrote:
|)  gerard46 wrote:
|) As an aside, I'd be interested to see how long  (wall clock and
|) processor time)  it would take to sort millions of those values
|) (and I assume that 2 million is quite less than what Heiner had
|) in mind when he said millions).
|)
|) If the OP is looking for duplicates, all one has to do is just
|) sort the file and see if any of the records match the previous
|) one.  No need to do a binary search.   This boils the problem
|) down to having some utility do the work sorting the file, and
|) creating a new file if the old one is to be kept intact.
|) Presumably, let the utility run overnight or at some other
|) "slack" time. __________________________________________Gerard S.

| I think you're overestimating the amount of time needed to sort
| mollions of numbers.  Basically, if it fits in memory, sorting time
| is in the order of minutes, or maybe even seconds.

Yes, and that'a a heck of a big IF  (if it fits in memory).  Millions
(let's pick ten as a "good" number) of numbers, each requiring 56 bytes
(as I understood the OP's numbers being 56-byte hex numbers) would
require 1/2 Gigabyte or so of memory.  Even if you could read all those
numbers into real memory, I don't think that it can be done in an order
of minutes, and certainly not seconds.  But, I could be wrong, of
course. __________________________________________________Gerard S.


| If it doesn't, you're best off with a sort that has very good locality
| of reference, or even tailor-make one that sorts in chunks (where each
| chunk fits in memory) and then merges the chunks.
| You can merge the chunks in one go, if you read those in smaller chunks
| so it fits in memory again, and use maybe a heap structure of pointers
| to chunks to always output the smallest value.
|
| I seem to remember there's a sorting challenge where you have to sort
| like billions or trillions of numbers on big hardware.

What size numbers, 32 bits ? _____________________________Gerard S.



0
Reply gerard46 (68) 1/29/2005 5:32:50 PM

Willem wrote:
> 
.... snip ...
> 
> I think you're overestimating the amount of time needed to sort
> mollions of numbers.  Basically, if it fits in memory, sorting time
> is in the order of minutes, or maybe even seconds.

The problem doesn't require sorting.  Here is a test run on my
hashlib package, run on a ten year old 486/80 machine.  The values
to insert came from a 32 bit Mersenne Twister random generator. 
Somewhere around 250,000 entries the system starts to use virtual
memory, IIRC.  I would expect most modern machinery to execute the
same runs in about 1 or 2 seconds.

[1] c:\c\hashlib>timerun hashtest 4 100001
Timer 3 on: 16:24:27
HASHLIB test04

New table, inserting 100001 values
Status: Entries=99999, Deleted=0, Probes=461067, Misses=246947

Walking returned 0
       0 items were inserted 0 times
   99997 items were inserted 1 times
       2 items were inserted 2 times
       0 items were inserted 3 times
       0 items were inserted 4 times
       0 items were inserted 5 times
       0 items were inserted 6 times
       0 items were inserted 7 times
       0 items were inserted 8 times
       0 items were inserted 9 or more times
Suppressing hshkill()

Timer 3 off: 16:24:32  Elapsed: 0:00:04.62

[1] c:\c\hashlib>timerun hashtest 4 1000001
Timer 3 on: 16:24:38
HASHLIB test04

New table, inserting 1000001 values
Status: Entries=999877, Deleted=0, Probes=5861138, Misses=3026865

Walking returned 0
       0 items were inserted 0 times
  999753 items were inserted 1 times
     124 items were inserted 2 times
       0 items were inserted 3 times
       0 items were inserted 4 times
       0 items were inserted 5 times
       0 items were inserted 6 times
       0 items were inserted 7 times
       0 items were inserted 8 times
       0 items were inserted 9 or more times
Suppressing hshkill()

Timer 3 off: 16:25:50  Elapsed: 0:01:12.01

The runs avoided the final freeing of the data entry storage, which
is an O(N*N) operation on many systems.  Finding this out inspired
my nmalloc package for DJGPP, which doesn't have this fault. 
hashtest ? gives help and shows how to avoid this.

Here is a test run showing the effect that free operation can
have.  One run uses the nmalloc package, and one uses the
original.  The time went from 4 seconds to well over 4 minutes. 
The inputs are identical because the random package was not seeded.


[1] c:\c\hashlib>timerun hshtstm 4 100000
Timer 3 on: 16:42:38
HASHLIB test04

New table, inserting 100000 values
Status: Entries=99998, Deleted=0, Probes=461060, Misses=246941

Walking returned 0
       0 items were inserted 0 times
   99996 items were inserted 1 times
       2 items were inserted 2 times
       0 items were inserted 3 times
       0 items were inserted 4 times
       0 items were inserted 5 times
       0 items were inserted 6 times
       0 items were inserted 7 times
       0 items were inserted 8 times
       0 items were inserted 9 or more times

Timer 3 off: 16:42:44  Elapsed: 0:00:05.55

[1] c:\c\hashlib>timerun hashtest 4 100000
Timer 3 on: 16:43:07
HASHLIB test04

New table, inserting 100000 values
Status: Entries=99998, Deleted=0, Probes=461060, Misses=246941

Walking returned 0
       0 items were inserted 0 times
   99996 items were inserted 1 times
       2 items were inserted 2 times
       0 items were inserted 3 times
       0 items were inserted 4 times
       0 items were inserted 5 times
       0 items were inserted 6 times
       0 items were inserted 7 times
       0 items were inserted 8 times
       0 items were inserted 9 or more times

Timer 3 off: 16:47:28  Elapsed: 0:04:20.67


-- 
"If you want to post a followup via groups.google.com, don't use
 the broken "Reply" link at the bottom of the article.  Click on 
 "show options" at the top of the article, then click on the 
 "Reply" at the bottom of the article headers." - Keith Thompson

0
Reply cbfalconer (19194) 1/29/2005 10:41:06 PM

CBFalconer wrote:
) Willem wrote:
)> 
) ... snip ...
)> 
)> I think you're overestimating the amount of time needed to sort
)> millions of numbers.  Basically, if it fits in memory, sorting time
)> is in the order of minutes, or maybe even seconds.
)
) The problem doesn't require sorting.  Here is a test run on my
) hashlib package, run on a ten year old 486/80 machine.  The values
) to insert came from a 32 bit Mersenne Twister random generator. 
) Somewhere around 250,000 entries the system starts to use virtual
) memory, IIRC.  I would expect most modern machinery to execute the
) same runs in about 1 or 2 seconds.

Sorting, when done properly, has a lot better locality of reference.
This means that on problem sizes that vastly outgrow available memory,
the sorting solution should outperform the hashing solution significantly.

If I read your test correctly, you were running with a mere million values.
Now, I don't know how many values the OP has to contend with, but if it's
in the order of billions, I think you're going to run out of core on most
modern desktop machines.

As an aside, have you done test runs with a sorting algorithm on the same
test data sets, to get a comparison ?



SaSW, Willem
-- 
Disclaimer: I am in no way responsible for any of the statements
            made in the above text. For all I know I might be
            drugged or something..
            No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
0
Reply willem (1478) 1/29/2005 11:19:17 PM

On Sat, 29 Jan 2005 01:32:13 +0100
Heiner Steven <heiner.steven@nexgo.de> wrote:

> Hello,
> 
> I'm searching for an memory-efficient (and hopefully fast)
> algorithm for detecting duplicate 56-bit numbers.
> 
> The input consists of millions of lines like this:
> 
>      0X01234567890abcdef01234567890abcd
> 
> Each line is a textual representation of a 56-bit hex number.
> 
> The file should be read into memory and stored there in a
> memory-efficient way.
> 
> Afterwards new numbers are read from somewhere else, and
> the program needs to know if the number exists already
> or if it is a new, unique one.
> 
> What would be a good data-structure for this purpose?
> 
> "gzip" shows that there's a lot of redundancy in the text file
> (factor >20). However, keeping the compressed data in
> memory is no option, because searching it would be too slow.
> 
> Heiner
> -- 
>   ___ _
> / __| |_ _____ _____ _ _     Heiner STEVEN <heiner.steven@nexgo.de>
> \__ \  _/ -_) V / -_) ' \    Shell Script Programmers: visit
> |___/\__\___|\_/\___|_||_|   http://www.shelldorado.com/


Any technique used has to compare favourably with the "braindead" approach: load all of the 56-bit entries into an array, sort them and then use something akin to a binary search.

The notable thing here is the memory usage - it's only 56-bits per entry. Compare this to any technique which has to store a value or pointer per entry: there's 32-bits. Also compare to any tree-like data structure which try's to exploit sparseness: if you store 64-bits per entry at the leaf nodes, and only one entry per leaf node is likely (which it is given the sparseness of the data you have), then it's already worse than the braindead approach.

The best I've been able to find, playing around in my head, is a small modification to the braindead approach which exploits the large amount of redundency in the first few bits.

If you have a data structure which does a simple lookup into a table on the value of the first, say, 8 bits to sorted arrays of entries 48-bits long you can reduce the amount of storage and improve the time to search for an entry. We could do the same process again on the 48-bit entries...

However, if you do this all the way down to the last 20 bits or so, you suddently start using more memory per entry than the braindead approach again. It would appear that any kind of tree structure applied all the way to the bottom few bits will require more memory than the braindead apprach no matter how it's organised.

The clue to the point where it's worth stopping is in the sparseness of the data structure. For each 8-bit lookup table we have 256 pointers For 'n' entries which are evenly distributed over the whole of the 0 to (2^56)-1 range we can expect n/256 entries for each array off the first 8-bit lookup table, which uses for 32 bit pointers 1KB of memory and then 6*n octets to store the remaining 48-bits of each entry. We can expect n/65536 entries if we lookup the first 16 bits, the lookup table requiring 256KB, then 5*n octets for the remaining 40 bits.

Or, in otherwords.... if we lookup the first 'b' bits of the entry, we require 4*2^b + ((56-b)*n)/8 octets of storage. If n=1 million...

b	storage (octets)
0	7000000	
4	6500064
8	6001024
12	5516384
16	5262144
20	4194304+4500000 ... which is big
24	is going to be much bigger..

Hopefully my random whitterings here might be useful. :P

Ian Woods
0
Reply newspub22 (20) 1/30/2005 12:59:26 AM

The practical problem with the braindead approach is that the guy has
millions of records, and a non-braindead sorting algorithm such as
quick or shell sort has to be used to realistically load the table.

Quick and shell sorting are more psychologically complicated, perhaps,
than open hash.

But if this is only my prejudice, the use of a sort AFTER all entries
are loaded creates a pragmatic issue.

This is that even using a good algorithm there will be a strange pause
after load during sort whereas open hash allows each entry to be loaded
with a time footprint that only degrades slowly.

In the "braindead" approach, you have to wait constant (but large) time
proportional to the size of the data set order 1 while the data set
loads, and a roughly equivalent time during sort until you are
presented with the first useful results.

Whereas an object approach using open hash has the multithread
capability to access items before any drama, before the load is
complete. During the load you can start work because some keys will
already exist, and the purpose of computers is to work us all to death,
isn't it, in paralell.

I am not merely trying to be droll: I am not at all trying to be a
troll. "Brain dead" approaches can be less maintainable and extensible
in reality, even though they are sold as such, because the code, small
and elegant AS CODE, is AS A DYNAMIC PROCESS hard to adapt to real
requirements, such as a sudden rage on the user's part to access the
table while waiting for it to be fully present.

In the brain dead approach, the table must be read-locked until it is
fully loaded and sorted. No such rule applies to open hashing.

Problem breakdown is a Good Thing. But one of the most primitive ways
to implement "separation of concerns" is to create a USELESS data set
and pipe that puppy which creates a USELESS data set, and send it down
the Fordist assembly line to the finisher process, only to have
something USEFUL at the end of the working day.

Many pipe mechanisms are of course asynchronous, but not this one.

I call this Fordist advisedly although my lawyer wants me to stop :-),
because you cannot let the customer drive the car before finishing it
but data structures are different.

Open hashing creates the final and useful product, lacking only closure
in the form of all entries, from the beginning and that is why it MAY
be useful here. It stores the minimal amount of crap (eg., zero crap)
per entry.

One negative remains the case that there is no requirement for deletion
apart from not storing duplicate keys. This means that you do have a
strong case for brain death here.

But my prejudice against "batch" and the step by step production of
Useless Things remains precisely because I recall, with less than
fondness, eternal vigils in mainframe computer rooms, glazedly watching
der blinkenlights.

Of course, real users can switch over to porn sites today or even do
useful work while a "brain dead" approach pipes garbage, only at the
end accomplishing the alchemical step.

But the argument for brain death as simplicity itself has to give way
to the realization that processes also need to be "as simple as
possible, but no simpler".

0
Reply spinoza1111 (3247) 1/30/2005 6:00:36 AM

Sorting, even when done properly, does not have better locality of
reference. In the initial phases of most sorting algorithms, the
algorithms will compare distant entries.

Whereas open hash has high locality UNTIL the table is almost full. In
moving forward and then "wrapping" mod the size of the table, it stays
as close as possible, at all times, to the starting point. Even in a
nearly full hashtable, I believe, the "locality of reference", if
quantified, would prove to be best-fit.

Which means that sorting solutions will cause page faults early on for
large data sets, while open hashing will cause page faults only when
blobs and clusters exist of values with close keys.

But this problem means that the hashing algorithm was probably poorly
chosen with respect to the key values whereas the problem in most
sorting algorithms is ineradicable. Most start with comparing distant
keys.

In fact, that's how the faster sort algorithms get their power, isn't
it, while the most "brain dead" algorithm has in fact high locality. It
causes page faults, but in a regular way.

0
Reply spinoza1111 (3247) 1/30/2005 1:43:36 PM

Chris Williams wrote:
> Heiner Steven wrote:
> 
>>I know how hashing works. But in that case I would have to
>>store all numbers + a hashing table in memory. Storing the
>>numbers in a more concise format would be better.

[...]
> Perhaps a trie? The idea is that you create a special sort of tree with
> a node-type that has 16 slots and a 16-bit flag item at the top.

We already experimented with a trie, but although it reduced
the number of elements stored by a factor 2.3, the large data
structure overhead nullified the effect.

We used a data structure with 9 bytes per Trie node, and used
one node for each character of a stored item.

> struct trie16 {
> unsigned short itemPointer;
> /* ^ each bit is a boolean value telling us whether or not each below
> pointer is a pointer to a string (1) or a pointer to another trie16
> structure */
> 
> void* _0;
> void* _1;
> void* _2;
> void* _3;
> void* _4;
> void* _5;
> void* _6;
> void* _7;
> void* _8;
> void* _9;
> void* _a;
> void* _b;
> void* _c;
> void* _d;
> void* _e;
> void* _f;
> }
> 
> You then use each of your strings as a "map" to guide you through the
> trie. So if you start with just one empty node, then when you add
> 0X01234567890abcdef01234567890abcd, you will put a pointer to it in the
> _0 slot and set the flag for that item to 1. If we then put in
> 0x11111111111111111111111111111111, this will go in the _1 slot. Then
> we try to add 0x00000000000000000000000000000000, which we will want to
> put in _0, but the slot is already filled.
> First we compare the strings. If they are exactly the same, then we can
> already consider it as stored. Otherwise we create a new node and move
> 0x01234... into the _1 slot of the second note and 0x00000... into the
> _0 slot. The _0 slot of the top node will be pointed to this new node
> and the flag set to 0. Note that is the two values were identical for
> several characters then we might have had to create three or four more
> nodes to extend the trie out to where the difference began.

I see, this would decrease the number of nodes used to store the
data, but use 16*4+2=66 bytes (instead of 9) per node. We need to check
whether this pays off using real data.

> I haven't done enough playing around with structures like this nor
> hashes to know what the relative speed/memory usage rates are like.
> Most programmers advocate hashes and I generally have to assume that
> they know what they are talking about...but just in case here is one
> other thing to look up and try =)

Thank you for your help, every good idea is welcome ;-)

Heiner

0
Reply heiner.steven2 (56) 1/31/2005 10:51:13 AM

spinoza1111@yahoo.com wrote:

> The hash table is far more easily explained...a strong argument in its
> favor. Its downside is that its worst case behavior sucks and is
> equivalent to linear search. Even before approaching worst case,
> unexpectedly patterned input will degrade the hash table.

A hash table can provide fast access, but it doesn't help
with the original data, which consumes too much memory space.

My focus is on a smaller memory footprint while still keeping
reasonably search times.

I'd like to find a way to keep all data in memory.

> Of course, a self-conscious "hash object", being an object whose state
> can readily contain performance information, can "notice" when strange
> clusters and blobs start to appear, by maintaining the maximum
> "reasonable" search length in object state.
> 
> This length should be a percent of the space available so that the code
> exhibits sensitivity to the nonconstant and arbitrary nature of the max
> size of the table.
> 
> When a blob appears, the table can be expanded and reorganized. The
> trouble is the time it takes to reorg, but this only occurs as needed.
> 
>>From Cormen's august standpoint, as the author of the formidable
> Introduction to Algorithms, this is ad hoccery...bricolage.
> 
> In my experience with early releases of data bases, such bricolage
> created strange pauses in the processing of user data.
> 
> Therefore, the hash algorithm's applicability needs to be considered
> carefully.

Heiner
0
Reply heiner.steven2 (56) 1/31/2005 10:59:46 AM

Matt Gregory wrote:

> Heiner Steven wrote:
> 
>> Hello,
>>
>> I'm searching for an memory-efficient (and hopefully fast)
>> algorithm for detecting duplicate 56-bit numbers.
>>
>> The input consists of millions of lines like this:
>>
>>     0X01234567890abcdef01234567890abcd
>>
>> Each line is a textual representation of a 56-bit hex number.
>>
>> The file should be read into memory and stored there in a
>> memory-efficient way.
> 
> 
> Read the numbers in and convert to 7 or 8 byte values, whichever
> floats your boat.  Put them in an array and sort them.  Find them
> using binary search.

We already converted them to an internal representation, but still
search for a way to store the numbers in a more concise format.
There is much redundancy in the numbers.

Heiner
0
Reply heiner.steven2 (56) 1/31/2005 12:19:26 PM

Heiner wrote:
) We already converted them to an internal representation, but still
) search for a way to store the numbers in a more concise format.
) There is much redundancy in the numbers.

You want something like, a skip list that only stores the differences ?


SaSW, Willem
-- 
Disclaimer: I am in no way responsible for any of the statements
            made in the above text. For all I know I might be
            drugged or something..
            No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
0
Reply willem (1478) 1/31/2005 12:52:23 PM

Heiner Steven wrote:
> Matt Gregory wrote:
>> Heiner Steven wrote:
>>>
>>> I'm searching for an memory-efficient (and hopefully fast)
>>> algorithm for detecting duplicate 56-bit numbers.
>>>
>>> The input consists of millions of lines like this:
>>>
>>>     0X01234567890abcdef01234567890abcd
>>>
>>> Each line is a textual representation of a 56-bit hex number.
>>>
>>> The file should be read into memory and stored there in a
>>> memory-efficient way.
>>
>> Read the numbers in and convert to 7 or 8 byte values, whichever
>> floats your boat.  Put them in an array and sort them.  Find them
>> using binary search.
> 
> We already converted them to an internal representation, but still
> search for a way to store the numbers in a more concise format.
> There is much redundancy in the numbers.

Why are you still searching?  I gave you a method, and even an
example of using it.  It seemed to meet all of your concerns.  It
is certainly portable.

-- 
"If you want to post a followup via groups.google.com, don't use
 the broken "Reply" link at the bottom of the article.  Click on 
 "show options" at the top of the article, then click on the 
 "Reply" at the bottom of the article headers." - Keith Thompson

0
Reply cbfalconer (19194) 1/31/2005 1:20:15 PM

CBFalconer wrote:
> Heiner Steven wrote:
> 
>>Matt Gregory wrote:
>>
>>>Heiner Steven wrote:
>>>
>>>>I'm searching for an memory-efficient (and hopefully fast)
>>>>algorithm for detecting duplicate 56-bit numbers.
>>>>
>>>>The input consists of millions of lines like this:
>>>>
>>>>    0X01234567890abcdef01234567890abcd
>>>>
>>>>Each line is a textual representation of a 56-bit hex number.
>>>>
>>>>The file should be read into memory and stored there in a
>>>>memory-efficient way.
>>>
>>>Read the numbers in and convert to 7 or 8 byte values, whichever
>>>floats your boat.  Put them in an array and sort them.  Find them
>>>using binary search.
>>
>>We already converted them to an internal representation, but still
>>search for a way to store the numbers in a more concise format.
>>There is much redundancy in the numbers.
> 
> 
> Why are you still searching?  I gave you a method, and even an
> example of using it.  It seemed to meet all of your concerns.  It
> is certainly portable.

A hashtable does not try to compress the original data.
The binary format of the number needs 16 bytes (not just 56 bits,
as I erroneously said). Since I have literally millions of them,
it would be excellent if there was a way to store them in a more
concise format.

Heiner
0
Reply heiner.steven2 (56) 1/31/2005 3:50:37 PM

Heiner Steven wrote:
> CBFalconer wrote:
>
.... snip ...
>>
>> Why are you still searching?  I gave you a method, and even an
>> example of using it.  It seemed to meet all of your concerns.  It
>> is certainly portable.
> 
> A hashtable does not try to compress the original data.
> The binary format of the number needs 16 bytes (not just 56 bits,
> as I erroneously said). Since I have literally millions of them,
> it would be excellent if there was a way to store them in a more
> concise format.

IIRC you said the datum was a string of hex chars.  Those are
trivially compressed to binary, at 4 bits per char.  If you aren't
capable of that step, I think you are in the wrong profession.  The
interesting part is the detection and ignoring (or counting) of
duplicates in the overall mass.  I demonstrated an efficient way of
doing that.  I have a feeling you never paid any attention, never
downloaded, never ran anything, etc.

-- 
"If you want to post a followup via groups.google.com, don't use
 the broken "Reply" link at the bottom of the article.  Click on 
 "show options" at the top of the article, then click on the 
 "Reply" at the bottom of the article headers." - Keith Thompson

0
Reply cbfalconer (19194) 1/31/2005 4:56:54 PM

Your example number string is 128 bits long but you wrote
that you want to work with 56 bit numbers.  I didn't notice
this at first so the code below was quickly written for
56 byte strings containing ASCII 0-9 and A-F but it shouldn't
take much to convert it to whatever you need.

At any rate, this is probably your best solution. ...


First (for sake of speed and memory) start by packing the
strings (compressing) from 2 bytes to 1.  If you have nothing
but 0-9 + a-f, then you can pack each symbol in 4 bits and
achieve 2:1.

Generate a simple hash function to create a 32 bit value from
the string (now 28 bytes long in packed format).

create a structure as follows

typedef struct MyStruct
{
MyStruct(char * data, unsigned int hashedID)
{
memcpy(m_value, data, 28);
m_hashedID = hashedID;
m_leftChild = m_rightChild = 0;
}
char                m_value[28];
unsigned int        m_hashedID;
struct MyStruct *   m_leftChild;
struct MyStruct *   m_rightChild;
} MyStruct;


search code is ...

MyStruct * Find(char * value, MyStruct * & myTree, bool & noFailAdd)
{
unsigned int bitMask = 1;
unsigned int valueHashID = MakeHashID(value);
MyStruct * cur = myTree, last = 0;


while (cur)
{
if (cur->m_hashedID == valueHashID)
if (strncmp(value, cur->m_value, 28) == 0)
{
// change strncmp to faster compare if you want.
noFailAdd = false;
return cur;
}
last = cur;
if (valueHashID & bitMask)
cur = cur->m_leftChild;
else
cur = cur->m_rightChild;
bitMask <<= 1;
if (!bitMask)
bitMask = 1;
}

if (noFailAdd)
{
MyStruct * newStruct = new MyStruct(value, valueHashID);
if (!myTree)
myTree = newStruct;
else
{
bitMask >>= 1;
if (bitMask == 0)
bitMask = 0x80000000;
if (valueHashID & bitMask)
last->m_leftChild = newStruct;
else
last->m_rightChild = newStruct;
}
return newStruct;
}

return 0;
}


NOTE: code is out of my head so it may have a bug or two.

Code returns pointer to found record or newly created record
if noFailAdd is true.  Otherwise returns 0 if not found.

This, assuming you have a good hash function, will have an
expected search/add time which is logarithmic.  You won't
get much faster than this.  Just override the new to get
MyStruct structures from a pre-allocated heap and you will
have very good speed.


Good luck,
- Michael A Maniscalco


Heiner Steven wrote:
> Hello,
>
> I'm searching for an memory-efficient (and hopefully fast)
> algorithm for detecting duplicate 56-bit numbers.
>
> The input consists of millions of lines like this:
>
>      0X01234567890abcdef01234567890abcd
>
> Each line is a textual representation of a 56-bit hex number.
>
> The file should be read into memory and stored there in a
> memory-efficient way.
>
> Afterwards new numbers are read from somewhere else, and
> the program needs to know if the number exists already
> or if it is a new, unique one.
>
> What would be a good data-structure for this purpose?
>
> "gzip" shows that there's a lot of redundancy in the text file
> (factor >20). However, keeping the compressed data in
> memory is no option, because searching it would be too slow.
>
> Heiner
> --
>   ___ _
> / __| |_ _____ _____ _ _     Heiner STEVEN <heiner.steven@nexgo.de>
> \__ \  _/ -_) V / -_) ' \    Shell Script Programmers: visit
> |___/\__\___|\_/\___|_||_|   http://www.shelldorado.com/

0
Reply michael191 (12) 1/31/2005 4:58:59 PM

Heiner Steven wrote:
> Chris Williams wrote:
> > Heiner Steven wrote:
> >
> >>I know how hashing works. But in that case I would have to
> >>store all numbers + a hashing table in memory. Storing the
> >>numbers in a more concise format would be better.
>
> [...]
> > Perhaps a trie? The idea is that you create a special sort of tree
with
> > a node-type that has 16 slots and a 16-bit flag item at the top.
>
> We already experimented with a trie, but although it reduced
> the number of elements stored by a factor 2.3, the large data
> structure overhead nullified the effect.
>
> We used a data structure with 9 bytes per Trie node, and used
> one node for each character of a stored item.

If you stored one symbol per node you still only need 5 bytes per
symbol, no?  Also, if the numbers are largely redundant, you are
better off storing the whole number in one node.  In your example
you have a 128 bit (16 byte) string containing characters 0-9 and
A-F.  So you can pack them down to 8 bytes total very easily.
If you use a binary tree method that's 16 bytes per unique number.
(20 with the fast solution I posted earlier).
Even if you have a million of unique numbers, that really not a
terribly large memory requirement (16 MB).  But I guess that depends
on your environment.

How many stringw will you be handling?  What is the
percentage of repeats?  These are very big factors in developing
the correct solution.

- Michael A. Manscalco

>
> > struct trie16 {
> > unsigned short itemPointer;
> > /* ^ each bit is a boolean value telling us whether or not each
below
> > pointer is a pointer to a string (1) or a pointer to another trie16
> > structure */
> >
> > void* _0;
> > void* _1;
> > void* _2;
> > void* _3;
> > void* _4;
> > void* _5;
> > void* _6;
> > void* _7;
> > void* _8;
> > void* _9;
> > void* _a;
> > void* _b;
> > void* _c;
> > void* _d;
> > void* _e;
> > void* _f;
> > }
> >
> > You then use each of your strings as a "map" to guide you through
the
> > trie. So if you start with just one empty node, then when you add
> > 0X01234567890abcdef01234567890abcd, you will put a pointer to it in
the
> > _0 slot and set the flag for that item to 1. If we then put in
> > 0x11111111111111111111111111111111, this will go in the _1 slot.
Then
> > we try to add 0x00000000000000000000000000000000, which we will
want to
> > put in _0, but the slot is already filled.
> > First we compare the strings. If they are exactly the same, then we
can
> > already consider it as stored. Otherwise we create a new node and
move
> > 0x01234... into the _1 slot of the second note and 0x00000... into
the
> > _0 slot. The _0 slot of the top node will be pointed to this new
node
> > and the flag set to 0. Note that is the two values were identical
for
> > several characters then we might have had to create three or four
more
> > nodes to extend the trie out to where the difference began.
>
> I see, this would decrease the number of nodes used to store the
> data, but use 16*4+2=66 bytes (instead of 9) per node. We need to
check
> whether this pays off using real data.
>
> > I haven't done enough playing around with structures like this nor
> > hashes to know what the relative speed/memory usage rates are like.
> > Most programmers advocate hashes and I generally have to assume
that
> > they know what they are talking about...but just in case here is
one
> > other thing to look up and try =)
>
> Thank you for your help, every good idea is welcome ;-)
> 
> Heiner

0
Reply michael191 (12) 1/31/2005 6:19:08 PM

On Sat, 29 Jan 2005 17:41:57 +0100, Daniel Lidstr´┐Żm wrote:
> 
> On my computer (3GHz P4) the output is this:
> Size of subset: 9952
> Lookup took 0s
> Found 9952 numbers

Update: I can store about 35 million 8-byte numbers in 800Mb of ram.
Looking up 350000 (existing) numbers took negligible time. I'd like to see
how this can be beaten.

-- 
Daniel

0
Reply someone2 (821) 1/31/2005 8:21:42 PM

spinoza1111@yahoo.com wrote:
> The practical problem with the braindead approach is that the guy has
> millions of records, and a non-braindead sorting algorithm such as
> quick or shell sort has to be used to realistically load the table.

Indeed. I implemented a prototype that converted the 32 byte ascii
format into the corresponding 16 byte binary format. Afterwards
1,7 Million numbers (the data of one day) was read into RAM,
and sorted using QuickSort. Afterwards I searced the list
using binary search, which is quite fast.

> Quick and shell sorting are more psychologically complicated, perhaps,
> than open hash.
> 
> But if this is only my prejudice, the use of a sort AFTER all entries
> are loaded creates a pragmatic issue.
> 
> This is that even using a good algorithm there will be a strange pause
> after load during sort whereas open hash allows each entry to be loaded
> with a time footprint that only degrades slowly.

The pause for sorting is indeed noticable, lasting about
5 seconds. But this is only the data of *one day*, and
certainly will get larger.

> In the "braindead" approach, you have to wait constant (but large) time
> proportional to the size of the data set order 1 while the data set
> loads, and a roughly equivalent time during sort until you are
> presented with the first useful results.
> 
> Whereas an object approach using open hash has the multithread
> capability to access items before any drama, before the load is
> complete. During the load you can start work because some keys will
> already exist, and the purpose of computers is to work us all to death,
> isn't it, in paralell.

I agree with you in general, but in this particular case I
will have to read all data before starting to work with it.
"work" in my case means: find duplicates.

[...excursion about the advantages of parallel programming omitted...]

> Open hashing creates the final and useful product, lacking only closure
> in the form of all entries, from the beginning and that is why it MAY
> be useful here. It stores the minimal amount of crap (eg., zero crap)
> per entry.
> 
> One negative remains the case that there is no requirement for deletion
> apart from not storing duplicate keys. This means that you do have a
> strong case for brain death here.
> 
> But my prejudice against "batch" and the step by step production of
> Useless Things remains precisely because I recall, with less than
> fondness, eternal vigils in mainframe computer rooms, glazedly watching
> der blinkenlights.
> 
> Of course, real users can switch over to porn sites today or even do
> useful work while a "brain dead" approach pipes garbage, only at the
> end accomplishing the alchemical step.
> 
> But the argument for brain death as simplicity itself has to give way
> to the realization that processes also need to be "as simple as
> possible, but no simpler".

I agree with you here. As a computer scientist I appreciate
an elegant (even complex) solution to a problem. But
the other, pragmatic half of me knows that a simple solution
can be more (cost or time) efficient.

I'm trying to solve a real-world problem here, and therefore
considerations like the time needed to implement it, or the ease
of maintenance come into play. I consider a solution to be
"good" for our purpose when it enables us to process the large
amounts of data in time.

The longer the discussion goes, the more I get the feeling that
taking the redundancy out of our input data is too expensive
in terms of development time.

But just for the fun of it, did anybody consider solving this
problem by interpreting the number as a 128-bit unsigned
integer, and representing the set of numbers seen using
a sparse bit array?

Heiner
0
Reply heiner.steven2 (56) 1/31/2005 8:24:10 PM

Daniel Lidstr=F6m wrote:
> On Sat, 29 Jan 2005 17:41:57 +0100, Daniel Lidstr=F6m wrote:
> >
> > On my computer (3GHz P4) the output is this:
> > Size of subset: 9952
> > Lookup took 0s
> > Found 9952 numbers
>
> Update: I can store about 35 million 8-byte numbers in 800Mb of ram.
> Looking up 350000 (existing) numbers took negligible time. I'd like
to see
> how this can be beaten.
>
> --
> Daniel

Ah, the gauntlet is thrown!  What we need is a standard test set.
A file containing the numbers to store.  By the way, my solution
is 20N so that's 667.57Mb for 35 million unique numbers.  And that
will provide logarithmic look up and insertion times.  I have a
solution I think will be superior still but it's more complex and
based on PPM models of symbols and contexts.

If anyone is up for the challenge, I'll generate a test set and we
can benchmark the various solutions.  I have a 1GB RAM machine to
run the apps on and will compile them using my C++ compiler to
make sure that the test is fair.  Then I will post the results of
the various solutions under various cases.
Anyone interested in some bragging rights?

- Michael A Maniscalco

0
Reply michael191 (12) 1/31/2005 8:34:01 PM

Also, (and I don't know if this is a factor) but most STL
implementations of
sets use red-black trees and the performance on that begins to degrade
when
thread safe insertion is required because the whole tree might
need to be shifted around to maintain performance.
- Michael A Maniscalco

0
Reply michael191 (12) 1/31/2005 8:41:16 PM

On Sat, 29 Jan 2005 01:32:13 +0100, Heiner Steven
<heiner.steven@nexgo.de> wrote:

>Hello,
>
>I'm searching for an memory-efficient (and hopefully fast)
>algorithm for detecting duplicate 56-bit numbers.
>
>The input consists of millions of lines like this:
>
>     0X01234567890abcdef01234567890abcd
>
>Each line is a textual representation of a 56-bit hex number.
>
>The file should be read into memory and stored there in a
>memory-efficient way.
>
>Afterwards new numbers are read from somewhere else, and
>the program needs to know if the number exists already
>or if it is a new, unique one.
>
>What would be a good data-structure for this purpose?
>
>"gzip" shows that there's a lot of redundancy in the text file
>(factor >20). However, keeping the compressed data in
>memory is no option, because searching it would be too slow.
>
>Heiner

Just a random thought.  How dense are the numbers over the allowed
range?  If the numbers are more than 50% of the range then it would be
easier to store the numbers that are missing, rather than the numbers
that are present.

If they are dense, but not up to 50%, then the suggestion of just
holding differences between numbers is worth investigating.

rossum

--

The ultimate truth is that there is no Ultimate Truth
0
Reply rossum48 (719) 1/31/2005 10:52:50 PM

Heiner Steven wrote:
> 
.... snip ...
> 
> I'm trying to solve a real-world problem here, and therefore
> considerations like the time needed to implement it, or the ease
> of maintenance come into play. I consider a solution to be
> "good" for our purpose when it enables us to process the large
> amounts of data in time.
> 
> The longer the discussion goes, the more I get the feeling that
> taking the redundancy out of our input data is too expensive
> in terms of development time.

I will undertake to produce a program, in portable standard C, that
will input a file of up to 1,000,000 lines of up to 32 hexadecimal
chars and output a similar file with all duplicate lines removed. 
The running time will not exceed 8 times the input/output time
to/from disk files on common file systems (and I expect much
less).  I will charge you 1 days work at $200 USD per hour, or 1600
USD.  I will deliver the program, in source form, via e-mail,
within one week of the fee being placed in escrow.  You are
entirely responsible for any expenses involved in the escrow
process.  No especial order is required for either the input or
output files.  Note: all strings are to be the same length,
otherwise leading 0 characters are to be ignored.

Does that solve your real world problem?

-- 
"If you want to post a followup via groups.google.com, don't use
 the broken "Reply" link at the bottom of the article.  Click on 
 "show options" at the top of the article, then click on the 
 "Reply" at the bottom of the article headers." - Keith Thompson

0
Reply cbfalconer (19194) 2/1/2005 12:10:15 AM

On Tue, 01 Feb 2005 00:10:15 GMT
CBFalconer <cbfalconer@yahoo.com> wrote:

> Heiner Steven wrote:
> > 
> ... snip ...
> > 
> > I'm trying to solve a real-world problem here, and therefore
> > considerations like the time needed to implement it, or the ease
> > of maintenance come into play. I consider a solution to be
> > "good" for our purpose when it enables us to process the large
> > amounts of data in time.
> > 
> > The longer the discussion goes, the more I get the feeling that
> > taking the redundancy out of our input data is too expensive
> > in terms of development time.
> 
> I will undertake to produce a program, in portable standard C, that
> will input a file of up to 1,000,000 lines of up to 32 hexadecimal
> chars and output a similar file with all duplicate lines removed. 
> The running time will not exceed 8 times the input/output time
> to/from disk files on common file systems (and I expect much
> less).  I will charge you 1 days work at $200 USD per hour, or 1600
> USD.  I will deliver the program, in source form, via e-mail,
> within one week of the fee being placed in escrow.  You are
> entirely responsible for any expenses involved in the escrow
> process.  No especial order is required for either the input or
> output files.  Note: all strings are to be the same length,
> otherwise leading 0 characters are to be ignored.
> 
> Does that solve your real world problem?

Here's some useless information, showing how the timing works out. The 8x the I/O time is an interesting one: I'd like to see a sort take that long!

$ time make56bit | load56bit

real    0m2.651s
user    0m2.045s
sys     0m0.046s

-@artanis ~
$ time make56bit | loadandsort56bit

real    0m2.807s
user    0m2.295s
sys     0m0.047s

(I've ran these a few dozen times, and these values are typical - sometimes the "real" (i.e. elapsed) time fluctuates as my machine starts doing something else at the same time).

I did it this way to avoid disk caching considerations, so this is just a pure memory based I/O system. Any disk activity to read the data will slow things down, but the point I'm trying to make here is that even 10 minutes of programming makes a reasonable implementation of the simple load-and-sort approach, that works on small-millions of 56-bit numbers.

The programs were written in ANSI C. "make56bit" spews out 10 million random 56 bit numbers as plaintext strings preceded by "0x". "load56bit" loads 10 million numbers from standard input into an array of 64-bit numbers. "loadandsort56bit" does the same operation as "load56bit" but then uses qsort (on the platform I ran this on is the one from glibc-2.x, which, if memory serves, is based on an O(n log n) worst case sort like merge sort rather than quicksort).

Admittedly, my hardware is a bit nippier than most (Athlon64) but even if the sort takes quite a bit more than 0.25 seconds the I/O time for reading the data will be much larger. This lends itself to a "read-a-block, sort block, then merge into large array" process quite nicely, if you have asynchronous I/O... but even this seems overkill to cut down the load-and-sort time by less than 10%.

Ian Woods
0
Reply newspub22 (20) 2/1/2005 1:40:06 AM

On Mon, 31 Jan 2005 21:21:42 +0100
Daniel Lidstr=F6m <someone@microsoft.com> wrote:

> On Sat, 29 Jan 2005 17:41:57 +0100, Daniel Lidstr=F6m wrote:
> >=20
> > On my computer (3GHz P4) the output is this:
> > Size of subset: 9952
> > Lookup took 0s
> > Found 9952 numbers
>=20
> Update: I can store about 35 million 8-byte numbers in 800Mb of ram.
> Looking up 350000 (existing) numbers took negligible time. I'd like to see
> how this can be beaten.
>=20
> --=20
> Daniel

Using a flat array of 7-byte values I can store 119 million entries in 800M=
B. Searching on this is O(log n) which isn't too shody, but there's certain=
ly ways the constant can be reduced. Theoretically though, because the data=
 is stored quote compactly the number of actual reads from memory could ver=
y well be lower, and performance improved.

Anyone willing to spend the 15 minutes or so to write the 40 line C program=
 to do this? :)

Ian Woods



0
Reply newspub22 (20) 2/1/2005 1:49:56 AM

Ian Woods wrote:
> CBFalconer <cbfalconer@yahoo.com> wrote:
>> Heiner Steven wrote:
>>>
>> ... snip ...
>>>
>>> I'm trying to solve a real-world problem here, and therefore
>>> considerations like the time needed to implement it, or the ease
>>> of maintenance come into play. I consider a solution to be
>>> "good" for our purpose when it enables us to process the large
>>> amounts of data in time.
>>>
>>> The longer the discussion goes, the more I get the feeling that
>>> taking the redundancy out of our input data is too expensive
>>> in terms of development time.
>>
>> I will undertake to produce a program, in portable standard C, that
>> will input a file of up to 1,000,000 lines of up to 32 hexadecimal
>> chars and output a similar file with all duplicate lines removed.
>> The running time will not exceed 8 times the input/output time
>> to/from disk files on common file systems (and I expect much
>> less).  I will charge you 1 days work at $200 USD per hour, or 1600
>> USD.  I will deliver the program, in source form, via e-mail,
>> within one week of the fee being placed in escrow.  You are
>> entirely responsible for any expenses involved in the escrow
>> process.  No especial order is required for either the input or
>> output files.  Note: all strings are to be the same length,
>> otherwise leading 0 characters are to be ignored.
>>
>> Does that solve your real world problem?
> 
> Here's some useless information, showing how the timing works out.
> The 8x the I/O time is an interesting one: I'd like to see a sort
> take that long!

That's to allow for the fact that his memory may all be virtual and
thrashing.  Remember, I told him how to do it, where the critical
software was, and gave him a timing demonstration of something very
close on my own machine.  He chose to respond with some put-down
and hand waving.  Some quotation about money and mouths goes here.

-- 
"If you want to post a followup via groups.google.com, don't use
 the broken "Reply" link at the bottom of the article.  Click on 
 "show options" at the top of the article, then click on the 
 "Reply" at the bottom of the article headers." - Keith Thompson


0
Reply cbfalconer (19194) 2/1/2005 3:37:08 AM

"Heiner Steven" <heiner.steven@nexgo.de>, haber iletisinde sunlari
yazdi:41FAD98D.7090407@nexgo.de...
> Hello,
>
> I'm searching for an memory-efficient (and hopefully fast)
> algorithm for detecting duplicate 56-bit numbers.
>
> The input consists of millions of lines like this:
>
>      0X01234567890abcdef01234567890abcd
>
> Each line is a textual representation of a 56-bit hex number.
>
> The file should be read into memory and stored there in a
> memory-efficient way.
>
> Afterwards new numbers are read from somewhere else, and
> the program needs to know if the number exists already
> or if it is a new, unique one.
>
> What would be a good data-structure for this purpose?
>
> "gzip" shows that there's a lot of redundancy in the text file
> (factor >20). However, keeping the compressed data in
> memory is no option, because searching it would be too slow.
>
> Heiner

Sounds like you need something called Patricia tree. Both memory-efficient
and
fast for searching.
You can store a byte (256 different values) or a hex digit (16 different
values) in each node.
Nodes that have only one child can be collapsed for memory efficiency. Since
you mention
the redundancy, it can be expected that there would be many collapsed nodes.

I would prefer hex digits in each node. So you need to visit upto 14 nodes
if there is no collapsed nodes. I would also consider using a bitmap for the
last node.
I mean a bitmap for 16 possible values which can be stored in 2 bytes. That
would also
help to store them more efficiently. In that case you would visit upto 13
nodes
and would check the bitmap.


0
Reply aslanski2002 (62) 2/1/2005 9:39:21 AM

CBFalconer wrote:
) Heiner Steven wrote:
)> A hashtable does not try to compress the original data.
)> The binary format of the number needs 16 bytes (not just 56 bits,
)> as I erroneously said). Since I have literally millions of them,
)> it would be excellent if there was a way to store them in a more
)> concise format.
)
) IIRC you said the datum was a string of hex chars.  Those are
) trivially compressed to binary, at 4 bits per char.  If you aren't
) capable of that step, I think you are in the wrong profession.  The
) interesting part is the detection and ignoring (or counting) of
) duplicates in the overall mass.  I demonstrated an efficient way of
) doing that.  I have a feeling you never paid any attention, never
) downloaded, never ran anything, etc.

The string of hex chars that the OP originally quoted happened to
have 32 digits.  Guess how many bytes that takes in binary format.

He's quite clearly saying that there is too much data to hold in memory
comfortably, so anything that adds extra data, such as a hash table, is
even worse.

Something like a trie, where you only store the bits relevant to the tree
depth in the node, would be a start.  Although you need extra space for the
pointers, you save space by joining all the common prefixes together.

A skip list where you only store the deltas could also be a way to do it.

And if you have the list beforehand and want all the duplicates in one go,
sorting the entire list is also a good option, because you can do that in
chunks that do fit in memory comfortably.


SaSW, Willem
-- 
Disclaimer: I am in no way responsible for any of the statements
            made in the above text. For all I know I might be
            drugged or something..
            No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
0
Reply willem (1478) 2/1/2005 11:21:20 AM

rossum wrote:

> Just a random thought.  How dense are the numbers over the allowed
> range?  If the numbers are more than 50% of the range then it would be
> easier to store the numbers that are missing, rather than the numbers
> that are present.

Well, since the 128-bit numbers are in the range from

     0...340282366920938463463374607431768211456

     ~= 3.4 * 10^38

we can safely assume that there will be large "holes" in the
value range ;-)

> If they are dense, but not up to 50%, then the suggestion of just
> holding differences between numbers is worth investigating.

Agreed. This would result in something similar to a linked
list, where the next element would be specified by the difference
to the previous one. In order not to have to search the table
from the beginning, some additional work would be required, e.g.
a table for the entry points.

Heiner
0
Reply heiner.steven2 (56) 2/1/2005 11:52:33 AM

Willem wrote:
> Heiner wrote:
> ) We already converted them to an internal representation, but still
> ) search for a way to store the numbers in a more concise format.
> ) There is much redundancy in the numbers.
> 
> You want something like, a skip list that only stores the differences ?

That certainly would reduce amount of memory used. The tradeoff
here of course is the searching speed.

Heiner
0
Reply heiner.steven2 (56) 2/1/2005 11:54:47 AM

Willem wrote:
> CBFalconer wrote:
> ) Heiner Steven wrote:
> )> A hashtable does not try to compress the original data.
> )> The binary format of the number needs 16 bytes (not just 56 bits,
> )> as I erroneously said). Since I have literally millions of them,
> )> it would be excellent if there was a way to store them in a more
> )> concise format.
> )
> ) IIRC you said the datum was a string of hex chars.  Those are
> ) trivially compressed to binary, at 4 bits per char.  If you aren't
> ) capable of that step, I think you are in the wrong profession.  The
> ) interesting part is the detection and ignoring (or counting) of
> ) duplicates in the overall mass.  I demonstrated an efficient way of
> ) doing that.  I have a feeling you never paid any attention, never
> ) downloaded, never ran anything, etc.
> 
> The string of hex chars that the OP originally quoted happened to
> have 32 digits.  Guess how many bytes that takes in binary format.

Indeed, my initial posting stating 56-bit numbers was wrong(*).
In fact we have 128-bit numbers -> 16 bytes.

> He's quite clearly saying that there is too much data to hold in memory
> comfortably, so anything that adds extra data, such as a hash table, is
> even worse.

Sorry if I have implied that there's too much memory to hold
in memory. The goal is to keep all numbers in memory, but in
a very memory-efficient way.

> Something like a trie, where you only store the bits relevant to the tree
> depth in the node, would be a start.  Although you need extra space for the
> pointers, you save space by joining all the common prefixes together.

We already implemented a trie implementation that didn't save memory
because the size of the node data structures exceeded the savings
from the trie compression. Of course there may be more efficient
ways of implementing the trie. We used one trie node per character,
where each node identified a common prefix.

> A skip list where you only store the deltas could also be a way to do it.

Agreed.

> And if you have the list beforehand and want all the duplicates in one go,
> sorting the entire list is also a good option, because you can do that in
> chunks that do fit in memory comfortably.

The numbers are read from a file (which serves as a "database"),
and are then present in memory. Afterwards single new arriving
numbers are looked up in memory to ensure they have not yet
been seen.


Heiner



(*) one part of the number is a 56-bit time stamp
0
Reply heiner.steven2 (56) 2/1/2005 12:03:36 PM

michael wrote:
> Heiner Steven wrote:
> 
>>Chris Williams wrote:
>>
>>>Heiner Steven wrote:
>>>
>>>
>>>>I know how hashing works. But in that case I would have to
>>>>store all numbers + a hashing table in memory. Storing the
>>>>numbers in a more concise format would be better.
>>
>>[...]
>>
>>>Perhaps a trie? The idea is that you create a special sort of tree
> 
> with
> 
>>>a node-type that has 16 slots and a 16-bit flag item at the top.
>>
>>We already experimented with a trie, but although it reduced
>>the number of elements stored by a factor 2.3, the large data
>>structure overhead nullified the effect.
>>
>>We used a data structure with 9 bytes per Trie node, and used
>>one node for each character of a stored item.
> 
> If you stored one symbol per node you still only need 5 bytes per
> symbol, no?

Our node data structure looked like this:

#pragma pack(1)
typedef struct node_s {
     unsigned char c;
     struct node_s *next; /* next in same level */
     struct node_s *cont; /* next level /*
} node_t;

These are 32-bit pointers.

 > Also, if the numbers are largely redundant, you are
> better off storing the whole number in one node.

That would be a good way to optimize the data structure,
because as we know there is much reduncancy.

 > In your example
> you have a 128 bit (16 byte) string containing characters 0-9 and
> A-F.  So you can pack them down to 8 bytes total very easily.

We have 32 byte hex-strings, that we can convert to 16-byte binary
strings.

> If you use a binary tree method that's 16 bytes per unique number.
> (20 with the fast solution I posted earlier).
> Even if you have a million of unique numbers, that really not a
> terribly large memory requirement (16 MB).  But I guess that depends
> on your environment.
> 
> How many stringw will you be handling?  What is the
> percentage of repeats?  These are very big factors in developing
> the correct solution.

The number of strings may well be >> 50 * 10^6. They are almost
never repeated (in fact they are defined to be unique), but
there *are* duplicates from time to time that cause much problems
during later processing

Heiner
0
Reply heiner.steven2 (56) 2/1/2005 12:14:37 PM

* Heiner Steven:
> Hello,
> 
> I'm searching for an memory-efficient (and hopefully fast)
> algorithm for detecting duplicate 56-bit numbers.
> 
> The input consists of millions of lines like this:
> 
>      0X01234567890abcdef01234567890abcd
> 
> Each line is a textual representation of a 56-bit hex number.

To mee that looks like a 128 bit number, i.e. either the example
is rubbish, or the statement is rubbish, or critical information
has been left unstated.

If the latter, what is the relationship between the text file
specification of a number (128 bits), and the number it specifies
(56 bits)?

56 bits leads my thoughts to DES encryption keys.


> The file should be read into memory and stored there in a
> memory-efficient way.

Trivial.  Store the bits.

 
> Afterwards new numbers are read from somewhere else, and
> the program needs to know if the number exists already
> or if it is a new, unique one.
> 
> What would be a good data-structure for this purpose?

As an anonymous someone else already remarked, a hash table
is one option.

You then replied

<quote>
   I know how hashing works. But in that case I would have to
   store all numbers + a hashing table in memory. Storing the
   numbers in a more concise format would be better.
</quote>

Well that's rubbish, the overhead is about 70% or so (less
memory overhead = more execution time overhead).

If you don't need to add the new numbers to the collection,
or the number of new numbers is relatively small, you can as
an alternative simply _sort_ the original number set, and then
use a binary search for each new number.

 
> "gzip" shows that there's a lot of redundancy in the text file
> (factor >20). However, keeping the compressed data in
> memory is no option, because searching it would be too slow.

Well that's rubbish, you just need to "compress" by converting
the textual specification to binary numbers.

Given the combination of rubbish statements and probable connection
to DES encryption keys, this looks like a teenage cracking effort.

Is it?

-- 
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
0
Reply alfps (7389) 2/1/2005 12:15:14 PM

Alf P. Steinbach wrote:
> * Heiner Steven:

[...]
>>I'm searching for an memory-efficient (and hopefully fast)
>>algorithm for detecting duplicate 56-bit numbers.
>>
>>The input consists of millions of lines like this:
>>
>>     0X01234567890abcdef01234567890abcd
>>
>>Each line is a textual representation of a 56-bit hex number.
> 
> To mee that looks like a 128 bit number, i.e. either the example
> is rubbish, or the statement is rubbish, or critical information
> has been left unstated.

My mistake, it *is* a 128 bit number.

> Trivial.  Store the bits.
> 
>>Afterwards new numbers are read from somewhere else, and
>>the program needs to know if the number exists already
>>or if it is a new, unique one.
>>
>>What would be a good data-structure for this purpose?
> 
> As an anonymous someone else already remarked, a hash table
> is one option.
> 
> You then replied
> 
> <quote>
>    I know how hashing works. But in that case I would have to
>    store all numbers + a hashing table in memory. Storing the
>    numbers in a more concise format would be better.
> </quote>
> 
> Well that's rubbish, the overhead is about 70% or so (less
> memory overhead = more execution time overhead).

The list can become quite large (50*10^6 entries), and
therefore consume much memory. On the other hand there's much
reduncancy in the numbers, "gzip" can compress the file
containing the numbers resulting in a file 1/26th of the
original size. This leads me to the conclusion that there
must be memory-efficient ways to store the numbers in
memory. If access times were not a concern, I could just
keep the numbers gzip-compressed in memory.

> If you don't need to add the new numbers to the collection,
> or the number of new numbers is relatively small, you can as
> an alternative simply _sort_ the original number set, and then
> use a binary search for each new number.

We discussed this already, it's in fact a valid solution.
Although it does not try to take advantage of the redundancy
in the numbers to store them more memory-efficiently.

>>"gzip" shows that there's a lot of redundancy in the text file
>>(factor >20). However, keeping the compressed data in
>>memory is no option, because searching it would be too slow.
> 
> Well that's rubbish, you just need to "compress" by converting
> the textual specification to binary numbers.
> 
> Given the combination of rubbish statements and probable connection
> to DES encryption keys, this looks like a teenage cracking effort.

I've been called many things, but being called "teenage" was quite
a long time ago :-)

Additionally the problem is no cracking effort, but ensuring
that the 128-bit numbers that are supposed to be unique really
*are* unique and do not break parts of our processing chain.

Heiner
0
Reply heiner.steven2 (56) 2/1/2005 12:41:41 PM

Heiner Steven wrote:
> Chris Williams wrote:
> > Perhaps a trie? The idea is that you create a special sort of tree
with
> > a node-type that has 16 slots and a 16-bit flag item at the top.
>
> We already experimented with a trie, but although it reduced
> the number of elements stored by a factor 2.3, the large data
> structure overhead nullified the effect.
>
> We used a data structure with 9 bytes per Trie node, and used
> one node for each character of a stored item.

Assuming that your data is fairly evenly spread (fairly random that is)
which is what your other posts lead me to believe, we can figure out
fairly precisely how much memory you will need with the trie I
suggested.

To fill one node you will need to insert 16 values (evenly spread). To
fill a second layer of nodes you will need 256. This goes up in the
sequence:

16
256
4096
65536
1048576

So assuming a million values, we will need about 5 layers worth of
nodes. This will be 69905 nodes at 66 bytes each giving us 4,613,730
bytes. In reality there will be some waste so let's make that 5.5
million.
Now, even though some branches might not make it to five and others
might make it to 6 or 7, I would say that the odds that the first four
layers of nodes aren't completely filled is very low. So much so that
we might as well pre-create them and start filling in values from the
fourth row--which also means that we can drop the first four nibbles
(two bytes) of each value. So to store the million values we will need:
(32 - 2) * 1,000,000 = 30,000,000

So with the trie I suggested and a million values we can estimate that
the total memory that should be needed will be 35,500,000 bytes. I note
that this is not a radical change from the blanket 32,000,000 bytes
required to put the full values in memory without the structure.

-Chris

0
Reply thesagerat (67) 2/1/2005 1:12:22 PM

Heiner wrote:
) Willem wrote:
)> Heiner wrote:
)> ) We already converted them to an internal representation, but still
)> ) search for a way to store the numbers in a more concise format.
)> ) There is much redundancy in the numbers.
)> 
)> You want something like, a skip list that only stores the differences ?
)
) That certainly would reduce amount of memory used. The tradeoff
) here of course is the searching speed.

With a skip list you should hape pretty decent searching speed.

The idea of a skip list is to have several levels of lists, where the
lowest level stores all the nodes in order, higher levels store every Nth
node (roughly), so when searching you go through the highest level, until
you've passed the target, and then go down a level and repeat until you
found the number.

So, each but the lowest level lists has two pointers; one to the next on
that level, and one to the corresponding one, one level lower.

For memory efficiency, the lowest level could be a flat huffman-encoded
list of differences, although that would make insertions quite expensive.


SaSW, Willem
-- 
Disclaimer: I am in no way responsible for any of the statements
            made in the above text. For all I know I might be
            drugged or something..
            No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
0
Reply willem (1478) 2/1/2005 1:46:47 PM

* Heiner Steven:
> The list can become quite large (50*10^6 entries), and
> therefore consume much memory. On the other hand there's much
> reduncancy in the numbers, "gzip" can compress the file
> containing the numbers resulting in a file 1/26th of the
> original size. This leads me to the conclusion that there
> must be memory-efficient ways to store the numbers in
> memory. If access times were not a concern, I could just
> keep the numbers gzip-compressed in memory.
> 
> > If you don't need to add the new numbers to the collection,
> > or the number of new numbers is relatively small, you can as
> > an alternative simply _sort_ the original number set, and then
> > use a binary search for each new number.
> 
> We discussed this already, it's in fact a valid solution.
> Although it does not try to take advantage of the redundancy
> in the numbers to store them more memory-efficiently.

Use a B-tree (that's not a binary tree, but a B-tree).  Ungood
to fill up memory because that'll just use disk, via virtual
memory, in a very inefficient way.  Therefore, B-tree: manage
that disk storage yourself, optimized by what you keep in RAM.

Uhm, perhaps this what's databases are for?  Perhaps they even
do that B-tree thing automatically?  Or even better structures?

Just a random thought.

-- 
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
0
Reply alfps (7389) 2/1/2005 3:26:15 PM

Heiner Steven writes:

> The list can become quite large (50*10^6 entries), and
> therefore consume much memory. On the other hand there's much
> reduncancy in the numbers, "gzip" can compress the file
> containing the numbers resulting in a file 1/26th of the
> original size. This leads me to the conclusion that there
> must be memory-efficient ways to store the numbers in
> memory. If access times were not a concern, I could just
> keep the numbers gzip-compressed in memory.

Then you just have to accept uncompressed data.  Compression
(of the zip nature) depends on context and sliding dictionaries.
That's how the compression is achieved--by recognizing and using
redundancy--but it only works given a *set* of data.  There's
no practical way to just make 128-bit numbers smaller, since
you are "using" the entire number.

You can resort to the techniques others have suggested involving
interesting ways of storing the data, but don't assume that just
because you can create a smaller zip file, this translates to
the ability to make the data smaller in some *other* fashion.

0
Reply Chris7 (2511) 2/1/2005 7:28:59 PM

Heiner Steven wrote:
> Hello,
> 
> I'm searching for an memory-efficient (and hopefully fast)
> algorithm for detecting duplicate 56-bit numbers.

128-bit, accoring to a later thread.
> The input consists of millions of lines like this:
> 
>     0X01234567890abcdef01234567890abcd
> 
> Each line is a textual representation of a 56-bit hex number.

there are >> 50e6 records.

> The file should be read into memory and stored there in a
> memory-efficient way.
> 
> Afterwards new numbers are read from somewhere else, and
> the program needs to know if the number exists already
> or if it is a new, unique one.

Not clear if the table has to be updated; I'm assuming that either it 
does not, or the number of updated values is small compared to the table 
and can be handled by keeping the updated values in a separate structure 
  which is searched separately.

> What would be a good data-structure for this purpose?
>
> "gzip" shows that there's a lot of redundancy in the text file
> (factor >20). However, keeping the compressed data in
> memory is no option, because searching it would be too slow.
> 
> Heiner

Obviously, the most memory efficient way, assuming no compression, is to 
  store the binary representation of the numbers in sorted order, and 
then do a binary search.

But it seems that there is some scope for compression. Unfortunately, it 
  is not clear where the redundancy is. Based on the posting about 
trie's it seems that the redundancy is NOT at the prefix level. It might 
be the case that different components of the numbers have different 
kinds of redundancy, and should be considered separately. Also, note 
that gzip works on 32KB blocks, so that temporally adjacent blocks may 
have redundancy, which may be lost if either the whole file is 
considered, or sorted numbers are considered.

In the absence of further information, I would do the following:

Sort the file
Split into blocks of K numbers, say 2000.
Compress each block using a standard algorithm [say gzip].
Build an array of <first number in block, pointer to compressed block>, 
sorted on the first number.

When presented with a number, use a binary search to find the compressed 
block that would contain that number.
Uncompress the block, and do a binary search on the uncompressed values.

You may need to play with the comparison function used to sort the file; 
straightforward numerical ordering may not give the same compressing 
blocks compared to some other function.
0
Reply ctips (287) 2/1/2005 9:26:26 PM

On Mon, 31 Jan 2005 12:34:01 -0800, michael wrote:

> If anyone is up for the challenge, I'll generate a test set and we
> can benchmark the various solutions.  I have a 1GB RAM machine to
> run the apps on and will compile them using my C++ compiler to
> make sure that the test is fair.  Then I will post the results of
> the various solutions under various cases.
> Anyone interested in some bragging rights?

Well, I'm up for the challenge. Probably won't come up with anything
fancy, but I'll use what I know of stl.

-- 
Daniel

0
Reply someone2 (821) 2/1/2005 9:36:48 PM

On Tue, 01 Feb 2005 13:14:37 +0100, Heiner Steven wrote:

> The number of strings may well be >> 50 * 10^6. They are almost
> never repeated (in fact they are defined to be unique), but
> there *are* duplicates from time to time that cause much problems
> during later processing

Do you really need real-time access? Otherwise just store the existing
numbers in files with the maximum amount of numbers that can be kept in
memory at the same time in each file. Perhaps you can store 10million in
ram? Then you'll have 10 files or so, for 100million numbers. A number is
unique if it doesn't exist in any of the files. Will it do?

-- 
Daniel

0
Reply someone2 (821) 2/1/2005 9:43:25 PM

Heiner Steven wrote:
> Hello,
> 

> Afterwards new numbers are read from somewhere else, and
> the program needs to know if the number exists already
> or if it is a new, unique one.
> 

If you could accumulate this input into "batches", you could
sort this input, and reduce the problem to "walking two ordered lists".

BTW your "wording" of the problem is a bit strange, IMHO.
I think it has nothing to do with duplicates and could be rephrased as: 
find if a set of numbers exist in a (larger) set.

>
> Heiner

HTH,
AvK
0
Reply moi 2/2/2005 10:44:58 AM

CTips wrote:

> Sort the file
> Split into blocks of K numbers, say 2000.
> Compress each block using a standard algorithm [say gzip].
> Build an array of <first number in block, pointer to compressed block>, 
> sorted on the first number.
> 
> When presented with a number, use a binary search to find the compressed 
> block that would contain that number.
> Uncompress the block, and do a binary search on the uncompressed values.
> 
> You may need to play with the comparison function used to sort the file; 
> straightforward numerical ordering may not give the same compressing 
> blocks compared to some other function.

<gag> that last sentence should read 'may not give the same 
_compression_ _ratio_ compared to some other function'

Anyway, I did some measurements using compress2/uncompress from zlib as 
the compression engine. I filled in an array with a random byte sequence 
then repeated it 11 times to get ~10x compression. I then decompressed 
the sequence 100,000 times. Decompression times are variable.

Results:
-  For a 32KB block, I observed a peak time of 115us/block.
-  It scales sub-linearly with block sizes below 32KB
	* 16 KB - 67us
	* 8KB  - 38 us
- It scales linearly with block sizes above 32 KB

I'm using a dual processor 3.06GHz Xeon with 2GB of memory

BTW: another thought - if you read in the numbers unsorted, then sort 
them, then compress them, your peak memory usage will not change; you 
should sort the file using a separate program, read in the sorted file 
block-by-block and compress each block before reading in the next one.

It would be interesting to get answers to the following questions:
- how much time can you afford per search?
- have you ever tried to take a file, sort it, convert it into 16 byte 
numbers (i.e. not characters) and then compressed it? What compression 
did you achieve?
0
Reply ctips (287) 2/2/2005 3:18:58 PM

CTips wrote:

 > Sort the file
 > Split into blocks of K numbers, say 2000.
 > Compress each block using a standard algorithm [say gzip].
 > Build an array of <first number in block, pointer to compressed block>,
 > sorted on the first number.
 >
 > When presented with a number, use a binary search to find the compressed
 > block that would contain that number.
 > Uncompress the block, and do a binary search on the uncompressed values.
 >
 > You may need to play with the comparison function used to sort the file;
 > straightforward numerical ordering may not give the same compressing
 > blocks compared to some other function.

<gag> that last sentence should read 'may not give the same 
_compression_ _ratio_ compared to some other function'

Anyway, I did some measurements using compress2/uncompress from zlib as 
the compression engine. I filled in an array with a random byte sequence 
then repeated it 11 times to get ~10x compression. I then decompressed 
the sequence 100,000 times. Decompression times are variable.

Results:
-  For a 32KB block, I observed a peak time of 115us/block.
-  It scales sub-linearly with block sizes below 32KB
	* 16 KB - 67us
	* 8KB  - 38 us
- It scales linearly with block sizes above 32 KB

I'm using a dual processor 3.06GHz Xeon with 2GB of memory

BTW: another thought - if you read in the numbers unsorted, then sort 
them, then compress them, your peak memory usage will not change; you 
should sort the file using a separate program, read in the sorted file 
block-by-block and compress each block before reading in the next one.

It would be interesting to get answers to the following questions:
- how much time can you afford per search?
- have you ever tried to take a file, sort it, convert it into 16 byte 
numbers (i.e. not characters) and then compressed it? What compression 
did you achieve?

0
Reply ctips (287) 2/2/2005 3:19:51 PM

Heiner Steven wrote:
> spinoza1111@yahoo.com wrote:
> > The practical problem with the braindead approach is that the guy
has
> > millions of records, and a non-braindead sorting algorithm such as
> > quick or shell sort has to be used to realistically load the table.
>
> Indeed. I implemented a prototype that converted the 32 byte ascii
> format into the corresponding 16 byte binary format. Afterwards
> 1,7 Million numbers (the data of one day) was read into RAM,
> and sorted using QuickSort. Afterwards I searced the list
> using binary search, which is quite fast.
>
> > Quick and shell sorting are more psychologically complicated,
perhaps,
> > than open hash.
> >
> > But if this is only my prejudice, the use of a sort AFTER all
entries
> > are loaded creates a pragmatic issue.
> >
> > This is that even using a good algorithm there will be a strange
pause
> > after load during sort whereas open hash allows each entry to be
loaded
> > with a time footprint that only degrades slowly.
>
> The pause for sorting is indeed noticable, lasting about
> 5 seconds. But this is only the data of *one day*, and
> certainly will get larger.

Hmm. I suppose you could in future scroll ads or something. Seriously,
this was my concern, but it is your call if the time is acceptable. It
does seem unbounded by anything in particular.
>
> > In the "braindead" approach, you have to wait constant (but large)
time
> > proportional to the size of the data set order 1 while the data set
> > loads, and a roughly equivalent time during sort until you are
> > presented with the first useful results.
> >
> > Whereas an object approach using open hash has the multithread
> > capability to access items before any drama, before the load is
> > complete. During the load you can start work because some keys will
> > already exist, and the purpose of computers is to work us all to
death,
> > isn't it, in paralell.
>
> I agree with you in general, but in this particular case I
> will have to read all data before starting to work with it.
> "work" in my case means: find duplicates.

I think this will embed a drawback, seriously. Basically, you seem to
be giving duplicates free room and board. With a hash based load you do
not.

>
> [...excursion about the advantages of parallel programming
omitted...]

I do run on, don't I? But this was a good faith excursus in real world
terms.
>
> > Open hashing creates the final and useful product, lacking only
closure
> > in the form of all entries, from the beginning and that is why it
MAY
> > be useful here. It stores the minimal amount of crap (eg., zero
crap)
> > per entry.
> >
> > One negative remains the case that there is no requirement for
deletion
> > apart from not storing duplicate keys. This means that you do have
a
> > strong case for brain death here.
> >
> > But my prejudice against "batch" and the step by step production of
> > Useless Things remains precisely because I recall, with less than
> > fondness, eternal vigils in mainframe computer rooms, glazedly
watching
> > der blinkenlights.
> >
> > Of course, real users can switch over to porn sites today or even
do
> > useful work while a "brain dead" approach pipes garbage, only at
the
> > end accomplishing the alchemical step.
> >
> > But the argument for brain death as simplicity itself has to give
way
> > to the realization that processes also need to be "as simple as
> > possible, but no simpler".
>
> I agree with you here. As a computer scientist I appreciate
> an elegant (even complex) solution to a problem. But
> the other, pragmatic half of me knows that a simple solution
> can be more (cost or time) efficient.
>
> I'm trying to solve a real-world problem here, and therefore
> considerations like the time needed to implement it, or the ease
> of maintenance come into play. I consider a solution to be
> "good" for our purpose when it enables us to process the large
> amounts of data in time.
>
> The longer the discussion goes, the more I get the feeling that
> taking the redundancy out of our input data is too expensive
> in terms of development time.
>
The open hash solution is simplicity itself. But it APPEARS to be new
terrain for you. One thing we must do, whether as technicians or
scientists, is perform without passion the self-reflexive task of
deciding to invest our time and through a combination of vanity, and
humility, we often aren't honest with ourselves. The "complexity" is
the distance between what we know and what we don't, and thought itself
is commodified.

I am myself trying to bail our of .Net in favor of Open Source which is
a "simpler" terrain but makes demands upon me.

> But just for the fun of it, did anybody consider solving this
> problem by interpreting the number as a 128-bit unsigned
> integer, and representing the set of numbers seen using
> a sparse bit array?
>
I sure did not. But such an array is based, isn't it, either on hash or
else a list of occupied zones searched in O(n/2) time. 
> Heiner

0
Reply spinoza1111 (3247) 2/4/2005 3:05:54 AM

Heiner Steven <heiner.steven@nexgo.de> writes:

> Hello,
> 
> I'm searching for an memory-efficient (and hopefully fast)
> algorithm for detecting duplicate 56-bit numbers.
> 
> The input consists of millions of lines like this:
> 
>      0X01234567890abcdef01234567890abcd
> 
> Each line is a textual representation of a 56-bit hex number.
> 
> The file should be read into memory and stored there in a
> memory-efficient way.
> 
> Afterwards new numbers are read from somewhere else, and
> the program needs to know if the number exists already
> or if it is a new, unique one.
> 
> What would be a good data-structure for this purpose?
> 
> "gzip" shows that there's a lot of redundancy in the text file
> (factor >20). However, keeping the compressed data in
> memory is no option, because searching it would be too slow.

Emended:  128 bit numbers.

Amended:  at least 50,000,000 numbers

There's still a lot that isn't said.  How often do "new" numbers come
in?  Is it important to do these immediately they arrive, or can some
quantity of new numbers be "batched" for later testing?  What kind of
throughput is needed?  How much memory is available?  Is there some
upper bound on the size of the "old number" set?  How often do "new"
numbers need to get added to the set of "old" numbers?  Is anything
known about the nature of the redundancy of the number sets (beyond
what the gzip compression factor is)?  If so what is it?

So, let me state some assumptions, and see if we can come up with
something.

There's a gigabyte of (physical) RAM available.  A reasonable upper
bound on the size of the "old number" set is 200,000,000 numbers.  New
numbers need to be tested right away.  The throughput needed is
testing 100 new numbers per second;  sometimes they come in faster
than that, but the average rate is a fair bit lower, so processing 100
new numbers per second will run at a duty cycle of less than 50%.  New
numbers need to be added to the old number set once per day (until
they are added, a small separate "new number" set is maintained using
more conventional means).

So here are some ideas (more accurately, some variations on
basically a single idea):

1. Split the (old) numbers into groups based on their high order
   sixteen bits.  Use gzip on each group.  When testing a new
   number, use the high order sixteen bits to choose the group,
   gunzip that group, and search in that group (using, eg, a binary
   search).  (The array holding addresses for start and end of
   each group takes negligible memory).

1a. Like (1), except the numbers stored in each group store only
    the low order fourteen bytes before gzip'ping.

2. Sort the (old) numbers, then split into 65,536 equal sized
   groups.  Remember the smallest number in each group in an array.
   As with 1, use gzip on each group.  When testing a new number,
   do a binary search on the "smallest number in each group array"
   to find the group, gunzip that group and search in it.

2a. Like (2), except store the offset from the base of the group
    before gzip'ping.

2b. Like (2), except store the delta from the previous number
    (in some sort of variable length encoding scheme) before
    gzip'ping.  Use gunzip followed by expanding the variable 
    length encoding to a fixed length encoding so binary search 
    can be done.

3. Use a 16 bit hash function to assign numbers to one of 65,536
   groups.  Would need to look for a hash function that does a fair
   job of distributing numbers evenly into the 65,536 "buckets".
   Use gzip on each group.  When testing a new number, compute the
   hash function to choose the group, then gunzip that group and
   search in it.

Assuming the groups are not hugely different in size (within a
factor of two, say), and the grouping scheme doesn't mess up the
gzip results too badly, some simple measurements here on an
Athlon XP2100 suggest these schemes can easily test at least
100 numbers per second.  If the gzip gets a compression of a
factor of 4 (much less than the factor of 20 for the entire
set), then 200,000,000 * 16 / 4 = 800,000,000 bytes is still
comfortably less than 1 gigabyte.

If that's not good enough, I recommend buying more memory. :)
0
Reply txr (1104) 2/4/2005 8:52:36 AM
comp.programming 10672 articles. 18 followers. Post

57 Replies
154 Views

Similar Articles

[PageSpeed] 54


  • Permalink
  • submit to reddit
  • Email
  • Follow


Reply:

Similar Artilces:

Finding the highest bit set in a 32-bit number
Hello All, I have coded the following function but I am not happy with the way it terminates the loop. If you can suggest some improvements or have a better version of your own, please let me know. uint FindHighestBit(uint32 n) { uint bits = 16; uint nToShift = 8; uint32 mask = 0xFFFF0000; while (1) { if ( n&mask ) { bits += nToShift; mask <<= nToShift; } else { bits -= nToShift; mask >>= nToShift; } if (nToShift > 1) n...

A algorithm for finding the number
Hi members, Maybe this is not a special question to ruby. But since my ruby app has meet this algorithm problem, so please give the suggestion if you'd like. I have these datas in a mysql table: mysql> select startNum,endNum from ip_data limit 10; +----------+----------+ | startNum | endNum | +----------+----------+ | 16777216 | 16777471 | | 16843008 | 16843263 | | 16909056 | 16909311 | | 17367040 | 17498111 | | 17498112 | 17563647 | | 17563648 | 17825791 | | 17825792 | 18153471 | | 18153472 | 18219007 | | 18219008 | 18350079 | | 18350080 | 18874367 | +-------...

Finding number of bits of integer
Hi, I do need to implement something similar to C++'s std::bitset in C; for this, I use an array of int's to get together any desired number of bits, possibly larger than 32/64 or anything like this. So of course it does not matter how many bits a single int has, but I do need a reliable way to find it out. I think of something like CHAR_BITS*sizeof(int) will do the trick, am I right here? I'm just confused that it is *CHAR*_BITS; in reference to the usual example, there are some machines where char's have 9 bits--but is in this case int required to have some multip...

Need help in finding an efficient algorithm
Need help in finding an efficient algorithm for the following problem. There are tasks which have predecessor tasks which needs to be completed before taking up the new task. Our aim is to find the required order of the tasks, so that the pre-requisite tasks are always completed before the task. For example TaskA needs taskJ and taskK TaskB needs none TaskC needs TaskA, taskK taskD needs taskB taskJ needs taskB taskK needs none There is no cyclical dependency and all the tasks take same amount of time A valid answer can be taskB taskK taskJ taskD taskC taskA <robertday5@gmail.com> ...

where to find all the 512 bits prime numbers?
My current project happens to need me to find all the 512 bits prime numbers.Where could I get them? Thanks very much. Fox Autumn-Fox wrote: > My current project happens to need me to find all the 512 bits prime > numbers.Where could I get them? Has someone also asked you for the stripy paint? I guess there's around 1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 of them. Give or take. (Actually I can't remember the formulas for estimating the number of prim...

how to efficiently find line number k
I have a big text file with millions of lines, and given any number k as input, I want to output line k. What is the most efficient way to do this, other than checking end-of-line k times? Thanks b83503104 wrote: > I have a big text file with millions of lines, and given any number k > as input, I want to output line k. What is the most efficient way to > do this, other than checking end-of-line k times? > Thanks No matter how you do it you will have to read every char before it. BTW this Q is OT b83503104@yahoo.com (b83503104) wrote: > I have a big text file with mi...

Find duplicate entries in STL algorithm
Hi, Is there a STL algorithm which helps me in finding duplicate entries from a list? and it allows me to plug in a binary function as the match criteria? Thank you. silverburgh.meryl@gmail.com wrote: > Is there a STL algorithm which helps me in finding duplicate entries > from a list? > and it allows me to plug in a binary function as the match criteria? What book on Standard Library are you reading? Create a copy of your list, sort it, then enumerate the elements and count the adjacent equal elements using 'adjacent_find'. Example of that (or something similar, anyway...

Algorithm for finding position of most significant bit of integer
I dunno if this is new or not. I've only found one other algorithm that does a binary search, and it doesn't do it quite like this. Assuming 32-bit unsigned integers, the pseudocode is: msbpos(val) { if ( val == 0 ) return( -1 ) lsb = 0 msb = 32 while ( lsb+1 != msb ) { mid = (lsb + msb) / 2 if ( val & (0xFFFFFFFF << mid) ) lsb = mid else msb = mid } return( msb ) } This gives the result as 1..32; subtract one if 0..31 is better. I stole the form of the search from Bentley's "Progra...

Efficient way to count the number of bits which are set in a value
Hi, What is the most efficient way in C language to count the number of bits which are set in a value ? Thx in advance, Karthik Balaguru "karthikbg" <karthik.balaguru@lntinfotech.com> writes: > What is the most efficient way in C language to count the number of > bits which are set in a value ? The most efficient way is to use google: bit count -- __Pascal Bourguignon__ http://www.informatimago.com/ "You cannot really appreciate Dilbert unless you read it in the original Klingon" Pascal Bourguignon wrote: > "karthikbg" ...

Is there any efficient algorithm to find all paths between a pair of nodes in a DAG?
Hi, I am looking for an algorithm to find all paths between 2 nodes in a DAG, which does not contain any directed cycle. The graph has a special property that it only have a single begin node, which is the only node without any incoming edges, and a single sink node, which is the only node without any outgoing edges. All paths begin from the [begin] node will finally reach the [sink] node. I want to find all paths between the begin node and the sink node. I only know an DFS-like algorithm as the following: dfs_all (node, path) { path = path || node if (node == sink) { pr...

Finding number of records in a database table from SAS efficiently
Hello, I am trying to find the efficient way to find the number of rows in a database table (DB2) efficiently using SAS. I tried ATTRN function, not working when I assigned a library to the schema. I tried END option in datastep, taking too long time (CPU Time:3.4 minutes, REAL time: 17 min for about 132 million records). I assume other ways i know of would take same amount of time. Please suggest the best and efficient way to find out the number of observations. Thanks On 07/09/2012 16:18, sas.consultant.ad@gmail.com wrote: > Hello, > > I am trying to find...

How to find the number of record of a database table from SAS efficiently
Hello, I am trying to find the efficient way to find the number of rows in a database table (DB2) efficiently using SAS. I tried ATTRN function, not working when I assigned a library to the schema. I tried END option in datastep, taking too long time (CPU Time:3.4 minutes, REAL time: 17 min). I assume other ways i know of would take same amount of time. Please suggest the best and efficient way to find out the number of observations. Thanks Hello again, I forgot to mention the number of records for the CPU and real times with END option; it is about 132 million. So I...

Efficient way to count the number of bits which are set in a value
Hi, What is the most efficient way in C language to count the number of bits which are set in a value ? Thx in advance, Karthik Balaguru "karthikbg" <karthik.balaguru@lntinfotech.com> writes: > What is the most efficient way in C language to count the number of > bits which are set in a value ? The most efficient way is to use google: bit count -- __Pascal Bourguignon__ http://www.informatimago.com/ "You cannot really appreciate Dilbert unless you read it in the original Klingon" Pascal Bourguignon wrote: > "karthikbg" ...

Efficient way of finding most-significant non-zero bit of integer
Hi all -- In writing some logic code for an octree, I've come across a situation where I have to find the most-significant non-zero bit of an integer. I can see how this might be done via a brute-force scan through the bits: integer :: mynumber,i do i = BIT_SIZE(i),1,-1 if(BTEST(mynumber,i-1)) exit enddo But can anyone suggest a more-efficient way of doing this? Essentially, I'm looking for an integer log_2() function. cheers, Rich Rich Townsend wrote: > Hi all -- > > In writing some logic code for an octree, I've come across a situation > where I have to...

32-bit Number * 32-bit Number = 64-bit result
I need to do multiplication that mirrors how C handles integer multiplication. That is, one 32-bit number times another 32-bit number equals the least significant 32-bits of the 64-bit result. In theory, I could mimic that behavior like so: (x * x) & 0xFFFFFFFF But this doesn't work quite right because the initial multiplication can produce such a large result that JavaScript looses some precision. Any thoughts? Ideas? "Jeff.M" <Mott.Jeff@gmail.com> writes: > I need to do multiplication that mirrors how C handles integer > multiplication. That is, one 32-bit...

find digit length or the number of numbers in number ?
Hi, I am trying to find out how many digits there are in a given number. The macro listed below works fine when applied to an INT, however when doing Doubles with numbers > then a billion ?? It stops working. Anyone any idea's ? Thankx ! Steven. #include <stdio.h> #include <math.h> #define DIGLEN(x) (x ? (int)(log10((double)(abs(x)))) + 1 : 1) int main(int argc, char *argv[]) { int i = 123; double j = 807319385.29; double k = 12258983401.75; printf("%3d: %d\n", DIGLEN(i), i); printf("%3d: %.2f\n", DIGLEN(j), j); printf("%3d: %.2f\n&q...

How to combine two 8 bit numbers to make a 16 bit number??
Hi<br><br>My electronics ADC ouputs the result of the conversion over two b= ytes because it is a 10 bit number. This is then fed into Labview via the = serial port. I have the two bytes stored in the various elements of an unb= undle block diagram.<br><br>My current vi has the two bytes stored seperate= ly. HOw may I combine the two together??<br><br>So far I have <br><br>1st = byte sent : (MSB)bit7/bit6/bit5/bit4/bit3/bit2/bit1/bit0(LSB)<br>2nd byte s= ent : (MSB)X/X/X/X/X/X/bit1/bit0(LSB)<br><br>I can send the two bytes fr...

please any one help me in dividing 1024 bit number with 1024 bit another number
hi , Can any one send me some process of dividing 1024 bit number with another 1024 bit number there is one algorithm developed by Knuth iam unable to follow that algorithm can any one help me please by providing a understanding document on it please thank you -- comp.lang.c.moderated - moderation address: clcm@plethora.net On 30 Jun 2003 23:15:21 GMT, siddamshetty <siddamshetty@yahoo.com> wrote: > hi , > > Can any one send me some process of dividing 1024 bit number with > another 1024 bit number there is one algorithm developed by Knuth iam &...

Finding all numbers which add up to another number
Hi everyone, I'm looking for a way to pass a number into a subroutine, and have it give back a list of all the ways that number can been added to get to that number. For example, if the function were to take 3 as a parameter it would give me back: 1,1,1 2,1 3 for 5 it would give back: 1,1,1,1,1 2,2,1 3,2 3,1,1 4,1 5 2,1,1,1 I haven't been able to come up with anything that does this the way I want: So far I did: &Sum(3) sub Sum { my $sum = shift; for (my $i=0; $i<=$sum; $i++) { for (my $j=0; $j<=$sum; $j++) { for (my $k=0; $k<=$sum; $k++) { ...

Finding the lowest number for a number of oloca
Hope someone can help - I'm stuck with a query! I have a table with distances from all the uk post codes to about 150 locations - making for a table containing about 500k records. The table has the columns:- NDistance (as integer) NName NLocation NAddress NPostCode NBranchID NBranchCode What I need to do is find the lowest number from the NDistance column for each location, and display this only once for each location. The query could be sorted on the NBranchID or NBranchCode column. Hope this makes sense . . . . . TIA -- Buzby "There's nothing more dangerous than a re...

Count number of bits set in a number
Hi all, I want to write a program to count number of bits set in a number. The condition is we should not loop through each bit to find whether its set or not. Thanks in advance, -Mukesh "Mack" <Mack.Chauhan@gmail.com> schrieb im Newsbeitrag news:1190879341.110090.142030@d55g2000hsg.googlegroups.com... > Hi all, > I want to write a program to count number of bits set in a number. > The condition is we should not loop through each bit to find whether > its set or not. You'd better do your homework yourself. Bye, Jojo Mack wrote: > Hi all, > ...

Algorithm Numbering Conflict with algorithmic and algorithm2e
Does anyone know how to change the following example so that the second algorithm will be numbered with a 2? i.e. "Algorithm 2" instead of "Algorithm 1". Thank you, Alan \documentclass{article} % % For algorithms (1) % % http://gicl.cs.drexel.edu/people/tjkopena/wiki/index.php?n=SWAT.Algorithm2eTooManyBrackets \makeatletter \newif\if@restonecol \makeatother \let\algorithm\relax \let\endalgorithm\relax \usepackage[algo2e,ruled,vlined]{algorithm2e} % need for algorithm environment (name is algorithm2e because of algo2e option) % % For algorithms (2) \usepackage{algorithm} \...

Auto-Number is producing duplicate numbers
I have a medium sized access database. I use the auto-number as the primary key in all my tables. Recently, one of my tables has been producing duplicate primary key numbers when I try to add a record, which results in an error message telling me that the record can't be saved due to duplicate number, etc. Any comments - I have the most recent Jet 4 installed and am using Access 2000 and running Windows XP. Thx ...

Reading 32-bit complex numbers in IDL (16-bit real / 16-bit imaginary)
Hi, I couldn't find a discussion of this specific problem on the group, so I am posting it here in hope of a solution. I have complex data in a binary file, with each value comprising of 4 bytes (32 bits), such that the 32-bit complex number consists of 16- bit real and 16-bit imaginary. I could use define a complex array in IDL and then read the data using the readu command, however the "complex" format in IDL is a 2*32-bit definition, i.e. 32-bit real and 32-bit imaginary. How can I read the 16-bit real - 16-bit imaginary complex number in IDL? Thanks, Waqas. On 08/07/2011...