Hi,
I have two 30GB+ files that I need to join. This operation occurs every
two weeks or so. The problem is that these files are sorted in the
**descending** order of the join keys when they are created. Also, I
need the joined output to be sorted in the **descending** order as
well. Thus, I cannot directly use the "join" command ... it only gives
me the joined first row viz. the one with the max join key. I need to
have some easy way to do this operation.
Info:
OS: AIX 5.1 on IBM pSeries
No of CPUs: 12 Power PC RS64 - IV CPUs
Shell: KSH
RAM: 32GB
Join File characteristics:
Join Key Field: 15 numeric digits [0-9]
Join-connections: one-to-one viz. for every row in the main file,
there exists one and only one row with the same join key in the second
file
File 1 Number of Columns: Join Key + 111 columns
File 2 Number of Columns: Join Key + 1 column
So far, I have been able to come up with the following approaches:
=============================
Approach 1: Grunt-and-bear-it
=============================
1. Sort the input files in ascending order using sort -n
2. Join the sorted files
3. Sort the joined file in descending order
Notes:
a. This approach is basically an exponential grunt approach
b. It will also require a large amount of intermediate/temp disk
space
===================================================
Approach 2: Change the LANG numeric collation order
===================================================
1. Change the LANG/collation sequence to a temporary LANG file that
has the numbers in descending collation order. viz. the collation order
of the digit 9 is lower than the collation order of the digit 8 and so
on. Do this change only for the execution of the "join" command
2. Perform the join on the input files. This will automatically
create the output file in the order desired without any intermediate
grunt-work.
Notes:
a. I am not sure if this approach is feasible, and if it so, how to
change the collation sequence of the digits only for the join command
b. I am also not sure if this will have any other effects on other
users using the AIX box.
===============================================================================
Approach 3: Use awk to buffer the smaller file and append it to the
larger file
===============================================================================
1. Store the first file (viz. 112 cols) as "file1.txt" and the second
file (viz. 2 cols) as "file2.txt"
2. awk '{ if( FILENAME ~ /^file2/ )
{ # We are reading the second file, buffer the input
file2_buffer[FNR] = $2;
}
else
{ # We are reading the first file ... write its output
along with the buffered second file's input
print $0 file2_buffer[FNR];
}
}' file2.txt file1.txt > file3.txt
O gurus-of-UNIX-dom ... I need your advice as to which of the above
strategies will be the optimal in terms of time needed to execute.
Also, any other strategies/approaches are extremely welcome, since this
operation is gonna b a pain-in-my-butt almost every two weeks, and I
need a fast method to be quickly done with it
Thanx
Neil
|
|
0
|
|
|
|
Reply
|
zeGenius (1)
|
8/15/2005 9:22:53 PM |
|
zeGenius wrote:
> Hi,
>
> I have two 30GB+ files that I need to join. This operation occurs every
> two weeks or so. The problem is that these files are sorted in the
> **descending** order of the join keys when they are created. Also, I
> need the joined output to be sorted in the **descending** order as
> well. [...]
>
> Join File characteristics:
> Join Key Field: 15 numeric digits [0-9]
> Join-connections: one-to-one viz. for every row in the main file,
> there exists one and only one row with the same join key in the second
> file
If you're really sure there's a one-to-one matching and
you're really sure both files are sorted the same way, it
follows that the first pair of input lines have matching keys
and produce one output line, the second input pair also has
matching keys and produces one output line, and so on. That
is, the keys are pretty much irrelevant: all you need is a
program that sucks in a pair of lines, outputs one line, and
repeats. Or have I missed something here?
> File 1 Number of Columns: Join Key + 111 columns
> File 2 Number of Columns: Join Key + 1 column
Okay, what do you mean by "two 30GB+ files?" Since both
hold the same number of lines (one-to-one matching) and each
line of file 1 is 7.5 times the length of a file 2 line, we've
either got
- File 1 ~= 30GB (~236 million lines), file 2 ~4 GB, or
- File 2 ~= 30GB (~1.76 billion lines), file 1 ~224 GB.
Since one of your plans is to hold the entirety of file 2 in
awk arrays, I imagine we're looking at the first case -- but
even so, that's probably a lot of data for an awk array! A
32-bit awk would probably not be able to absorb the data.
Hmmm... I'm curious about that one-column field in file 2.
What is it? Status indicator? Flag to govern further processing?
What are you going to do with it in the next step, and can we
perhaps come up with a way to collapse the two steps into one?
(Equivalently: Where did it come from, and is the previous step
the candidate for collapsing?) The principle at work here is
that since the CPU is far faster than the I/O devices, it's
usually better to spend a lot of CPU on a complicated process
if it avoids just one round-trip to the disk. Pipelines can
help, since the I/O will only be "real" at the "ends."
> So far, I have been able to come up with the following approaches:
>
> =============================
> Approach 1: Grunt-and-bear-it
> =============================
>
> 1. Sort the input files in ascending order using sort -n
> 2. Join the sorted files
> 3. Sort the joined file in descending order
>
> Notes:
> a. This approach is basically an exponential grunt approach
> b. It will also require a large amount of intermediate/temp disk
> space
I'm not sure where you get "exponential" from, but it does
seem an expensive undertaking. Sorting a quarter-billion records
(three times!) is not to be undertaken lightly, and it'll be
about twenty times worse if you actually have seven times that
many.
> ===================================================
> Approach 2: Change the LANG numeric collation order
> ===================================================
>
> 1. Change the LANG/collation sequence to a temporary LANG file that
> has the numbers in descending collation order. viz. the collation order
> of the digit 9 is lower than the collation order of the digit 8 and so
> on. Do this change only for the execution of the "join" command
>
> 2. Perform the join on the input files. This will automatically
> create the output file in the order desired without any intermediate
> grunt-work.
>
> Notes:
> a. I am not sure if this approach is feasible, and if it so, how to
> change the collation sequence of the digits only for the join command
> b. I am also not sure if this will have any other effects on other
> users using the AIX box.
This is clever, and might even work. But do you mean LANG,
or LC_COLLATE? Either way, you'll need to hunt up your system's
way of defining locale-dependent goodies. On Solaris you'd build
a localedef database; I don't know where AIX keeps the equivalent
information.
An environment variable in one process won't bother other
processes on the box.
To avoid fiddling with locales, you could write yourself a
simple filter program that translates 9->0, 8->1, ..., 0->9 only
on the key fields; we might dub this a "substitution sort."
You'd run file2 through the filter and store it in a temporary
file, then pipe the large file through the filter and into join
along with the temporary file, piping the output through the
filter again to reconstitute the original key fields. You could
even avoid the temporary file (which would probably be a good idea
if it's "30GB+") by convincing join to read from a pair of pipes
instead of from a pipe and a disk file -- I don't know how to do
this directly from a shell, but named pipes ought to do the trick.
> ===============================================================================
> Approach 3: Use awk to buffer the smaller file and append it to the
> larger file
> ===============================================================================
I suspect you've got too much data for this to work.
> O gurus-of-UNIX-dom ... I need your advice as to which of the above
> strategies will be the optimal in terms of time needed to execute.
> Also, any other strategies/approaches are extremely welcome, since this
> operation is gonna b a pain-in-my-butt almost every two weeks, and I
> need a fast method to be quickly done with it
Personally, I think I'd grab the join source and hack it to
use descending order. Given one-to-one matching, join is a more
complicated program than you need -- but unless you're Really Sure
about the matching, key comparison provides a safety valve in case
of mismatches.
--
Eric.Sosman@sun.com
|
|
0
|
|
|
|
Reply
|
Eric
|
8/15/2005 10:27:06 PM
|
|
On Mon, 15 Aug 2005 14:22:53 -0700, zeGenius wrote:
> Hi,
>
> I have two 30GB+ files that I need to join. This operation occurs every
> two weeks or so. The problem is that these files are sorted in the
> **descending** order of the join keys when they are created. Also, I
> need the joined output to be sorted in the **descending** order as
> well. Thus, I cannot directly use the "join" command ... it only gives
> me the joined first row viz. the one with the max join key. I need to
> have some easy way to do this operation.
[snip]
> O gurus-of-UNIX-dom ... I need your advice as to which of the above
> strategies will be the optimal in terms of time needed to execute.
> Also, any other strategies/approaches are extremely welcome, since this
> operation is gonna b a pain-in-my-butt almost every two weeks, and I
> need a fast method to be quickly done with it
If both files are of the order of 30 gig, I would write a C program to do
this. One of your solutions (using awk) implied that the two files might
differ in size. I would definitly look at using 'sfio' rather than stdio
for this as well. However any way you look at it reading and writing 60gig
of data is going to take some time. A single IDE drive probably has a
sustained write speed of 30Mb/s, and a read speed of maybe 50Mb/s. So it
will take of the order of an hour to read and write the data alone to a
single drive. Of course having multiple drives will help.
Joinn is a fairly simple program, start off by reading the first line of
both files, compare them, if they are the same then output them and start
on the next pair. Otherwise discard the 'smaller' one, and refresh it. If
you get to EOF on either file, then exit. Grab the source for an existing
implementation, and edit it to reverse the comparision routine, and you
are done.
Others may suggest porting 'tac'. If this was a one time only job I would
suggest it myself. However this sounds as if the extra time you spend on
getting a fast C program that does what you want will pay dividends.
|
|
0
|
|
|
|
Reply
|
Icarus
|
8/16/2005 5:41:03 AM
|
|
zeGenius wrote:
> Hi,
>
> I have two 30GB+ files that I need to join. This operation occurs every
> two weeks or so. The problem is that these files are sorted in the
> **descending** order of the join keys when they are created. Also, I
> need the joined output to be sorted in the **descending** order as
> well. Thus, I cannot directly use the "join" command ... it only gives
> me the joined first row viz. the one with the max join key. I need to
> have some easy way to do this operation.
>
> Info:
>
> OS: AIX 5.1 on IBM pSeries
> No of CPUs: 12 Power PC RS64 - IV CPUs
> Shell: KSH
> RAM: 32GB
A couple of other suggestions to avoid rewriting the data to disk...
a) Does your AIX have the newer awk implementation called nawk (or gawk
for that matter) that allows you to do getlines from multiple input
streams? If so you could write your own join with such an awk.
b) Use awk to transform the descending order key into an ascending
order (e.g. maxpossiblevalue-key) through pipes and then back again...
mkfifo /tmp/foo
awk -f desctoasc.awk file1.txt > /tmp/foo &
awk -f desctoasc.awk file2.txt | join /tmp/foo - | awk -f
asctodesc.awk > joined.txt
HTH,
-- ced
>
>
> Join File characteristics:
> Join Key Field: 15 numeric digits [0-9]
> Join-connections: one-to-one viz. for every row in the main file,
> there exists one and only one row with the same join key in the second
> file
> File 1 Number of Columns: Join Key + 111 columns
> File 2 Number of Columns: Join Key + 1 column
>
>
> So far, I have been able to come up with the following approaches:
>
>
> =============================
> Approach 1: Grunt-and-bear-it
> =============================
>
> 1. Sort the input files in ascending order using sort -n
> 2. Join the sorted files
> 3. Sort the joined file in descending order
>
> Notes:
> a. This approach is basically an exponential grunt approach
> b. It will also require a large amount of intermediate/temp disk
> space
>
>
>
> ===================================================
> Approach 2: Change the LANG numeric collation order
> ===================================================
>
> 1. Change the LANG/collation sequence to a temporary LANG file that
> has the numbers in descending collation order. viz. the collation order
> of the digit 9 is lower than the collation order of the digit 8 and so
> on. Do this change only for the execution of the "join" command
>
> 2. Perform the join on the input files. This will automatically
> create the output file in the order desired without any intermediate
> grunt-work.
>
> Notes:
> a. I am not sure if this approach is feasible, and if it so, how to
> change the collation sequence of the digits only for the join command
> b. I am also not sure if this will have any other effects on other
> users using the AIX box.
>
>
>
>
> ===============================================================================
> Approach 3: Use awk to buffer the smaller file and append it to the
> larger file
> ===============================================================================
>
> 1. Store the first file (viz. 112 cols) as "file1.txt" and the second
> file (viz. 2 cols) as "file2.txt"
>
> 2. awk '{ if( FILENAME ~ /^file2/ )
> { # We are reading the second file, buffer the input
> file2_buffer[FNR] = $2;
> }
> else
> { # We are reading the first file ... write its output
> along with the buffered second file's input
> print $0 file2_buffer[FNR];
> }
> }' file2.txt file1.txt > file3.txt
>
>
> O gurus-of-UNIX-dom ... I need your advice as to which of the above
> strategies will be the optimal in terms of time needed to execute.
> Also, any other strategies/approaches are extremely welcome, since this
> operation is gonna b a pain-in-my-butt almost every two weeks, and I
> need a fast method to be quickly done with it
>
> Thanx
>
> Neil
>
--
Chuck Dillon
Senior Software Engineer
NimbleGen Systems Inc.
|
|
0
|
|
|
|
Reply
|
Chuck
|
8/16/2005 1:34:48 PM
|
|
zeGenius wrote:
> Hi,
>
> I have two 30GB+ files that I need to join. This operation occurs every
> two weeks or so. The problem is that these files are sorted in the
> **descending** order of the join keys when they are created. Also, I
> need the joined output to be sorted in the **descending** order as
> well. Thus, I cannot directly use the "join" command ... it only gives
> me the joined first row viz. the one with the max join key. I need to
> have some easy way to do this operation.
>
> Info:
>
> OS: AIX 5.1 on IBM pSeries
> No of CPUs: 12 Power PC RS64 - IV CPUs
> Shell: KSH
> RAM: 32GB
>
>
> Join File characteristics:
> Join Key Field: 15 numeric digits [0-9]
> Join-connections: one-to-one viz. for every row in the main file,
> there exists one and only one row with the same join key in the second
> file
> File 1 Number of Columns: Join Key + 111 columns
> File 2 Number of Columns: Join Key + 1 column
>
>
> So far, I have been able to come up with the following approaches:
>
>
> =============================
> Approach 1: Grunt-and-bear-it
> =============================
>
> 1. Sort the input files in ascending order using sort -n
> 2. Join the sorted files
> 3. Sort the joined file in descending order
>
> Notes:
> a. This approach is basically an exponential grunt approach
> b. It will also require a large amount of intermediate/temp disk
> space
>
slow and takes much temporay disk space.
>
>
> ===================================================
> Approach 2: Change the LANG numeric collation order
> ===================================================
>
> 1. Change the LANG/collation sequence to a temporary LANG file that
> has the numbers in descending collation order. viz. the collation order
> of the digit 9 is lower than the collation order of the digit 8 and so
> on. Do this change only for the execution of the "join" command
>
> 2. Perform the join on the input files. This will automatically
> create the output file in the order desired without any intermediate
> grunt-work.
>
> Notes:
> a. I am not sure if this approach is feasible, and if it so, how to
> change the collation sequence of the digits only for the join command
> b. I am also not sure if this will have any other effects on other
> users using the AIX box.
>
not possible, as all available language have 0..9, not 9..0.
>
>
>
> ===============================================================================
> Approach 3: Use awk to buffer the smaller file and append it to the
> larger file
> ===============================================================================
>
> 1. Store the first file (viz. 112 cols) as "file1.txt" and the second
> file (viz. 2 cols) as "file2.txt"
>
> 2. awk '{ if( FILENAME ~ /^file2/ )
> { # We are reading the second file, buffer the input
> file2_buffer[FNR] = $2;
> }
> else
> { # We are reading the first file ... write its output
> along with the buffered second file's input
> print $0 file2_buffer[FNR];
> }
> }' file2.txt file1.txt > file3.txt
this takes a lot of memory.
>
>
> O gurus-of-UNIX-dom ... I need your advice as to which of the above
> strategies will be the optimal in terms of time needed to execute.
> Also, any other strategies/approaches are extremely welcome, since this
> operation is gonna b a pain-in-my-butt almost every two weeks, and I
> need a fast method to be quickly done with it
>
> Thanx
>
> Neil
>
You need to paste the columns, then delete the second occurence of the
key field:
paste file1.txt file2.txt | awk '{$113="";print}'
If you mind the two spaces at #113:
paste file1.txt file2.txt |
awk '{for (i=1;i<=NF;++i)if(i!=113)print $i;printf "\n"}' ORS=" "
--
Michael Tosch @ hp : com
|
|
0
|
|
|
|
Reply
|
Michael
|
8/16/2005 5:59:43 PM
|
|
Sorry no answer here except suggesting you try options 1 and 3 using
the "time" command to see how long they take, and also use vmstat to
monitor the performance during that time.
I am also curious if not confidential: What data would be needed that
is 30GB in size and still being used in a flat ASCII file as its
storage method?
btw - if you are using JFS with large-file-enabled for that 30GB file,
I believe the max size is 64GB before you reach a O/S limit...If not,
you may want to switch to JFS2 in AIX 5 if you have not already.
Sorry I couldn't be more help on your main topic...
|
|
0
|
|
|
|
Reply
|
steven_nospam
|
8/17/2005 4:03:11 PM
|
|
On Mon, 15 Aug 2005 14:22:53 -0700, zeGenius wrote:
> I have two 30GB+ files that I need to join. This operation occurs every
> two weeks or so. The problem is that these files are sorted in the
> **descending** order of the join keys when they are created. Also, I
> need the joined output to be sorted in the **descending** order as
> well. Thus, I cannot directly use the "join" command ... it only gives
> me the joined first row viz. the one with the max join key. I need to
> have some easy way to do this operation.
Use FIFOs and pipes. In Linux bash, how about:
join [options] <(tac filea) <(tac fileb) | tac >filec
If I understand the problem correctly, this will involve 'flipping' the
big file only twice. The tac utility if fairly fast. On my old machine
with IDE disks, it takes about 1 minute per gigabyte.
--
They laughed at Einstein. They laughed at the
Wright Brothers. But they also laughed at Bozo
the Clown.
-- Carl Sagan
|
|
0
|
|
|
|
Reply
|
Joe
|
8/24/2005 3:37:32 AM
|
|
|
6 Replies
644 Views
(page loaded in 0.122 seconds)
|