Making existing process efficient

Problem
--------
I have a file in which there are millions of records. I need to read
those records and output records to different
files depending on the value of fields in the record. There would be
several records that would end up getting
inserted into the same file due to similar value of fields in the
records.

Current scenario
----------------
Our process runs on a mutiproc machine. Currently it does things in
the simplest possible way. It opens several
output streams (as and when required) and keeps writing records to the
streams one by one. It appends records when the corresponding stream
is already open.

I am _trying_ to do things efficiently here. It's important to
preserve the ordering of records in the generated
files i.e. if record A1 is prior to record A2 in the input file, then
A1 should still be prior to A2 in the result
file (if they land up in the same file).

I have thought of some ways -

a) Keep writing N (sufficiently large) number of records to an
internal buffer (B1). Fork a child process and open a
pipe with the child. The child would inherit the buffer (B1) from
parent. Child would sort the records by the stream where they would be
written. It would then write the records to the destination file. It
would keep writing to pipe the last record it could write
successfully. It would exit after writing the last record. This would
give the advantage of writing records belonging to the same file in
one go. This would reduce the number of seeks that would have
otherwise been caused by random access of files.

Parent would continue writing the next set of N records to another
internal buffer (B2). It would wait for the child
process to complete before forking again. When the child has finished
it would rename B2 as B1 and fork another
process. In case the child crashes abruptly, parent would write the
unsucessful records from buffer B1 to output
files.

b) (Complex approach) Let the parent write N records to an internal
buffer (B1) and fork a child process C1. It
would continue writing N more records to buffer B2 and fork another
child process. It would open a pipe with both
the child processes.

The child process C1 would sort the internal buffer B1 by the output
stream. It would then lock all the output files
before writing. It would communciate to parent that it has locked all
the files. It would then write and relinquish
locks one by one once it's done.

The child process C2 would sort the internal buffer B2 by the output
stream. It would wait for parent to
communicater that C1 has locked all the files. C2 would then  attempt
to grab locks on files and would write to them once it gets it.

I think approach (b) is unduly complex (that too without mentioning
crash recovery of child) and may or may not get any additional
performance benefits. Measurements would only tell me how much
performance gain I get with the approaches. (I feel that probably I/O
is the bottleneck).

Is there any other approach that could help me here?

Thanks..
P.S. - I am on FreeBSD 4.10
0
12/5/2007 7:38:31 PM
comp.unix.programmer 10820 articles. 0 followers. kokososo56 (349) is leader. Post Follow

7 Replies
224 Views

Similar Articles

[PageSpeed] 28

On Wed, 05 Dec 2007 11:38:31 -0800, Kelvin Moss wrote:

> Problem
> --------
> I have a file in which there are millions of records. I need to read
> those records and output records to different files depending on the
> value of fields in the record. There would be several records that would
> end up getting inserted into the same file due to similar value of
> fields in the records.
> 
> Current scenario
> ----------------
> Our process runs on a mutiproc machine. Currently it does things in the
> simplest possible way. It opens several output streams (as and when
> required) and keeps writing records to the streams one by one. It
> appends records when the corresponding stream is already open.
> 
> I am _trying_ to do things efficiently here. It's important to preserve
> the ordering of records in the generated files i.e. if record A1 is
> prior to record A2 in the input file, then A1 should still be prior to
> A2 in the result file (if they land up in the same file).

.. Create X children
.. Try read Y records, until EOF
..  Find next free child[1]
..  Give read records to child, along with start/end record number
..   Child writes data into child specific files, along with record number.
.. Mergesort child specific files into real ones, via. record number.

 Then you just need to find your optimal values of X and Y.

 If the merge sort is a problem as a separate step, with a bit more work
you can have each child output to a proc. per. output file and do the
merge sort in real time[1].

 For bonus points you can try creating children on other machines.


[1] Simple loop should work if the processing per. record is roughly the
same.

[2] Need some way for the child to say "have nothing for you, processing
record Y now ... so it doesn't deadlock if one child doesn't have any
records of type Z).

-- 
James Antill -- james@and.org
C String APIs use too much memory? ustr: length, ref count, size and
read-only/fixed. Ave. 44% overhead over strdup(), for 0-20B strings
http://www.and.org/ustr/
0
12/5/2007 8:06:41 PM
Kelvin Moss wrote:
> Problem
> --------
> I have a file in which there are millions of records. I need to read
> those records and output records to different
> files depending on the value of fields in the record. There would be
> several records that would end up getting
> inserted into the same file due to similar value of fields in the
> records.
> 
> Current scenario
> ----------------
> Our process runs on a mutiproc machine. Currently it does things in
> the simplest possible way. It opens several
> output streams (as and when required) and keeps writing records to the
> streams one by one. It appends records when the corresponding stream
> is already open.

     Unless the decision of which output file (files?)
should receive a record is incredibly complicated, the
task will be completely I/O-bound.  Implication: Fancy
schemes to use the CPU more efficiently or to apply the
power of additional CPU's/cores will help very little.

     That said, for a really large amount of data it may
be worth while avoiding the need to copy everything from
kernel space out to user space and back to kernel space
again, as in a simple read()/write() scheme.  Consider
using mmap() instead, with madvise() to tell the virtual
memory system that your accesses will be sequential.  At
the very least, use mmap() for the input file even if
you continue to use write() for the outputs.

> I have thought of some ways -
> 
> a) Keep writing N (sufficiently large) number of records to an
> internal buffer (B1). Fork a child process and open a
> pipe with the child. The child would inherit the buffer (B1) from
> parent. Child would sort the records by the stream where they would be
> written. It would then write the records to the destination file. It
> would keep writing to pipe the last record it could write
> successfully. It would exit after writing the last record. This would
> give the advantage of writing records belonging to the same file in
> one go. This would reduce the number of seeks that would have
> otherwise been caused by random access of files.
> 
> Parent would continue writing the next set of N records to another
> internal buffer (B2). It would wait for the child
> process to complete before forking again. When the child has finished
> it would rename B2 as B1 and fork another
> process. In case the child crashes abruptly, parent would write the
> unsucessful records from buffer B1 to output
> files.

     "It is a capital mistake to theorize before one has
data," and I lack data.  Still, it seems to me that this
is much too involved for any benefit it might bring --
indeed, the overhead of all those fork() calls might more
than offset any gain.

     If you're using mmap() for output, there's no need
for any of this.  Just mmap() the buffers to their output
files, copy records into them, and let the virtual memory
system take care of flushing the pages to disk.  As I
mentioned before, madvise() may help the VM system make
good choices.

     If you're using write() for output, the only delay
you incur in the write() call itself is the time to copy
the data from user space to kernel space; unless you've
explicitly asked for data synchronization, the file system
will perform the physical writes at some later time.  If
delays on the output side really are a concern, you could
use multiple writer threads to gain parallelism and keep
the input side running more or less unimpeded.  This is
likely to be a good deal more efficient than using
multiple processes, especially a process-per-write()!

     If you're using write(), changing to writev() could
save some time.  With write() you'd copy data from the
input buffer to an output buffer and thence to the kernel,
while with writev() you could copy a whole bunch of
discontiguous records straight from the input buffer to
the kernel, thus saving one copy step.  For "millions of
records," the savings might be worth while.  Of course,
it requires that you not re-fill the input buffer until
all the output streams are no longer interested in any
of its content, which may complicate things a little if
you're using multiple writer threads.  "It stands to
reason" (that is, "trust, but verify") that the performance
of mmap()/writev() should be better than mmap()/copy/write(),
but probably not quite as good as mmap()/mmap()/copy.

     The big take-aways:

     - You're almost certainly I/O bound, so schemes to
       optimize the CPU are optimizing the wrong thing.

     - And yet, with "millions of records" the copy steps
       may be worth a look.  Plain read()/copy/write()
       moves each byte three times; using mmap() on the
       input eliminates one move; using mmap() or writev()
       on the output eliminates another.

     - If more parallelism is needed, use multiple threads
       rather than multiple processes.  The synchronization
       issues are the same, but the overheads are lower.

     - Along with mmap(), use madvise() to inform the
       system of the access patterns for your buffers.

     - And, as always: Measure, measure, measure!

-- 
Eric.Sosman@sun.com
0
Eric.Sosman (4552)
12/5/2007 8:32:13 PM
In article <pan.2007.12.05.20.06.41@and.org>,
 James Antill <james-netnews@and.org> wrote:

> On Wed, 05 Dec 2007 11:38:31 -0800, Kelvin Moss wrote:
> 
> > Problem
> > --------
> > I have a file in which there are millions of records. I need to read
> > those records and output records to different files depending on the
> > value of fields in the record. There would be several records that would
> > end up getting inserted into the same file due to similar value of
> > fields in the records.
> > 
> > Current scenario
> > ----------------
> > Our process runs on a mutiproc machine. Currently it does things in the
> > simplest possible way. It opens several output streams (as and when
> > required) and keeps writing records to the streams one by one. It
> > appends records when the corresponding stream is already open.
> > 
> > I am _trying_ to do things efficiently here. It's important to preserve
> > the ordering of records in the generated files i.e. if record A1 is
> > prior to record A2 in the input file, then A1 should still be prior to
> > A2 in the result file (if they land up in the same file).
> 
> . Create X children
> . Try read Y records, until EOF
> .  Find next free child[1]
> .  Give read records to child, along with start/end record number
> .   Child writes data into child specific files, along with record number.
> . Mergesort child specific files into real ones, via. record number.

Sort?  The data is already in order, it seems unlikely that using a 
sorting algorithm could possibly be more efficient than what he's 
already doing.

-- 
Barry Margolin, barmar@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***
*** PLEASE don't copy me on replies, I'll read them in the group ***
0
barmar (6125)
12/6/2007 4:32:43 AM
On Dec 5, 12:32 pm, Eric Sosman <Eric.Sos...@sun.com> wrote:
> Kelvin Moss wrote:
>      The big take-aways:
>
>      - You're almost certainly I/O bound, so schemes to
>        optimize the CPU are optimizing the wrong thing.
>
>      - And yet, with "millions of records" the copy steps
>        may be worth a look.  Plain read()/copy/write()
>        moves each byte three times; using mmap() on the
>        input eliminates one move; using mmap() or writev()
>        on the output eliminates another.
>
>      - If more parallelism is needed, use multiple threads
>        rather than multiple processes.  The synchronization
>        issues are the same, but the overheads are lower.
>
>      - Along with mmap(), use madvise() to inform the
>        system of the access patterns for your buffers.
>
>      - And, as always: Measure, measure, measure!

Thanks Eric. This all makes a lot of sense. Here are a few questions -

1. I had thought about mmaping the input file. Unfortunately the file
in question is very big (> 4 GB). I believe it won't help much to mmap
the file on a 32 bit system.
2. I don't use writev but as I understand writev probably makes
writing more efficient if the records to be written are scattered. If
I sort my data then I won't have the problem of scattered inputs.
Would writev still bring me some adavntages?
3. I am thinking that probably opening too many files in one single
process is also slowing down the I/O. So how about this approach -
Let the parent hash the output file descriptors into M buckets and
accordingly write to M buffers. Once a buffer is full, it would fork a
child process that would sort the records based on fd and write to
files. It would effectively cut down the number of open file
descriptors per process and disk seeks too (by sorting on fd).

Your comments would definitely help me.

Thanks ..






0
12/7/2007 10:20:19 PM
Kelvin Moss wrote:
> On Dec 5, 12:32 pm, Eric Sosman <Eric.Sos...@sun.com> wrote:
>> Kelvin Moss wrote:
>>      The big take-aways:
>>
>>      - You're almost certainly I/O bound, so schemes to
>>        optimize the CPU are optimizing the wrong thing.
>>
>>      - And yet, with "millions of records" the copy steps
>>        may be worth a look.  Plain read()/copy/write()
>>        moves each byte three times; using mmap() on the
>>        input eliminates one move; using mmap() or writev()
>>        on the output eliminates another.
>>
>>      - If more parallelism is needed, use multiple threads
>>        rather than multiple processes.  The synchronization
>>        issues are the same, but the overheads are lower.
>>
>>      - Along with mmap(), use madvise() to inform the
>>        system of the access patterns for your buffers.
>>
>>      - And, as always: Measure, measure, measure!
> 
> Thanks Eric. This all makes a lot of sense. Here are a few questions -
> 
> 1. I had thought about mmaping the input file. Unfortunately the file
> in question is very big (> 4 GB). I believe it won't help much to mmap
> the file on a 32 bit system.

     Map the first chunk of the file, process it, unmap it,
map the next chunk, process it, unmap it, lather, rinse,
repeat.  Choice of chunk size is up to you.

> 2. I don't use writev but as I understand writev probably makes
> writing more efficient if the records to be written are scattered. If
> I sort my data then I won't have the problem of scattered inputs.
> Would writev still bring me some adavntages?

     You keep talking about sorting your data, but in the
original post you said

 >>> It's important to
 >>> preserve the ordering of records in the generated
 >>> files i.e. if record A1 is prior to record A2 in the
 >>> input file, then A1 should still be prior to A2 in the
 >>> result file (if they land up in the same file).

.... so I don't understand why any sorting is necessary or
desirable.  Unless I've misunderstood, you're taking in one
long stream of records and splitting them between multiple
output files, preserving their original order.  Maybe you
have an input file listing every United States citizen,
sorted by name, and you want to produce fifty-something
output files also sorted by name, one for each State or
territory.  If that's not close to what you're attempting,
describe your intentions more fully.

     Anyhow, the important thing about this framework is
that you don't need to rearrange the records, just route
each to the proper output (or outputs).  You could do this
by visiting each input record in turn, still in the input
buffer, and doing a write() to the chosen file (or files).
But with a large number of records that will be a lot of
trips in and out of the kernel, and it's probably a good
idea to try to write more than one record per trip.

     First idea: Set aside a buffer area for each output
file, copy each record to its proper buffer(s) as you
encounter it, and do a write() whenever a buffer fills up.
Advantages: Fewer round-trips to the kernel, simple to code.
Drawbacks: Every record gets copied at least twice, once
to "marshal" into its buffer, and a second time to send
the data to the kernel's file system cache.

     Second idea: Use writev() instead of write(), so you
can send all the Alaska records to the Alaska output
directly from the input buffer, without copying.  Advantage:
Fewer round-trips to the kernel, simple to code, only one
copy.  Drawbacks: Must remember to *do* the pending writev()
calls before changing the mapping of (or reading more data
into) the input buffer.

     Third idea: Like the first, but with the output buffers
mmap()'ed to their output files so all you need to do is
copy the record to the buffer and let the virtual memory
system do its magic; no write() or writev().  Advantages:
Fewest round-trips to the kernel, only one copy.  Drawbacks:
Using mmap() while a file changes size is a little more work.

> 3. I am thinking that probably opening too many files in one single
> process is also slowing down the I/O.

     How many is "too many?"  I'm not familiar with the limits
of FreeBSD, but Solaris and Linux can handle tens of thousands
of open files simultaneously without trouble.  How many do
you need?

> So how about this approach -
> Let the parent hash the output file descriptors into M buckets and
> accordingly write to M buffers. Once a buffer is full, it would fork a
> child process that would sort the records based on fd and write to
> files. It would effectively cut down the number of open file
> descriptors per process and disk seeks too (by sorting on fd).

     This idea of forking seems to fascinate you in a way I
can only think of as unhealthy.  Besides, it now sounds like
you need to copy every record at least twice: Once to dump
it into one of the M buffers, and again when the child "sorts"
(that word, again) the records.

     As for avoiding disk seeks -- Well, I've been assuming
that the input and output are files, not raw devices.  If
so, the file system takes care of managing the physical I/O,
combining multiple writes to adjacent areas into single
physical writes, optimizing the seek patterns, and so on.
You can exercise some control over the FS behavior from the
application level, but it's a weakish sort of control, kind
of like pushing on a rope.

> Your comments would definitely help me.

     More background on what you're trying to do would help me.
What are these millions of records, how big are they, how
many output files are there, why all this talk about sorting?
(And why do you think fork() is free?)

-- 
Eric.Sosman@sun.com
0
Eric.Sosman (4552)
12/7/2007 11:14:03 PM
On Dec 7, 3:14 pm, Eric Sosman <Eric.Sos...@sun.com> wrote:
> > 1. I had thought about mmaping the input file. Unfortunately the file
> > in question is very big (> 4 GB). I believe it won't help much to mmap
> > the file on a 32 bit system.
>
>      Map the first chunk of the file, process it, unmap it,
> map the next chunk, process it, unmap it, lather, rinse,
> repeat.  Choice of chunk size is up to you.

OK..so I would need to keep track of bytes read through mapped region
since I would not get any EOF like error, right?


> > 2. I don't use writev but as I understand writev probably makes
> > writing more efficient if the records to be written are scattered. If
> > I sort my data then I won't have the problem of scattered inputs.
> > Would writev still bring me some adavntages?

>      You keep talking about sorting your data, but in the
> original post you said
>
>  >>> It's important to
>  >>> preserve the ordering of records in the generated
>  >>> files i.e. if record A1 is prior to record A2 in the
>  >>> input file, then A1 should still be prior to A2 in the
>  >>> result file (if they land up in the same file).
>
> ... so I don't understand why any sorting is necessary or
> desirable.  Unless I've misunderstood, you're taking in one
> long stream of records and splitting them between multiple
> output files, preserving their original order.  Maybe you
> have an input file listing every United States citizen,
> sorted by name, and you want to produce fifty-something
> output files also sorted by name, one for each State or
> territory.  If that's not close to what you're attempting,
> describe your intentions more fully.

That's more or less correct. Let me give a simplified example (using
first two fields as criterion for output file) - say the input file
contains records like

A1 B1 C1 D1
A1 B2 C1 D2
A1 B1 C3 D4
A1 B3 C2 D4
A1 B3 C6 C7

So at the end, we would have output files like -
A1_B1.xtn
---------
A1 B1 C1 D1
A1 B1 C3 D4

A1_B2.xtn
---------
A1 B2 C1 D2

A1_B3.xtn
---------
A1 B3 C2 D4
A1 B3 C6 C7



>      Anyhow, the important thing about this framework is
> that you don't need to rearrange the records, just route
> each to the proper output (or outputs).  You could do this
> by visiting each input record in turn, still in the input
> buffer, and doing a write() to the chosen file (or files).
> But with a large number of records that will be a lot of
> trips in and out of the kernel, and it's probably a good
> idea to try to write more than one record per trip.

True

>      First idea: Set aside a buffer area for each output
> file, copy each record to its proper buffer(s) as you
> encounter it, and do a write() whenever a buffer fills up.
> Advantages: Fewer round-trips to the kernel, simple to code.
> Drawbacks: Every record gets copied at least twice, once
> to "marshal" into its buffer, and a second time to send
> the data to the kernel's file system cache.

I do not know in advance about the number of files that are going to
get opened. Hence that would mean to resort to malloc/new, which I
would like to avoid if I can. The overhead of dynamic memory
management for so many buffers would probably outweigh the benefits of
this approach.

>      Second idea: Use writev() instead of write(), so you
> can send all the Alaska records to the Alaska output
> directly from the input buffer, without copying.  Advantage:
> Fewer round-trips to the kernel, simple to code, only one
> copy.  Drawbacks: Must remember to *do* the pending writev()
> calls before changing the mapping of (or reading more data
> into) the input buffer.

Seems like a good approach to me. Another issue I have is that the
legacy code uses C++ iostream for doing I/O. Would there be any issue
in mixing writev with legacy code in such a case?


>      Third idea: Like the first, but with the output buffers
> mmap()'ed to their output files so all you need to do is
> copy the record to the buffer and let the virtual memory
> system do its magic; no write() or writev().  Advantages:
> Fewest round-trips to the kernel, only one copy.  Drawbacks:
> Using mmap() while a file changes size is a little more work.

Good approach but again requires to resort to dynamic memory
management (Aggravates when the file size changes). I guess it's hard
to get the best of both the worlds :-)


> > 3. I am thinking that probably opening too many files in one single
> > process is also slowing down the I/O.
>
>      How many is "too many?"  I'm not familiar with the limits
> of FreeBSD, but Solaris and Linux can handle tens of thousands
> of open files simultaneously without trouble.  How many do
> you need?

I am expecting some 5000 file handles. I am not sure on the BSD limits
but going by what you say it should not become a bottleneck.


> > So how about this approach -
> > Let the parent hash the output file descriptors into M buckets and
> > accordingly write to M buffers. Once a buffer is full, it would fork a
> > child process that would sort the records based on fd and write to
> > files. It would effectively cut down the number of open file
> > descriptors per process and disk seeks too (by sorting on fd).
>
>      This idea of forking seems to fascinate you in a way I
> can only think of as unhealthy.  Besides, it now sounds like
> you need to copy every record at least twice: Once to dump
> it into one of the M buffers, and again when the child "sorts"
> (that word, again) the records.
>      As for avoiding disk seeks -- Well, I've been assuming
> that the input and output are files, not raw devices.  If

That's correct.

> so, the file system takes care of managing the physical I/O,
> combining multiple writes to adjacent areas into single
> physical writes, optimizing the seek patterns, and so on.

Any idea if older Unices also emply such optimizations (like
FreeBSD4.x)?

> You can exercise some control over the FS behavior from the
> application level, but it's a weakish sort of control, kind
> of like pushing on a rope.

OK

> > Your comments would definitely help me.
>
>      More background on what you're trying to do would help me.
> What are these millions of records, how big are they, how
> many output files are there, why all this talk about sorting?
> (And why do you think fork() is free?)

I have tried to give some background on the problem. The records are
essentially user data records where each record size is variable
(unlike the example that I gave above). I agree with you that it's
essentially an I/O bound process. But we don't seem to use the
multiproc usage even if it's available. Indeed fork isn't free but may
be faster with multiproc around.

Thanks again, your comments would definitely help.




0
12/9/2007 6:33:16 AM
In article 
<1b3b4120-20dd-4a14-acdf-00f45a09ca21@a35g2000prf.googlegroups.com>,
 Kelvin Moss <km_jr_usenet@yahoo.com> wrote:

> On Dec 7, 3:14 pm, Eric Sosman <Eric.Sos...@sun.com> wrote:
> > so, the file system takes care of managing the physical I/O,
> > combining multiple writes to adjacent areas into single
> > physical writes, optimizing the seek patterns, and so on.
> 
> Any idea if older Unices also emply such optimizations (like
> FreeBSD4.x)?

I think most OSes in the past 15-20 years are similar.

> > You can exercise some control over the FS behavior from the
> > application level, but it's a weakish sort of control, kind
> > of like pushing on a rope.
> 
> OK
> 
> > > Your comments would definitely help me.
> >
> >      More background on what you're trying to do would help me.
> > What are these millions of records, how big are they, how
> > many output files are there, why all this talk about sorting?
> > (And why do you think fork() is free?)
> 
> I have tried to give some background on the problem. The records are
> essentially user data records where each record size is variable
> (unlike the example that I gave above). I agree with you that it's
> essentially an I/O bound process. But we don't seem to use the
> multiproc usage even if it's available. Indeed fork isn't free but may
> be faster with multiproc around.

Your application is totally sequential, so there's not really any way to 
benefit from multiple processors at this level.

Behind the scenes the kernel will make use of multiprocessors when 
actually performing the disk I/O.  When your process calls write(), the 
data will simply be copied to a kernel buffer, and any available CPU 
will eventually perform the actual write to the disk.  There may also be 
prefetching, so while your CPU is processing one record, another CPU may 
start the disk read for the next record, and it will already be in the 
kernel buffer by the time your application looks for it.

-- 
Barry Margolin, barmar@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***
*** PLEASE don't copy me on replies, I'll read them in the group ***
0
barmar (6125)
12/9/2007 11:21:28 PM
Reply: