f



random writing access to a file in Python

I have a 250 Gbyte file (occupies the whole hard drive space) and want 
to change only eight bytes in this file at a given offset of appr. 200 
Gbyte (all other data in that file should remain unchanged).

How can I do that in Python?

Claudio Grondi
0
8/25/2006 2:04:02 PM
comp.lang.python 77058 articles. 6 followers. Post Follow

20 Replies
449 Views

Similar Articles

[PageSpeed] 46

[Claudio Grondi]
> I have a 250 Gbyte file (occupies the whole hard drive space)

Then where is Python stored ;-)?

> and want to change only eight bytes in this file at a given offset of appr. 200
> Gbyte (all other data in that file should remain unchanged).
>
> How can I do that in Python?

Same way you'd do it in C, really:

f = open(PATH_TO_FILE, "r+b")
f.seek(appr. 200 Gbyte)
f.write(A_STRING_CONTAINING_YOUR_NEW_8_BYTES)
f.close()

This depends on your operating system and file system and platform C
library supporting seeks to very large offsets.  For example, Windows
using NTFS does.  Try it.  Python should complain (raise an exception
during the seek() attempt) if your box refuses to cooperate.  Use as
recent a released version of Python as you can.
0
tim.peters (499)
8/25/2006 2:15:39 PM
Tim Peters wrote:
> [Claudio Grondi]
> 
>> I have a 250 Gbyte file (occupies the whole hard drive space)
> 
> 
> Then where is Python stored ;-)?
> 
>> and want to change only eight bytes in this file at a given offset of 
>> appr. 200
>> Gbyte (all other data in that file should remain unchanged).
>>
>> How can I do that in Python?
> 
> 
> Same way you'd do it in C, really:
> 
> f = open(PATH_TO_FILE, "r+b")
> f.seek(appr. 200 Gbyte)
> f.write(A_STRING_CONTAINING_YOUR_NEW_8_BYTES)
> f.close()
> 
> This depends on your operating system and file system and platform C
> library supporting seeks to very large offsets.  For example, Windows
> using NTFS does.  Try it.  Python should complain (raise an exception
> during the seek() attempt) if your box refuses to cooperate.  Use as
> recent a released version of Python as you can.

Thank you much for the quick response.

The core of my problem was ... trying to use 'wb' or 'w+b' ... (stupid 
me ...)

Claudio Grondi
0
8/25/2006 2:39:14 PM
On Fri, 25 Aug 2006 16:39:14 +0200, Claudio Grondi
<claudio.grondi@freenet.de> declaimed the following in comp.lang.python:

> The core of my problem was ... trying to use 'wb' or 'w+b' ... (stupid 
> me ...)

	Ouch... How many times did you have to restore that massive file
from backup?

-- 
	Wulfraed	Dennis Lee Bieber		KD6MOG
	wlfraed@ix.netcom.com		wulfraed@bestiaria.com
		HTTP://wlfraed.home.netcom.com/
	(Bestiaria Support Staff:		web-asst@bestiaria.com)
		HTTP://www.bestiaria.com/
0
wlfraed (4596)
8/25/2006 4:31:05 PM
Dennis Lee Bieber wrote:
> On Fri, 25 Aug 2006 16:39:14 +0200, Claudio Grondi
> <claudio.grondi@freenet.de> declaimed the following in comp.lang.python:
> 
> 
>>The core of my problem was ... trying to use 'wb' or 'w+b' ... (stupid 
>>me ...)
> 
> 
> 	Ouch... How many times did you have to restore that massive file
> from backup?
> 
I was smart enough to try it first on a very small file wondering what 
was happening. Python documentation and even Google search after 'random 
file access in Python' were not helpful as there was no example and no 
hint available.

The only hint about random file access in Python I found with Google was 
Table of Contents of "Python Cookbook" from O'Railly:
   http://www.oreilly.com/catalog/pythoncook2/toc.html
and hints about random reading access.

I was stupid enough to forget about 'r+' (used it many times before in 
C/C++ a decade ago, but not yet in Python) thinking just too much the 
Pythonic way:

  ===============================================================
  if I want to write, I don't open for reading (plus or not plus)
  ===============================================================

Actually my file was 'only' 42 GByte, but I wanted to construct the 
question making it impossible to suggest use of an intermediate file.

In between I have chosen a total new approach as random writing to hard 
disk seems to actually move the disk head each time when seeking, so 
apparently no cache is used sorting a bit the pieces to write to the 
disk, so if there are many of them there is no other chance as to try to 
put them together in memory first before writing them to the file. This 
makes the straightforward intuitive programming a bit complicated 
because to work on large files it is necessary to work in chunks and 
waste some processing results when they don't fill the gaps. I suppose I 
am still not on the right path, so by the way:

Is there a ready to use (free, best Open Source) tool able to sort lines 
(each line appr. 20 bytes long) of a XXX GByte large text file (i.e. in 
place) taking full advantage of available memory to speed up the process 
as much as possible?

Claudio Grondi
0
8/25/2006 8:55:59 PM
>  ===============================================================
>  if I want to write, I don't open for reading (plus or not plus)
>  ===============================================================

I was thinking the same thing.  I know python is only following
C, which is good, but the flags are confusing.  I'll bet that
when fopen() was first written, they had only 'r', and 'w'.
Then, when that was working they thought, "but what if we
want to read and write on the same handle".  So they tacked
on the '+' to mean, "Let me do the other thing too".

-- 
Posted via a free Usenet account from http://www.teranews.com

0
st8492 (48)
8/25/2006 9:08:02 PM
> Is there a ready to use (free, best Open Source) tool able to sort lines 
> (each line appr. 20 bytes long) of a XXX GByte large text file (i.e. in 
> place) taking full advantage of available memory to speed up the process 
> as much as possible?

Sounds like an occasion to use a merge-sort.  The pseudo-code 
would be:

break up the file into bite-sized chunks (maybe a couple megs 
each).
	Sort each of them linewise.
	Write them out to intermediate files

Once you have these pieces, open each file

read the first line of each one.

[here] Find the "earliest" of each of those lines according to 
your sort-order.
	write it to your output file
	read the next line from that particular file
	return to [here]

There are some optimizations that can be had on this as 
well...you can find the "earliest" *and* the "next earliest" of 
those lines/files, and just read from the "earliest" file until 
the current line of it passes "next earliest"...lather, rinse, 
repeat shifting "next earliest" to be the "earliest" and then 
find the new "next earliest".

I don't know if the GNU "sort" utility does this, but I've thrown 
some rather large files at it and haven't choked it yet.

-tkc



0
python.list (1515)
8/25/2006 9:54:55 PM
Claudio Grondi wrote:

> I was smart enough to try it first on a very small file wondering what 
> was happening. Python documentation and even Google search after 'random 
> file access in Python' were not helpful as there was no example and no 
> hint available.

one would think that the repeated use of the word "truncate" in the 
documentation for open/file would have provided the necessary hints:

     "'w' for writing (truncating an existing file)"

     "Modes 'r+', 'w+' and 'a+' open the file for updating
     (note that 'w+' truncates the file)"

</F>

0
fredrik2101 (5275)
8/25/2006 10:04:48 PM
On Fri, 25 Aug 2006 22:55:59 +0200, Claudio Grondi
<claudio.grondi@freenet.de> declaimed the following in comp.lang.python:

> Is there a ready to use (free, best Open Source) tool able to sort lines 
> (each line appr. 20 bytes long) of a XXX GByte large text file (i.e. in 
> place) taking full advantage of available memory to speed up the process 
> as much as possible?
>
	Well... For "in place" my first requirement would have to be:
fixed-length records since you are trying to exchange chunks of data and
that requires source and destinations to be the same size.

	Unfortunately, "in place" and "advantage of available memory" tend
to be at odds.

	Almost any strict sort algorithm (ie, one that would apply to
memory) can be applied to "in place" (bubble, quick, etc.) but would be
working on a record by record basis -- no in-memory chunking.

	Taking advantage of memory would involve a variation of the
classical sort-merge algorithm (classically illustrated using four
tape-drives <G>)...

Drive	1		2		3		4
		IN		Out1	Out2	spare

Read from IN as many records as can be handled by the in-memory sorter
(or, for simplicity, a known fixed number -- say 50). Sort the 50, write
chunk to Out1.

Read from IN another 50, sort, write to Out2.

Repeat, alternating Out1 and Out2 until EOF(IN).

Merge:
Drive	1		2		3		4
		Out1	In1		In2		Out2

Read one record from In1 and In2, compare, write "earlier" to Out1 and
read next record from whichever input you wrote from. Repeat until the
chunks are complete (50 in from each input, 100 records out).

Repeat using Out2 this time... And alternate for each chunk.

Continue:
Drive	1		2		3		4
		In1		Out1	Out2	In2

Same operation, but with 100 records each input chunk (200 output).

Repeat merging until there are only one chunk in each of In1 and In2,
meaning there will only be one chunk of twice the length in Out1 after
the final merge.


	Now -- what you might be able to do is "tag sort". If the keys are
much shorter than the data lines (though given your ~20byte records,
this is unlikely to be effective)...

	Read only the keys into memory, along with a record number (if fixed
length) or record offset (if variable length). Sort the keys with tag.
Then, in key order, copy the records from the input file to an output
file. (If fixed length records, you could swap in place if you ensure
you keep updating the in-memory tags for the swaps)
-- 
	Wulfraed	Dennis Lee Bieber		KD6MOG
	wlfraed@ix.netcom.com		wulfraed@bestiaria.com
		HTTP://wlfraed.home.netcom.com/
	(Bestiaria Support Staff:		web-asst@bestiaria.com)
		HTTP://www.bestiaria.com/
0
wlfraed (4596)
8/26/2006 6:54:48 PM
Claudio Grondi <claudio.grondi@freenet.de> writes:
> Is there a ready to use (free, best Open Source) tool able to sort
> lines (each line appr. 20 bytes long) of a XXX GByte large text file
> (i.e. in place) taking full advantage of available memory to speed up
> the process as much as possible?

Try the standard Unix/Linux sort utility.  Use the --buffer-size=SIZE
to tell it how much memory to use.
0
phr.cx (5493)
8/26/2006 7:35:16 PM
Paul Rubin wrote:
> Claudio Grondi <claudio.grondi@freenet.de> writes:
> 
>>Is there a ready to use (free, best Open Source) tool able to sort
>>lines (each line appr. 20 bytes long) of a XXX GByte large text file
>>(i.e. in place) taking full advantage of available memory to speed up
>>the process as much as possible?
> 
> 
> Try the standard Unix/Linux sort utility.  Use the --buffer-size=SIZE
> to tell it how much memory to use.
I am on Windows and it seems, that Windows XP SP2 'sort' can work with 
the file, but not without a temporary file and space for the resulting 
file,  so triple of space of the file to sort must be provided.
Windows XP 'sort' uses constantly appr. 300 MByte of memory and can't 
use 100% of CPU all the time, probably due to I/O operations via USB (25 
MByte/s experienced top data transfer speed).
I can't tell yet if it succeeded as the sorting of the appr. 80 GByte 
file with fixed length records of 20 bytes is still in progress (for 
eleven CPU time hours / 18 daytime hours).
I am not sure if own programming would help in my case to be much faster 
than the systems own sort (I haven't tried yet to set the size of memory 
to use in the options to e.g 1.5 GByte as the sort help tells it is 
better not to specify it). My machine is a Pentium 4, 2.8 GHz with 2.0 
GByte RAM.
I would be glad to hear if the time required for sorting I currently 
experience is as expected for such kind of task or is there still much 
space for improvement?

Claudio Grondi
0
8/26/2006 10:19:14 PM
Claudio Grondi <claudio.grondi@freenet.de> writes:
> > Try the standard Unix/Linux sort utility.  Use the --buffer-size=SIZE
> > to tell it how much memory to use.
> I am on Windows and it seems, that Windows XP SP2 'sort' can work with
> the file, but not without a temporary file and space for the resulting
> file,  so triple of space of the file to sort must be provided.

Oh, sorry, I didn't see the earlier parts of the thread.  Anyway,
depending on the application, it's probably not worth the hassle of
coding something yourself, instead of just throwing more disk space at
the Unix utility.  But especially if the fields are fixed size, you
could just mmap the file and then do quicksort on disk.  Simplest
would be to just let the OS paging system take care of caching stuff
if you wanted to get fancy, you could sort in memory once the sorting
regions got below a certain size.

A huge amount of stuff has been written (e.g. about half of Knuth vol
3) about how to sort.  Remember too, that traditionally large-scale
sorting was done on serial media like tape drives, so random access
isn't that vital.
0
phr.cx (5493)
8/26/2006 10:54:40 PM
Paul Rubin wrote:
> Claudio Grondi <claudio.grondi@freenet.de> writes:
> 
>>>Try the standard Unix/Linux sort utility.  Use the --buffer-size=SIZE
>>>to tell it how much memory to use.
>>
>>I am on Windows and it seems, that Windows XP SP2 'sort' can work with
>>the file, but not without a temporary file and space for the resulting
>>file,  so triple of space of the file to sort must be provided.
> 
> 
> Oh, sorry, I didn't see the earlier parts of the thread.  Anyway,
> depending on the application, it's probably not worth the hassle of
> coding something yourself, instead of just throwing more disk space at
> the Unix utility.  But especially if the fields are fixed size, you
> could just mmap the file and then do quicksort on disk.  Simplest
> would be to just let the OS paging system take care of caching stuff
> if you wanted to get fancy, you could sort in memory once the sorting
> regions got below a certain size.
> 
> A huge amount of stuff has been written (e.g. about half of Knuth vol
> 3) about how to sort.  Remember too, that traditionally large-scale
> sorting was done on serial media like tape drives, so random access
> isn't that vital.

Does it mean, that in case of very large files:
   the size of available memory for the sorting operation (making it 
possible to work on larger chunks of data in memory) has less impact on 
the actual sorting speed than
   the speed of the data transfer from/to storage device(s)
?
So, that the most effective measure towards shortening the time required 
for sorting very large files were to use faster hard drives (e.g. 10.000 
rpm instead of 7.600 rpm) and faster interfaces for the data transfer 
(e.g. E-IDE or S-ATA instead of USB), right?

Claudio Grondi
0
8/26/2006 11:55:25 PM
Claudio Grondi <claudio.grondi@freenet.de> writes:
> Does it mean, that in case of very large files:
>    the size of available memory for the sorting operation (making it
> possible to work on larger chunks of data in memory) has less impact
> on the actual sorting speed than
>    the speed of the data transfer from/to storage device(s)

Transfer speed and parallelism matters most, if the cpu can keep up.
Parallelism means you want multiple drives that you can do i/o to
simultaneously without having to seek.  Random access helps simulate
this, but only somewhat.  Large database installations always want
lots of parallel disks for similar reasons to this.

The basic method of sorting large files has traditionally been:

1) Read the file in pieces p(1),p(2),...,p(n) where each piece is as
big as will fit in memory.  Sort each piece with your favorite
in-memory sort, and write out sorted pieces that we'll designate
r(1,1),...r(1,n).  The r(a,b)'s are called "runs".  Use multiple
output devices (tape or disk drives) to write out the runs, i.e. each
output tape will contain its own bunch of runs.

2) Read back a bunch of the runs in parallel, i.e. say you have
written r(1,1),r(1,2),...,r(1,5) to five separate tape drives.  Read
them back simultaneously and merge them (requires very little external
memory) into a new "second level" run called r(2,1).  Similarly merge
r(1,6),...,r(1,10) to the second level run r(2,2), and so forth.

3) Merge the second level runs into third level runs, and so forth
recursively until there's only one giant run.  That is the sorted
output.

The amount of memory determines the size of the initial "first level"
runs, which determines how many recursive merges are needed, so more
memory helps.

The number of levels of recursion also depends on the "merge order",
i.e. the number of runs merged at each merge step.  You really want
each merge to read from M physical input devices that are separate
from the output device(s).  

There are obviously a lot of optimizations that the above description
is missing, and there's a lot of strategy about how to organize the
merges, e.g. if you have 8 drives, do you use 4 for input and 4 for
output, or 5 in and 3 out, or what?  

Is this some kind of production deployment problem you're working on,
or do you just have a big pile of data you need to sort once?  If you
need to deploy across multiple machines (i.e. you can't just buy a
giant machine if you need to do this sorting frequently), then I'd
suggest reading up on the subject, starting with Knuth vol 3.  
0
phr.cx (5493)
8/27/2006 12:34:29 AM
Paul Rubin wrote:
> Claudio Grondi <claudio.grondi@freenet.de> writes:
> 
>>Does it mean, that in case of very large files:
>>   the size of available memory for the sorting operation (making it
>>possible to work on larger chunks of data in memory) has less impact
>>on the actual sorting speed than
>>   the speed of the data transfer from/to storage device(s)
> 
> 
> Transfer speed and parallelism matters most, if the cpu can keep up.
> Parallelism means you want multiple drives that you can do i/o to
> simultaneously without having to seek.  Random access helps simulate
> this, but only somewhat.  Large database installations always want
> lots of parallel disks for similar reasons to this.
> 
> The basic method of sorting large files has traditionally been:
> 
> 1) Read the file in pieces p(1),p(2),...,p(n) where each piece is as
> big as will fit in memory.  Sort each piece with your favorite
> in-memory sort, and write out sorted pieces that we'll designate
> r(1,1),...r(1,n).  The r(a,b)'s are called "runs".  Use multiple
> output devices (tape or disk drives) to write out the runs, i.e. each
> output tape will contain its own bunch of runs.
> 
> 2) Read back a bunch of the runs in parallel, i.e. say you have
> written r(1,1),r(1,2),...,r(1,5) to five separate tape drives.  Read
> them back simultaneously and merge them (requires very little external
> memory) into a new "second level" run called r(2,1).  Similarly merge
> r(1,6),...,r(1,10) to the second level run r(2,2), and so forth.
> 
> 3) Merge the second level runs into third level runs, and so forth
> recursively until there's only one giant run.  That is the sorted
> output.
> 
> The amount of memory determines the size of the initial "first level"
> runs, which determines how many recursive merges are needed, so more
> memory helps.
> 
> The number of levels of recursion also depends on the "merge order",
> i.e. the number of runs merged at each merge step.  You really want
> each merge to read from M physical input devices that are separate
> from the output device(s).  
> 
> There are obviously a lot of optimizations that the above description
> is missing, and there's a lot of strategy about how to organize the
> merges, e.g. if you have 8 drives, do you use 4 for input and 4 for
> output, or 5 in and 3 out, or what?  
The Windows XP SP 2 '/> sort' (sorting of four Gigs of 20 byte records 
took 12 CPU and 18 usual hours) has, from what I could observe on the 
task manager, done the job in only two runs of 'copying' :
   1. first to a temporary file,
   2. then to the final file.
In the second run the size of processed chunks between reading/writing 
was in the order of up to tenths of Megabytes, where in the first run in 
order of up to hundreds Megabytes. I suppose that the procedure behind 
it was as follows:
1. decision about the size of chunks to split the file into and choosing 
the size of required memory
2. processing the chunks with in-memory sorting them and writing to the 
temporary file
3. decision about the size of buffers for merge sorting the chunks into 
the final file, so that they all fit into the 300 MByte of used memory
4. opening as many 'pipes' as there were chunks filling all of the pipe 
buffers up when one of them runs out of data
5. continuously switching between reading and writing to the hard disk, 
writing the results of the merge sorting to the final file always when 
one of the buffers run out of data and then filling up all of the 
buffers for the next cycle (concluded from observed scattered reading, 
smooth writing)

> 
> Is this some kind of production deployment problem you're working on,
> or do you just have a big pile of data you need to sort once?  
The latter.
One of the intermediate reasons behind doing it, is an attempt to get 
more and better intuitive understanding what are the hardware and 
software limits of brute force based approaches to solving problems in 
the area of AI, language processing and data compression.

> If you
> need to deploy across multiple machines (i.e. you can't just buy a
> giant machine if you need to do this sorting frequently), then I'd
> suggest reading up on the subject, starting with Knuth vol 3.  

Thank you much for your detailed response.

Claudio Grondi
0
8/27/2006 7:44:30 AM
Claudio Grondi <claudio.grondi@freenet.de> writes:
> The Windows XP SP 2 '/> sort' (sorting of four Gigs of 20 byte records
> took 12 CPU and 18 usual hours) has, from what I could observe on the
> task manager, done the job in only two runs of 'copying' :

That is terrible; on a reasonably fast machine these days it should
take just a few minutes.

> > Is this some kind of production deployment problem you're working on,
> > or do you just have a big pile of data you need to sort once?
> The latter.
> One of the intermediate reasons behind doing it, is an attempt to get
> more and better intuitive understanding what are the hardware and
> software limits of brute force based approaches to solving problems in
> the area of AI, language processing and data compression.

This makes no sense; the methods you want are from the ancient days of
"data processing", nothing to do with AI.  Remember that the mainframe
computers of the 1960's that this stuff was developed on, had less
computing power and memory than a modern mobile phone.  But phone
companies, tax agencies, and so forth, still managed to run these big
sorting tasks on those mainframes in order to send people their phone
bills, identify the tax cheaters, etc.  There is rich and wonderful
history to it.
0
phr.cx (5493)
8/27/2006 5:57:47 PM
On Sun, 27 Aug 2006 01:55:25 +0200, Claudio Grondi
<claudio.grondi@freenet.de> declaimed the following in comp.lang.python:

> So, that the most effective measure towards shortening the time required 
> for sorting very large files were to use faster hard drives (e.g. 10.000 
> rpm instead of 7.600 rpm) and faster interfaces for the data transfer 
> (e.g. E-IDE or S-ATA instead of USB), right?
>
	Definitely not USB <G> (nor FireWire).

	Of course, if you wanted the real feel of classical sorting, you'd
put up a bank of 3-4 9-track tape drives <G>
-- 
	Wulfraed	Dennis Lee Bieber		KD6MOG
	wlfraed@ix.netcom.com		wulfraed@bestiaria.com
		HTTP://wlfraed.home.netcom.com/
	(Bestiaria Support Staff:		web-asst@bestiaria.com)
		HTTP://www.bestiaria.com/
0
wlfraed (4596)
8/27/2006 5:58:01 PM
On Sun, 27 Aug 2006 09:44:30 +0200, Claudio Grondi
<claudio.grondi@freenet.de> declaimed the following in comp.lang.python:

> The Windows XP SP 2 '/> sort' (sorting of four Gigs of 20 byte records 
> took 12 CPU and 18 usual hours) has, from what I could observe on the 
> task manager, done the job in only two runs of 'copying' :
>    1. first to a temporary file,
>    2. then to the final file.
> In the second run the size of processed chunks between reading/writing 
> was in the order of up to tenths of Megabytes, where in the first run in 
> order of up to hundreds Megabytes. I suppose that the procedure behind 
> it was as follows:
> 1. decision about the size of chunks to split the file into and choosing 
> the size of required memory

	It probably needed to allow for stack space if using a recursive
algorithm (quicksort), which, for short records, could mean more
recursion...

> 2. processing the chunks with in-memory sorting them and writing to the 
> temporary file

	Assuming the 300MB you mention later is the data size per chunk, a
4GB file would take around 14 "chunks"...

> 3. decision about the size of buffers for merge sorting the chunks into 
> the final file, so that they all fit into the 300 MByte of used memory

	The merge phase for this, using lots of head seeks, only needs room
to hold 14 records -- one per chunk. It is possible they have some other
optimizations; like loading all records from a chunk that are
lower/earlier than the first record of all other chunks before writing
any data. This could perhaps help if the input had been nearly sorted to
begin with, as each chunk would be nearly "done".

-- 
	Wulfraed	Dennis Lee Bieber		KD6MOG
	wlfraed@ix.netcom.com		wulfraed@bestiaria.com
		HTTP://wlfraed.home.netcom.com/
	(Bestiaria Support Staff:		web-asst@bestiaria.com)
		HTTP://www.bestiaria.com/
0
wlfraed (4596)
8/27/2006 5:58:01 PM
On Sun, 27 Aug 2006 00:19:14 +0200, Claudio Grondi
<claudio.grondi@freenet.de> declaimed the following in comp.lang.python:

> I am on Windows and it seems, that Windows XP SP2 'sort' can work with 
> the file, but not without a temporary file and space for the resulting 

	Lovely... My desktop's been running WinXP SP2 for a year and I never
even knew there was a "sort" verb on it <G>

	Considering how many times I've been opening small text files in
Excel just to sort records...
-- 
	Wulfraed	Dennis Lee Bieber		KD6MOG
	wlfraed@ix.netcom.com		wulfraed@bestiaria.com
		HTTP://wlfraed.home.netcom.com/
	(Bestiaria Support Staff:		web-asst@bestiaria.com)
		HTTP://www.bestiaria.com/
0
wlfraed (4596)
8/27/2006 5:58:01 PM
Paul Rubin wrote:
> Claudio Grondi <claudio.grondi@freenet.de> writes:
> 
>>The Windows XP SP 2 '/> sort' (sorting of four Gigs of 20 byte records
>>took 12 CPU and 18 usual hours) has, from what I could observe on the
>>task manager, done the job in only two runs of 'copying' :
> 
> 
> That is terrible; on a reasonably fast machine these days it should
> take just a few minutes.
Ok, I see - the misunderstanding is, that there were 4.294.967.296 
records each 20 bytes long, what makes the actual file 85.899.345.920 
bytes large (I just used 'Gigs' for telling the number of records, not 
the size of the file).
Still not acceptable sorting time?

Claudio Grondi
0
8/27/2006 8:32:21 PM
Claudio Grondi <claudio.grondi@freenet.de> writes:
> >>The Windows XP SP 2 '/> sort' (sorting of four Gigs of 20 byte records
> >>took 12 CPU and 18 usual hours)....
> Ok, I see - the misunderstanding is, that there were 4.294.967.296
> records each 20 bytes long, what makes the actual file 85.899.345.920
> bytes large (I just used 'Gigs' for telling the number of records, not
> the size of the file).
> Still not acceptable sorting time?

I think that's not so bad, though probably still not optimal.  85 GB
divided by 18 hours is 1.3 MB/sec, which means if the program is
reading the file 8 times, it's getting 10 MB/sec through the Windows
file system, which is fairly reasonable throughput.

If you know something about the distribution of the data (e.g. the
records are random 20-byte hashes) you might be able to sort it in
essentially linear time (radix sorting).  But even with a general
purpose algorithm, if you have a few hundred MB of ram and some
scratch disk space to work with, you should be able to sort that much
data in much less than 18 hours.

But if you only had to do it once and it's finished now, why do you
still care how long it took?
0
phr.cx (5493)
8/27/2006 10:06:07 PM
Reply: