(1) I have a relatively large file (9.4GB) that containing a
rectangular matrix (columns separated by tabs, rows separated by
newlines). I want to generate a file that contains the transpose
of this matrix, and do so without slurping the entire matrix into
memory all at once.
Is there a utility that would be helpful for this task?
(2) The only approach I can think of is to write temporary files
containing the transposes of the submatrices corresponding to
"strips" of n consecutive rows, and then using /usr/bin/paste to
glue all these submatrices into a single file.
Still, even this strategy requires transposing the n rows. I can
do this easily with a Python or Perl script, but I was wondering
if there is some Unix utility to do it?
Any suggestions for accomplishing (1) or (2) from the command line
using Unix utilities would be appreciated.
(FWIW, I use zsh.)
TIA!
~K
|
|
0
|
|
|
|
Reply
|
kj
|
6/3/2010 6:19:41 PM |
|
kj wrote:
> (1) I have a relatively large file (9.4GB) that containing a
> rectangular matrix (columns separated by tabs, rows separated by
> newlines). I want to generate a file that contains the transpose
> of this matrix, and do so without slurping the entire matrix into
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> memory all at once.
^^^^^^^^^^^^^^^^^^^
The only way to do this that I can see is to cut out the i-th column,
convert newlines to tabs, print newline, and repeat that for every column
with increasing `i'. IOW, you would be trading memory efficiency against
runtime efficiency, since you would have to read the entire file as often as
you have columns. Utilities that would come in handy then are cut(1) or
awk(1), and tr(1). Shell redirection can optionally write the result into a
new file.
HTH
PointedEars
|
|
0
|
|
|
|
Reply
|
Thomas
|
6/3/2010 9:35:48 PM
|
|
kj wrote:
> (1) I have a relatively large file (9.4GB) that containing a
> rectangular matrix (columns separated by tabs, rows separated by
> newlines). I want to generate a file that contains the transpose
> of this matrix, and do so without slurping the entire matrix into
> memory all at once.
>
> Is there a utility that would be helpful for this task?
>
> (2) The only approach I can think of is to write temporary files
> containing the transposes of the submatrices corresponding to
> "strips" of n consecutive rows, and then using /usr/bin/paste to
> glue all these submatrices into a single file.
>
> Still, even this strategy requires transposing the n rows. I can
> do this easily with a Python or Perl script, but I was wondering
> if there is some Unix utility to do it?
>
> Any suggestions for accomplishing (1) or (2) from the command line
> using Unix utilities would be appreciated.
AFAICT, you can't even write a single complete line of the transposition
without having read up to the last line of the original matrix.
While on 64-bit machines with enough RAM a process could probably keep 9.4GB
of data in memory, I think the approach of reading the file line-by-line,
and append each item to its own "column" file, and then just cat all these
files together to get the transposed matrix.
For example, with a small matrix using awk:
$ cat matrix
a b c d
e f g h
i j k l
m n o p
q r s t
$ awk 'NR==1{
for(i=1;i<=NF;i++)names[i]=sprintf("column%03d", i)
}
{
for(i=1;i<=NF;i++){
printf "%s%s", s[i], $i > names[i]
s[i]=FS
}
}
END{
# add terminating newlines
for(i=1;i<=NF;i++)
print "" > names[i]
}' matrix
(adapt to your actual number of columns, separator, etc.)
To avoid the overhead of calling sprintf() every time, you could probably
save the names in an array at the beginning, eg
and then redirect output to names[i]. When that has run,
$ cat column001
a e i m q
$ cat column002
b f j n r
$ cat column* > transposed
$ cat transposed
a e i m q
b f j n r
c g k o s
d h l p t
|
|
-1
|
|
|
|
Reply
|
pk
|
6/3/2010 9:48:31 PM
|
|
On 2010-06-03, kj <no.email@please.post> wrote:
> (1) I have a relatively large file (9.4GB) that containing a
> rectangular matrix (columns separated by tabs, rows separated by
> newlines). I want to generate a file that contains the transpose
> of this matrix, and do so without slurping the entire matrix into
> memory all at once.
> Is there a utility that would be helpful for this task?
There are utilities that would be helpful, but I don't think it
can be done without temporary files.
Quite simply: So far as I can tell, before you can finish the first
line of your output, you have to have read the last line, so either
you're jumping around a lot, which will be excruciatingly slow, or
you have it all in memory, or...
> (2) The only approach I can think of is to write temporary files
> containing the transposes of the submatrices corresponding to
> "strips" of n consecutive rows, and then using /usr/bin/paste to
> glue all these submatrices into a single file.
Well, pragmatically, the shortest path is probably to do something
much like this.
A thought: Do you have any expectations about the contents of the
fields? Say, are they of fixed length? Could they be padded to a
fixed length without undue hardship? Do you know in advance the number
of rows and columns? It wouldn't be exceptionally hard to write a
new file containing a fixed-size grid of fixed-size fields, then
write a tiny little C program to go through populating fields
appropriately.
Finally, last but not least: Use sqlite, shove everything into a
table, extract from the table. It will use a ton of disk space and
CPU time, but it will work within available memory and be surprisingly
zippy, I'd guess.
-s
--
Copyright 2010, all wrongs reversed. Peter Seebach / usenet-nospam@seebs.net
http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!
|
|
0
|
|
|
|
Reply
|
Seebs
|
6/4/2010 2:08:43 AM
|
|
kj wrote:
> (1) I have a relatively large file (9.4GB) that containing a
> rectangular matrix (columns separated by tabs, rows separated by
> newlines). I want to generate a file that contains the transpose
> of this matrix, and do so without slurping the entire matrix into
> memory all at once.
>
> Is there a utility that would be helpful for this task?
>
> (2) The only approach I can think of is to write temporary files
> containing the transposes of the submatrices corresponding to
> "strips" of n consecutive rows, and then using /usr/bin/paste to
> glue all these submatrices into a single file.
>
> Still, even this strategy requires transposing the n rows. I can
> do this easily with a Python or Perl script, but I was wondering
> if there is some Unix utility to do it?
>
> Any suggestions for accomplishing (1) or (2) from the command line
> using Unix utilities would be appreciated.
You're already aware of the problem that you have in principle with
this type of task. One more thought; are your matrices fully filled
or sparsely populated? In the latter case you might be able to use
the all-in-memory approach anyway, because you'd need just allocate
the actual values and leave the zero-values away. (In any language,
like awk, that supports associative arrays this would require just
a few lines of code.) If your matrices not sparsely populated then
follow the way of using temporary files if your memory is limited.
Janis
>
> (FWIW, I use zsh.)
>
> TIA!
>
> ~K
|
|
0
|
|
|
|
Reply
|
Janis
|
6/4/2010 5:05:53 AM
|
|
"Janis Papanagnou" <janis_papanagnou@hotmail.com> wrote in message
news:hua1jh$4j3$1@news.m-online.net...
> You're already aware of the problem that you have in principle with
> this type of task. One more thought; are your matrices fully filled
> or sparsely populated?
Excellemt point.
> In the latter case you might be able to use
> the all-in-memory approach anyway, because you'd need just allocate
> the actual values and leave the zero-values away.
Are you assuming that zero values comstitute "sparse" elements of the array?
The OP made no such indication, only that "... columns [are] separated by
tabs, rows [are] separated by newlines."
Null values e.g. ...\t\t... are far more reasonable for the sparse entries,
but of course the OP can define the data structure that is of question.
> (In any language, like awk, that supports associative arrays this
> would require just a few lines of code.)
Please educate us in such simplicity ... please.
Even such well-known routines as the R/S/SPlus http://cran.r-project.org
tramspose function, or the Fortran transpose routines from http://ornl.gov
(U.S. Oak Ridge National Laboratories) that we used 20+ years ago on such
arrays, are more than "a few lines of code."
If you can provide the "few lines" of awk necessary then you too can be
famous rather than merely flippant.
|
|
0
|
|
|
|
Reply
|
Greg
|
6/4/2010 8:12:02 AM
|
|
Greg Russell wrote:
> "Janis Papanagnou" <janis_papanagnou@hotmail.com> wrote in message
> news:hua1jh$4j3$1@news.m-online.net...
>
>> You're already aware of the problem that you have in principle with
>> this type of task. One more thought; are your matrices fully filled
>> or sparsely populated?
>
> Excellemt point.
>
>> In the latter case you might be able to use
>> the all-in-memory approach anyway, because you'd need just allocate
>> the actual values and leave the zero-values away.
>
> Are you assuming that zero values comstitute "sparse" elements of the array?
No, that's an arbitrary value, generally it's problem dependent.
> The OP made no such indication, only that "... columns [are] separated by
> tabs, rows [are] separated by newlines."
>
> Null values e.g. ...\t\t... are far more reasonable for the sparse entries,
> but of course the OP can define the data structure that is of question.
>
>> (In any language, like awk, that supports associative arrays this
>> would require just a few lines of code.)
>
> Please educate us in such simplicity ... please.
Sure. Here's one way to implement it...
awk ' BEGIN { FS = "\t" }
{ for (col=1; col<=NF; col++) if ($col) matrix[NR,col] = $col }
END { for (col=1; col<=NF; col++)
for (row=1; row<=NR; row++)
printf("%s%s",
((row,col) in matrix) ? matrix[row,col] : 0,
(row<NR) ? FS : RS)
}
'
Janis
>
> Even such well-known routines as the R/S/SPlus http://cran.r-project.org
> tramspose function, or the Fortran transpose routines from http://ornl.gov
> (U.S. Oak Ridge National Laboratories) that we used 20+ years ago on such
> arrays, are more than "a few lines of code."
>
> If you can provide the "few lines" of awk necessary then you too can be
> famous rather than merely flippant.
>
>
>
|
|
-1
|
|
|
|
Reply
|
Janis
|
6/4/2010 8:29:27 AM
|
|
> Greg Russell wrote:
>>
>> If you can provide the "few lines" of awk necessary then you too can be
>> famous rather than merely flippant.
BTW, neither one of your choices is what I am aiming at.
Janis
|
|
0
|
|
|
|
Reply
|
Janis
|
6/4/2010 8:30:54 AM
|
|
Janis Papanagnou wrote:
> [...] (In any language, like awk, that supports associative arrays this
> would require just a few lines of code.) If your matrices not sparsely
> populated then follow the way of using temporary files if your memory is
> limited.
Temporary files and arrays are unnecessary. You can read and print the
columns as rows one by one.
PointedEars
|
|
0
|
|
|
|
Reply
|
Thomas
|
6/4/2010 9:26:56 AM
|
|
Thomas 'PointedEars' Lahn <PointedEars@web.de> writes:
> Janis Papanagnou wrote:
>
>
> Temporary files and arrays are unnecessary. You can read and print the
> columns as rows one by one.
Yes, by re-reading the input file each time you want a new row.
|
|
0
|
|
|
|
Reply
|
Maxwell
|
6/4/2010 11:56:07 AM
|
|
On 6/4/2010 3:12 AM, Greg Russell wrote:
> "Janis Papanagnou"<janis_papanagnou@hotmail.com> wrote in message
> news:hua1jh$4j3$1@news.m-online.net...
>
>> You're already aware of the problem that you have in principle with
>> this type of task. One more thought; are your matrices fully filled
>> or sparsely populated?
>
> Excellemt point.
>
>> In the latter case you might be able to use
>> the all-in-memory approach anyway, because you'd need just allocate
>> the actual values and leave the zero-values away.
>
> Are you assuming that zero values comstitute "sparse" elements of the array?
> The OP made no such indication, only that "... columns [are] separated by
> tabs, rows [are] separated by newlines."
>
> Null values e.g. ...\t\t... are far more reasonable for the sparse entries,
> but of course the OP can define the data structure that is of question.
>
>> (In any language, like awk, that supports associative arrays this
>> would require just a few lines of code.)
>
> Please educate us in such simplicity ... please.
>
> Even such well-known routines as the R/S/SPlus http://cran.r-project.org
> tramspose function, or the Fortran transpose routines from http://ornl.gov
> (U.S. Oak Ridge National Laboratories) that we used 20+ years ago on such
> arrays, are more than "a few lines of code."
>
> If you can provide the "few lines" of awk necessary then you too can be
> famous rather than merely flippant.
>
I don't think Janis was being flippant, it's just that, as he showed in a
followup, it's a trivial solution as long as we can identify a "null" value for
input fields (e.g. either zero or a null string).
I didn't read the references - if they make it sound difficult then they must be
dealing with a different problem or coded in a language that doesn't support
associative arrays or addressing input that has no specific null value (?) or.....
Ed.
|
|
0
|
|
|
|
Reply
|
Ed
|
6/4/2010 1:15:35 PM
|
|
On 6/3/2010 1:19 PM, kj wrote:
> (1) I have a relatively large file (9.4GB) that containing a
> rectangular matrix (columns separated by tabs, rows separated by
> newlines). I want to generate a file that contains the transpose
> of this matrix, and do so without slurping the entire matrix into
> memory all at once.
>
> Is there a utility that would be helpful for this task?
>
> (2) The only approach I can think of is to write temporary files
> containing the transposes of the submatrices corresponding to
> "strips" of n consecutive rows, and then using /usr/bin/paste to
> glue all these submatrices into a single file.
>
> Still, even this strategy requires transposing the n rows. I can
> do this easily with a Python or Perl script, but I was wondering
> if there is some Unix utility to do it?
>
> Any suggestions for accomplishing (1) or (2) from the command line
> using Unix utilities would be appreciated.
You could do something like this:
$ cat file
a b c
d e f
$ awk 'BEGIN{ getline; nf=NF; for (i=1;i<=nf;i++) { while (getline<FILENAME)
printf "%s ",$i; print ""; close(FILENAME) } }' file
a d
b e
c f
It might be slow, but it will avoid temp files and the need to slurp the whole
file into memory, which were your stated goals.
Note that using getline is rarely the right approach but it's probably fine for
this very specific job. Read and understand http://awk.info/?tip/getline before
considering using it in other scripts.
Ed.
|
|
0
|
|
|
|
Reply
|
Ed
|
6/4/2010 1:44:04 PM
|
|
Maxwell Lol wrote:
> Thomas 'PointedEars' Lahn <PointedEars@web.de> writes:
>
>> Janis Papanagnou wrote:
[To make sure; I didn't write the subsequent two lines; that was Mr Lahn.]
>>
>>
>> Temporary files and arrays are unnecessary. You can read and print the
>> columns as rows one by one.
I didn't say they are necessary; rather I suggested to use one of the two
approaches, depending on the actual problem domain. Your suggestion seems
to me to be much inferiors WRT performance; O(N^2) instead of O(N), which
is inacceptable if you're processing Gigabytes of data as the OP actually
does.
>
> Yes, by re-reading the input file each time you want a new row.
Indeed.
It would be interesting if there'd be a tool to directly access entities
in a file by seek'ing to the respective position to avoid such performance
degradation; say, as intelligent tail(1) program may operate on character
entities. But that seems impossible without having a "structured file"; a
database, maybe, or using file concepts of higher abstraction levels which
I haven't heard to be in use in practise.
Janis
|
|
0
|
|
|
|
Reply
|
Janis
|
6/4/2010 1:59:34 PM
|
|
Janis Papanagnou wrote:
> It would be interesting if there'd be a tool to directly access entities
> in a file by seek'ing to the respective position to avoid such performance
> degradation; say, as intelligent tail(1) program may operate on character
> entities. But that seems impossible without having a "structured file"; a
> database, maybe, or using file concepts of higher abstraction levels which
> I haven't heard to be in use in practise.
Well, to nitpick a bit, tail can certainly be told to count bytes rather
than lines (using -c), and if all the "cells" are the same width, it may be
possible to calculate the offsets for a certain [row,column] cell. In the
same way, dd can seek to arbitrary positions in the file, so (just for fun!)
one could do, say
# assuming each element is 2 characters wide, 7 rows, 5 columns
$ cat matrix
01 02 03 04 05
06 07 08 09 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
26 27 28 29 30
31 32 33 34 35
# matrix[row,column] is at offset row * 15 + col * 3
# in the file, if row and col are 0-based
# bash
for col in {0..5}; do
line=""
sep=""
for row in {0..6}; do
offset=$(( row * 15 + col * 3 ))
elem=$(dd if=matrix skip=$offset bs=1 count=2)
line="$line$sep$elem"
sep=" "
done
printf "%s\n" "$line"
done > transposed
$ cat transposed
Again, it's just for fun. With a large file that ends up calling dd possibly
millions of times. And although seeks may be fast, I would not bet that in
terms of pure runtime it would be faster than the pure O^2 approach (but I
haven't tried).
|
|
0
|
|
|
|
Reply
|
pk
|
6/4/2010 2:25:22 PM
|
|
pk wrote:
> Again, it's just for fun. With a large file that ends up calling dd
> possibly millions of times. And although seeks may be fast, I would not
> bet that in terms of pure runtime it would be faster than the pure O^2
> approach (but I haven't tried).
Just to follow up a bit on that, still assuming fixed widths, the same
algorithm could also be implemented using some programming language like
perl pr python, which would probably be much more efficient as it won't have
all the overhead of spawning processes. Here's an example in Perl:
open(M,"+<",$ARGV[0]);
for $col (0..4) {
$line=$sep="";
for $row (0..6) {
seek(M, $row * 15 + $col * 3, 0);
read(M, $elem, 2);
print $sep, $elem;
$sep = " ";
}
print "\n";
}
close(M);
(all error checking omitted)
|
|
0
|
|
|
|
Reply
|
pk
|
6/4/2010 2:34:52 PM
|
|
Janis Papanagnou wrote:
> Maxwell Lol wrote:
>> Thomas 'PointedEars' Lahn <PointedEars@web.de> writes:
>>> Janis Papanagnou wrote:
> [To make sure; I didn't write the subsequent two lines; that was Mr Lahn.]
Isn't it a bit pointless to include that attribution line then?
>>> Temporary files and arrays are unnecessary. You can read and print the
>>> columns as rows one by one.
>
> I didn't say they are necessary; rather I suggested to use one of the two
> approaches, depending on the actual problem domain. Your suggestion seems
> to me to be much inferiors WRT performance; O(N^2) instead of O(N), which
> is inacceptable if you're processing Gigabytes of data as the OP actually
> does.
AISB, it is a tradeoff: Memory vs. runtime efficiency/complexity. The OP
asked for a solution that would save memory, hence my suggestion.
As to whether that iterative approach would be inacceptable, that depends on
the machine speed, and the other processes it is supposed to be running.
ISTM nowadays processing a large text file multiple times is to be preferred
over using a multiple of the amount of memory required to do the former in
order to store the user data of the text file.
Using an implementation of my suggestion, transposing a 1000×1000 matrix in
a 6.6 MiB text file and storing the result in another text file took only 84
seconds on an Intel Core2 Duo SU9400 CPU (1.4 GHz, 800 MHz FSB) that already
had an estimated combined load of 50-75% primarily due to a multimedia
application running in the background. The shell process that did it and
its child processes combined never required more than 4 MiB of memory.
PointedEars
|
|
0
|
|
|
|
Reply
|
Thomas
|
6/4/2010 2:41:24 PM
|
|
pk wrote:
> Janis Papanagnou wrote:
>
>> It would be interesting if there'd be a tool to directly access entities
>> in a file by seek'ing to the respective position to avoid such performance
>> degradation; say, as intelligent tail(1) program may operate on character
>> entities. But that seems impossible without having a "structured file"; a
>> database, maybe, or using file concepts of higher abstraction levels which
>> I haven't heard to be in use in practise.
>
> Well, to nitpick a bit, tail can certainly be told to count bytes rather
> than lines (using -c), and if all the "cells" are the same width, it may be
> possible to calculate the offsets for a certain [row,column] cell.
I was referring to something else. I've heard (not checked the tail(1)
source code) that modern implementations would make some guessed seek
close to the point where the start might be. I was specifically not
speaking of equal-width elements which would certainly make the task
quite trivial.
Janis
> In the
> same way, dd can seek to arbitrary positions in the file, so (just for fun!)
> one could do, say
>
> # assuming each element is 2 characters wide, 7 rows, 5 columns
>
> $ cat matrix
> 01 02 03 04 05
> 06 07 08 09 10
> 11 12 13 14 15
> 16 17 18 19 20
> 21 22 23 24 25
> 26 27 28 29 30
> 31 32 33 34 35
>
> # matrix[row,column] is at offset row * 15 + col * 3
> # in the file, if row and col are 0-based
>
> # bash
> for col in {0..5}; do
> line=""
> sep=""
> for row in {0..6}; do
> offset=$(( row * 15 + col * 3 ))
> elem=$(dd if=matrix skip=$offset bs=1 count=2)
> line="$line$sep$elem"
> sep=" "
> done
> printf "%s\n" "$line"
> done > transposed
>
> $ cat transposed
>
>
> Again, it's just for fun. With a large file that ends up calling dd possibly
> millions of times. And although seeks may be fast, I would not bet that in
> terms of pure runtime it would be faster than the pure O^2 approach (but I
> haven't tried).
|
|
0
|
|
|
|
Reply
|
Janis
|
6/4/2010 3:22:32 PM
|
|
Thomas 'PointedEars' Lahn wrote:
> Janis Papanagnou wrote:
>
>> Maxwell Lol wrote:
>>> Thomas 'PointedEars' Lahn <PointedEars@web.de> writes:
>>>> Janis Papanagnou wrote:
>> [To make sure; I didn't write the subsequent two lines; that was Mr Lahn.]
>
> Isn't it a bit pointless to include that attribution line then?
Tell that Maxwell if you're happy playing the control freak. (Stripping
context is fine if done right and if it's not misleading.) It certainly
is *not* pointless to make clear that I haven't wrote anything that YOU,
specifically, wrote. No big issue, but clarification should be in place.
>
>>>> Temporary files and arrays are unnecessary. You can read and print the
>>>> columns as rows one by one.
>> I didn't say they are necessary; rather I suggested to use one of the two
>> approaches, depending on the actual problem domain. Your suggestion seems
>> to me to be much inferiors WRT performance; O(N^2) instead of O(N), which
>> is inacceptable if you're processing Gigabytes of data as the OP actually
>> does.
>
> AISB, it is a tradeoff: Memory vs. runtime efficiency/complexity. The OP
> asked for a solution that would save memory, hence my suggestion.
Suggestions, especially sensible ones, are always welcome.
>
> As to whether that iterative approach would be inacceptable, that depends on
> the machine speed, and the other processes it is supposed to be running.
> ISTM nowadays processing a large text file multiple times is to be preferred
> over using a multiple of the amount of memory required to do the former in
> order to store the user data of the text file.
You seem to not know or haven't thought much about the computational
complexity, about the _practical relevance_ between O(N^2) and O(N),
I suppose?
>
> Using an implementation of my suggestion, transposing a 1000×1000 matrix in
> a 6.6 MiB text file and storing the result in another text file took only 84
> seconds on an Intel Core2 Duo SU9400 CPU (1.4 GHz, 800 MHz FSB) that already
> had an estimated combined load of 50-75% primarily due to a multimedia
> application running in the background. The shell process that did it and
> its child processes combined never required more than 4 MiB of memory.
You've tried "toy data"; 6.6 MiB is nothing compared to 9.4 GB if you're
going to compare an O(N^2) algorithm to an O(N) algorithm. If you've not
yet learned about computational complexity I suggest to try a linear
O(N) algorithm and compare the timing with your O(N^2) algorithm for a
certain data set of size 10^6 Byte, then increase that to 4*10^6 Byte,
and finally increase it to the data set that is topic here; which is
approx. 1358 times larger than your toy data, and that factor enters the
equation squared (1358^2). You won't see any results soon[*] with your
approach or with any other approach of complexity O(N^2).
Another point you may recognize is; those "Core2"/"GHz"/... terminology
that you are celebrating above is good if you want to sell computers,
not if you're searching a good algorithm that performs well and scales
well with real world application data. Don't expect commercial system
development/evolution to solve real world performance issues of a
significant magnitude. If you want to compare algorithms of *different*
complexity classes it's not that helpful. OTOH, your data *is* helpful
if you make the effort to calculate - or try it out - the difference to
the data size relevant in this thread; you've omitted this crucial step.
This just BTW.
Janis
[*] And while you wait for the results you can do the math; it's surely
an eye opener, I promise.
>
>
> PointedEars
|
|
0
|
|
|
|
Reply
|
Janis
|
6/4/2010 6:36:30 PM
|
|
Janis Papanagnou <janis_papanagnou@hotmail.com> writes:
> It would be interesting if there'd be a tool to directly access entities
> in a file by seek'ing to the respective position to avoid such performance
> degradation; say, as intelligent tail(1) program may operate on character
> entities. But that seems impossible without having a "structured file"; a
> database, maybe, or using file concepts of higher abstraction levels which
> I haven't heard to be in use in practise.
There are databases that are designed to do this. I was reading an
article in the July 2010 Linux Journal that talk about the NoSQL
databases (BigTable, HBase, Cassandra, Redis, MongoDB, Voldemort,
CouchDB, DYnomite, HyperTable, etc.)
The just us e a friggin large database, and a simple key to value
lookup.
As a simple form, use awk's associative array:
a[row "," column]=value;
In perl (which might perform better), it's
$a{"$row $column"}=$value;
This keeps the entire data in memory, but with virual memory, it
should work. You just get pages swapped in and out. I have no idea of
the performance. But if you have a 4GB file, and 4GB of memory...
Well, you can also get a machine with BIG ram. It might be cheaper
than a MATLAB license.
|
|
0
|
|
|
|
Reply
|
Maxwell
|
6/4/2010 8:21:44 PM
|
|
Janis Papanagnou <janis_papanagnou@hotmail.com> writes:
> Tell that Maxwell if you're happy playing the control freak.
Oops. If I screwed up, I apologize.
|
|
0
|
|
|
|
Reply
|
Maxwell
|
6/4/2010 8:30:02 PM
|
|
Janis Papanagnou wrote:
> Thomas 'PointedEars' Lahn wrote:
>> Janis Papanagnou wrote:
>>> Maxwell Lol wrote:
>>>> Thomas 'PointedEars' Lahn <PointedEars@web.de> writes:
>>>>> Janis Papanagnou wrote:
>>> [To make sure; I didn't write the subsequent two lines; that was Mr
>>> [Lahn.]
>>
>> Isn't it a bit pointless to include that attribution line then?
>
> Tell that Maxwell if you're happy playing the control freak.
I am telling it *you* because I am referring to *your* reply. This has
nothing to do with me being a control freak. Point, kettle, black.
> (Stripping context is fine if done right and if it's not misleading.) It
> certainly is *not* pointless to make clear that I haven't wrote anything
> that YOU, specifically, wrote. No big issue, but clarification should be
> in place.
Why, you could have simple removed that attribution line and spared yourself
the explanation. Since computational complexity appears to be the most
important to you, you could have scored there.
>>>>> Temporary files and arrays are unnecessary. You can read and print
>>>>> the columns as rows one by one.
>>> I didn't say they are necessary; rather I suggested to use one of the
>>> two approaches, depending on the actual problem domain. Your suggestion
>>> seems to me to be much inferiors WRT performance; O(N^2) instead of
>>> O(N), which is inacceptable if you're processing Gigabytes of data as
>>> the OP actually does.
>>
>> AISB, it is a tradeoff: Memory vs. runtime efficiency/complexity. The OP
>> asked for a solution that would save memory, hence my suggestion.
>
> Suggestions, especially sensible ones, are always welcome.
You should face the simple fact that what you consider sensible might not be
considered sensible by others, depending on the circumstances. In this
case, one might consider it sensible to do what I suggested because one does
not have the choice.
>> As to whether that iterative approach would be inacceptable, that depends
>> on the machine speed, and the other processes it is supposed to be
>> running. ISTM nowadays processing a large text file multiple times is to
>> be preferred over using a multiple of the amount of memory required to do
>> the former in order to store the user data of the text file.
>
> You seem to not know or haven't thought much about the computational
> complexity, about the _practical relevance_ between O(N^2) and O(N),
> I suppose?
No, you are mistaken. As you could have noticed, complexity is exactly what
I have in mind in this case. You don't seem to understand that reducing
computionational complexity is not always the ultimate goal. It certainly
is not in this particular case.
>> Using an implementation of my suggestion, transposing a 1000×1000 matrix
>> in a 6.6 MiB text file and storing the result in another text file took
>> only 84 seconds on an Intel Core2 Duo SU9400 CPU (1.4 GHz, 800 MHz FSB)
>> that already had an estimated combined load of 50-75% primarily due to a
>> multimedia application running in the background. The shell process that
>> did it and its child processes combined never required more than 4 MiB of
>> memory.
>
> You've tried "toy data"; 6.6 MiB is nothing compared to 9.4 GB if you're
> going to compare an O(N^2) algorithm to an O(N) algorithm.
I am well aware that it is only a fraction of the data that the OP wants to
process. You could multiply the time required to get an idea of how low it
would probably take with 9.4 GiB on the same machine (ca. 17 to 34 standard
hours); however, you should also note that the matrix was described to be
rectangular, not quadratic. So it might have a lot of columns and
measurably fewer rows, which would reduce the number of rows that cut(1) or
awk(1) would have to process considerably. Compare this against the
computational complexity of storing the entire matrix in memory.
In any case, the point that you don't seem to get is that the OP does not
want to store 9.4 GiB of data in memory (or a database), maybe because they
just cannot do that. So they could not care less how long it takes (i.e.,
the computational complexity of the algorithm) as long as it can be done.
> [snip more pointless babbling]
PointedEars
|
|
0
|
|
|
|
Reply
|
Thomas
|
6/4/2010 9:01:53 PM
|
|
Janis Papanagnou wrote:
> Thomas 'PointedEars' Lahn wrote:
>> Janis Papanagnou wrote:
>>> Maxwell Lol wrote:
>>>> Thomas 'PointedEars' Lahn <PointedEars@web.de> writes:
>>>>> Janis Papanagnou wrote:
>>> [To make sure; I didn't write the subsequent two lines; that was Mr
>>> [Lahn.]
>>
>> Isn't it a bit pointless to include that attribution line then?
>
> Tell that Maxwell if you're happy playing the control freak.
I am telling it *you* because I am referring to *your* reply. This has
nothing to do with me being a control freak. Point, kettle, black.
> (Stripping context is fine if done right and if it's not misleading.) It
> certainly is *not* pointless to make clear that I haven't wrote anything
> that YOU, specifically, wrote. No big issue, but clarification should be
> in place.
Why, you could have simple removed that attribution line and spared yourself
the explanation. Since computational complexity appears to be the most
important to you, you could have scored there.
>>>>> Temporary files and arrays are unnecessary. You can read and print
>>>>> the columns as rows one by one.
>>> I didn't say they are necessary; rather I suggested to use one of the
>>> two approaches, depending on the actual problem domain. Your suggestion
>>> seems to me to be much inferiors WRT performance; O(N^2) instead of
>>> O(N), which is inacceptable if you're processing Gigabytes of data as
>>> the OP actually does.
>>
>> AISB, it is a tradeoff: Memory vs. runtime efficiency/complexity. The OP
>> asked for a solution that would save memory, hence my suggestion.
>
> Suggestions, especially sensible ones, are always welcome.
You should face the simple fact that what you consider sensible might not be
considered sensible by others, depending on the circumstances. In this
case, one might consider it sensible to do what I suggested because one does
not have the choice.
>> As to whether that iterative approach would be inacceptable, that depends
>> on the machine speed, and the other processes it is supposed to be
>> running. ISTM nowadays processing a large text file multiple times is to
>> be preferred over using a multiple of the amount of memory required to do
>> the former in order to store the user data of the text file.
>
> You seem to not know or haven't thought much about the computational
> complexity, about the _practical relevance_ between O(N^2) and O(N),
> I suppose?
No, you are mistaken. As you could have noticed, complexity is exactly what
I have in mind in this case. You don't seem to understand that reducing
computational complexity is not always the ultimate goal. It certainly
is not in this particular case.
>> Using an implementation of my suggestion, transposing a 1000×1000 matrix
>> in a 6.6 MiB text file and storing the result in another text file took
>> only 84 seconds on an Intel Core2 Duo SU9400 CPU (1.4 GHz, 800 MHz FSB)
>> that already had an estimated combined load of 50-75% primarily due to a
>> multimedia application running in the background. The shell process that
>> did it and its child processes combined never required more than 4 MiB of
>> memory.
>
> You've tried "toy data"; 6.6 MiB is nothing compared to 9.4 GB if you're
> going to compare an O(N^2) algorithm to an O(N) algorithm.
I am well aware that it is only a fraction of the data that the OP wants to
process. You could multiply the time required to get an idea of how low it
would probably take with 9.4 GiB on the same machine (ca. 17 to 34 standard
hours); however, you should also note that the matrix was described to be
rectangular, not quadratic. So it might have a lot of columns and
measurably fewer rows, which would reduce the number of rows that cut(1) or
awk(1) would have to process considerably. Compare this against the
computational complexity of storing the entire matrix in memory.
In any case, the point that you don't seem to get is that the OP does not
want to store 9.4 GiB of data in memory (or a database), maybe because they
just cannot do that. So they could not care less how long it takes (i.e.,
the computational complexity of the algorithm) as long as it can be done.
> [snip more pointless babbling]
PointedEars
|
|
0
|
|
|
|
Reply
|
Thomas
|
6/4/2010 9:03:26 PM
|
|
I use the approach with intermediate files, using Unix "split", "tr",
then "paste", admittedly on much smaller data sets. This is faster
than "cut" because cutting each column iteratively means reading
n lines, each till the first delimiter, then
n lines till the second delimiter
....
n lines to the end of line.
Thus, it requires reading O(n + 2n + ... + mn) = O(n m^2) amount of
data to produce m rows, then O(mn) amount to read those and write them
as a single file.
Using "split" requires 1 pass O(mn) to produce n rows, 1 pass to read
each row, transpose it and write to a column file, then 1 pass for
"paste" to read one line from each column file and concatenate it to
the final output file.
On datasets of several MB, split was also 5-6 times faster than a
"while read" loop to achieve the same thing. So here is the code I use
(more or less):
#!/bin/sh
d1=$(mktemp -d)
d2=$(mktemp -d)
split -l 1 -a 20 "$1" $d1/ # Assuming m < 26^20
for f in $d1/*; do
tr '\t' '\n' < $f > $d2/$f
done
paste $d2/* > "$1".t
|
|
0
|
|
|
|
Reply
|
pg
|
6/5/2010 12:27:24 AM
|
|
Thomas 'PointedEars' Lahn wrote:
> Janis Papanagnou wrote:
>
>> Thomas 'PointedEars' Lahn wrote:
>>> Janis Papanagnou wrote:
>>>> Maxwell Lol wrote:
>>>>> Thomas 'PointedEars' Lahn <PointedEars@web.de> writes:
>>>>>> Janis Papanagnou wrote:
>>>> [To make sure; I didn't write the subsequent two lines; that was Mr
>>>> [Lahn.]
>>> Isn't it a bit pointless to include that attribution line then?
>> Tell that Maxwell if you're happy playing the control freak.
>
> I am telling it *you* because I am referring to *your* reply. This has
> nothing to do with me being a control freak. Point, kettle, black.
You can tell that whomever you like; the point was that I don't want to be
associated with what YOU have wrote.
>
>> (Stripping context is fine if done right and if it's not misleading.) It
>> certainly is *not* pointless to make clear that I haven't wrote anything
>> that YOU, specifically, wrote. No big issue, but clarification should be
>> in place.
>
> Why, you could have simple removed that attribution line and spared yourself
> the explanation.
I thought I've explained that already twice. Here again, just for you, more
explicit; there was a posting with misleading attribution, and not reacting
might leave a wrong impression, and leave it documented in the archives. By
clarifying that in a reply this is documented as well.
> Since computational complexity appears to be the most
> important to you, you could have scored there.
What I consider important is my own business and certainly you're incapable
of knowing that. Simply put; speak for yourself, and don't pretent to know
anything about other's agenda.
>
>>>>>> Temporary files and arrays are unnecessary. You can read and print
>>>>>> the columns as rows one by one.
>>>> I didn't say they are necessary; rather I suggested to use one of the
>>>> two approaches, depending on the actual problem domain. Your suggestion
>>>> seems to me to be much inferiors WRT performance; O(N^2) instead of
>>>> O(N), which is inacceptable if you're processing Gigabytes of data as
>>>> the OP actually does.
>>> AISB, it is a tradeoff: Memory vs. runtime efficiency/complexity. The OP
>>> asked for a solution that would save memory, hence my suggestion.
>> Suggestions, especially sensible ones, are always welcome.
>
> You should face the simple fact that what you consider sensible might not be
> considered sensible by others, depending on the circumstances. In this
> case, one might consider it sensible to do what I suggested because one does
> not have the choice.
And this triviality should be of any value in response to that friendly
line of comment that I gave about the welcomness sensible suggestions?
>
>>> As to whether that iterative approach would be inacceptable, that depends
>>> on the machine speed, and the other processes it is supposed to be
>>> running. ISTM nowadays processing a large text file multiple times is to
>>> be preferred over using a multiple of the amount of memory required to do
>>> the former in order to store the user data of the text file.
>> You seem to not know or haven't thought much about the computational
>> complexity, about the _practical relevance_ between O(N^2) and O(N),
>> I suppose?
>
> No, you are mistaken. As you could have noticed, complexity is exactly what
> I have in mind in this case. You don't seem to understand that reducing
> computational complexity is not always the ultimate goal. It certainly
> is not in this particular case.
Well, if that's your conclusion I really cannot help you.
>
>>> Using an implementation of my suggestion, transposing a 1000�1000 matrix
>>> in a 6.6 MiB text file and storing the result in another text file took
>>> only 84 seconds on an Intel Core2 Duo SU9400 CPU (1.4 GHz, 800 MHz FSB)
>>> that already had an estimated combined load of 50-75% primarily due to a
>>> multimedia application running in the background. The shell process that
>>> did it and its child processes combined never required more than 4 MiB of
>>> memory.
>> You've tried "toy data"; 6.6 MiB is nothing compared to 9.4 GB if you're
>> going to compare an O(N^2) algorithm to an O(N) algorithm.
>
> I am well aware that it is only a fraction of the data that the OP wants to
> process. You could multiply the time required to get an idea of how low it
> would probably take with 9.4 GiB on the same machine (ca. 17 to 34 standard
> hours);
Then you've done your math wrong. The factor linear in data size enters
the equation in an O(N^2) algorithm in quadratic magnitude; not 1358, as
you've guessed. But you removed that quote from the reply, and preferred
to despise it as "pointless babbling", instead of thinking about it. It's
of course completely up to you how you behave and how you communicate.
> however, you should also note that the matrix was described to be
> rectangular, not quadratic. So it might have a lot of columns and
> measurably fewer rows, which would reduce the number of rows that cut(1) or
> awk(1) would have to process considerably.
Whether the matrix as a few colums and many lines or vice versa, we can
speculate all day long - *that's* pointless, if anything.
I will agree that we can make up corner cases, like a matrix of only two
lines of data, each line 4 GB large, or vice versa, two columns with many
(US-)billions or (US-)trillions of lines. But in a more realistic view it
is preferable to not assume corner cases if the primary problem of any
proposal is it's inappropriate algorithmic/computational base. YMMV.
It would be easier to just set up that test case on your computer and try
it, as I suggested upthread; then you would see where that naive thinking
will lead to in real world applications.
> Compare this against the
> computational complexity of storing the entire matrix in memory.
I don't think that storing a many GB large data set in memory; rather I
gave a proposal for a sparcely pupulated matrix, where that could be an
option. OTOH, it has been pointed out, not only by me, that you'll get
a complexity issue if you try an O(N^2) algorithm on gigabytes of data.
If you don't understand that, don't want to try that out, belittle the
consequences, well.., then I haven't more to add.
>
> In any case, the point that you don't seem to get is that the OP does not
> want to store 9.4 GiB of data in memory (or a database), maybe because they
> just cannot do that.
If you would have been a bit less aggressive and more observant you
might have seen that I made a suggestion for sparce matrices, and not
suggested to keep all those gigabytes in memory. And I think that other
approaches are also feasible, like spreading the contents across many
files - _as long as the algorithm used is O(N)_, not O(N^2)!
But even if the matrix is not sparce - we still don't know -, I would
not refrain from pointing out that a O(N^2) algorithm on that huge data
set will just not work in practise.
If you are still convinced that your approach works, provide the math
(if you think what I wrote is wrong), or provide test cases that prove
me wrong, but with appropriate data. Toy data are okay as well, if you
extrapolate with the correct math; but having both wrong does not lead
to results that are of any worth.
> So they could not care less how long it takes (i.e.,
> the computational complexity of the algorithm) as long as it can be done.
Your algorithm requires a couple years on those gigabytes on your box;
unlikely that this will be considered as "can be done" for a practical
application that more likely will need results in hours.
Janis
>
>> [snip more pointless babbling]
>
>
> PointedEars
WRT further replies a personal note to you:
I've made this last effort to respond to you. Frankly, I don't think it
will be of any help to you, judging from what I could observe from your
behaviour since you recently joined here. I would have really preferred
if you've had put me in your personal killfile, as you seem to have
publicly announced a couple days ago; that would have saved me some more
time to explain what should have already been obvious enough. Since your
words are, both topical and meta, apparently unrealiable you'll make it
now into my killfile, as my first and only member; you're unique here in
c.u.s. - Good luck!
|
|
0
|
|
|
|
Reply
|
Janis
|
6/6/2010 3:36:55 PM
|
|
In article <86rqqmF14dU1@mid.individual.net>,
Greg Russell <grussell@example.con> wrote:
>"Janis Papanagnou" <janis_papanagnou@hotmail.com> wrote in message
>news:hua1jh$4j3$1@news.m-online.net...
>
>> You're already aware of the problem that you have in principle with
>> this type of task. One more thought; are your matrices fully filled
>> or sparsely populated?
>
>Excellemt point.
>
>> In the latter case you might be able to use
>> the all-in-memory approach anyway, because you'd need just allocate
>> the actual values and leave the zero-values away.
>
>Are you assuming that zero values comstitute "sparse" elements of the array?
>The OP made no such indication, only that "... columns [are] separated by
>tabs, rows [are] separated by newlines."
>
>Null values e.g. ...\t\t... are far more reasonable for the sparse entries,
>but of course the OP can define the data structure that is of question.
>
>> (In any language, like awk, that supports associative arrays this
>> would require just a few lines of code.)
>
>Please educate us in such simplicity ... please.
>
>Even such well-known routines as the R/S/SPlus http://cran.r-project.org
>tramspose function, or the Fortran transpose routines from http://ornl.gov
>(U.S. Oak Ridge National Laboratories) that we used 20+ years ago on such
>arrays, are more than "a few lines of code."
Yup. they don't handle arrays the way AWK does, either.
Awk has *BUILT*IN* support for sparse arrays,
>If you can provide the "few lines" of awk necessary then you too can be
>famous rather than merely flippant.
>
In awk it *IS* trivial. you pull the input value into a simple variable.
Then you compare the value of that variable against the 'missing data'
value. if *mismatch* assign to array[row,col] Awk arrays are _dynamic_
in size, when you reference a previously un-referenced element, the array
grows by the size of *one* element. regardless of whether it is contiguous
to another element or not.
|
|
0
|
|
|
|
Reply
|
bonomi
|
6/24/2010 6:08:04 PM
|
|
In article <hub0s6$fkp$1@news.m-online.net>,
Janis Papanagnou <janis_papanagnou@hotmail.com> wrote:
>Maxwell Lol wrote:
>> Thomas 'PointedEars' Lahn <PointedEars@web.de> writes:
>>
>>> Janis Papanagnou wrote:
>[To make sure; I didn't write the subsequent two lines; that was Mr Lahn.]
>>>
>>>
>>> Temporary files and arrays are unnecessary. You can read and print the
>>> columns as rows one by one.
>
>I didn't say they are necessary; rather I suggested to use one of the two
>approaches, depending on the actual problem domain. Your suggestion seems
>to me to be much inferiors WRT performance; O(N^2) instead of O(N), which
>is inacceptable if you're processing Gigabytes of data as the OP actually
>does.
>
>>
>> Yes, by re-reading the input file each time you want a new row.
>
>Indeed.
>
>It would be interesting if there'd be a tool to directly access entities
>in a file by seek'ing to the respective position to avoid such performance
>degradation; say, as intelligent tail(1) program may operate on character
>entities. But that seems impossible without having a "structured file"; a
>database, maybe, or using file concepts of higher abstraction levels which
>I haven't heard to be in use in practise.
Actually, 'dd' will do that. IF the fields are fixed-width (which they
probably are *NOT*, in the OP's case) one could do
"dd bs={fieldwidth} iskip={fieldindex} sourcefile"
to extract a single field. I suspect that this would be _seriously_ slower
than the 'read a row, write to per-column tempfiles. Would eliminate the
need for additional 9+ gigs of disk-space for the tempfile, though.
|
|
0
|
|
|
|
Reply
|
bonomi
|
6/24/2010 6:20:04 PM
|
|
On 24/06/10 20:20, Robert Bonomi wrote:
> In article <hub0s6$fkp$1@news.m-online.net>,
> Janis Papanagnou <janis_papanagnou@hotmail.com> wrote:
>> Maxwell Lol wrote:
>>> Thomas 'PointedEars' Lahn <PointedEars@web.de> writes:
>>>
>>>> Janis Papanagnou wrote:
>> [To make sure; I didn't write the subsequent two lines; that was Mr Lahn.]
>>>>
>>>>
>>>> Temporary files and arrays are unnecessary. You can read and print the
>>>> columns as rows one by one.
>>
>> I didn't say they are necessary; rather I suggested to use one of the two
>> approaches, depending on the actual problem domain. Your suggestion seems
>> to me to be much inferiors WRT performance; O(N^2) instead of O(N), which
>> is inacceptable if you're processing Gigabytes of data as the OP actually
>> does.
>>
>>>
>>> Yes, by re-reading the input file each time you want a new row.
>>
>> Indeed.
>>
>> It would be interesting if there'd be a tool to directly access entities
>> in a file by seek'ing to the respective position to avoid such performance
>> degradation; say, as intelligent tail(1) program may operate on character
>> entities. But that seems impossible without having a "structured file"; a
>> database, maybe, or using file concepts of higher abstraction levels which
>> I haven't heard to be in use in practise.
>
> Actually, 'dd' will do that. IF the fields are fixed-width
That's exactly the crux why it doesn't apply in the general case we have.
Janis
> (which they
> probably are *NOT*, in the OP's case) one could do
> "dd bs={fieldwidth} iskip={fieldindex} sourcefile"
> to extract a single field. I suspect that this would be _seriously_ slower
> than the 'read a row, write to per-column tempfiles. Would eliminate the
> need for additional 9+ gigs of disk-space for the tempfile, though.
>
|
|
0
|
|
|
|
Reply
|
Janis
|
6/24/2010 7:18:52 PM
|
|
In article <i00b2s$cof$1@news.m-online.net>,
Janis Papanagnou <janis_papanagnou@hotmail.com> wrote:
>On 24/06/10 20:20, Robert Bonomi wrote:
>> In article <hub0s6$fkp$1@news.m-online.net>,
>> Janis Papanagnou <janis_papanagnou@hotmail.com> wrote:
>>> Maxwell Lol wrote:
>>>> Thomas 'PointedEars' Lahn <PointedEars@web.de> writes:
>>>>
>>>>> Janis Papanagnou wrote:
>>> [To make sure; I didn't write the subsequent two lines; that was Mr Lahn.]
>>>>>
>>>>>
>>>>> Temporary files and arrays are unnecessary. You can read and print the
>>>>> columns as rows one by one.
>>>
>>> I didn't say they are necessary; rather I suggested to use one of the two
>>> approaches, depending on the actual problem domain. Your suggestion seems
>>> to me to be much inferiors WRT performance; O(N^2) instead of O(N), which
>>> is inacceptable if you're processing Gigabytes of data as the OP actually
>>> does.
>>>
>>>>
>>>> Yes, by re-reading the input file each time you want a new row.
>>>
>>> Indeed.
>>>
>>> It would be interesting if there'd be a tool to directly access entities
>>> in a file by seek'ing to the respective position to avoid such performance
>>> degradation; say, as intelligent tail(1) program may operate on character
>>> entities. But that seems impossible without having a "structured file"; a
>>> database, maybe, or using file concepts of higher abstraction levels which
>>> I haven't heard to be in use in practise.
>>
>> Actually, 'dd' will do that. IF the fields are fixed-width
>
>That's exactly the crux why it doesn't apply in the general case we have.
just playing devil's advocate -- it would be "somewhat" (may be a record for
understatement!) painful, but one *could* still use dd, if you had another
tool that made a pass through the file, and counted the start-position and
length of each value.
Seriously, your best bet is going to be to take a 'strip' of rows that will
fit into memory, transpose, and output to a column file. repeat for as
many additional strips as it takes to process the file, then concatenate th
column files.
Or, if you know the maximum width of a field, you can write a little program
that reads sequential values from the source file, and plugs them into a
calculated position in a transposition file, being careful to include an
appropriate delimiter with each field written. Then use another hand-crafted
program that reads that transposition file, and 'squeezes out' all the 'null'
values (this converts back to the variable-width fields of the original).
|
|
0
|
|
|
|
Reply
|
bonomi
|
6/24/2010 7:38:16 PM
|
|
kj <no.email@please.post> wrote:
> (1) I have a relatively large file (9.4GB) that containing a
> rectangular matrix (columns separated by tabs, rows separated by
> newlines). I want to generate a file that contains the transpose
> of this matrix, and do so without slurping the entire matrix into
> memory all at once.
I wouldn't do that with a scripting language. Write a program in C or
so.
For many operations you don't need the inverse. Multiplying with a
matrix transpose is just as easy as with the matrix. They have the same
eigenvalues. Et cetera.
Is this conversion going to happen more than once? Can you change the
program that generated the original matrix to write out the transpose?
Victor.
--
Victor Eijkhout -- eijkhout at tacc utexas edu
|
|
0
|
|
|
|
Reply
|
see
|
6/25/2010 5:29:29 PM
|
|
|
28 Replies
2178 Views
(page loaded in 0.472 seconds)
|