Hi
I have a text file with fields in the format:
blah blah blah Ratio1 blah Ratio2 blah Ratio3 blah
1 2 5 7 2 3 4 9 0
4 2 7 7 8 3 9 2 3
5 3 2 0 9 8 7 5 2
The top line is the name of each of the fields.
The rest is info in each record associated with wach field
I want to just pull out the fields that have the Ratio heading:
Ratio1 Ratio2 Ratio3
7 3 9
7 3 2
0 8 5
here I could do awk '{print $4 $6 $8}'
but I actually have hundreds if not thousands of fields.
And in some files the Ratio fields could be every 2nd field or every
3rd or 4th field, so no consistency. So the best way to do it is to
just say "does this column have a Ratio* heading? If so, print it, if
not, skip it"
Any help would be a ppreciated.
Thanx
|
|
0
|
|
|
|
Reply
|
spacemancw
|
10/20/2003 1:28:39 AM |
|
In article <fbb1f69b.0310191728.74f35a98@posting.google.com>,
spacemancw <spacemancw@yahoo.com> wrote:
>Hi
>
>I have a text file with fields in the format:
>
>blah blah blah Ratio1 blah Ratio2 blah Ratio3 blah
>1 2 5 7 2 3 4 9 0
>4 2 7 7 8 3 9 2 3
>5 3 2 0 9 8 7 5 2
>
>The top line is the name of each of the fields.
>The rest is info in each record associated with wach field
>
>I want to just pull out the fields that have the Ratio heading:
>
>Ratio1 Ratio2 Ratio3
>7 3 9
>7 3 2
>0 8 5
>
>here I could do awk '{print $4 $6 $8}'
>
>but I actually have hundreds if not thousands of fields.
>And in some files the Ratio fields could be every 2nd field or every
>3rd or 4th field, so no consistency. So the best way to do it is to
>just say "does this column have a Ratio* heading? If so, print it, if
>not, skip it"
>
>Any help would be a ppreciated.
FNR == 1 {
fields = 0
for(i = 1; i <= NF; i++)
if($i ~ /^Ratio/)
ratio[++fields] = i
}
{
for(i = 1; i < fields; i++)
printf "%6s ", $(ratio[i])
printf "%6s\n", $(ratio[fields])
}
-- don
|
|
0
|
|
|
|
Reply
|
don
|
10/20/2003 1:47:36 AM
|
|
Don Thanx
this is a start ... but instead of outputting:
Ratio1 Ratio2 Ratio3
7 3 9
7 3 2
0 8 5
it outputs:
Ratio Ratio Ratio
<- blank line
<- blank line
<- blank line
(note .... the Ratio headings are now Ratio instead of Ratio1, Ratio2
etc
and then all the lines below are blank, so the values for each field
haven't been extracted. But I will try to mess with it.
At the moment I am using GNU cut (cos the file is so big) to cut out
each column, grep the result for "Ratio", if positive then write the
result to it's own file, leaving me with a bunch of files that I then
paste together. It works but it's not the best way to do this. For a
file that has 3000 columns and 90 rows it takes 5 minutes.
Thanx again
don@news.daedalus.co.nz (Don Stokes) wrote in message news:<Y8Hkb.1867$ws.188723@news02.tsnz.net>...
> In article <fbb1f69b.0310191728.74f35a98@posting.google.com>,
> spacemancw <spacemancw@yahoo.com> wrote:
> >Hi
> >
> >I have a text file with fields in the format:
> >
> >blah blah blah Ratio1 blah Ratio2 blah Ratio3 blah
> >1 2 5 7 2 3 4 9 0
> >4 2 7 7 8 3 9 2 3
> >5 3 2 0 9 8 7 5 2
> >
> >The top line is the name of each of the fields.
> >The rest is info in each record associated with wach field
> >
> >I want to just pull out the fields that have the Ratio heading:
> >
> >Ratio1 Ratio2 Ratio3
> >7 3 9
> >7 3 2
> >0 8 5
> >
> >here I could do awk '{print $4 $6 $8}'
> >
> >but I actually have hundreds if not thousands of fields.
> >And in some files the Ratio fields could be every 2nd field or every
> >3rd or 4th field, so no consistency. So the best way to do it is to
> >just say "does this column have a Ratio* heading? If so, print it, if
> >not, skip it"
> >
> >Any help would be a ppreciated.
>
> FNR == 1 {
> fields = 0
> for(i = 1; i <= NF; i++)
> if($i ~ /^Ratio/)
> ratio[++fields] = i
> }
>
> {
> for(i = 1; i < fields; i++)
> printf "%6s ", $(ratio[i])
> printf "%6s\n", $(ratio[fields])
> }
>
> -- don
|
|
0
|
|
|
|
Reply
|
spacemancw
|
10/20/2003 2:45:06 PM
|
|
On 10/20/2003 9:45 AM, spacemancw wrote:
> Don Thanx
>
> this is a start ... but instead of outputting:
>
> Ratio1 Ratio2 Ratio3
> 7 3 9
> 7 3 2
> 0 8 5
>
> it outputs:
>
> Ratio Ratio Ratio
> <- blank line
> <- blank line
> <- blank line
<snip>
Try this:
NR == 1 {
for(i = 1; i <= NF; i++)
if($i ~ /^Ratio/)
fields[i] = ""
}
{
sep=""
for (i in fields) {
printf("%s%s",sep,$i)
sep = "\t"
}
printf("\n")
}
Regards,
Ed.
|
|
0
|
|
|
|
Reply
|
Ed
|
10/20/2003 3:39:55 PM
|
|
the program siggested mught work on your awk if you replace
for(i = 1; i < fields; i++)
printf "%6s ", $(ratio[i])
by
for(i = 1; i < fields; i++) {
j = ratio[i]
printf "%6s ", $j
}
Other suggestions: derive from the first line of the table the
required arguments for another awk program, or for a unique cut
command.
Laurent
> > FNR == 1 {
> > fields = 0
> > for(i = 1; i <= NF; i++)
> > if($i ~ /^Ratio/)
> > ratio[++fields] = i
> > }
> >
> > {
> > for(i = 1; i < fields; i++)
> > printf "%6s ", $(ratio[i])
> > printf "%6s\n", $(ratio[fields])
> > }
> >
> > -- don
|
|
0
|
|
|
|
Reply
|
Laurent
|
10/20/2003 4:57:42 PM
|
|
"spacemancw" <spacemancw@yahoo.com> wrote in message news:fbb1f69b.0310200645.16b8aefe@posting.google.com...
> Don Thanx
>
> this is a start ... but instead of outputting:
>
> Ratio1 Ratio2 Ratio3
> 7 3 9
> 7 3 2
> 0 8 5
>
> it outputs:
>
> Ratio Ratio Ratio
> <- blank line
> <- blank line
> <- blank line
>
> (note .... the Ratio headings are now Ratio instead of Ratio1, Ratio2
> etc
Are you sure the data that produces the bad output is
the same as you posted earlier? Can you post a sample?
One possible explanation is the headings row is not
the first line of the file (as Don's program assumed)
and another is that the headings are of the form
"Ratio 1" with a space. But really this is just guessing.
John.
|
|
0
|
|
|
|
Reply
|
John
|
10/20/2003 6:00:31 PM
|
|
In article <fbb1f69b.0310200645.16b8aefe@posting.google.com>,
spacemancw <spacemancw@yahoo.com> wrote:
>Don Thanx
>
>this is a start ... but instead of outputting:
>
>Ratio1 Ratio2 Ratio3
>7 3 9
>7 3 2
>0 8 5
>
>it outputs:
>
>Ratio Ratio Ratio
> <- blank line
> <- blank line
> <- blank line
>
>(note .... the Ratio headings are now Ratio instead of Ratio1, Ratio2
>etc
>and then all the lines below are blank, so the values for each field
>haven't been extracted. But I will try to mess with it.
>At the moment I am using GNU cut (cos the file is so big) to cut out
>each column, grep the result for "Ratio", if positive then write the
>result to it's own file, leaving me with a bunch of files that I then
>paste together. It works but it's not the best way to do this. For a
>file that has 3000 columns and 90 rows it takes 5 minutes.
Well, it worked for me, copying and pasting both the data in your
message and my posted code. Are you sure you copied the code correctly?
toybsd% cat >x.awk
FNR == 1 {
fields = 0
for(i = 1; i <= NF; i++)
if($i ~ /^Ratio/)
ratio[++fields] = i
}
{
for(i = 1; i < fields; i++)
printf "%6s ", $(ratio[i])
printf "%6s\n", $(ratio[fields])
}
toybsd% cat >t.t
blah blah blah Ratio1 blah Ratio2 blah Ratio3 blah
1 2 5 7 2 3 4 9 0
4 2 7 7 8 3 9 2 3
5 3 2 0 9 8 7 5 2
toybsd% awk -f x.awk t.t
Ratio1 Ratio2 Ratio3
7 3 9
7 3 2
0 8 5
toybsd%
-- don
|
|
0
|
|
|
|
Reply
|
don
|
10/20/2003 10:48:31 PM
|
|
Apologies .. Don ur script is correct ..... my headings do have
spaces, Ratio (1) etc .. so I just vi the spaces away. And it works on
my sample.
But I am having trouble getting it to work on my real data.
It pulls out the Ratio headings but it is getting values on the lines
below from other columns, they don't match up. There must be something
in the data format that's causing this. But seperating the columns
works with my GNU cut script. I dunno how to show u a real sample of
the data other than to email u a file. The first few hundred columns
are junk. Each line wraps the screen numerous times, so I can't really
paste it in here. So any takes for a 500k sample file, email me and
I'll send it to u.
Again Don, thanx, urs works fine, just my files must have some
formatting problem.
Thanx to all the others too
don@news.daedalus.co.nz (Don Stokes) wrote in message news:<3DZkb.2066$ws.206735@news02.tsnz.net>...
> In article <fbb1f69b.0310200645.16b8aefe@posting.google.com>,
> spacemancw <spacemancw@yahoo.com> wrote:
> >Don Thanx
> >
> >this is a start ... but instead of outputting:
> >
> >Ratio1 Ratio2 Ratio3
> >7 3 9
> >7 3 2
> >0 8 5
> >
> >it outputs:
> >
> >Ratio Ratio Ratio
> > <- blank line
> > <- blank line
> > <- blank line
> >
> >(note .... the Ratio headings are now Ratio instead of Ratio1, Ratio2
> >etc
> >and then all the lines below are blank, so the values for each field
> >haven't been extracted. But I will try to mess with it.
> >At the moment I am using GNU cut (cos the file is so big) to cut out
> >each column, grep the result for "Ratio", if positive then write the
> >result to it's own file, leaving me with a bunch of files that I then
> >paste together. It works but it's not the best way to do this. For a
> >file that has 3000 columns and 90 rows it takes 5 minutes.
>
> Well, it worked for me, copying and pasting both the data in your
> message and my posted code. Are you sure you copied the code correctly?
>
> toybsd% cat >x.awk
> FNR == 1 {
> fields = 0
> for(i = 1; i <= NF; i++)
> if($i ~ /^Ratio/)
> ratio[++fields] = i
> }
>
> {
> for(i = 1; i < fields; i++)
> printf "%6s ", $(ratio[i])
> printf "%6s\n", $(ratio[fields])
> }
> toybsd% cat >t.t
> blah blah blah Ratio1 blah Ratio2 blah Ratio3 blah
> 1 2 5 7 2 3 4 9 0
> 4 2 7 7 8 3 9 2 3
> 5 3 2 0 9 8 7 5 2
> toybsd% awk -f x.awk t.t
> Ratio1 Ratio2 Ratio3
> 7 3 9
> 7 3 2
> 0 8 5
> toybsd%
>
> -- don
|
|
0
|
|
|
|
Reply
|
spacemancw
|
10/21/2003 3:22:31 AM
|
|
spacemancw wrote:
> But I am having trouble getting it to work on my real data.
> It pulls out the Ratio headings but it is getting values on the lines
> below from other columns, they don't match up. There must be something
> in the data format that's causing this. But seperating the columns
> works with my GNU cut script.
What was you cut script? maybe it is easiest to just let awk generate
that cut script from the first line.
Laurent
|
|
0
|
|
|
|
Reply
|
Laurent
|
10/21/2003 5:41:46 AM
|
|
In article <fbb1f69b.0310201922.e9a1114@posting.google.com>,
spacemancw <spacemancw@yahoo.com> wrote:
>Apologies .. Don ur script is correct ..... my headings do have
>spaces, Ratio (1) etc .. so I just vi the spaces away. And it works on
>my sample.
>But I am having trouble getting it to work on my real data.
>It pulls out the Ratio headings but it is getting values on the lines
>below from other columns, they don't match up. There must be something
>in the data format that's causing this. But seperating the columns
Well, you'll need to make sure no other fields have spaces in them, e.g.
Field 1 Ratio2 Blah
1 2 3
is going to treat "Field" and "1" as two fields, so it'll print 3 under
the heading "Ratio3".
Also, you'll need to make sure no fields are missing; if a data field is
blank rather than 0, for example, it'll just mess everything up.
Finally, note that the formatting in the script I gave assumes every
field and header fits in a six column field:
{
for(i = 1; i < fields; i++)
printf "%6s ", $(ratio[i])
printf "%6s\n", $(ratio[fields])
}
If any fields are bigger than that, change the 6s in the printf format
to an appropriate field size.
-- don
|
|
0
|
|
|
|
Reply
|
don
|
10/21/2003 6:17:30 AM
|
|
"spacemancw" <spacemancw@yahoo.com> wrote in message news:fbb1f69b.0310201922.e9a1114@posting.google.com...
> Apologies .. Don ur script is correct ..... my headings do have
> spaces, Ratio (1) etc .. so I just vi the spaces away. And it works on
> my sample.
> But I am having trouble getting it to work on my real data.
> It pulls out the Ratio headings but it is getting values on the lines
> below from other columns, they don't match up. There must be something
> in the data format that's causing this. But seperating the columns
> works with my GNU cut script. I dunno how to show u a real sample of
> the data other than to email u a file. The first few hundred columns
> are junk. Each line wraps the screen numerous times, so I can't really
> paste it in here. So any takes for a 500k sample file, email me and
> I'll send it to u.
>
Just a thought, but Don's awk script is field based.
It assumes that your columns are always filled and
are separated by white space.
Is it?
Does your cut script cut fields (cut -f) or characters (cut -c)?
John.
|
|
0
|
|
|
|
Reply
|
John
|
10/21/2003 9:03:41 AM
|
|
John L wrote:
> "spacemancw" <spacemancw@yahoo.com> wrote in message news:fbb1f69b.0310201922.e9a1114@posting.google.com...
>
>>Apologies .. Don ur script is correct ..... my headings do have
>>spaces, Ratio (1) etc .. so I just vi the spaces away. And it works on
>>my sample.
>>But I am having trouble getting it to work on my real data.
>>It pulls out the Ratio headings but it is getting values on the lines
>>below from other columns, they don't match up. There must be something
>>in the data format that's causing this. But seperating the columns
>>works with my GNU cut script. I dunno how to show u a real sample of
>>the data other than to email u a file. The first few hundred columns
>>are junk. Each line wraps the screen numerous times, so I can't really
>>paste it in here. So any takes for a 500k sample file, email me and
>>I'll send it to u.
>>
>
>
> Just a thought, but Don's awk script is field based.
> It assumes that your columns are always filled and
> are separated by white space.
>
> Is it?
>
> Does your cut script cut fields (cut -f) or characters (cut -c)?
>
> John.
>
Also, cut works on a single character (e.g. tab OR space) whereas awk
will use general white-space (e.g. tab AND/OR space) as the field
separator.
Ed.
|
|
0
|
|
|
|
Reply
|
Ed
|
10/21/2003 1:30:41 PM
|
|
Don
again apologies ...... ur script is perfect. My data isn't.
Well the whole point of wanting this script is to avoid sifting thru
the junk and just eliminate it.
Out of 4000 columns only 330 are needed. It turns out that there are
two columns in there with the names:
'Num Bkgnd' and 'Den Bkgnd'
both with spaces that just threw everything off. I deleted the spaces
and the script worked perfectly.
Thanx to all who helped .... I appreciate it.
Roger
don@news.daedalus.co.nz (Don Stokes) wrote in message news:<_b4lb.2170$ws.212470@news02.tsnz.net>...
> In article <fbb1f69b.0310201922.e9a1114@posting.google.com>,
> spacemancw <spacemancw@yahoo.com> wrote:
> >Apologies .. Don ur script is correct ..... my headings do have
> >spaces, Ratio (1) etc .. so I just vi the spaces away. And it works on
> >my sample.
> >But I am having trouble getting it to work on my real data.
> >It pulls out the Ratio headings but it is getting values on the lines
> >below from other columns, they don't match up. There must be something
> >in the data format that's causing this. But seperating the columns
>
> Well, you'll need to make sure no other fields have spaces in them, e.g.
>
> Field 1 Ratio2 Blah
> 1 2 3
>
> is going to treat "Field" and "1" as two fields, so it'll print 3 under
> the heading "Ratio3".
>
> Also, you'll need to make sure no fields are missing; if a data field is
> blank rather than 0, for example, it'll just mess everything up.
>
> Finally, note that the formatting in the script I gave assumes every
> field and header fits in a six column field:
>
> {
> for(i = 1; i < fields; i++)
> printf "%6s ", $(ratio[i])
> printf "%6s\n", $(ratio[fields])
> }
>
> If any fields are bigger than that, change the 6s in the printf format
> to an appropriate field size.
>
> -- don
|
|
0
|
|
|
|
Reply
|
spacemancw
|
10/21/2003 2:46:56 PM
|
|
In article <fbb1f69b.0310201922.e9a1114@posting.google.com>,
spacemancw <spacemancw@yahoo.com> wrote:
% in the data format that's causing this. But seperating the columns
% works with my GNU cut script. I dunno how to show u a real sample of
It sounds like you have fixed-width columns in your data. gawk has
a well-thought-out and quite useful extension which handles this
kind of data, so if you use gawk, the problem might reduce to figuring
out the offsets and widths of the columns and using them to set the
FIELDWIDTHS variable.
On the other hand, you can always use substr(), which might be the right
choice in this case since the records seem to be new-line delimited:
BEGIN {
i = 0
heading[++i] = "Column heading"
heading[++i] = "Other column heading"
cols = i
getline
for (i = 1; i <= cols; i++) {
offset[i] = index($0, heading[i])
width[i] = match(substr($0, offset[i]+length(heading[i])), /[^ ]/) +
length(heading[i])
}
}
{ for (i = 1; i <= cols; i++)
f[i] = substr($0, offset[i], width[i])
}
# then you can do whatever you like with the data, much as if
# it were in $1, $2, $3 rather than f[1], f[2], f[3]
f[7] ~ /some pattern/ { print f[1],f[2],f[3] }
--
Patrick TJ McPhee
East York Canada
ptjm@interlog.com
|
|
0
|
|
|
|
Reply
|
ptjm
|
10/21/2003 3:51:21 PM
|
|
"spacemancw" <spacemancw@yahoo.com> wrote in message news:fbb1f69b.0310210646.493cc3f7@posting.google.com...
> Don
>
> again apologies ...... ur script is perfect. My data isn't.
> Well the whole point of wanting this script is to avoid sifting thru
> the junk and just eliminate it.
> Out of 4000 columns only 330 are needed. It turns out that there are
> two columns in there with the names:
>
> 'Num Bkgnd' and 'Den Bkgnd'
>
> both with spaces that just threw everything off. I deleted the spaces
> and the script worked perfectly.
>
You could have found this by asking awk to count the fields:
$ awk '{print NF}' data
and then pipe through sort -u or uniq -c to make sure that each row
had the same number of fields, then use awk to identify and print
the "bad" rows like this:
$ awk 'NF != 78' data (where 78 is the "correct" number found above
and then you can add an action to the above to "fix" or delete
those rows to make the data suitable for Don's program.
This sort of approach is what awk was made for, and is more robust
than manually editing the data.
Good luck in future.
John.
|
|
0
|
|
|
|
Reply
|
John
|
10/21/2003 4:52:22 PM
|
|
In article <bn3kln$a24$3@news.eusc.inter.net>,
Patrick TJ McPhee <ptjm@interlog.com> wrote:
>In article <fbb1f69b.0310201922.e9a1114@posting.google.com>,
>spacemancw <spacemancw@yahoo.com> wrote:
>
>% in the data format that's causing this. But seperating the columns
>% works with my GNU cut script. I dunno how to show u a real sample of
>
>It sounds like you have fixed-width columns in your data. gawk has
>a well-thought-out and quite useful extension which handles this
>kind of data, so if you use gawk, the problem might reduce to figuring
>out the offsets and widths of the columns and using them to set the
>FIELDWIDTHS variable.
An awk function that analyzes a header to generate an appropriate FIELDWIDTHS
strings is available at: ftp://ftp.armory.com/pub/lib/awk/SetFields
John
--
John DuBois spcecdt@armory.com KC6QKZ/AE http://www.armory.com/~spcecdt/
|
|
0
|
|
|
|
Reply
|
spcecdt
|
10/22/2003 12:22:08 AM
|
|
|
15 Replies
201 Views
(page loaded in 0.106 seconds)
|