Matrix transposition problem

  • Follow


Hi All,

I have 1000s of data files with 1000s of lines each. Each line has a
tab-separated data series corresponding to 1 month. The field logic of
the lines is as follows:
<startday>	<value1>	<flag1>	<value2>	<flag2>	(...)	<value31>	<flag31>

There are always 31 values and 31 flags in a line, even for months with
less than 31 days. Real data lines look like this:

2000-01-01	0.2	1	0.7	0	(...)	1.0	-2
2000-02-01	0.1	-1	0.9	1	(...)	2.0	-1
2000-03-01	0.3	1	1.7	0	(...)	3.0	0


I would like to transpose the data matrix in order to have separate
lines per day, i.e. each input line is transposed into 31 output lines,
like this:

<startday>	<value1>	<flag1>
<startday+1>	<value2>	<flag2>
....
<startday+30>	<value31>	<flag31>

I've been looking at the typical AWK matrix transposition examples from
the AWK user guides, etc., but they are not really what I want. Can
someone help me?

Thanks in advance, Hermann

0
Reply peifer (24) 11/24/2006 5:18:59 PM

Hermann wrote:
> Hi All,
> 
> I have 1000s of data files with 1000s of lines each. Each line has a
> tab-separated data series corresponding to 1 month. The field logic of
> the lines is as follows:
> <startday>	<value1>	<flag1>	<value2>	<flag2>	(...)	<value31>	<flag31>

So just a single date on each line? The startday with syntax YYYY-mm-dd?

> 
> There are always 31 values and 31 flags in a line, even for months with
> less than 31 days. Real data lines look like this:
> 
> 2000-01-01	0.2	1	0.7	0	(...)	1.0	-2
> 2000-02-01	0.1	-1	0.9	1	(...)	2.0	-1
> 2000-03-01	0.3	1	1.7	0	(...)	3.0	0

Shouldn't the dates be 2000-01-01, 2000-01-02, 2000-01-03, ...?
(i.e. ISO dates)

> 
> I would like to transpose the data matrix in order to have separate
> lines per day, i.e. each input line is transposed into 31 output lines,
> like this:

By "transposed" you mean the lines to be just "folded", and the days
for subsequent lines generated?

> 
> <startday>	<value1>	<flag1>
> <startday+1>	<value2>	<flag2>
> ...
> <startday+30>	<value31>	<flag31>
> 
> I've been looking at the typical AWK matrix transposition examples from

Transposition is changing columns with rows, and v.v. That's not exactly
what you're describing above.

> the AWK user guides, etc., but they are not really what I want. Can
> someone help me?
> 
> Thanks in advance, Hermann
> 

This is an outline how you can do it...

BEGIN{ORS="\t"}
{ date = substr($1,1,8)
   for (i=1;i<=31;i++)
     print date i, $(i*2), $(i*2+1)
}

What must be still done is to give the day number a leading zero.

Janis
0
Reply Janis 11/24/2006 7:23:49 PM


Janis Papanagnou wrote:
> Hermann wrote:
> 
>> Hi All,
>>
>> I have 1000s of data files with 1000s of lines each. Each line has a
>> tab-separated data series corresponding to 1 month. The field logic of
>> the lines is as follows:
>> <startday>    <value1>    <flag1>    <value2>    <flag2>    (...)    
>> <value31>    <flag31>
> 
> 
> So just a single date on each line? The startday with syntax YYYY-mm-dd?
> 
>>
>> There are always 31 values and 31 flags in a line, even for months with
>> less than 31 days. Real data lines look like this:
>>
>> 2000-01-01    0.2    1    0.7    0    (...)    1.0    -2
>> 2000-02-01    0.1    -1    0.9    1    (...)    2.0    -1
>> 2000-03-01    0.3    1    1.7    0    (...)    3.0    0
> 
> 
> Shouldn't the dates be 2000-01-01, 2000-01-02, 2000-01-03, ...?
> (i.e. ISO dates)
> 
>>
>> I would like to transpose the data matrix in order to have separate
>> lines per day, i.e. each input line is transposed into 31 output lines,
>> like this:
> 
> 
> By "transposed" you mean the lines to be just "folded", and the days
> for subsequent lines generated?
> 
>>
>> <startday>    <value1>    <flag1>
>> <startday+1>    <value2>    <flag2>
>> ...
>> <startday+30>    <value31>    <flag31>
>>
>> I've been looking at the typical AWK matrix transposition examples from
> 
> 
> Transposition is changing columns with rows, and v.v. That's not exactly
> what you're describing above.
> 
>> the AWK user guides, etc., but they are not really what I want. Can
>> someone help me?
>>
>> Thanks in advance, Hermann
>>
> 
> This is an outline how you can do it...
> 
> BEGIN{ORS="\t"}
> { date = substr($1,1,8)
>   for (i=1;i<=31;i++)
>     print date i, $(i*2), $(i*2+1)
> }
> 
> What must be still done is to give the day number a leading zero.

There was a typo, sorry; must be OFS (instead of ORS).

BEGIN{OFS="\t"}
{ d = substr($1,1,8)
   for (i=1;i<=31;i++)
     print sprintf("%s%02d", d, i), $(i*2), $(i*2+1)
}

I also added to print the leading zero, BTW.

> 
> Janis
0
Reply Janis 11/24/2006 7:28:16 PM

Janis Papanagnou wrote:
> Janis Papanagnou wrote:
>> So just a single date on each line? The startday with syntax YYYY-mm-dd?

The answer to both questions is: Yes.

>> Shouldn't the dates be 2000-01-01, 2000-01-02, 2000-01-03, ...?
>> (i.e. ISO dates)

No. There is only 1 line per month and the dates are as given in the 
sample lines.

>> By "transposed" you mean the lines to be just "folded", and the days
>> for subsequent lines generated?
(...)
>> Transposition is changing columns with rows, and v.v. That's not exactly
>> what you're describing above.

You are right. "Line folding" might be a better description of what I am 
looking for. Sorry for any potential misunderstandings.

 >> This is an outline how you can do it...

Thanks. I will try out and let you know.

Hermann
0
Reply Hermann 11/24/2006 8:20:21 PM

Hermann Peifer wrote:
> Janis Papanagnou wrote:
> 
>  >> This is an outline how you can do it...
> 
> Thanks. I will try out and let you know.

You're welcome.

And I noticed that my suggestion could be simplified further...

{ d = substr($1,1,8)
   for (i=1;i<=31;i++)
     printf("%s%02d\t%s\t%s\n", d, i, $(i*2), $(i*2+1))
}


Janis
0
Reply Janis 11/24/2006 9:09:12 PM

Janis Papanagnou wrote:
> And I noticed that my suggestion could be simplified further...

Thanks again. Your previous code worked fine. I am sure this one will 
also do.

Hermann
0
Reply Hermann 11/24/2006 9:35:47 PM

Hermann wrote:
> I have 1000s of data files with 100s of lines each. Each line has a
> tab-separated data series corresponding to 1 month. The field logic of
> the lines is as follows:
> <startday>	<value1>	<flag1>	<value2>	<flag2>	(...)	<value31>	<flag31>
> 
> There are always 31 values and 31 flags in a line, even for months with
> less than 31 days. Real data lines look like this:
> 
> 2000-01-01	0.2	1	0.7	0	(...)	1.0	-2
> 2000-02-01	0.1	-1	0.9	1	(...)	2.0	-1
> 2000-03-01	0.3	1	1.7	0	(...)	3.0	0
> 
> 
> I would like to transpose the data matrix in order to have separate
> lines per day, i.e. each input line is transposed into 31 output lines,
> like this:
> 
> <startday>	<value1>	<flag1>
> <startday+1>	<value2>	<flag2>
> ...
> <startday+30>	<value31>	<flag31> 

My AWK script is ready and seems to do what I want. In addition to the 
original problem from above, I also have to cut the FILENAME into pieces 
and use the snippets in the output lines. Then I have 2 types of input 
lines (daily and hourly data) which I want to convert into into a 
standard output format. A 3rd type of lines (weekly, monthly of yearly 
data) has to be ignored.

Thanks again to Janis for helping me.

Before applying the script to some 46000+ files (with a total of ~35 
million input lines resulting into ~900 million output lines):

Could someone perhaps have a look and see if the script looks reasonable 
(for a beginner) or if there are some (potential) problems?

--snip--
BEGIN {FS = OFS= "\t"}
{
  if (FNR == 1) {
   split(FILENAME,f,".")                    # split filename
   sc = substr(FILENAME,1,7)                # get station code
   cp = substr(FILENAME,8,5)                # get component number
   ms = substr(FILENAME,13,5)               # get measurement number
   dt = substr(FILENAME,18,length(f[1])-17) # get data type
  }
  if (NF == 63) {                           # line with daily data
   d = substr($1,1,8)                       # get YYYY-MM- from $1
    for (i=1;i<=31;i++)                     # unfold to 31 lines
     print sc,cp,ms,dt,sprintf("%s%02d",d,i),"00:00",$(i*2),$(i*2+1)
  }
  else if (NF == 49) {                      # line with hourly data
   for (i=1;i<=24;i++)                      # unfold to 24 lines
    print sc,cp,ms,dt,$1,sprintf("%02d",i)":00",$(i*2),$(i*2+1)
  }
  else {next}                               # do nothing (other data)
}
--snip--

In case someone wants to try it out: sample files are available at:
http://cdrtest.eionet.europa.eu/at/eea/colqleaoq/envrwlbtw

Thanks in advance, Hermann
0
Reply Hermann 11/26/2006 9:28:52 AM

=CF/=C7 Hermann Peifer =DD=E3=F1=E1=F8=E5:
> Hermann wrote:
> > I have 1000s of data files with 100s of lines each. Each line has a
> > tab-separated data series corresponding to 1 month. The field logic of
> > the lines is as follows:
> > <startday>	<value1>	<flag1>	<value2>	<flag2>	(...)	<value31>	<flag31>
> >
> > There are always 31 values and 31 flags in a line, even for months with
> > less than 31 days. Real data lines look like this:
> >
> > 2000-01-01	0.2	1	0.7	0	(...)	1.0	-2
> > 2000-02-01	0.1	-1	0.9	1	(...)	2.0	-1
> > 2000-03-01	0.3	1	1.7	0	(...)	3.0	0
> >
> >
> > I would like to transpose the data matrix in order to have separate
> > lines per day, i.e. each input line is transposed into 31 output lines,
> > like this:
> >
> > <startday>	<value1>	<flag1>
> > <startday+1>	<value2>	<flag2>
> > ...
> > <startday+30>	<value31>	<flag31>
>
> My AWK script is ready and seems to do what I want. In addition to the
> original problem from above, I also have to cut the FILENAME into pieces
> and use the snippets in the output lines. Then I have 2 types of input
> lines (daily and hourly data) which I want to convert into into a
> standard output format. A 3rd type of lines (weekly, monthly of yearly
> data) has to be ignored.
>
> Thanks again to Janis for helping me.
>
> Before applying the script to some 46000+ files (with a total of ~35
> million input lines resulting into ~900 million output lines):
>
> Could someone perhaps have a look and see if the script looks reasonable
> (for a beginner) or if there are some (potential) problems?
>
> --snip--
> BEGIN {FS =3D OFS=3D "\t"}
> {
>   if (FNR =3D=3D 1) {
>    split(FILENAME,f,".")                    # split filename
>    sc =3D substr(FILENAME,1,7)                # get station code
>    cp =3D substr(FILENAME,8,5)                # get component number
>    ms =3D substr(FILENAME,13,5)               # get measurement number
>    dt =3D substr(FILENAME,18,length(f[1])-17) # get data type
>   }
>   if (NF =3D=3D 63) {                           # line with daily data
>    d =3D substr($1,1,8)                       # get YYYY-MM- from $1
>     for (i=3D1;i<=3D31;i++)                     # unfold to 31 lines
>      print sc,cp,ms,dt,sprintf("%s%02d",d,i),"00:00",$(i*2),$(i*2+1)
>   }
>   else if (NF =3D=3D 49) {                      # line with hourly data
>    for (i=3D1;i<=3D24;i++)                      # unfold to 24 lines
>     print sc,cp,ms,dt,$1,sprintf("%02d",i)":00",$(i*2),$(i*2+1)
>   }
>   else {next}                               # do nothing (other data)
> }
> --snip--
>
> In case someone wants to try it out: sample files are available at:
> http://cdrtest.eionet.europa.eu/at/eea/colqleaoq/envrwlbtw
>
> Thanks in advance, Hermann

The natural way (in awk) is to write:

BEGIN { FS =3D OFS =3D "\t" }
FNR =3D=3D 1 { ... }
NF =3D=3D 63 { ... }
..=2E.

Awk is particulary good in verification too.
Why don't you try something like in the lines of this:

BEGIN { FS =3D "\t" }
$1 !=3D "AT0001A" { print FNR, NF, "error\t" $0 > "/dev/stderr" }
$2 !=3D "00001" { ... }
$3 !=3D "00100" { ... }
$4 !~ "(day|hour)" { ... }
$5 !~ /[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]/ { ... }

Of course, you know better what to expect in each field :)

0
Reply Vassilis 11/26/2006 10:52:57 AM

Vassilis wrote:
> �/� Hermann Peifer ������:
>> Could someone perhaps have a look and see if the script looks reasonable
>> (for a beginner) or if there are some (potential) problems?
>>
>> --snip--
>> BEGIN {FS = OFS= "\t"}
>> {
>>   if (FNR == 1) {
>>    split(FILENAME,f,".")                    # split filename
>>    sc = substr(FILENAME,1,7)                # get station code
>>    cp = substr(FILENAME,8,5)                # get component number
>>    ms = substr(FILENAME,13,5)               # get measurement number
>>    dt = substr(FILENAME,18,length(f[1])-17) # get data type
>>   }
>>   if (NF == 63) {                           # line with daily data
>>    d = substr($1,1,8)                       # get YYYY-MM- from $1
>>     for (i=1;i<=31;i++)                     # unfold to 31 lines
>>      print sc,cp,ms,dt,sprintf("%s%02d",d,i),"00:00",$(i*2),$(i*2+1)
>>   }
>>   else if (NF == 49) {                      # line with hourly data
>>    for (i=1;i<=24;i++)                      # unfold to 24 lines
>>     print sc,cp,ms,dt,$1,sprintf("%02d",i)":00",$(i*2),$(i*2+1)
>>   }
>>   else {next}                               # do nothing (other data)
>> }
> 
> The natural way (in awk) is to write:
> 
> BEGIN { FS = OFS = "\t" }
> FNR == 1 { ... }
> NF == 63 { ... }
> ...

Thanks for the hint. I was able to save some lines  w/o loosing any 
logic. See below. I also added a section to find out how many days a 
given month has.

--snip--
BEGIN {FS = OFS= "\t"}

FNR == 1 {
  split(FILENAME,f,".")                    # split filename
  sc = substr(FILENAME,1,7)                # get station code
  cp = substr(FILENAME,8,5)                # get component number
  ms = substr(FILENAME,13,5)               # get measurement number
  dt = substr(FILENAME,18,length(f[1])-17) # get data type
}

NF == 63 {                                # line with daily data
  d = substr($1,1,8)                       # get YYYY-MM- from $1
  y = substr($1,1,4)                       # get YYYY from $1
  m = substr($1,6,2)                       # get MM from $1
  n = 31                                   # default number of days
   if (m=="04"||m=="06"||m=="09"||m==11) n=30
   if (m=="02" && !(y%4)) n=29             # Leap year logic OK for
   if (m=="02" && y%4)    n=28             # year range: 1901<=y<=2099
  for (i=1;i<=n;i++)                       # unfold to <n> lines
   print sc,cp,ms,dt,sprintf("%s%02d",d,i),"00:00",$(i*2),$(i*2+1)
}

NF == 49 {                                # line with hourly data
  for (i=1;i<=24;i++)                      # unfold to 24 lines
   print sc,cp,ms,dt,$1,sprintf("%02d",i)":00",$(i*2),$(i*2+1)
}
--snip--

Thanks again, Hermann

> 
> Awk is particulary good in verification too.
> Why don't you try something like in the lines of this:
> 
> BEGIN { FS = "\t" }
> $1 != "AT0001A" { print FNR, NF, "error\t" $0 > "/dev/stderr" }
> $2 != "00001" { ... }
> $3 != "00100" { ... }
> $4 !~ "(day|hour)" { ... }
> $5 !~ /[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]/ { ... }
> 
> Of course, you know better what to expect in each field :)
> 
0
Reply Hermann 11/26/2006 12:23:40 PM

Hermann Peifer wrote:

>  n = 31                                   # default number of days
>   if (m=="04"||m=="06"||m=="09"||m==11) n=30
>   if (m=="02" && !(y%4)) n=29             # Leap year logic OK for
>   if (m=="02" && y%4)    n=28             # year range: 1901<=y<=2099
>  for (i=1;i<=n;i++)                       # unfold to <n> lines
>   print sc,cp,ms,dt,sprintf("%s%02d",d,i),"00:00",$(i*2),$(i*2+1)
> }

You can speed up this part of the script.

  split("31 28 31 30 31 30 31 31 30 31 30 31", days_per_month)

But speeding up only makes sense if you really
care about processing speed.
 
>> Awk is particulary good in verification too.

Before running the script over the whole set of files,
you could also write some auxiliary script for detecting
invalid data. For example, lines with invalid length,
missing data, invalid format. I thinks that's what
Vassilis meant when he mentioned verification.
0
Reply UTF 11/26/2006 5:52:39 PM

Hermann <peifer@gmx.net> wrote:
> Hi All,
> 
> I have 1000s of data files with 1000s of lines each. Each line has a
> tab-separated data series corresponding to 1 month. The field logic of
> the lines is as follows:
> <startday>      <value1>        <flag1> <value2>        <flag2> (...)   <value31>       <flag31>
> 
> There are always 31 values and 31 flags in a line, even for months with
> less than 31 days. Real data lines look like this:
> 
> 2000-01-01      0.2     1       0.7     0       (...)   1.0     -2
> 2000-02-01      0.1     -1      0.9     1       (...)   2.0     -1
> 2000-03-01      0.3     1       1.7     0       (...)   3.0     0
> 
> 
> I would like to transpose the data matrix in order to have separate
> lines per day, i.e. each input line is transposed into 31 output lines,
> like this:
> 
> <startday>      <value1>        <flag1>
> <startday+1>    <value2>        <flag2>
> ...
> <startday+30>   <value31>       <flag31>
> 
> I've been looking at the typical AWK matrix transposition examples from
> the AWK user guides, etc., but they are not really what I want. Can
> someone help me?
> 
> Thanks in advance, Hermann

You already got Awk pointers.  Now, let's try shell solution:

    while read startday rest; do
	IFS=- read yyyy mm dd <<< $startday
	set -- $rest		# value1 flag1 value2 flag2 ...
	i=0
	while [ $# -gt 0 ]; do
	    echo $yyyy-$mm-$((dd + i)) $1 $2
	    i=$((i + 1))
	    shift 2
	done
    done < file.in

-- 
William Park <opengeometry@yahoo.ca>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
	   http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
	  http://freshmeat.net/projects/bashdiff/
0
Reply William 11/26/2006 6:43:48 PM

Jürgen Kahrs wrote:
> Hermann Peifer wrote:
> 
>>  n = 31                                   # default number of days
>>   if (m=="04"||m=="06"||m=="09"||m==11) n=30
>>   if (m=="02" && !(y%4)) n=29             # Leap year logic OK for
>>   if (m=="02" && y%4)    n=28             # year range: 1901<=y<=2099
>>  for (i=1;i<=n;i++)                       # unfold to <n> lines
>>   print sc,cp,ms,dt,sprintf("%s%02d",d,i),"00:00",$(i*2),$(i*2+1)
>> }
> 
> You can speed up this part of the script.
> 
>   split("31 28 31 30 31 30 31 31 30 31 30 31", days_per_month)
> 

I have simplified my "How many days has a month?" logic to:

n = 31                                   # set default number of days
  if (m=="04"||m=="06"||m=="09"||m==11) n=30
  if (m=="02") (y%4?n=28:n=29)            # Simple leap year logic

I can't see how your solution handles leap years?

> Before running the script over the whole set of files,
> you could also write some auxiliary script for detecting
> invalid data. For example, lines with invalid length,
> missing data, invalid format. I thinks that's what
> Vassilis meant when he mentioned verification.

I have indeed already written 2 auxiliary scripts to check consistency 
of the input data files.

Regards, Hermann
0
Reply Hermann 11/26/2006 7:07:35 PM

Hermann Peifer wrote:


> I can't see how your solution handles leap years?

It doesnt. This should be obvious to do.
I just wanted to say that an array-lookup
is much faster than a chain of if/else.
0
Reply UTF 11/26/2006 8:26:32 PM

William Park wrote:

> Hermann <peifer@gmx.net> wrote:
> 
>>Hi All,
<snip>
>>I've been looking at the typical AWK matrix transposition examples from
>>the AWK user guides, etc., but they are not really what I want. Can
>>someone help me?
>>
>>Thanks in advance, Hermann
> 
> 
> You already got Awk pointers.  Now, let's try shell solution:

Why? This is comp.lang.awk and there's no indication I can see that the 
OP is using UNIX. I'm not normally a topicality cop and I have been 
known to post a few shell suggestions when the OP indicates that's what 
they use and/or there's clearly more appropriate shell tools, but this 
posting doesn't appear to offer any better solution than the awk ones 
and seems about as appropriate as if someone posted a "let's try 
DOS/perl/ruby/... solution" out of the blue. Did I miss something?

	Ed.
0
Reply Ed 11/27/2006 2:29:06 PM

13 Replies
172 Views

(page loaded in 0.109 seconds)

Similiar Articles:


















7/9/2012 7:53:05 AM


Reply: