combine two data sets into one file with some tweaks to the lines (x-values)...

  • Follow


Dear all!


I want to plot two data files using Gnuplot; in the Gnuplot newsgroup 
comp.graphics.apps.gnuplot, I have been advised to proceed my problem 
to the awk newsgroup, since my problem seems to be beyond the scope of 
Gnuplot. I am new to awk, and thus don't know if awk would be capable 
to solve my problem. Thus, I would be very grateful if one of You 
specialists here could give me a clear hint. Thanks in advance!!! This 
is the problem:

I have two data files that I'd need to plot - one is experimental data, 
one reference data. I think it is not important for You what type of 
data it is, but just briefly: the data are mass spectrometry data.

The problem: the files have slightly different formatting. My original 
experimental data, for example, look something like this (beginning of 
a file):
45.000000   11093   

46.100000   670   

47.100000   747   

48.100000   670   

49.000000   5200   

50.000000   16501   

51.100000   26043
As You'll see: the x-values start at 45, and are non-integer (some 
values are a bit off from the ".0"-reading) - but the x-spacing between 
the data points is approx. one.

My reference data, however, start as this:
15   150
26   80
27   380
29   160
31   60
37   60
38   190
You see: the reference only states the values that are non-zero; 
additionally, all x-values are integer, and start at an arbitrary value 
(the first non-zero point).

Of course, also the y-values are different (but that is just an 
arbitrary scaling factor which isn't important).

Now, what I'd like to have is ONE file that contains both data; I would 
only need the x-values once, and I can live with integer values for the 
x-axis; in fact, the few non-integer values in the experimental data 
are just odd, and for our purpose it would be okay to average them to 
integers. So, what I'd need would look something like this:

1      0            0
2      0            0
3      0            0
....
15    0            150
16    0            0
....
26    0            80
....
45    11093   90

Okay? So, I'd need to pad the missing lines into both files (it would 
be okay to start at "1", since this would be the first menaingful 
x-value in my case - although the intensity (y-value) might be zero). 
Additionally, I'd need to average the x-values from the experimental 
data file to integers, and use only one of the x-data for the file; 
then, I'd need to copy the respective y-values from both files into the 
two other columns...


Actually, it is quite a simple task - but I have no clue if awk would 
be able to do this - and how to start... unfortunately, I am not a 
specialist in those sorts of things, and I don't know much about 
regular expressions and such things.
So, as said: could someone of You give me a hint a) as to whether or 
not this is possible, and b) how I might start?

Thank You very much for considering my question! And in advance already 
for any answer!
With kind regards,
Bj�rn

0
Reply iso 3/8/2010 10:11:12 PM

On 3/8/2010 4:11 PM, Bj�rn wrote:
> Dear all!
>
>
> I want to plot two data files using Gnuplot; in the Gnuplot newsgroup
> comp.graphics.apps.gnuplot, I have been advised to proceed my problem to
> the awk newsgroup, since my problem seems to be beyond the scope of
> Gnuplot. I am new to awk, and thus don't know if awk would be capable to
> solve my problem. Thus, I would be very grateful if one of You
> specialists here could give me a clear hint. Thanks in advance!!! This
> is the problem:
>
> I have two data files that I'd need to plot - one is experimental data,
> one reference data. I think it is not important for You what type of
> data it is, but just briefly: the data are mass spectrometry data.
>
> The problem: the files have slightly different formatting. My original
> experimental data, for example, look something like this (beginning of a
> file):
> 45.000000 11093
> 46.100000 670
> 47.100000 747
> 48.100000 670
> 49.000000 5200
> 50.000000 16501
> 51.100000 26043
> As You'll see: the x-values start at 45, and are non-integer (some
> values are a bit off from the ".0"-reading) - but the x-spacing between
> the data points is approx. one.
>
> My reference data, however, start as this:
> 15 150
> 26 80
> 27 380
> 29 160
> 31 60
> 37 60
> 38 190
> You see: the reference only states the values that are non-zero;
> additionally, all x-values are integer, and start at an arbitrary value
> (the first non-zero point).
>
> Of course, also the y-values are different (but that is just an
> arbitrary scaling factor which isn't important).
>
> Now, what I'd like to have is ONE file that contains both data; I would
> only need the x-values once, and I can live with integer values for the
> x-axis; in fact, the few non-integer values in the experimental data are
> just odd, and for our purpose it would be okay to average them to
> integers. So, what I'd need would look something like this:
>
> 1 0 0
> 2 0 0
> 3 0 0
> ...
> 15 0 150
> 16 0 0
> ...
> 26 0 80
> ...
> 45 11093 90
>
> Okay? So, I'd need to pad the missing lines into both files (it would be
> okay to start at "1", since this would be the first menaingful x-value
> in my case - although the intensity (y-value) might be zero).
> Additionally, I'd need to average the x-values from the experimental
> data file to integers, and use only one of the x-data for the file;
> then, I'd need to copy the respective y-values from both files into the
> two other columns...
>
>
> Actually, it is quite a simple task - but I have no clue if awk would be
> able to do this - and how to start... unfortunately, I am not a
> specialist in those sorts of things, and I don't know much about regular
> expressions and such things.
> So, as said: could someone of You give me a hint a) as to whether or not
> this is possible, and b) how I might start?
>
> Thank You very much for considering my question! And in advance already
> for any answer!
> With kind regards,
> Bj�rn
>

Yes, awk can do it. Whatever it is. I think what you're trying to say is that 
given input like this:

file1:
3.00 5
5.10 9
7.10 2

file2:
4 8
5 2

you'd want output like this:

1 0 0
2 0 0
3 5 0
4 0 8
5 9 2
6 0 0
7 2 0

So, the output is the first field field rounded down starting at 1 and ending at 
the highest number seen, the 2nd field of output is the 2nd field from file1 
while the 3rd field of output is the 2nd field from file2. That'd be something 
like (untested):

awk '{ $1 = int($1); max = ($1 > max ? $1 : max); out[$1,NR==FNR] = $2 }
      END { for (i=1;i<=max;i++) printf "%d %d %d\n",i,out[i,1],out[i,0] }

If that's not what you want, provide some small, simple sample input and 
expected output as I showed above.

      Ed.
0
Reply Ed 3/8/2010 10:46:20 PM


Hey, Ed!

GREAT! Thank You very much!!! Great help! :-D I am really happy!!!
Your suggestion really does the trick for me. I had to tweak around a
bit to get it working (as said, I have never used awk - so I had to
find out how to call it and name the input and output files...), but
that's nothing. It was a real motivation to know it would work! (In
fact, I was a bit surprised that Gnuplot couldn't handle this task...
and thus, I was a bit sceptical what tool might be able to help me).
Just for reference (for other awk-newbies such as myself): if You want
Ed's example to work from command line, You have to add the input and
output file information; this is what I used in the end as a call from
the terminal:
awk '{ $1 =3D int($1); max =3D ($1 > max ? $1 : max); out[$1,NR=3D=3DFNR] =
=3D
$2 } END { for (i=3D1;i<=3Dmax;i++) printf "%d %d %d\n",i,out[i,1],out[i,
0] }' expt.txt ref.txt > out.txt


You may recognize that it is Ed's example, but I added the apostrophe
(or single quote sign) at the end to end the command definition fro
awk, and then added the two input files (expt.txt and ref.txt), and -
using the >-sign - the output file out.txt. Thus, my output ends up in
out.txt.

Thanks again, Ed, for taking the time to answer my (probably simple)
question! You saved me a lot of time looking around erratically!
Thanks!

Kind regards,
Bj=F6rn



> So, the output is the first field field rounded down starting at 1 and en=
ding at
> the highest number seen, the 2nd field of output is the 2nd field from fi=
le1
> while the 3rd field of output is the 2nd field from file2. That'd be some=
thing
> like (untested):
>
> awk '{ $1 =3D int($1); max =3D ($1 > max ? $1 : max); out[$1,NR=3D=3DFNR]=
 =3D $2 }
> =A0 =A0 =A0 END { for (i=3D1;i<=3Dmax;i++) printf "%d %d %d\n",i,out[i,1]=
,out[i,0] }
>
> If that's not what you want, provide some small, simple sample input and
> expected output as I showed above.
>
> =A0 =A0 =A0 Ed.

0
Reply bbcda 3/9/2010 6:05:40 AM

On 3/9/2010 12:05 AM, bbcda wrote:
> Hey, Ed!
>
> GREAT! Thank You very much!!! Great help! :-D I am really happy!!!
> Your suggestion really does the trick for me. I had to tweak around a
> bit to get it working (as said, I have never used awk - so I had to
> find out how to call it and name the input and output files...), but
> that's nothing. It was a real motivation to know it would work! (In
> fact, I was a bit surprised that Gnuplot couldn't handle this task...
> and thus, I was a bit sceptical what tool might be able to help me).
> Just for reference (for other awk-newbies such as myself): if You want
> Ed's example to work from command line, You have to add the input and
> output file information; this is what I used in the end as a call from
> the terminal:
> awk '{ $1 = int($1); max = ($1>  max ? $1 : max); out[$1,NR==FNR] =
> $2 } END { for (i=1;i<=max;i++) printf "%d %d %d\n",i,out[i,1],out[i,
> 0] }' expt.txt ref.txt>  out.txt
>
>
> You may recognize that it is Ed's example, but I added the apostrophe
> (or single quote sign) at the end to end the command definition fro
> awk, and then added the two input files (expt.txt and ref.txt), and -
> using the>-sign - the output file out.txt. Thus, my output ends up in
> out.txt.
>
> Thanks again, Ed, for taking the time to answer my (probably simple)
> question! You saved me a lot of time looking around erratically!
> Thanks!

You're welcome. FYI gnuplot is a tool for plotting graphs, not manipulating text 
files, while awk is a tool for manipulating text files, not for plotting graphs. 
Since a lot of what typically needs to be plotted is a result of manipulating 
text files, awk is gnuplots best friend so if you're going to be using gnuplot a 
lot, you might want to learn awk from the book Effective Awk Programming, Third 
Edition By Arnold Robbins (http://www.oreilly.com/catalog/awkprog3/) and lurk 
around here a bit.

	Ed.

> Kind regards,
> Bj�rn
>
>
>
>> So, the output is the first field field rounded down starting at 1 and ending at
>> the highest number seen, the 2nd field of output is the 2nd field from file1
>> while the 3rd field of output is the 2nd field from file2. That'd be something
>> like (untested):
>>
>> awk '{ $1 = int($1); max = ($1>  max ? $1 : max); out[$1,NR==FNR] = $2 }
>>        END { for (i=1;i<=max;i++) printf "%d %d %d\n",i,out[i,1],out[i,0] }
>>
>> If that's not what you want, provide some small, simple sample input and
>> expected output as I showed above.
>>
>>        Ed.
>

0
Reply Ed 3/9/2010 12:42:45 PM

3 Replies
130 Views

(page loaded in 0.093 seconds)

Similiar Articles:













7/12/2012 8:53:57 PM


Reply: