Getting continuation lines

  • Follow


I'm using awk to do some manipulation on raw email/news headers.  I've
ended up using getline to fetch the 'continuation' References: headers,
which may be a bad sign!

I'm using a fairly basic script to input each line of the file and
assign them to an array indexed by keyword until I get to a blank line
indicating the end of the headers, then calling a routine emitheaders()
to output a manipulated version of the array.


(continuetoend==1)   {print; next}
#once we have found end of headers, simply output everything else,
#skipping rest of program
($1=="References:")  {slurprefs($1)}
($1=="Date:")        {sub(/BST/,"+0100")}#correct for StrongED

                     {header[$1]=$0}

                     {if (NF==0)
                      {emitheaders()
                       continuetoend=1}
            # set marker to indicate that we have finished scanning headers
                     }

The 'continuation lines' of References of course have no keywords, being
preceded only by a blank space, e.g.

> References: <931a3e96-343a-4104-aa87-9392b00f05a4@googlegroups.com>
>  <jvvaeq$um3$1@dont-email.me>
>  <ac9ada58-842a-4baa-90f3-eb6d27b47dd3@googlegroups.com>

The function slurprefs() was intended to input and append any lines
following the detection of a References: header and starting with a
blank space, and then pass the subsequent state of $0 (i.e. a set of
fields starting with some other header) on to the following rules in the
main routine.

However, despite endless fiddling with 'while' versus 'do while' and
matching versus not-matching, I can't seem to come up with a version
that doesn't involve doing a check within the loop then duplicating this
with another check for the terminating condition.   I'm sure there ought
to be a more elegant way of doing it.

Or would it be better to do this using some kind of range check in the
main routine?


function slurprefs(key)
# get rest of References lines
{
 header[key]=$0
 do{
 getline
  if($0~"^ ") header[key]=sprintf("%s\n%s",header[key],$0)
  #concatenate Refs header lines
 }  while($0~"^ ") #check for leading space


 #at exit from function fields are all set up for following line -
 #so this function must be called first in program or else we might lose
 #the data in the line after References:
}


-- 
Harriet Bazley                     ==  Loyaulte me lie ==

The saddest words in the English language are 'Too' and 'late'
0
Reply harriet.bazley (13) 8/11/2012 2:59:21 PM

In article <a3729ebd52.harriet@blueyonder.co.uk>,
Harriet Bazley  <harriet.bazley@blueyonder.co.uk> wrote:
>I'm using awk to do some manipulation on raw email/news headers.  I've
>ended up using getline to fetch the 'continuation' References: headers,
>which may be a bad sign!

(Didn't look at your specific problem or code - just want to make a couple
of comments)

1) A common idiom for this is to set RS="" and FS="\n".
    This makes each "paragraph" a record and each line within the paragraph
    a field.  So, you can access each line as $1, $2, etc.
    I think this idiom works well for parsing "mailbox" files.

2) As you've noted, "getline" is to be avoided, if at all possible.  I use
it as rarely as possible.  Good story: There is one particular script that
I wrote quite a while ago, that uses "getline" - and was in production for a
long time until one day I had to add a new feature.  I could not figure out
how it worked for a long time - a long time of hair-pulling and debugging.
Finally figured out that there was a getline embedded in the code - that
was, of course, changing $0 on the fly.  Funny thing is - you might expect
that I'd have re-written it on the spot to not use getline, but, on
analysis, I concluded that getline was still the right way to do it, despite
my experience.  Such cases *are* exceedingly rare, however...

P.S.  Ed has a document somewhere detailing the evils of "getline". Good read.

-- 
Given Bush and his insanely expensive wars (*), that we will be paying for
for generations to come, the only possible response a sensible person need
ever give, when a GOPer/TeaBagger says anything about "deficits", is a
polite snicker.

(*) Obvious money transfers between the taxpayers and Bush's moneyed
interests.  Someday, we'll actually figure out a way to have a war where the
money just gets moved around and nobody (on either side) gets injured or
killed.  That will be an accomplishment of which we will be justly proud.

0
Reply gazelle3 (1608) 8/11/2012 3:21:13 PM


In article <a3729ebd52.harriet@blueyonder.co.uk>, Harriet Bazley wrote:
> I'm using awk to do some manipulation on raw email/news headers.  I've
> ended up using getline to fetch the 'continuation' References: headers,
> which may be a bad sign!
> 
> I'm using a fairly basic script to input each line of the file and
> assign them to an array indexed by keyword until I get to a blank line
> indicating the end of the headers, then calling a routine emitheaders()
> to output a manipulated version of the array.
> 
[..snip..]

awk '$0 ~ /^[^ ]/{ a[h=$1]=$0;next } { a[h] = a[h] $0 }
END { for (i in a) print i, a[i]}' datafile

The END section is just an example, adjust as necessary.

John.

  


0
Reply j.eh (18) 8/11/2012 4:06:36 PM

On 8/11/2012 9:59 AM, Harriet Bazley wrote:
> I'm using awk to do some manipulation on raw email/news headers.

You might find your scripting easier if you process your raw email with 
"formail" first. See also "procmail".

     Ed.
0
Reply mortonspam (827) 8/13/2012 7:48:37 AM

On 8/11/2012 9:59 AM, Harriet Bazley wrote:
> I'm using awk to do some manipulation on raw email/news headers.  I've
> ended up using getline to fetch the 'continuation' References: headers,
> which may be a bad sign!
>
> I'm using a fairly basic script to input each line of the file and
> assign them to an array indexed by keyword until I get to a blank line
> indicating the end of the headers, then calling a routine emitheaders()
> to output a manipulated version of the array.
>
>
> (continuetoend==1)   {print; next}
> #once we have found end of headers, simply output everything else,
> #skipping rest of program
> ($1=="References:")  {slurprefs($1)}
> ($1=="Date:")        {sub(/BST/,"+0100")}#correct for StrongED
>
>                       {header[$1]=$0}
>
>                       {if (NF==0)
>                        {emitheaders()
>                         continuetoend=1}
>              # set marker to indicate that we have finished scanning headers
>                       }
>
> The 'continuation lines' of References of course have no keywords, being
> preceded only by a blank space, e.g.
>
>> References: <931a3e96-343a-4104-aa87-9392b00f05a4@googlegroups.com>
>>   <jvvaeq$um3$1@dont-email.me>
>>   <ac9ada58-842a-4baa-90f3-eb6d27b47dd3@googlegroups.com>
>
> The function slurprefs() was intended to input and append any lines
> following the detection of a References: header and starting with a
> blank space, and then pass the subsequent state of $0 (i.e. a set of
> fields starting with some other header) on to the following rules in the
> main routine.
>
> However, despite endless fiddling with 'while' versus 'do while' and
> matching versus not-matching, I can't seem to come up with a version
> that doesn't involve doing a check within the loop then duplicating this
> with another check for the terminating condition.   I'm sure there ought
> to be a more elegant way of doing it.
>
> Or would it be better to do this using some kind of range check in the
> main routine?
>
>
> function slurprefs(key)
> # get rest of References lines
> {
>   header[key]=$0
>   do{
>   getline
>    if($0~"^ ") header[key]=sprintf("%s\n%s",header[key],$0)
>    #concatenate Refs header lines
>   }  while($0~"^ ") #check for leading space
>
>
>   #at exit from function fields are all set up for following line -
>   #so this function must be called first in program or else we might lose
>   #the data in the line after References:
> }
>
>

You didn't post any sample input but from looking at what your script is doing I 
THINK what you'd want is something like this:

inBody { print; next }

# headers section
NF {
    if (/^[[:alnum:]]+:/) {
       key = $1
       if (key == "Date:") {
          sub(/BST/,"+0100") #correct for StrongED
       }
    }
    header[key] = header[key] sep[key] $0
    sep[key] = RS
    next
}

{ emitheaders(); inBody = 1 }

Note that variables with names like "continuetoend" are a very bad idea in any 
language as they introduce inversion of control and tightly couple the section 
of code that identifies a specific condition with the section of code that acts 
on that condition. Just name your variable based on the condition that it 
represents (e.g. "found the start of the email body") so that in future if you 
want to do something different under that condition you can without having to 
rename your variable or make other unnecessary changes.

Regards,

    Ed.
0
Reply mortonspam (827) 8/13/2012 8:29:26 AM

On 13 Aug 2012 as I do recall,
          Ed Morton  wrote:

> On 8/11/2012 9:59 AM, Harriet Bazley wrote:
> > I'm using awk to do some manipulation on raw email/news headers.

> You might find your scripting easier if you process your raw email with
> "formail" first. See also "procmail".
>
Not an option unfortunately - I'm calling this script directly from
within my text editor under RISC OS (if formail or procmail exist for
this OS, I don't know about it).

> > I've ended up using getline to fetch the 'continuation' References:
> > headers, which may be a bad sign!
> >
> > I'm using a fairly basic script to input each line of the file and
> > assign them to an array indexed by keyword until I get to a blank line
> > indicating the end of the headers, then calling a routine emitheaders()
> > to output a manipulated version of the array.
> >
> >
> > (continuetoend==1)   {print; next}
> > #once we have found end of headers, simply output everything else,
> > #skipping rest of program
> > ($1=="References:")  {slurprefs($1)}
> > ($1=="Date:")        {sub(/BST/,"+0100")}#correct for StrongED
> >
> >                       {header[$1]=$0}
> >
> >                       {if (NF==0)
> >                        {emitheaders()
> >                         continuetoend=1}
> >              # set marker to indicate that we have finished scanning headers
> >                       }
> >

[snip]

> >
> > function slurprefs(key)
> > # get rest of References lines
> > {
> >   header[key]=$0
> >   do{
> >   getline
> >    if($0~"^ ") header[key]=sprintf("%s\n%s",header[key],$0)
> >    #concatenate Refs header lines
> >   }  while($0~"^ ") #check for leading space
> >
> >
> >   #at exit from function fields are all set up for following line -
> >   #so this function must be called first in program or else we might lose
> >   #the data in the line after References:
> > }
> >
> >
>
> You didn't post any sample input

Sample input would be a random set of headers, e.g.

  From: Harriet Bazley <harriet.bazley@blueyonder.co.uk>
  To: Ed Morton  <mortonspam@gmail.com>
  Subject: RE: trousers
  Message-ID: <d4f90cb752.harriet@blueyonder.co.uk>
  References: <BLU132-W174EF2182B111187AB3DB4C2F50@phx.gbl> <d88b49a252.harriet@blueyonder.co.uk> <BLU132-W18AFC5B8A93061EEAC739C2FF0@phx.gbl>
   <f0e985a252.harriet@blueyonder.co.uk> <BLU132-W29643C6E498985D5F78B10C2FE0@phx.gbl> <7cc61ea952.harriet@blueyonder.co.uk>
   <BLU132-W19082E2AEC6F3A62DFF305C2E90@phx.gbl> <f0a492aa52.harriet@blueyonder.co.uk> <BLU132-W15502FD688DA22B085C3C0C2ED0@phx.gbl>
   <e5e7dfb152.harriet@blueyonder.co.uk> <8ce2ffb152.harriet@blueyonder.co.uk> <SNT145-W95B28FB0BCB3C6E39E201AC2D80@phx.gbl>
  In-Reply-To: <k0abgm$1c3$1@dont-email.me>
  User-Agent: Messenger-Pro/6.06 (MsgServe/6.04) (RISC-OS/5.19)
  MIME-Version: 1.0
  Content-Type: text/plain; charset=us-ascii


>  but from looking at what your script is doing I
> THINK what you'd want is something like this:
>
> inBody { print; next }
>
> # headers section
> NF {
>     if (/^[[:alnum:]]+:/) {
>        key = $1
>        if (key == "Date:") {
>           sub(/BST/,"+0100") #correct for StrongED
>        }
>     }
>     header[key] = header[key] sep[key] $0
>     sep[key] = RS
>     next
> }
>
> { emitheaders(); inBody = 1 }

Right, let me try to parse this....

The first rule is as before - triggered only once an arbitrary control
variable has been set, and short-circuiting any of the others from
executing once it has been set.

The second rule is executed for every non-blank line (until inBody is
set).   If a 'header keyword' is found at the start of the input, then
the rule acts as before, recording the keyword for use in indexing the
header array (with a special case for "Date:").   We now have an extra
array "sep" making an appearance:  sep[key] will presumably be blank on
the first occurrence, causing header[key] to be set to $0 as before.
sep[key] is then set, to the linefeed character (or whatever separator
is being used).

If a keyword is *not* found on the current line, then the rule will use
whatever keyword was last found to index its arrays with, which means
that the existing value of header[key] will have the current line
appended to it using the separator.   (I think it's the use of the sep
array to provide an automatic blank the first time round and a separator
in the case of the repeat that I was having difficulty getting my mind
round: I'd instinctively have tried to use a single variable and tried
to 'blank' it manually.)

This has the advantage in that it works for *all* continued headers, not
just References:  To/CC are the other likely candidates....


The third rule is presumably only executed if both the others have
failed to match, as they both end in 'next':  i.e. it acts once only
when the first blank line in the file is met, representing the end of
the header material, when it takes appropriate action.

(I think I'd write this with a condition (NF==0) to contrast with
(NF>0), omitting the "next" at the end of the latter just to increase
clarity.)


It isn't having the desired effect, but I think I now understand why...
<fiddles around> Yes, it turns out that all lines where the header
keyword contains a hyphen are being treated as continuation lines, thus
the problem is simply with the expression /^[[:alnum:]]+:/

/^[[:alpha:]\-]+:/ seems to do it (I don't think numeric digits are
valid in this context?)

[snip]

> name your variable based on the condition that it represents (e.g.
> "found the start of the email body") so that in future if you want to
> do something different under that condition you can without having to
> rename your variable or make other unnecessary changes.

Noted!

-- 
Harriet Bazley                     ==  Loyaulte me lie ==

It is easier to fight for one's principles than to live up to them.
0
Reply harriet.bazley (13) 8/13/2012 11:30:32 PM

On 11 Aug 2012 as I do recall,
          j.eh@mchsi.com wrote:

> In article <a3729ebd52.harriet@blueyonder.co.uk>, Harriet Bazley wrote:
> > I'm using awk to do some manipulation on raw email/news headers.  I've
> > ended up using getline to fetch the 'continuation' References: headers,
> > which may be a bad sign!
> >
> > I'm using a fairly basic script to input each line of the file and
> > assign them to an array indexed by keyword until I get to a blank line
> > indicating the end of the headers, then calling a routine emitheaders()
> > to output a manipulated version of the array.
> >
> [..snip..]
>
> $0 ~ /^[^ ]/{ a[h=$1]=$0;next }


a[h=$1]=$0
is an idiom I haven't come across before...

-- 
Harriet Bazley                     ==  Loyaulte me lie ==

Those who cannot remember the past are condemned to repeat it.
0
Reply harriet.bazley (13) 8/13/2012 11:35:53 PM

On 8/13/2012 6:30 PM, Harriet Bazley wrote:
<snip>
> /^[[:alpha:]\-]+:/ seems to do it (I don't think numeric digits are
> valid in this context?)

Don't escape the "-" as that's unnecessary at best.

    Ed.
0
Reply mortonspam (827) 8/14/2012 5:38:34 AM

On 14 Aug 2012 as I do recall,
          Ed Morton  wrote:

> On 8/13/2012 6:30 PM, Harriet Bazley wrote:
> <snip>
> > /^[[:alpha:]\-]+:/ seems to do it (I don't think numeric digits are
> > valid in this context?)
>
> Don't escape the "-" as that's unnecessary at best.
>

So it is!   How does that work?

"Effective AWK Programming" instructs "To include one of the characters
'\',']','-' or '~' in a character list, put a '\' in front of it", and
I've always assumed that was essential to avoid syntax errors:   how
ever does awk know that I'm not trying to write a range of characters
here with the end missing?

-- 
Harriet Bazley                     ==  Loyaulte me lie ==

Reality is for people who can't face science fiction.
0
Reply harriet.bazley (13) 8/14/2012 8:38:02 AM

On 8/14/2012 3:38 AM, Harriet Bazley wrote:
> On 14 Aug 2012 as I do recall,
>            Ed Morton  wrote:
>
>> On 8/13/2012 6:30 PM, Harriet Bazley wrote:
>> <snip>
>>> /^[[:alpha:]\-]+:/ seems to do it (I don't think numeric digits are
>>> valid in this context?)
>>
>> Don't escape the "-" as that's unnecessary at best.
>>
>
> So it is!   How does that work?
>
> "Effective AWK Programming" instructs "To include one of the characters
> '\',']','-' or '~' in a character list, put a '\' in front of it", and
> I've always assumed that was essential to avoid syntax errors:   how
> ever does awk know that I'm not trying to write a range of characters
> here with the end missing?
>

Awk just assumes you meant what you wrote and that the "-" isn't part of a range 
because you have it as the last (or first) character in the bracket expression.

Here's the reference from the POSIX standard 
(http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html#tag_09_03_05):

-----
A range expression shall be expressed as the starting point and the ending point 
separated by a hyphen ( '-' ).
....
The hyphen character shall be treated as itself if it occurs first (after an 
initial '^', if any) or last in the list, or as an ending range point in a range 
expression.
-----

Regards,

     Ed.


0
Reply mortonspam (827) 8/14/2012 1:21:21 PM

9 Replies
32 Views

(page loaded in 0.199 seconds)


Reply: