running gawk under win32 (haven't tested under *nix);
source file is a typical email text file w/ several messages in it.
(assume there is only 1 "^(From)" per message header - though "From"
may also be in the middle of a line)
I want to use the RE "^(From)" as my RS.
IF I use:
{
if ($0 ~ /^(From)/ )
print "\n---\n"$0"\n===\n";
}
as my gawk pattern, it DOES correctly print only the lines where From
is at the begining of line.
HOWEVER - if I use:
BEGIN { RS="^(From)"; } (tried adding FS="\n"; )
{
print "\n---\n"$0"\n===\n";
}
I EXPECT that each iteration of $0 SHOULD be a full message.
in fact, I only get 2 iterations; the 1st one is blank, the 2nd one
contains the entire rest of the file. (I'm working with a test file of
only 4 msgs - I'm sure this would die on large files)
I've done regex testing with egrep successfully, but can't seem to
replicate what I want using gawk.
I'm obviously missing a citical piece of logic here - I'm hoping that
not only could some kind soul show me the err of my ways, but fix the
"crick" in my logic/understanding of regex's.... (with regard to gawk)
tia - Bob
|
|
0
|
|
|
|
Reply
|
Bob
|
3/3/2004 5:54:37 PM |
|
In article <vo5c40lauk3rdufnvqdub34aoj2u988j3u@4ax.com>,
Bob <nospam_nsh@starnetwx.net> wrote:
>running gawk under win32 (haven't tested under *nix);
>
>source file is a typical email text file w/ several messages in it.
>(assume there is only 1 "^(From)" per message header - though "From"
>may also be in the middle of a line)
>
>I want to use the RE "^(From)" as my RS.
I would suggest that you not do this, at least until you fully understand
the implications. That is, the implications of changing the standard
variables in general, and this one in particular.
I've always found it easier to leave the defaults as is (except FS, and
that only when the input data is very regular, such as the Unix password
file), and write my own routines to handle things.
Here's how I would do the email parse:
/^(From)/ { p() }
{ s = s $0 "\n" }
END { p() }
function p() {
if (!s) return
... do whatever with s ...
s = ""
}
I've used this idiom consistently over the years, and it is a good one.
|
|
0
|
|
|
|
Reply
|
gazelle
|
3/3/2004 6:53:20 PM
|
|
On Wed, 03 Mar 2004 18:53:20 GMT, gazelle@yin.interaccess.com (Kenny
McCormack) wrote:
>In article <vo5c40lauk3rdufnvqdub34aoj2u988j3u@4ax.com>,
>Bob <nospam_nsh@starnetwx.net> wrote:
>>
>>source file is a typical email text file w/ several messages in it.
>>(assume there is only 1 "^(From)" per message header - though "From"
>>may also be in the middle of a line)
>>
>>I want to use the RE "^(From)" as my RS.
>
>I would suggest that you not do this, at least until you fully understand
>the implications. That is, the implications of changing the standard
>variables in general, and this one in particular.
>
>I've always found it easier to leave the defaults as is (except FS, and
>that only when the input data is very regular, such as the Unix password
>file), and write my own routines to handle things.
>
>Here's how I would do the email parse:
>
>/^(From)/ { p() }
>{ s = s $0 "\n" }
>END { p() }
>
>function p() {
> if (!s) return
> ... do whatever with s ...
> s = ""
> }
>
>I've used this idiom consistently over the years, and it is a good one.
Kenny;
tx very much for your reply -
your way is certainly worth pursuing, and since I can get it to work
alot faster than my way.....
but - I'm still curious to know why my way isn't working...
in addition - even if I do use your way;
when I insert the "^" into the RE, things fall apart. I only get 1
"record" [string] printed, and it contains the whole file. apparently
the ^(From) only matches ONE match - I need to match every instance.
here's what I'm doing: (in this example - ^(stop) marks the end of
the headers)
BEGIN { c=1; } // ('record' counter)
{
print "** "c" **";
if ($0 ~ /^(From)(.*[ :space:]*)*^(stop)/ ) grep();
// I've also tried substituting [ \t\f\n\r\v] for [:space:]
// and many other permutations for that matter.
s=s$0;
c++
}
END { grep(); }
function grep() {
if (s) print "\n---\n"s"\n===\n";
s="";
}
What I REALLY want to do is "highlight" or extract all headers between
"^(From) and ^(stop)" - more to the point I really want everything
EXCEPT the headers - but I can easily NOT the RE after it works.
here's another attempt that get's a little better:
BEGIN { c=1; }
{
print "** "c" **";
if ($0 ~ /(rule)(.*)|[:space:]*incoming/ ) grep();
s=s$0;
c++
}
END { grep(); }
function grep() {
if (s) print "\n---\n"s"\n===\n";
s="";
}
This grabs the two lines including ^(From) & ^(stop), but obviously
nothing in between them....
|
|
0
|
|
|
|
Reply
|
Bob
|
3/3/2004 8:24:41 PM
|
|
Sorry - I copied the wrong file on the last awk pattern.
On Wed, 03 Mar 2004 18:53:20 GMT, gazelle@yin.interaccess.com (Kenny
McCormack) wrote:
>In article <vo5c40lauk3rdufnvqdub34aoj2u988j3u@4ax.com>,
>Bob <nospam_nsh@starnetwx.net> wrote:
>>
>>source file is a typical email text file w/ several messages in it.
>>(assume there is only 1 "^(From)" per message header - though "From"
>>may also be in the middle of a line)
>>
>>I want to use the RE "^(From)" as my RS.
>
>I would suggest that you not do this, at least until you fully understand
>the implications. That is, the implications of changing the standard
>variables in general, and this one in particular.
>
>I've always found it easier to leave the defaults as is (except FS, and
>that only when the input data is very regular, such as the Unix password
>file), and write my own routines to handle things.
>
>Here's how I would do the email parse:
>
>/^(From)/ { p() }
>{ s = s $0 "\n" }
>END { p() }
>
>function p() {
> if (!s) return
> ... do whatever with s ...
> s = ""
> }
>
>I've used this idiom consistently over the years, and it is a good one.
Kenny;
tx very much for your reply -
your way is certainly worth pursuing, and since I can get it to work
alot faster than my way.....
but - I'm still curious to know why my way isn't working...
in addition - even if I do use your way;
when I insert the "^" into the RE, things fall apart. I only get 1
"record" [string] printed, and it contains the whole file. apparently
the ^(From) only matches ONE match - I need to match every instance.
here's what I'm doing: (in this example - ^(stop) marks the end of
the headers)
BEGIN { c=1; } // ('record' counter)
{
print "** "c" **";
if ($0 ~ /^(From)(.*[ :space:]*)*^(stop)/ ) grep();
// I've also tried substituting [ \t\f\n\r\v] for [:space:]
// and many other permutations for that matter.
s=s$0;
c++
}
END { grep(); }
function grep() {
if (s) print "\n---\n"s"\n===\n";
s="";
}
What I REALLY want to do is "highlight" or extract all headers between
"^(From) and ^(stop)" - more to the point I really want everything
EXCEPT the headers - but I can easily NOT the RE after it works.
here's another attempt that get's a little better: (but it's NOT using
"^")
BEGIN { c=1; }
{
print "** "c" **";
if ($0 ~ /(From)(.*)|[:space:]*stop/ ) grep();
s=s$0;
c++
}
END { grep(); }
function grep() {
if (s) print "\n---\n"s"\n===\n";
s="";
}
This grabs the two lines including ^(From) & ^(stop), but obviously
nothing in between them....
|
|
0
|
|
|
|
Reply
|
Bob
|
3/3/2004 8:27:52 PM
|
|
On Wed, 03 Mar 2004 14:24:41 -0600, Bob <nospam_nsh@starnetwx.net>
wrote:
>but - I'm still curious to know why my way isn't working...
>in addition - even if I do use your way;
>when I insert the "^" into the RE, things fall apart. I only get 1
>"record" [string] printed, and it contains the whole file. apparently
>the ^(From) only matches ONE match - I need to match every instance.
It is working correctly. A regex applies to the string it is
presented - RS is presented with the *entire* file so ^From matches
*only* the first occurance int the file. Think about it: if RS is
FROM, then the concept of lines (records) in the usual sense simply
doesn't exist.
I can almost visualize how it works: I would expect it to match only
From at the very beginning of the file - I suspect that since you got
a blank and everything else, the file begins with From.
I think you want a pattern consisting of \nFrom: wherever it occurs in
the file, but you may have to be explicit about \n - you may have to
encode it as octal escape sequences for both bytes in the Windows \n.
T.E.D. (tdavis@gearbox.maem.umr.edu)
SPAM filter: Messages to this address *must* contain "T.E.D."
somewhere in the body or they will be automatically rejected.
|
|
0
|
|
|
|
Reply
|
Ted
|
3/3/2004 9:44:18 PM
|
|
On Wed, 03 Mar 2004 15:44:18 -0600, Ted Davis
<tdavis@gearbox.maem.umr.edu> wrote:
>
> I think you want a pattern consisting of \nFrom: wherever it occurs in
> the file, but you may have to be explicit about \n - you may have to
> encode it as octal escape sequences for both bytes in the Windows \n.
>
To match the beginning of messages in an mbox file, you want "\nFrom ".
--
Incrsease your earoning poswer and gaerner profwessional resspect.
Get the Un1iversity Dewgree you have already earned.
[from the prestigious, non-accredited University of Spam!]
|
|
0
|
|
|
|
Reply
|
Bill
|
3/7/2004 3:33:45 AM
|
|
In article <pphqh1-l83.ln1@don.localnet>,
Bill Marcum <bmarcum@iglou.com.urgent> wrote:
% On Wed, 03 Mar 2004 15:44:18 -0600, Ted Davis
% <tdavis@gearbox.maem.umr.edu> wrote:
% >
% > I think you want a pattern consisting of \nFrom: wherever it occurs in
% > the file, but you may have to be explicit about \n - you may have to
% > encode it as octal escape sequences for both bytes in the Windows \n.
% >
%
% To match the beginning of messages in an mbox file, you want "\nFrom ".
More specifically, you want
\nFrom [^ ]+ [A-Z][a-z]{2} [A-Z][a-z]{2} [ 0-9][0-9] [0-9]{2}:[0-9]{2}:[0-9]{2} [0-9]{4}\n
although some applications screw up the date formatting -- I think the
best thing is to treat such mailboxes as corrupt, fix the format and
stop using such broken applications (and don't write any -- this is the
reason for my reply). Note that the first instance will not have a
preceding \n -- I would anchor this with ^ and $, but I don't see how
that can work in an RS.
--
Patrick TJ McPhee
East York Canada
ptjm@interlog.com
|
|
0
|
|
|
|
Reply
|
ptjm
|
3/7/2004 8:51:00 PM
|
|
Patrick TJ McPhee <ptjm@interlog.com> wrote:
> In article <pphqh1-l83.ln1@don.localnet>,
> Bill Marcum <bmarcum@iglou.com.urgent> wrote:
> % On Wed, 03 Mar 2004 15:44:18 -0600, Ted Davis
> % <tdavis@gearbox.maem.umr.edu> wrote:
> % >
> % > I think you want a pattern consisting of \nFrom: wherever it occurs in
> % > the file, but you may have to be explicit about \n - you may have to
> % > encode it as octal escape sequences for both bytes in the Windows \n.
> % >
> %
> % To match the beginning of messages in an mbox file, you want "\nFrom ".
>
> More specifically, you want
>
> \nFrom [^ ]+ [A-Z][a-z]{2} [A-Z][a-z]{2} [ 0-9][0-9] [0-9]{2}:[0-9]{2}:[0-9]{2} [0-9]{4}\n
>
> although some applications screw up the date formatting -- I think the
> best thing is to treat such mailboxes as corrupt, fix the format and
> stop using such broken applications (and don't write any -- this is the
> reason for my reply). Note that the first instance will not have a
> preceding \n -- I would anchor this with ^ and $, but I don't see how
> that can work in an RS.
Or, use 'formail' :-)
--
William Park, Open Geometry Consulting, <opengeometry@yahoo.ca>
Linux solution for data processing and document management.
|
|
0
|
|
|
|
Reply
|
William
|
3/7/2004 9:59:57 PM
|
|
|
7 Replies
83 Views
(page loaded in 0.077 seconds)
Similiar Articles: 0-padding 4 digit numbers with sed - comp.unix.shell... 201004,201004 ABCDEFG%04951%1951%0008,201004,201004 The question is, even with the "g" option for global, why does the above command only change the first occurrence? find nth max - comp.soft-sys.matlab... more or less what matlab's SORT already does, except one only ... find the column index of the first occurance of a value in each ... Excel: Find the nth Occurrence of a ... sed - what do these lines do ? - comp.unix.shellIf you won't help me, > why don't you just go find your lost manhood elsewhere. ... lang.awk hi, I want to do something like this with sed but only on the first occurrence ... (sh/bash) How to check for a string matching -*? - comp.unix.shell ...It is only OCCASIONALLY a discussion group about C; mostly, like most ... comp.unix.shell sed insert - comp.lang.awk... this with sed but only on the first occurrence ... Script to count occurrences - comp.unix.programmerI can see that there are only a few hundred unique IP ... $ time jot -r 60000000 1 10 | rs -C. -g0 4 | cut -d ... lines *after*, but *not* including the first occurrence? ... in ... basic questions about the leapsecond - comp.protocols.time.ntp ...... the NTP day rollover will not occur until the end of the first occurrence of ... If a reference clock is among them, only its leap bits are used. If not, a vote is taken ... regsub (and regular expressions in general) trouble. - comp.lang ...Right, it seems pointless to do only some of them ... but that really had the same effect (just finding the first ... Can regexp be used in RS? - comp.lang.awk regsub (and ... HP48GII and USB - comp.sys.hp48The first HP calculator (though we called it "the computer") that I ever used ... Schmidt wrote: > James M. Prange wrote: > > >>Actually, I think that RS-232 mentions only ... NMEA ref.clock better than my ISP's timeserver? - comp.protocols ...The RS-232 I used to use was very tolerant of ... degrade it. > > The GPS 18x LVC only claims an accuracy of one microsecond for PPS in > the first place. The RS-232 ... .g64 images - comp.sys.cbmApart from old 1-byte buffer RS-232 stuff I don't think anything would be ... the 1571? > > No chance, unless you find a way to copy them back in the first place. > The only ... RuneScape - MMORPG - The No.1 Free Online Multiplayer GamePlay RuneScape for free, and join a global community of millions as you complete quests and win enormous treasures in a 3D world full of magic and monsters. Word 2007/2008: Using Find and Replace, UWECTo find the first occurrence of your text, click FIND NEXT The first ... e., uppercase or lowercase) as the text in the Find what text box. Find whole words only 7/13/2012 9:52:18 PM
|