Trying to use a regular expression for record separators
but can not seem to nail it down... Is it possible in gawk?
Currently using:
RS="\n---\n"
But wanted instead:
RS=/^---$/
or maybe:
RS="^---$"
--
later on,
Mike
http://topcat.hypermart.net/
|
|
0
|
|
|
|
Reply
|
mss
|
1/7/2010 8:29:27 PM |
|
mss wrote:
> Trying to use a regular expression for record separators
> but can not seem to nail it down... Is it possible in gawk?
>
> Currently using:
>
> RS="\n---\n"
>
> But wanted instead:
>
> RS=/^---$/
>
> or maybe:
>
> RS="^---$"
You have to be careful when using some characters in double quotes (like
"$"), because the shell might interpolate them before awk sees them. Better
use single quotes.
That said, I *think* you might be looking for this:
RS='(\n|^)---(\n|$)'
but beware that the \n's will be swallowed by RS and will not be part of the
records. If RS='---' is enough, I suggest you use that form instead.
|
|
0
|
|
|
|
Reply
|
pk
|
1/7/2010 8:31:18 PM
|
|
pk wrote:
> RS='(\n|^)---(\n|$)'
>
> but beware that the \n's will be swallowed by RS and will not be part of
> the records.
Ah, and that using an RS more than one character long is not standard,
although many awks seem to support it.
|
|
0
|
|
|
|
Reply
|
pk
|
1/7/2010 8:35:04 PM
|
|
pk wrote:
> You have to be careful when using some characters in double quotes (like
> "$"), because the shell might interpolate them before awk sees them. Better
> use single quotes.
>
> That said, I *think* you might be looking for this:
>
> RS='(\n|^)---(\n|$)'
Hey pk.
Well, in this case RS really is 3 dashes, on a line by itself,
with no whitespace.
This works: RS="(\n|^)---(\n|$)"
but isnt that essentially the same as: RS="\n---\n"?
> but beware that the \n's will be swallowed by RS and will not be part of the
> records. If RS='---' is enough, I suggest you use that form instead.
Err well, I have FS="\n" too...
Here's my definition of a record for this data set (comments most welcome):
record 1
---
record 2
--
later on,
Mike
http://topcat.hypermart.net/
|
|
0
|
|
|
|
Reply
|
mss
|
1/7/2010 8:51:50 PM
|
|
mss wrote:
> pk wrote:
>
>> You have to be careful when using some characters in double quotes (like
>> "$"), because the shell might interpolate them before awk sees them.
>> Better use single quotes.
>>
>> That said, I *think* you might be looking for this:
>>
>> RS='(\n|^)---(\n|$)'
>
> Hey pk.
>
> Well, in this case RS really is 3 dashes, on a line by itself,
> with no whitespace.
>
> This works: RS="(\n|^)---(\n|$)"
>
> but isnt that essentially the same as: RS="\n---\n"?
Almost, except when your input file starts with "---" or ends with "---"
(with no final \n, although this shouldn't happen in a proper text file).
>> but beware that the \n's will be swallowed by RS and will not be part of
>> the records. If RS='---' is enough, I suggest you use that form instead.
>
> Err well, I have FS="\n" too...
>
> Here's my definition of a record for this data set (comments most
> welcome):
>
> record 1
> ---
> record 2
If the record itself does not contain "---", I'd say that just RS='---'
would work, or perhaps RS='---\n?' (or, as I said, RS='---\n') if you don't
want the records to start with \n. It also depends on how you want to
process the records.
|
|
0
|
|
|
|
Reply
|
pk
|
1/7/2010 8:53:13 PM
|
|
On 1/7/2010 2:31 PM, pk wrote:
> mss wrote:
>
>> Trying to use a regular expression for record separators
>> but can not seem to nail it down... Is it possible in gawk?
>>
>> Currently using:
>>
>> RS="\n---\n"
>>
>> But wanted instead:
>>
>> RS=/^---$/
>>
>> or maybe:
>>
>> RS="^---$"
>
> You have to be careful when using some characters in double quotes (like
> "$"), because the shell might interpolate them before awk sees them. Better
> use single quotes.
>
> That said, I *think* you might be looking for this:
>
> RS='(\n|^)---(\n|$)'
>
> but beware that the \n's will be swallowed by RS and will not be part of the
> records. If RS='---' is enough, I suggest you use that form instead.
From the gawk manual:
-------
NOTE: Remember that in awk, the �^� and �$� anchor metacharacters match the
beginning and end of a string, and not the beginning and end of a line. As a
result, something like �RS = "^[[:upper:]]"� can only match at the beginning of
a file. This is because gawk views the input file as one long string that
happens to contain newline characters in it. It is thus best to avoid anchor
characters in the value of RS.
------
So, for the OP, just be aware that when you're setting RS to "^---" that's only
to help with the case where "---" appears at the start of the file before any
actual record. Similarly for the "$" at the end only helping if the file doesn't
end in a newline.
Including the newline both before and after the dashes would introduce a problem
with a couple of cases, though.
1) If there was no record between 2 separators, the first RS detection would
suck up the \n before the second RS, so that second RS wouldn't match the RE and
so would be treated as a record.
2) If there was no RS at the end of the file, then the final record would have a
trailing newline while the others wouldn't which would be undesirable and
probably unexpected.
Look:
$ cat file
record 1 field 1
record 1 field 2
---
---
record 3 field 1
record 3 field 2
$ awk -v FS='\n' -v RS='(^|\n)---(\n|$)' '{printf "$0=<%s>\n\t$1=<%s>,$2=<%s>\n
",$0,$1,$2}' file
$0=<record 1 field 1
record 1 field 2>
$1=<record 1 field 1>,$2=<record 1 field 2>
$0=<---
record 3 field 1
record 3 field 2
>
$1=<--->,$2=<record 3 field 1>
Notice that one of the separator lines became part of record 3 and that record 3
has a trailing newline (see where the ">" ended up when printing $0).
Unfortunately, I don't think there is a robust solution just using an RE as the
RS since (ignoring files that don't end in a newline):
1) RS='(^|\n)---\n' will produce undesirable results if you have blank records
between separators or if there's no RS at the end of the file.
2) RS='(^|\n)---' alone will produce undesirable results if you have lines that
start with 3 dashes followed by additional text or if the file DOES end in a
newline, as it should.
3) RS='---\n' alone will produce undesirable results if you have lines that end
with 3 dashes preceded by additional text.
Choose your poison, but if you can guarantee not to have input lines that end in
"---" other than the RS lines, I'd go with "3" and just deal with the fact that
every record will have a trailing newline. If you can't then I'd forget about
using awks record splitting to identify the records and instead do something like:
awk '
function procRec( nf,fnr,flds) {
nfPlus1 = split(Rec,flds,RS)
print Rec
for (fnr=1;fnr<nfPlus1;fnr++) {
print fnr,flds[fnr]
}
Rec = ""
}
/^---$/ { procRec(); next }
{ Rec = Rec $0 RS }
END { procRec() }
' file
Regards,
Ed.
|
|
0
|
|
|
|
Reply
|
Ed
|
1/8/2010 1:19:17 AM
|
|
On Jan 7, 8:19=A0pm, Ed Morton <mortons...@gmail.com> wrote:
> On 1/7/2010 2:31 PM, pk wrote:
>
>
>
>
>
> > mss wrote:
>
> >> Trying to use a regular expression for record separators
> >> but can not seem to nail it down... Is it possible in gawk?
>
> >> Currently using:
>
> >> RS=3D"\n---\n"
>
> >> But wanted instead:
>
> >> RS=3D/^---$/
>
> >> or maybe:
>
> >> RS=3D"^---$"
>
> > You have to be careful when using some characters in double quotes (lik=
e
> > "$"), because the shell might interpolate them before awk sees them. Be=
tter
> > use single quotes.
>
> > That said, I *think* you might be looking for this:
>
> > RS=3D'(\n|^)---(\n|$)'
>
> > but beware that the \n's will be swallowed by RS and will not be part o=
f the
> > records. If RS=3D'---' is enough, I suggest you use that form instead.
>
> =A0From the gawk manual:
>
> -------
> NOTE: Remember that in awk, the =91^=92 and =91$=92 anchor metacharacters=
match the
> beginning and end of a string, and not the beginning and end of a line. A=
s a
> result, something like =91RS =3D "^[[:upper:]]"=92 can only match at the =
beginning of
> a file. This is because gawk views the input file as one long string that
> happens to contain newline characters in it. It is thus best to avoid anc=
hor
> characters in the value of RS.
> ------
>
> So, for the OP, just be aware that when you're setting RS to "^---" that'=
s only
> to help with the case where "---" appears at the start of the file before=
any
> actual record. Similarly for the "$" at the end only helping if the file =
doesn't
> end in a newline.
>
> Including the newline both before and after the dashes would introduce a =
problem
> with a couple of cases, though.
>
> 1) If there was no record between 2 separators, the first RS detection wo=
uld
> suck up the \n before the second RS, so that second RS wouldn't match the=
RE and
> so would be treated as a record.
>
> 2) If there was no RS at the end of the file, then the final record would=
have a
> trailing newline while the others wouldn't which would be undesirable and
> probably unexpected.
>
> Look:
>
> $ cat file
> record 1 field 1
> record 1 field 2
> ---
> ---
> record 3 field 1
> record 3 field 2
>
> $ awk -v FS=3D'\n' -v RS=3D'(^|\n)---(\n|$)' '{printf "$0=3D<%s>\n\t$1=3D=
<%s>,$2=3D<%s>\n
> ",$0,$1,$2}' file
> $0=3D<record 1 field 1
> record 1 field 2>
> =A0 =A0 =A0 =A0 =A0$1=3D<record 1 field 1>,$2=3D<record 1 field 2>
> $0=3D<---
> record 3 field 1
> record 3 field 2
> =A0>
> =A0 =A0 =A0 =A0 =A0$1=3D<--->,$2=3D<record 3 field 1>
>
> Notice that one of the separator lines became part of record 3 and that r=
ecord 3
> has a trailing newline (see where the ">" ended up when printing $0).
>
> Unfortunately, I don't think there is a robust solution just using an RE =
as the
> RS since (ignoring files that don't end in a newline):
>
> 1) RS=3D'(^|\n)---\n' will produce undesirable results if you have blank =
records
> between separators or if there's no RS at the end of the file.
>
> 2) RS=3D'(^|\n)---' alone will produce undesirable results if you have li=
nes that
> start with 3 dashes followed by additional text or if the file DOES end i=
n a
> newline, as it should.
>
> 3) RS=3D'---\n' alone will produce undesirable results if you have lines =
that end
> with 3 dashes preceded by additional text.
>
> Choose your poison, but if you can guarantee not to have input lines that=
end in
> "---" other than the RS lines, I'd go with "3" and just deal with the fac=
t that
> every record will have a trailing newline. If you can't then I'd forget a=
bout
> using awks record splitting to identify the records and instead do someth=
ing like:
>
> awk '
> function procRec( =A0 =A0 =A0 nf,fnr,flds) {
> =A0 =A0 =A0 =A0 nfPlus1 =3D split(Rec,flds,RS)
> =A0 =A0 =A0 =A0 print Rec
> =A0 =A0 =A0 =A0 for (fnr=3D1;fnr<nfPlus1;fnr++) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 print fnr,flds[fnr]
> =A0 =A0 =A0 =A0 }
> =A0 =A0 =A0 =A0 Rec =3D ""}
>
> /^---$/ { procRec(); next }
> { Rec =3D Rec $0 RS }
> END { procRec() }
> ' file
>
> Regards,
>
> =A0 =A0 =A0 =A0 Ed.- Hide quoted text -
>
> - Show quoted text -
That's a great explanation.
|
|
0
|
|
|
|
Reply
|
r
|
1/8/2010 1:49:24 AM
|
|
Ed Morton wrote:
> From the gawk manual:
>
> -------
> NOTE: Remember that in awk, the ?^? and ?$? anchor metacharacters match the
> beginning and end of a string, and not the beginning and end of a line. As a
> result, something like ?RS = "^[[:upper:]]"? can only match at the beginning of
> a file. This is because gawk views the input file as one long string that
> happens to contain newline characters in it. It is thus best to avoid anchor
> characters in the value of RS.
> ------
Just read this in the manual as well...
Will study the rest of your post (thanks).
--
later on,
Mike
http://topcat.hypermart.net/
|
|
0
|
|
|
|
Reply
|
mss
|
1/8/2010 2:08:07 AM
|
|
|
7 Replies
232 Views
(page loaded in 0.081 seconds)
Similiar Articles: ANSI terminal escape sequence regexp - comp.unix.shellCan regexp be used in RS? - comp.lang.awk ANSI terminal escape sequence regexp - comp.unix.shell I am looking for a regexp that matches the ANSI terminal escape sequences ... Using regular expressions to split a string - comp.lang.javascript ...Using regular expressions to split a string - comp.lang.javascript ... Can regexp be used in RS? - comp.lang.awk Trying to use a regular expression for record ... ... list ... regsub (and regular expressions in general) trouble. - comp.lang ...Can regexp be used in RS? - comp.lang.awk regsub (and regular expressions in general) trouble. - comp.lang ... I think the regular expression is >going to become much more ... Password Strength Regular Expression - comp.lang.javascript ...Hi I just need a regular expression for the condition that the password has to be a combination of alphabets and numbers. It cannot be only numbers n... CO detector for X10 or RS485 - comp.home.automationAnybody know of a CO detector that can be hooked up to a home automation system via X10 or RS485? Thanks for any ideas. ... Split string at specific character - comp.soft-sys.matlab ...As far as I can see, removing this basic function forces me to use REGEXP for splitting - but this is dull slow in Matlab 6.5 and I'm not sure how to create a bullet ... RS-232 problems - comp.soft-sys.matlabHow can I send binary data directly? Something like this fwrite(s,'bin','111000 ... Fan-out would be nice - I've used two RS-232 input in >> parallel fed from one GPS ... Pervasive 7: export data to sql-script - comp.databases.btrieve ...(C=Comma/Quote-Delimited, R=Raw Comma Delimited, S=FS/RS-Delimited). /H Disables ... While originally built to hit Pervasive.SQL, this tool can be used to access just ... [FYI] MSXML HTTP translates response status code 204 to 1223 ...>> And as in my library the response listener can be user-defined (a setter can >> shadow the inherited method), the RegExp provides the user which much an >> easier-to-use ... How to split on a backslash '\' ? - comp.lang.javascript ...... Unnecessary, you can pass the RegExp literal as-is. > result = str.split( pattern ); There is no backslash in the parsed string, so no match that could be used for ... Regular expression - Wikipedia, the free encyclopediaAs an example of the syntax, the regular expression \bex can be used to search for all instances of the ... them are defined to produce regular expressions: (concatenation) RS ... Regular Expression Matching Can Be Simple And Fast - swtchboardRegular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...) Russ Cox rsc@swtch.com January 2007 Introduction 7/28/2012 8:00:57 PM
|