Can regexp be used in RS?

  • Follow


Trying to use a regular expression for record separators
but can not seem to nail it down... Is it possible in gawk?

Currently using:

RS="\n---\n"

But wanted instead:

RS=/^---$/

or maybe:

RS="^---$"

-- 
later on,
Mike

http://topcat.hypermart.net/
0
Reply mss 1/7/2010 8:29:27 PM

mss wrote:

> Trying to use a regular expression for record separators
> but can not seem to nail it down... Is it possible in gawk?
> 
> Currently using:
> 
> RS="\n---\n"
> 
> But wanted instead:
> 
> RS=/^---$/
> 
> or maybe:
> 
> RS="^---$"

You have to be careful when using some characters in double quotes (like 
"$"), because the shell might interpolate them before awk sees them. Better 
use single quotes.

That said, I *think* you might be looking for this:

RS='(\n|^)---(\n|$)'

but beware that the \n's will be swallowed by RS and will not be part of the 
records. If RS='---' is enough, I suggest you use that form instead.
0
Reply pk 1/7/2010 8:31:18 PM


pk wrote:

> RS='(\n|^)---(\n|$)'
> 
> but beware that the \n's will be swallowed by RS and will not be part of
> the records.

Ah, and that using an RS more than one character long is not standard, 
although many awks seem to support it.
0
Reply pk 1/7/2010 8:35:04 PM

pk wrote:

> You have to be careful when using some characters in double quotes (like 
> "$"), because the shell might interpolate them before awk sees them. Better 
> use single quotes.
>
> That said, I *think* you might be looking for this:
>
> RS='(\n|^)---(\n|$)'

Hey pk.

Well, in this case RS really is 3 dashes, on a line by itself,
with no whitespace.

This works: RS="(\n|^)---(\n|$)"

but isnt that essentially the same as: RS="\n---\n"?

> but beware that the \n's will be swallowed by RS and will not be part of the 
> records. If RS='---' is enough, I suggest you use that form instead.

Err well, I have FS="\n" too...

Here's my definition of a record for this data set (comments most welcome):

record 1
---
record 2

-- 
later on,
Mike

http://topcat.hypermart.net/
0
Reply mss 1/7/2010 8:51:50 PM

mss wrote:

> pk wrote:
> 
>> You have to be careful when using some characters in double quotes (like
>> "$"), because the shell might interpolate them before awk sees them.
>> Better use single quotes.
>>
>> That said, I *think* you might be looking for this:
>>
>> RS='(\n|^)---(\n|$)'
> 
> Hey pk.
> 
> Well, in this case RS really is 3 dashes, on a line by itself,
> with no whitespace.
> 
> This works: RS="(\n|^)---(\n|$)"
> 
> but isnt that essentially the same as: RS="\n---\n"?

Almost, except when your input file starts with "---" or ends with "---" 
(with no final \n, although this shouldn't happen in a proper text file).
 
>> but beware that the \n's will be swallowed by RS and will not be part of
>> the records. If RS='---' is enough, I suggest you use that form instead.
> 
> Err well, I have FS="\n" too...
> 
> Here's my definition of a record for this data set (comments most
> welcome):
> 
> record 1
> ---
> record 2

If the record itself does not contain "---", I'd say that just RS='---' 
would work, or perhaps RS='---\n?' (or, as I said, RS='---\n') if you don't 
want the records to start with \n. It also depends on how you want to 
process the records.

0
Reply pk 1/7/2010 8:53:13 PM

On 1/7/2010 2:31 PM, pk wrote:
> mss wrote:
>
>> Trying to use a regular expression for record separators
>> but can not seem to nail it down... Is it possible in gawk?
>>
>> Currently using:
>>
>> RS="\n---\n"
>>
>> But wanted instead:
>>
>> RS=/^---$/
>>
>> or maybe:
>>
>> RS="^---$"
>
> You have to be careful when using some characters in double quotes (like
> "$"), because the shell might interpolate them before awk sees them. Better
> use single quotes.
>
> That said, I *think* you might be looking for this:
>
> RS='(\n|^)---(\n|$)'
>
> but beware that the \n's will be swallowed by RS and will not be part of the
> records. If RS='---' is enough, I suggest you use that form instead.

 From the gawk manual:

-------
NOTE: Remember that in awk, the �^� and �$� anchor metacharacters match the 
beginning and end of a string, and not the beginning and end of a line. As a 
result, something like �RS = "^[[:upper:]]"� can only match at the beginning of 
a file. This is because gawk views the input file as one long string that 
happens to contain newline characters in it. It is thus best to avoid anchor 
characters in the value of RS.
------

So, for the OP, just be aware that when you're setting RS to "^---" that's only 
to help with the case where "---" appears at the start of the file before any 
actual record. Similarly for the "$" at the end only helping if the file doesn't 
end in a newline.

Including the newline both before and after the dashes would introduce a problem 
with a couple of cases, though.

1) If there was no record between 2 separators, the first RS detection would 
suck up the \n before the second RS, so that second RS wouldn't match the RE and 
so would be treated as a record.

2) If there was no RS at the end of the file, then the final record would have a 
trailing newline while the others wouldn't which would be undesirable and 
probably unexpected.

Look:

$ cat file
record 1 field 1
record 1 field 2
---
---
record 3 field 1
record 3 field 2

$ awk -v FS='\n' -v RS='(^|\n)---(\n|$)' '{printf "$0=<%s>\n\t$1=<%s>,$2=<%s>\n
",$0,$1,$2}' file
$0=<record 1 field 1
record 1 field 2>
         $1=<record 1 field 1>,$2=<record 1 field 2>
$0=<---
record 3 field 1
record 3 field 2
 >
         $1=<--->,$2=<record 3 field 1>

Notice that one of the separator lines became part of record 3 and that record 3 
has a trailing newline (see where the ">" ended up when printing $0).

Unfortunately, I don't think there is a robust solution just using an RE as the 
RS since (ignoring files that don't end in a newline):

1) RS='(^|\n)---\n' will produce undesirable results if you have blank records 
between separators or if there's no RS at the end of the file.

2) RS='(^|\n)---' alone will produce undesirable results if you have lines that 
start with 3 dashes followed by additional text or if the file DOES end in a 
newline, as it should.

3) RS='---\n' alone will produce undesirable results if you have lines that end 
with 3 dashes preceded by additional text.

Choose your poison, but if you can guarantee not to have input lines that end in 
"---" other than the RS lines, I'd go with "3" and just deal with the fact that 
every record will have a trailing newline. If you can't then I'd forget about 
using awks record splitting to identify the records and instead do something like:

awk '
function procRec(	nf,fnr,flds) {
	nfPlus1 = split(Rec,flds,RS)
	print Rec
	for (fnr=1;fnr<nfPlus1;fnr++) {
		print fnr,flds[fnr]
	}
	Rec = ""
}
/^---$/ { procRec(); next }
{ Rec = Rec $0 RS }
END { procRec() }
' file

Regards,

	Ed.




0
Reply Ed 1/8/2010 1:19:17 AM

On Jan 7, 8:19=A0pm, Ed Morton <mortons...@gmail.com> wrote:
> On 1/7/2010 2:31 PM, pk wrote:
>
>
>
>
>
> > mss wrote:
>
> >> Trying to use a regular expression for record separators
> >> but can not seem to nail it down... Is it possible in gawk?
>
> >> Currently using:
>
> >> RS=3D"\n---\n"
>
> >> But wanted instead:
>
> >> RS=3D/^---$/
>
> >> or maybe:
>
> >> RS=3D"^---$"
>
> > You have to be careful when using some characters in double quotes (lik=
e
> > "$"), because the shell might interpolate them before awk sees them. Be=
tter
> > use single quotes.
>
> > That said, I *think* you might be looking for this:
>
> > RS=3D'(\n|^)---(\n|$)'
>
> > but beware that the \n's will be swallowed by RS and will not be part o=
f the
> > records. If RS=3D'---' is enough, I suggest you use that form instead.
>
> =A0From the gawk manual:
>
> -------
> NOTE: Remember that in awk, the =91^=92 and =91$=92 anchor metacharacters=
 match the
> beginning and end of a string, and not the beginning and end of a line. A=
s a
> result, something like =91RS =3D "^[[:upper:]]"=92 can only match at the =
beginning of
> a file. This is because gawk views the input file as one long string that
> happens to contain newline characters in it. It is thus best to avoid anc=
hor
> characters in the value of RS.
> ------
>
> So, for the OP, just be aware that when you're setting RS to "^---" that'=
s only
> to help with the case where "---" appears at the start of the file before=
 any
> actual record. Similarly for the "$" at the end only helping if the file =
doesn't
> end in a newline.
>
> Including the newline both before and after the dashes would introduce a =
problem
> with a couple of cases, though.
>
> 1) If there was no record between 2 separators, the first RS detection wo=
uld
> suck up the \n before the second RS, so that second RS wouldn't match the=
 RE and
> so would be treated as a record.
>
> 2) If there was no RS at the end of the file, then the final record would=
 have a
> trailing newline while the others wouldn't which would be undesirable and
> probably unexpected.
>
> Look:
>
> $ cat file
> record 1 field 1
> record 1 field 2
> ---
> ---
> record 3 field 1
> record 3 field 2
>
> $ awk -v FS=3D'\n' -v RS=3D'(^|\n)---(\n|$)' '{printf "$0=3D<%s>\n\t$1=3D=
<%s>,$2=3D<%s>\n
> ",$0,$1,$2}' file
> $0=3D<record 1 field 1
> record 1 field 2>
> =A0 =A0 =A0 =A0 =A0$1=3D<record 1 field 1>,$2=3D<record 1 field 2>
> $0=3D<---
> record 3 field 1
> record 3 field 2
> =A0>
> =A0 =A0 =A0 =A0 =A0$1=3D<--->,$2=3D<record 3 field 1>
>
> Notice that one of the separator lines became part of record 3 and that r=
ecord 3
> has a trailing newline (see where the ">" ended up when printing $0).
>
> Unfortunately, I don't think there is a robust solution just using an RE =
as the
> RS since (ignoring files that don't end in a newline):
>
> 1) RS=3D'(^|\n)---\n' will produce undesirable results if you have blank =
records
> between separators or if there's no RS at the end of the file.
>
> 2) RS=3D'(^|\n)---' alone will produce undesirable results if you have li=
nes that
> start with 3 dashes followed by additional text or if the file DOES end i=
n a
> newline, as it should.
>
> 3) RS=3D'---\n' alone will produce undesirable results if you have lines =
that end
> with 3 dashes preceded by additional text.
>
> Choose your poison, but if you can guarantee not to have input lines that=
 end in
> "---" other than the RS lines, I'd go with "3" and just deal with the fac=
t that
> every record will have a trailing newline. If you can't then I'd forget a=
bout
> using awks record splitting to identify the records and instead do someth=
ing like:
>
> awk '
> function procRec( =A0 =A0 =A0 nf,fnr,flds) {
> =A0 =A0 =A0 =A0 nfPlus1 =3D split(Rec,flds,RS)
> =A0 =A0 =A0 =A0 print Rec
> =A0 =A0 =A0 =A0 for (fnr=3D1;fnr<nfPlus1;fnr++) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 print fnr,flds[fnr]
> =A0 =A0 =A0 =A0 }
> =A0 =A0 =A0 =A0 Rec =3D ""}
>
> /^---$/ { procRec(); next }
> { Rec =3D Rec $0 RS }
> END { procRec() }
> ' file
>
> Regards,
>
> =A0 =A0 =A0 =A0 Ed.- Hide quoted text -
>
> - Show quoted text -

That's a great explanation.
0
Reply r 1/8/2010 1:49:24 AM

Ed Morton wrote:

>  From the gawk manual:
>
> -------
> NOTE: Remember that in awk, the ?^? and ?$? anchor metacharacters match the 
> beginning and end of a string, and not the beginning and end of a line. As a 
> result, something like ?RS = "^[[:upper:]]"? can only match at the beginning of 
> a file. This is because gawk views the input file as one long string that 
> happens to contain newline characters in it. It is thus best to avoid anchor 
> characters in the value of RS.
> ------

Just read this in the manual as well...

Will study the rest of your post (thanks).

-- 
later on,
Mike

http://topcat.hypermart.net/
0
Reply mss 1/8/2010 2:08:07 AM

7 Replies
232 Views

(page loaded in 0.081 seconds)

Similiar Articles:













7/28/2012 8:00:57 PM


Reply: