How to have word boundaries in regular expressions in mawk?

  • Follow


Hi,

I'm trying to write a script that first reads a list of words,  and
then for each file
in the argument list, checks for all occurrences of the words in all
lines and writes a file
with the words that were present.

Pretty simple, eh?

Actually not, if some words can be substrings of others, and you're
using mawk.
I wrote a script that I thought should work -- and it does work on my
testcase in gawk --
but mawk seems not to recognize "\<", "\>", or "\y".

A simplified version of my script, "findvars", looks like:
-------------
#!/usr/bin/awk -f
#
# findvars
#
BEGIN { IGNORECASE = 1;  namesfile = ""; }
NR==1 {
  if (length(namesfile) == 0)  exit # actually with an error message

  while ((getline nraw < namesfile) > 0
  { varre[nraw] = "\\<" nraw "\\>"; ++nn }
}
#
NF==0  { next }
/^[Cc]/  { next } # guess what kind of files I'm processing
/^ *[!]/  { next }
#
{
  sub(/[!].*$/, "")
  for (nraw in varre)
     if ($0 ~ varre[nraw]) { used[nraw] = 1; ++nm }
}
#
END {
  if (nm)
  {  usesfile = "used-from-" namesfile;
      for (nraw in used)
      { if (np++)
            printf ",%s", nraw >> usesfile;
         else
            printf ", only: %s", nraw > usesfile;
      }
      printf "\n" >> usesfile;
}
-------------

A typical file of names would be  "barfmodvars"  looking like this:
--------------
xbarf
honk
honker
splat
xbarf1
xbarf2
barf
--------------
(these would be variables declared in a Fortran-90 module)

A typical file to process would be "sloppycode.f" with lines like
this:
--------------
      subroutine sloppycode(in1,in2,outs)
      use barfmod
      implicit real(a-h,o-z)

      splat(:) = in1*xbarf1(:)+in2*xbarf2(:)
      outs = sum(splat)/honker
      return
      end
--------------

Then we apply the script with a command, e.g.
~> ./finduses namesfile=barfmodvars sloppycode.f

....and hope to get a file, "used-from-barfmodvars", with a ",
only: ..." clause
that we can paste into the source file.

This works fine with gawk (at home), but the computers at work have
mawk.
On mawk, the word boundary metacharacters apparently don't work,
and the result is no matches for any names.

The man-page for mawk says it uses "extended regular expressions as
with egrep (1)."
However, the local egrep recognizes word boundary metacharacters just
fine.

Am I missing something here, or is there a work-around for this?
Or should I just download and install gawk?

Regards,
Fred
0
Reply fred 5/8/2009 1:17:52 AM

Mawk doesn't support the word matching operators. I think it doesn't support
IGNORECASE either.  If you need those features, just download and install gawk.

Arnold

In article <a6c66d76-de22-44e7-91a0-f2f913c01a32@r13g2000vbr.googlegroups.com>,
 <fred.hawes@gmail.com> wrote:
>Hi,
>
>I'm trying to write a script that first reads a list of words,  and
>then for each file
>in the argument list, checks for all occurrences of the words in all
>lines and writes a file
>with the words that were present.
>
>Pretty simple, eh?
>
>Actually not, if some words can be substrings of others, and you're
>using mawk.
>I wrote a script that I thought should work -- and it does work on my
>testcase in gawk --
>but mawk seems not to recognize "\<", "\>", or "\y".
>
>A simplified version of my script, "findvars", looks like:
>-------------
>#!/usr/bin/awk -f
>#
># findvars
>#
>BEGIN { IGNORECASE = 1;  namesfile = ""; }
>NR==1 {
>  if (length(namesfile) == 0)  exit # actually with an error message
>
>  while ((getline nraw < namesfile) > 0
>  { varre[nraw] = "\\<" nraw "\\>"; ++nn }
>}
>#
>NF==0  { next }
>/^[Cc]/  { next } # guess what kind of files I'm processing
>/^ *[!]/  { next }
>#
>{
>  sub(/[!].*$/, "")
>  for (nraw in varre)
>     if ($0 ~ varre[nraw]) { used[nraw] = 1; ++nm }
>}
>#
>END {
>  if (nm)
>  {  usesfile = "used-from-" namesfile;
>      for (nraw in used)
>      { if (np++)
>            printf ",%s", nraw >> usesfile;
>         else
>            printf ", only: %s", nraw > usesfile;
>      }
>      printf "\n" >> usesfile;
>}
>-------------
>
>A typical file of names would be  "barfmodvars"  looking like this:
>--------------
>xbarf
>honk
>honker
>splat
>xbarf1
>xbarf2
>barf
>--------------
>(these would be variables declared in a Fortran-90 module)
>
>A typical file to process would be "sloppycode.f" with lines like
>this:
>--------------
>      subroutine sloppycode(in1,in2,outs)
>      use barfmod
>      implicit real(a-h,o-z)
>
>      splat(:) = in1*xbarf1(:)+in2*xbarf2(:)
>      outs = sum(splat)/honker
>      return
>      end
>--------------
>
>Then we apply the script with a command, e.g.
>~> ./finduses namesfile=barfmodvars sloppycode.f
>
>...and hope to get a file, "used-from-barfmodvars", with a ",
>only: ..." clause
>that we can paste into the source file.
>
>This works fine with gawk (at home), but the computers at work have
>mawk.
>On mawk, the word boundary metacharacters apparently don't work,
>and the result is no matches for any names.
>
>The man-page for mawk says it uses "extended regular expressions as
>with egrep (1)."
>However, the local egrep recognizes word boundary metacharacters just
>fine.
>
>Am I missing something here, or is there a work-around for this?
>Or should I just download and install gawk?
>
>Regards,
>Fred


-- 
Aharon (Arnold) Robbins 				arnold AT skeeve DOT com
P.O. Box 354		Home Phone: +972  8 979-0381
Nof Ayalon		Cell Phone: +972 50  729-7545
D.N. Shimshon 99785	ISRAEL
0
Reply arnold 5/8/2009 8:14:05 AM


1 Replies
230 Views

(page loaded in 0.041 seconds)

Similiar Articles:








7/24/2012 1:33:53 PM


Reply: