question about gsub

  • Follow


here is a snippet from a large program that behaves the same way in a 
standalone "file":
{gsub(" the | of "," ")}1
echo hair of the dog | gawk -f file
hair the dog
was expecting: hair  dog

same output if {gsub(" of | the "," ")}1 is in "file"

my question is why? Am I misunderstanding the "|" operator in gsub or is 
there something magic about " the |" or "| the"? same for:
Gnu Awk (gawk) 3.0, patchlevel 0
and
GNU Awk 3.1.6

windows XP pro sp2.
-- 
(^\pop/^)
I'm lost... I've gone to look for myself.
If I should return before I get back, keep me here.
--
0
Reply pop 5/25/2009 12:54:33 PM

pop wrote:
> here is a snippet from a large program that behaves the same way in a 
> standalone "file":
> {gsub(" the | of "," ")}1
> echo hair of the dog | gawk -f file
> hair the dog
> was expecting: hair  dog
> 
> same output if {gsub(" of | the "," ")}1 is in "file"
> 
> my question is why?

See this simpler example that might make it more apparent...

   echo abc | awk '{gsub(/ab|bc/,"X")}1'

In your case; why do you think that the blank between "hair of"
and "the dog" should be considered in two substitutions?

Now try...

   echo abbc | awk '{gsub(/ab|bc/,"X")}1'

or (with two inner spaces)...

   echo "hair of  the dog" | awk '{gsub(/ the | of /," ")}1'

to see what happens. Makes sense, don't you think?

Janis

> Am I misunderstanding the "|" operator in gsub or is 
> there something magic about " the |" or "| the"? same for:
> Gnu Awk (gawk) 3.0, patchlevel 0
> and
> GNU Awk 3.1.6
> 
> windows XP pro sp2.
0
Reply Janis 5/25/2009 1:07:24 PM


On Monday 25 May 2009 14:54, pop wrote:

> here is a snippet from a large program that behaves the same way in a
> standalone "file":
> {gsub(" the | of "," ")}1
> echo hair of the dog | gawk -f file
> hair the dog
> was expecting: hair  dog

The first match encountered in the string is " of ", which is replaced
with " ", leaving

hair the dog

gsub then continues to scan the string starting from the "t" in "the" (which
is the character after the end of the previous match), and no further match
s found. So the end result you see is correct.

0
Reply pk 5/25/2009 1:19:35 PM

pk said the following on 5/25/2009 8:19 AM:
> On Monday 25 May 2009 14:54, pop wrote:
> 
>> here is a snippet from a large program that behaves the same way in a
>> standalone "file":
>> {gsub(" the | of "," ")}1
>> echo hair of the dog | gawk -f file
>> hair the dog
>> was expecting: hair  dog
> 
> The first match encountered in the string is " of ", which is replaced
> with " ", leaving
> 
> hair the dog
> 
> gsub then continues to scan the string starting from the "t" in "the" (which
> is the character after the end of the previous match), and no further match
> s found. So the end result you see is correct.
> 
OK - thanks; that makes sense now but it was sure unexpected. Solved by 
using separate gsubs. Sure do encounter unexpected situations in the 
programming world :)

-- 
(^\pop/^)
I'm lost... I've gone to look for myself.
If I should return before I get back, keep me here.
--
0
Reply pop 5/25/2009 1:25:30 PM

Janis Papanagnou said the following on 5/25/2009 8:07 AM:
> pop wrote:
>> here is a snippet from a large program that behaves the same way in a 
>> standalone "file":
>> {gsub(" the | of "," ")}1
>> echo hair of the dog | gawk -f file
>> hair the dog
>> was expecting: hair  dog
>>
>> same output if {gsub(" of | the "," ")}1 is in "file"
>>
>> my question is why?
> 
> See this simpler example that might make it more apparent...
> 
>   echo abc | awk '{gsub(/ab|bc/,"X")}1'
> 
> In your case; why do you think that the blank between "hair of"
> and "the dog" should be considered in two substitutions?
> 
> Now try...
> 
>   echo abbc | awk '{gsub(/ab|bc/,"X")}1'
> 
> or (with two inner spaces)...
> 
>   echo "hair of  the dog" | awk '{gsub(/ the | of /," ")}1'
> 
> to see what happens. Makes sense, don't you think?
> 
> Janis
> 
>> Am I misunderstanding the "|" operator in gsub or is there something 
>> magic about " the |" or "| the"? same for:
>> Gnu Awk (gawk) 3.0, patchlevel 0
>> and
>> GNU Awk 3.1.6
>>
>> windows XP pro sp2.
OK - thanks; that makes sense now but it was sure unexpected. Solved by 
using separate gsubs. Sure do encounter unexpected situations in the 
programming world :)

-- 
(^\pop/^)
I'm lost... I've gone to look for myself.
If I should return before I get back, keep me here.
--
0
Reply pop 5/25/2009 1:26:00 PM

On May 25, 8:25=A0am, pop <p_...@hotmail.com> wrote:
> pk said the following on 5/25/2009 8:19 AM:
>
>
>
> > On Monday 25 May 2009 14:54, pop wrote:
>
> >> here is a snippet from a large program that behaves the same way in a
> >> standalone "file":
> >> {gsub(" the | of "," ")}1
> >> echo hair of the dog | gawk -f file
> >> hair the dog
> >> was expecting: hair =A0dog
>
> > The first match encountered in the string is " of ", which is replaced
> > with " ", leaving
>
> > hair the dog
>
> > gsub then continues to scan the string starting from the "t" in "the" (=
which
> > is the character after the end of the previous match), and no further m=
atch
> > s found. So the end result you see is correct.
>
> OK - thanks; that makes sense now but it was sure unexpected. Solved by
> using separate gsubs. Sure do encounter unexpected situations in the
> programming world :)

There's almost certainly no need for coding separate calls to gsub().
If you tell us IN WORDS what it is you're trying to do, we could
probably tell you how to do it by writing one call to gsub(). For
example, if you don't really want to add spuriuous white-space between
words, then one of these might be what you really want:

$ echo hair of the dog | awk '{while (gsub(/(the|of) /," "));}1'
hair dog

$ echo hair of the dog | awk '{gsub(/ (the|of)\>/,"")}1'
hair dog

That last one is, I believe, gawk specific.

Regards,

    Ed.
0
Reply Ed 5/25/2009 2:55:32 PM

In article <gve50c$k5q$1@aioe.org>, pk  <pk@pk.invalid> wrote:
>On Monday 25 May 2009 14:54, pop wrote:
>
>> here is a snippet from a large program that behaves the same way in a
>> standalone "file":
>> {gsub(" the | of "," ")}1
>> echo hair of the dog | gawk -f file
>> hair the dog
>> was expecting: hair  dog
>
>The first match encountered in the string is " of ", which is replaced
>with " ", leaving
>
>hair the dog
>
>gsub then continues to scan the string starting from the "t" in "the" (which
>is the character after the end of the previous match), and no further match
>s found. So the end result you see is correct.

Note that, sometimes, what you want is:

    while (sub(...));

instead of gsub().  I use this from time to time; it just keeps doing
the sub until it fails.

0
Reply gazelle 5/25/2009 3:07:22 PM

Ed Morton said the following on 5/25/2009 9:55 AM:
> On May 25, 8:25 am, pop <p_...@hotmail.com> wrote:
>> pk said the following on 5/25/2009 8:19 AM:
>>
>>
>>
<snip>
>> OK - thanks; that makes sense now but it was sure unexpected. Solved by
>> using separate gsubs. Sure do encounter unexpected situations in the
>> programming world :)
> 
> There's almost certainly no need for coding separate calls to gsub().
> If you tell us IN WORDS what it is you're trying to do, we could
> probably tell you how to do it by writing one call to gsub(). For
> example, if you don't really want to add spuriuous white-space between
> words, then one of these might be what you really want:
> 
> $ echo hair of the dog | awk '{while (gsub(/(the|of) /," "));}1'
> hair dog
> 
> $ echo hair of the dog | awk '{gsub(/ (the|of)\>/,"")}1'
> hair dog
> 
> That last one is, I believe, gawk specific.
> 
> Regards,
> 
>     Ed.
Thanks - atually I used:
echo hair of the dog|awk '{while(gsub(" (the|of) "," "));}1'
hair dog
in the final program to accomplish what I needed. As far as an 
explanation; I was removing all "connectors/articles,etc." such as 
"the,a,an,and,or,in,if,..." from title of books,movies,plays,etc. jut 
leaving the significant keywords for a personal project I am working on.

-- 
(^\pop/^)
I'm lost... I've gone to look for myself.
If I should return before I get back, keep me here.
--
0
Reply pop 5/25/2009 3:44:55 PM

On May 25, 10:44=A0am, pop <p_...@hotmail.com> wrote:
> Ed Morton said the following on 5/25/2009 9:55 AM:
>
>
>
> > On May 25, 8:25 am, pop <p_...@hotmail.com> wrote:
> >> pk said the following on 5/25/2009 8:19 AM:
>
> <snip>
> >> OK - thanks; that makes sense now but it was sure unexpected. Solved b=
y
> >> using separate gsubs. Sure do encounter unexpected situations in the
> >> programming world :)
>
> > There's almost certainly no need for coding separate calls to gsub().
> > If you tell us IN WORDS what it is you're trying to do, we could
> > probably tell you how to do it by writing one call to gsub(). For
> > example, if you don't really want to add spuriuous white-space between
> > words, then one of these might be what you really want:
>
> > $ echo hair of the dog | awk '{while (gsub(/(the|of) /," "));}1'
> > hair dog
>
> > $ echo hair of the dog | awk '{gsub(/ (the|of)\>/,"")}1'
> > hair dog
>
> > That last one is, I believe, gawk specific.
>
> > Regards,
>
> > =A0 =A0 Ed.
>
> Thanks - atually I used:
> echo hair of the dog|awk '{while(gsub(" (the|of) "," "));}1'
> hair dog
> in the final program to accomplish what I needed. As far as an
> explanation; I was removing all "connectors/articles,etc." such as
> "the,a,an,and,or,in,if,..." from title of books,movies,plays,etc. jut
> leaving the significant keywords for a personal project I am working on.
>

The way you're approaching it won't work, then, as it'll leave in
words suffixed with punctuation marks, words that contain
capitalisation, etc. You could do this instead with GNU awk:

$ echo "The ghost walked in!" | awk 'BEGIN{IGNORECASE=3D1}
{gsub(/\<(the|a|an|and|or|in|if)\>/,"")}1'

Regards,

    Ed.

0
Reply Ed 5/25/2009 4:53:26 PM

Ed Morton said the following on 5/25/2009 11:53 AM:
> On May 25, 10:44 am, pop <p_...@hotmail.com> wrote:
>> Ed Morton said the following on 5/25/2009 9:55 AM:
>>
>>
>>
>>> On May 25, 8:25 am, pop <p_...@hotmail.com> wrote:
>>>> pk said the following on 5/25/2009 8:19 AM:
>> <snip>
>>>> OK - thanks; that makes sense now but it was sure unexpected. Solved by
>>>> using separate gsubs. Sure do encounter unexpected situations in the
>>>> programming world :)
>>> There's almost certainly no need for coding separate calls to gsub().
>>> If you tell us IN WORDS what it is you're trying to do, we could
>>> probably tell you how to do it by writing one call to gsub(). For
>>> example, if you don't really want to add spuriuous white-space between
>>> words, then one of these might be what you really want:
>>> $ echo hair of the dog | awk '{while (gsub(/(the|of) /," "));}1'
>>> hair dog
>>> $ echo hair of the dog | awk '{gsub(/ (the|of)\>/,"")}1'
>>> hair dog
>>> That last one is, I believe, gawk specific.
>>> Regards,
>>>     Ed.
>> Thanks - atually I used:
>> echo hair of the dog|awk '{while(gsub(" (the|of) "," "));}1'
>> hair dog
>> in the final program to accomplish what I needed. As far as an
>> explanation; I was removing all "connectors/articles,etc." such as
>> "the,a,an,and,or,in,if,..." from title of books,movies,plays,etc. jut
>> leaving the significant keywords for a personal project I am working on.
>>
> 
> The way you're approaching it won't work, then, as it'll leave in
> words suffixed with punctuation marks, words that contain
> capitalisation, etc. You could do this instead with GNU awk:
> 
> $ echo "The ghost walked in!" | awk 'BEGIN{IGNORECASE=1}
> {gsub(/\<(the|a|an|and|or|in|if)\>/,"")}1'
> 
> Regards,
> 
>     Ed.
> 
Good idea! Thanks, I hadn't thought of that... I'll use it.

-- 
(^\pop/^)
I'm lost... I've gone to look for myself.
If I should return before I get back, keep me here.
--
0
Reply pop 5/25/2009 5:26:50 PM

9 Replies
357 Views

(page loaded in 0.156 seconds)

Similiar Articles:













7/27/2012 10:41:15 AM


Reply: