gensub() question

  • Follow


How does one delete duplicate adjacent words
(leaving only one occurrence of the doubled word)
using regex in gawk?


before: "the the"

after: "the"


These fail:

echo "the the" | gawk '{print gensub(/\<([a-z]+) +\\1\>/, "\\1", "")}'

echo "the the" | gawk '{print gensub(/\<([a-z]+) +\1\>/, "\\1", "")}'


-- 
later on,
Mike

http://www.topcat.hypermart.net/index.html
0
Reply mss 1/13/2011 6:06:07 PM

On 13.01.2011 19:06, mss wrote:
> How does one delete duplicate adjacent words
> (leaving only one occurrence of the doubled word)
> using regex in gawk?
> 
> 
> before: "the the"
> 
> after: "the"
> 
> 
> These fail:
> 
> echo "the the" | gawk '{print gensub(/\<([a-z]+) +\\1\>/, "\\1", "")}'
> 
> echo "the the" | gawk '{print gensub(/\<([a-z]+) +\1\>/, "\\1", "")}'
> 
> 

According to the gawk manual components of the regexp may be referenced
in the _replacement_ text, so back-references seem not existing in gawk.

Note, BTW, that back-referenced, even though they are supported by some
"regexp" parser implementations, exceed the class of regular expressions.

A workaround is to iterate over the fields and compare adjacent ones;

  { for (i=1; i<NF; i++) if ($i == $(i+1)) $i = "" ; print }

with the usual caveat that you change the while space characters that
way, so above may need some more tweaks. The same problem if you print
the fields (conditionally); then the white spaces between fields may be
changed depending on the data.

Janis
0
Reply Janis 1/13/2011 6:33:27 PM


On Jan 13, 12:06=A0pm, mss <m...@dev.null> wrote:
> How does one delete duplicate adjacent words
> (leaving only one occurrence of the doubled word)
> using regex in gawk?
>
> before: "the the"
>
> after: "the"
>
> These fail:
>
> echo "the the" | gawk '{print gensub(/\<([a-z]+) +\\1\>/, "\\1", "")}'
>
> echo "the the" | gawk '{print gensub(/\<([a-z]+) +\1\>/, "\\1", "")}'
>
> --
> later on,
> Mike
>
> http://www.topcat.hypermart.net/index.html

How do you define a "word", how do you define "adjacent", and does
case matter?

For example:

a) Are "it's", "ill-mannered", "Ke$ha", each one word or 2 or
something else?
b) Is "the-the" or "the, the", or "The THE" examples of adjacent
"the"s or something else?

     Ed.
0
Reply Ed 1/13/2011 7:31:41 PM

Janis Papanagnou wrote:

> According to the gawk manual components of the regexp may be referenced
> in the _replacement_ text, so back-references seem not existing in gawk.
>
> Note, BTW, that back-referenced, even though they are supported by some
> "regexp" parser implementations, exceed the class of regular expressions.
>
> A workaround is to iterate over the fields and compare adjacent ones;
>
>   { for (i=1; i<NF; i++) if ($i == $(i+1)) $i = "" ; print }
>
> with the usual caveat that you change the while space characters that
> way, so above may need some more tweaks. The same problem if you print
> the fields (conditionally); then the white spaces between fields may be
> changed depending on the data.

Thanks for the help Janis, appreciate it.

Yes it seems the replacement parameter of
gensub() can make use of backreferences. 

-- 
later on,
Mike

http://www.topcat.hypermart.net/index.html
0
Reply mss 1/13/2011 8:59:26 PM

Ed Morton wrote:

> How do you define a "word", how do you define "adjacent", and does
> case matter?
>
> For example:
>
> a) Are "it's", "ill-mannered", "Ke$ha", each one word or 2 or
> something else?
> b) Is "the-the" or "the, the", or "The THE" examples of adjacent
> "the"s or something else?

Well, those are good questions Ed. I honestly haven't yet defined
what all a word might encompass...

I've been studying Jeffrey Friedl's 'Mastering Regular Expressions'
and was hoping to apply the lessons contained therein to gawk
(understanding on this end of course that some modification would
be necessary), and one of the examples tackled doubled-words like
'the the'.

Its easy enough to do in native awk by simply comparing fields
(as Janis too notes):

if ($2 == $1) {etc...}

But then the question of backreferences came to mind.

-- 
later on,
Mike

http://www.topcat.hypermart.net/index.html
0
Reply mss 1/13/2011 9:17:34 PM

On 1/13/2011 3:17 PM, mss wrote:
> Ed Morton wrote:
>
>> How do you define a "word", how do you define "adjacent", and does
>> case matter?
>>
>> For example:
>>
>> a) Are "it's", "ill-mannered", "Ke$ha", each one word or 2 or
>> something else?
>> b) Is "the-the" or "the, the", or "The THE" examples of adjacent
>> "the"s or something else?
>
> Well, those are good questions Ed. I honestly haven't yet defined
> what all a word might encompass...
>
> I've been studying Jeffrey Friedl's 'Mastering Regular Expressions'
> and was hoping to apply the lessons contained therein to gawk
> (understanding on this end of course that some modification would
> be necessary), and one of the examples tackled doubled-words like
> 'the the'.
>
> Its easy enough to do in native awk by simply comparing fields
> (as Janis too notes):
>
> if ($2 == $1) {etc...}
>
> But then the question of backreferences came to mind.
>

Got it. Yes, unfortunately awk doesn't support backreferences in the matching RE 
so in non-gawk you'd have to do something like:

	match($0,/[a-z]+/)

then use substr() to save the result of that match and then truncate $0 using 
substr() again then match() on that....

In a newer version of gawk you could get cute with the split() functions new 
gawk-specific argument that's an array of field separators. Something like this:

	c = split($0,betweenWords,/[a-x]+/,words)
	for (i=1; i<c; i++)
	   if (words[i] == words[i+1])
	      print "found dup",words[i]

Regards,

	Ed.

0
Reply Ed 1/14/2011 1:47:16 AM

>How does one delete duplicate adjacent words
>(leaving only one occurrence of the doubled word)
>using regex in gawk?

I believe that awk's regex machine always compiles to what is called a 
"finate state machine", and what you are asking for can not be done with 
a FSM (it's almost identical to the textbook "match on 'ab', 'aabb', 
'aaabbb', 'aaaabbbb', etc." example of what a FSM can not match against)

- Sam

-- 
#Sam Trenholme http://samiam.org -- Usenet user since September 1993#
######## My email address is at http://samiam.org/mailme.php ########
# The following script works around an annoyance in the Nano Editor #
cat | awk '{a=a $0 "\n";if($0 ~ /[a-zA-Z0-9]/){printf("%s",a);a=""}}'
0
Reply Sam 1/14/2011 2:01:21 AM

Both the dfa and regex code that gawk uses could support this feature
(they do for grep), but it's an extension that I chose not to add.

Technically, the point is correct, but the code can implement backrefs
for matching if one sets the right bit(s) when configuring the matchers.

In article <igoah2$fh2$1@Milagro.leafnode.foo>,
Sam Trenholme  <sam-readsig-4email@samiam.org> wrote:
>>How does one delete duplicate adjacent words
>>(leaving only one occurrence of the doubled word)
>>using regex in gawk?
>
>I believe that awk's regex machine always compiles to what is called a 
>"finate state machine", and what you are asking for can not be done with 
>a FSM (it's almost identical to the textbook "match on 'ab', 'aabb', 
>'aaabbb', 'aaaabbbb', etc." example of what a FSM can not match against)
>
>- Sam
>
>-- 
>#Sam Trenholme http://samiam.org -- Usenet user since September 1993#
>######## My email address is at http://samiam.org/mailme.php ########
># The following script works around an annoyance in the Nano Editor #
>cat | awk '{a=a $0 "\n";if($0 ~ /[a-zA-Z0-9]/){printf("%s",a);a=""}}'


-- 
Aharon (Arnold) Robbins 			arnold AT skeeve DOT com
P.O. Box 354		Home Phone: +972  8 979-0381
Nof Ayalon		Cell Phone: +972 50  729-7545
D.N. Shimshon 99785	ISRAEL
0
Reply arnold 1/14/2011 12:15:56 PM

>Both the dfa and regex code that gawk uses could support this feature
>(they do for grep), but it's an extension that I chose not to add.
>
>Technically, the point is correct, but the code can implement backrefs
>for matching if one sets the right bit(s) when configuring the matchers.

You know, I wonder how specific POSIX is about regular expressions and 
what regular expressions are OK or not (yes, I know, these days the 
slightly older POSIX is readily available for anyone on the web to read 
over at opengroup, but you have probably already gone over it).

Speaking of which, I would love to see POSIX updated.  When I last 
looked at it, it didn't have any networking beyond UUCP (which isn't 
quote dead; I actually found an ISP that still offers UUCP--but they no 
longer offer Usenet-over-UUCP).  I would love to see POSIX commands for 
modern TCP/IP networking, as well as a well defined interface for 
configuring a firewall (Linux has changed this.  Twice), connecting to a 
wireless network [1], and what not.

In terms of AWK, I would like to see POSIX expand AWK to have it have a 
built-in sort [2], a defined interface for matching [A-Z] in a non-C 
locale [3], and perhaps real multi-dimensional associative arrays if 
we're really ambitious [4].

But POSIX, alas, is moribund. [5]  I don't think we will ever see any 
changes made to the AWK language become part of the standard at this 
point.

- Sam

[1] The Linux command line interface is one way for WEP, and 
something completely different for WPA.  God help you if you want to 
configure a wireless access point from the Linux command line.

[2] The correct way to specify this interface is to use the variable 
WHINY_USERS.  Any other name would be wrong.

[3] I still contend that [A-Z] should only match on uppercase in all AWK 
implementations, or, barring that, a backwards-compatible command that 
can be put in BEGIN that says "make the locale C", such as "C_LOCALE=1"

[4] People with problems that need multi-dimensional AAs to solve should 
probably be using Perl or Python, though.

[5] The same, for better or for worse, can be said of Usenet, which is 
slowly dying--many groups I loved in the mid-1990s and early 2000s are 
now ghost towns.  Very sad.

-- 
#Sam Trenholme http://samiam.org -- Usenet user since September 1993#
######## My email address is at http://samiam.org/mailme.php ########
# The following script works around an annoyance in the Nano Editor #
cat | awk '{a=a $0 "\n";if($0 ~ /[a-zA-Z0-9]/){printf("%s",a);a=""}}'
0
Reply Sam 1/15/2011 2:06:07 AM

On Sat, 15 Jan 2011 02:06:07 +0000 (UTC) Sam Trenholme
<sam-readsig-4email@samiam.org> wrote:

> Speaking of which, I would love to see POSIX updated.

It was last updated in 2008.

> When I last looked at it, it didn't have any networking beyond UUCP
> (which isn't quote dead; I actually found an ISP that still offers
> UUCP--but they no longer offer Usenet-over-UUCP).  I would love to see
> POSIX commands for modern TCP/IP networking, as well as a well defined
> interface for configuring a firewall (Linux has changed this.  Twice),
> connecting to a wireless network [1], and what not.
> 
> In terms of AWK, I would like to see POSIX expand AWK to have it have a 
> built-in sort [2], a defined interface for matching [A-Z] in a non-C 
> locale [3], and perhaps real multi-dimensional associative arrays if 
> we're really ambitious [4].
> 
> But POSIX, alas, is moribund. [5]

Is it? By the looks of email volume and active bugs, it wouldn't seem so:

https://www.opengroup.org/sophocles/show_archive.tpl?CALLER=index.tpl&source=L&listname=austin-group-l

http://austingroupbugs.net/view_all_bug_page.php

> [3] I still contend that [A-Z] should only match on uppercase in all AWK 
> implementations, or, barring that, a backwards-compatible command that 
> can be put in BEGIN that says "make the locale C", such as "C_LOCALE=1"

You can just say LC_ALL=C awk etc.

0
Reply pk 1/15/2011 12:18:44 PM

In article <igs4bc$nus$1@speranza.aioe.org>, pk  <pk@pk.invalid> wrote:
....
>> [3] I still contend that [A-Z] should only match on uppercase in all AWK 
>> implementations, or, barring that, a backwards-compatible command that 
>> can be put in BEGIN that says "make the locale C", such as "C_LOCALE=1"
>
>You can just say LC_ALL=C awk etc.

From the C:\> prompt?

I think previous poster's point is valid - that there should be a way to
do it in the AWK language itself.  Not a big deal, of course, as there
are always workarounds - but valid nevertheless.

-- 
Just for a change of pace, this sig is *not* an obscure reference to
comp.lang.c...

0
Reply gazelle 1/15/2011 1:42:13 PM

10 Replies
584 Views

(page loaded in 0.222 seconds)

Similiar Articles:













7/22/2012 11:25:50 AM


Reply: