Hi,
I believe I've found a bug in the GNU-awk sub/gsub function
(w.r.t POSIX specification at least) involving backslash
interpretation in the replacement expression, example;
gsub(".","\\\\")
INPUT: x EXPECTED OUTPUT: \ ACTUAL OUTPUT: \\
also,
sub(".","\\\\")
INPUT: x EXPECTED OUTPUT: \ ACTUAL OUTPUT: \\
The replacement expression is '\\' and the standard says that a
double backslash should represent a backslash,
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[ POSIX1003.2-2001 Utilities - awk ]
sub(ere, repl[, in ])
Substitute the string repl in place of the first instance of
the extended regular expression ERE in string in and return the
number of substitutions. An ampersand ( '&' ) appearing in the
string repl shall be replaced by the string from in that
-> matches the ERE. An ampersand preceded with a backslash ( '\' )
-> shall be interpreted as the literal ampersand character. An
-> occurrence of two consecutive backslashes shall be interpreted
-> as just a single literal backslash character. Any other
occurrence of a backslash (for example, preceding any other
character) shall be treated as a literal backslash character.
Note that if repl is a string literal (the lexical token
STRING; see Grammar ), the handling of the ampersand character
occurs after any lexical processing, including any lexical
backslash escape sequence processing. If in is specified and it
is not an lvalue (see Expressions in awk ), the behavior is
undefined. If in is omitted, awk shall use the current record
($0) in its place.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I have only tested this with version _3.0.5_ of GNU-awk so maybe it
has been fixed since but otherwise it is a very annoying problem!
best wishes
laura
--
echo alru_aafriehdab@ittnreen.tocm |sed 's/\(.\)\(.\)/\2\1/g'
|
|
0
|
|
|
|
Reply
|
laura_fairhead
|
11/5/2003 6:09:40 PM |
|
On 11/5/2003 12:09 PM, laura fairhead wrote:
> Hi,
>
> I believe I've found a bug in the GNU-awk sub/gsub function
> (w.r.t POSIX specification at least) involving backslash
> interpretation in the replacement expression, example;
>
> gsub(".","\\\\")
>
> INPUT: x EXPECTED OUTPUT: \ ACTUAL OUTPUT: \\
>
> also,
>
> sub(".","\\\\")
>
> INPUT: x EXPECTED OUTPUT: \ ACTUAL OUTPUT: \\
>
> The replacement expression is '\\' and the standard says that a
> double backslash should represent a backslash,
But you're replacing it with 2 double-backslashes, so I'd expect the output to
be 2 backslashes as you're seeing. What am I missing?
Ed.
|
|
0
|
|
|
|
Reply
|
Ed
|
11/5/2003 6:14:45 PM
|
|
In article <slrnbqif6v.e8.laura_fairhead@bell486.bittersweet.org>,
laura fairhead wrote:
>Hi,
>I have only tested this with version _3.0.5_ of GNU-awk so maybe it
>has been fixed since but otherwise it is a very annoying problem!
Ok, I just downloaded and compiled version 3.1.2 and it still has the
same problem.
regards
laura
--
echo alru_aafriehdab@ittnreen.tocm |sed 's/\(.\)\(.\)/\2\1/g'
|
|
0
|
|
|
|
Reply
|
laura_fairhead
|
11/5/2003 7:04:13 PM
|
|
In article <3FA93E15.50209@lucent.com>, Ed Morton wrote:
>
>
>On 11/5/2003 12:09 PM, laura fairhead wrote:
>> Hi,
>>
>> I believe I've found a bug in the GNU-awk sub/gsub function
>> (w.r.t POSIX specification at least) involving backslash
>> interpretation in the replacement expression, example;
>>
>> gsub(".","\\\\")
>>
>> INPUT: x EXPECTED OUTPUT: \ ACTUAL OUTPUT: \\
>>
>> also,
>>
>> sub(".","\\\\")
>>
>> INPUT: x EXPECTED OUTPUT: \ ACTUAL OUTPUT: \\
>>
>> The replacement expression is '\\' and the standard says that a
>> double backslash should represent a backslash,
>
>But you're replacing it with 2 double-backslashes, so I'd expect the output to
>be 2 backslashes as you're seeing. What am I missing?
The lexical processing of the string literal,
q="\\\\"
print q
The string literal itself represents only 2 backslashes.
All literal strings in awk are subject to the same pre-processing
rules, (the relevant section of the std. is awk/grammar/STRING)
so you can use \n \t \b etc
It just means that in cases like this the text data is actually
parsed twice - once by the string parser (logically before any command
is executed) and then for '\\', '\&' and '&' by sub/gsub ..... or at
least that _should_ be the case, gawk sub/gsub doesn't process the
'\\' properly.
It's _weird_ because it actually gets '\\&' right but not '\\a' so
gsub(".","\\\\") INPUT: x
EXPECTED OUTPUT: \
ACTUAL OUTPUT: \\
gsub(".","\\\\&") INPUT: x
EXPECTED OUTPUT: \x
ACTUAL OUTPUT: \x
gsub(".","\\\\a") INPUT: x
EXPECTED OUTPUT: \a
ACTUAL OUTPUT: \\a
As if having the '&' caused the parser to actually go back and change
the rules on what was before!
I tried more tests with the following one-liner;
awk '{t=$0; $0="x"; gsub(".",t); printf "%-16s %s\n",t,$0;}' testfile
then testfile contains the pattern only ;
\\
\\&
\\a
\\\\
etc...,,
byefornow
laura
|
|
0
|
|
|
|
Reply
|
laura_fairhead
|
11/5/2003 7:53:34 PM
|
|
laura fairhead wrote:
> In article <slrnbqif6v.e8.laura_fairhead@bell486.bittersweet.org>,
> laura fairhead wrote:
>
>>Hi,
>>I have only tested this with version _3.0.5_ of GNU-awk so maybe it
>>has been fixed since but otherwise it is a very annoying problem!
>
>
> Ok, I just downloaded and compiled version 3.1.2 and it still has the
> same problem.
>
Try to see if section 8.1.3.1 in the GNU-awk manual covers this topic.
It's about backslashes wrt to sub, gsub and gensub. It states some
inconsistency between historical implementations, gawk and POSIX.
--
Vale !
Christianus Auriocus
|
|
0
|
|
|
|
Reply
|
Christian
|
11/5/2003 9:48:25 PM
|
|
In article <3FA97029.1060901@btcips73x1.cip.uni-bayreuth.de>,
Christian Gollwitzer wrote:
>laura fairhead wrote:
>> In article <slrnbqif6v.e8.laura_fairhead@bell486.bittersweet.org>,
>> laura fairhead wrote:
>>
>>>Hi,
>>>I have only tested this with version _3.0.5_ of GNU-awk so maybe it
>>>has been fixed since but otherwise it is a very annoying problem!
>>
>>
>> Ok, I just downloaded and compiled version 3.1.2 and it still has the
>> same problem.
>>
>
>
>Try to see if section 8.1.3.1 in the GNU-awk manual covers this topic.
>It's about backslashes wrt to sub, gsub and gensub. It states some
>inconsistency between historical implementations, gawk and POSIX.
o.i.c, it seems they don't agree with the standard. That is all very
well and good because I know GNU-awk is not POSIX compliant by default,
however it should be, as documented, after a "--posix", unyet the
same defective behaviour remains. Well, FSF get my vote as the next Microsoft,
they seem to have their own ideas, good luck to them
thanx,
laura
>
>
>--
>Vale !
> Christianus Auriocus
>
--
echo alru_aafriehdab@ittnreen.tocm |sed 's/\(.\)\(.\)/\2\1/g'
|
|
0
|
|
|
|
Reply
|
laura_fairhead
|
11/6/2003 12:22:24 AM
|
|
On Wed, 5 Nov 2003 19:53:34 +0000, laura_fairhead@INVALID.com (laura
fairhead) wrote:
>In article <3FA93E15.50209@lucent.com>, Ed Morton wrote:
>>
>>
>>On 11/5/2003 12:09 PM, laura fairhead wrote:
>>> Hi,
>>>
>>> I believe I've found a bug in the GNU-awk sub/gsub function
>>> (w.r.t POSIX specification at least) involving backslash
>>> interpretation in the replacement expression, example;
>>>
>>> gsub(".","\\\\")
>>>
>>> INPUT: x EXPECTED OUTPUT: \ ACTUAL OUTPUT: \\
>>>
>>> also,
>>>
>>> sub(".","\\\\")
>>>
>>> INPUT: x EXPECTED OUTPUT: \ ACTUAL OUTPUT: \\
>>>
>>> The replacement expression is '\\' and the standard says that a
>>> double backslash should represent a backslash,
>>
>>But you're replacing it with 2 double-backslashes, so I'd expect the output to
>>be 2 backslashes as you're seeing. What am I missing?
>
>The lexical processing of the string literal,
>
>q="\\\\"
>print q
>
>The string literal itself represents only 2 backslashes.
>
>All literal strings in awk are subject to the same pre-processing
>rules, (the relevant section of the std. is awk/grammar/STRING)
>so you can use \n \t \b etc
>
>It just means that in cases like this the text data is actually
>parsed twice - once by the string parser (logically before any command
>is executed) and then for '\\', '\&' and '&' by sub/gsub ..... or at
>least that _should_ be the case, gawk sub/gsub doesn't process the
>'\\' properly.
>
>It's _weird_ because it actually gets '\\&' right but not '\\a' so
>
>gsub(".","\\\\") INPUT: x
> EXPECTED OUTPUT: \
> ACTUAL OUTPUT: \\
>
>gsub(".","\\\\&") INPUT: x
> EXPECTED OUTPUT: \x
> ACTUAL OUTPUT: \x
>
>gsub(".","\\\\a") INPUT: x
> EXPECTED OUTPUT: \a
> ACTUAL OUTPUT: \\a
>
>As if having the '&' caused the parser to actually go back and change
>the rules on what was before!
>
>I tried more tests with the following one-liner;
>
>awk '{t=$0; $0="x"; gsub(".",t); printf "%-16s %s\n",t,$0;}' testfile
>
>then testfile contains the pattern only ;
>\\
>\\&
>\\a
>\\\\
>etc...,,
See the help for gawk -> "String functions" -> "More About \ and &
with sub, gsub and gensub"
--------
You type sub sees sub generates
-------- ---------- ---------------
\& & the matched text
\\& \& a literal &
\\\& \& a literal &
\\\\& \\& a literal \&
\\\\\& \\& a literal \&
\\\\\\& \\\& a literal \\&
\\q \q a literal \q
---------
Remember you have to use a command line switch to force POSIX
compliance, otherwise you get gawk specific behavior.
T.E.D. (tdavis@gearbox.maem.umr.edu - e-mail must contain "T.E.D." or my .sig in the body)
|
|
0
|
|
|
|
Reply
|
Ted
|
11/6/2003 2:53:08 AM
|
|
laura_fairhead@INVALID.com (laura fairhead) writes:
> I believe I've found a bug in the GNU-awk sub/gsub function
> (w.r.t POSIX specification at least) involving backslash
> interpretation in the replacement expression, example;
>
> gsub(".","\\\\")
>
> INPUT: x EXPECTED OUTPUT: \ ACTUAL OUTPUT: \\
>
> also,
>
> sub(".","\\\\")
>
> INPUT: x EXPECTED OUTPUT: \ ACTUAL OUTPUT: \\
>
> The replacement expression is '\\' and the standard says that a
> double backslash should represent a backslash,
Yeah. This is one place where the standard was changed in
the latest (2001) version - for the very good reason that
the old wording was seriously ambiguous, and various
implementations had interpreted it differently,
although it can be debated if the solution was ideal.
It will certainly break backwards-compatibility in some cases
(although so would've any other choice in some platform).
I guess gawk hasn't caught up with the change yet, but neither
have any other awk's I've tried (not too many, I admit).
--
Tapani Tarvainen
|
|
0
|
|
|
|
Reply
|
Tapani
|
11/6/2003 6:57:23 AM
|
|
laura_fairhead@INVALID.com (laura fairhead) writes:
> In article <3FA97029.1060901@btcips73x1.cip.uni-bayreuth.de>,
> Christian Gollwitzer wrote:
> >Try to see if section 8.1.3.1 in the GNU-awk manual covers this topic.
> >It's about backslashes wrt to sub, gsub and gensub. It states some
> >inconsistency between historical implementations, gawk and POSIX.
>
> o.i.c, it seems they don't agree with the standard. That is all very
> well and good because I know GNU-awk is not POSIX compliant by default,
> however it should be, as documented, after a "--posix", unyet the
> same defective behaviour remains.
Well, in this case it is somewhat understandable, given the
way standards have been changing. I believe the current gawk
behaviour is compatible with some version of POSIX, and I'm
sure they'll catch up when (or if) the standard stabilises
(although possibly requiring --posix or another, new switch
to maintain backwards compatibility).
Do you (or does anyone) know which current awk implementations
get this right?
--
Tapani Tarvainen
|
|
0
|
|
|
|
Reply
|
Tapani
|
11/6/2003 7:04:43 AM
|
|
In article <slrnbqj51s.g5.laura_fairhead@bell486.bittersweet.org>,
>>> Ok, I just downloaded and compiled version 3.1.2 and it still has the
>>> same problem.
3.1.3 is what's current, although this code hasn't changed.
>>Try to see if section 8.1.3.1 in the GNU-awk manual covers this topic.
>>It's about backslashes wrt to sub, gsub and gensub. It states some
>>inconsistency between historical implementations, gawk and POSIX.
>
>o.i.c, it seems they don't agree with the standard. That is all very
>well and good because I know GNU-awk is not POSIX compliant by default,
>however it should be, as documented, after a "--posix", unyet the
>same defective behaviour remains. Well, FSF get my vote as the next Microsoft,
>they seem to have their own ideas, good luck to them
Gee, that's a real nice way to say "thank you" for the years of time I've
spent on this, without receiving a dime. Nothing is stopping you
from using either of the other two freely available source code awks,
or from spending money on Kenny's beloved tawk. Why don't you do go that?
Not to mention that this is Free Software: you have the source code,
submit a patch.
Also note that the POSIX standard has changed in this area. The 1992
standard was quite brain-damaged, so I chose a more reasonable route
with gawk. The new standard is fairly recent and I haven't gotten
around to making a change here. I still think the standard is broken,
although less so.
Sheesh.
--
Aharon (Arnold) Robbins --- Pioneer Consulting Ltd. arnold AT skeeve DOT com
P.O. Box 354 Home Phone: +972 8 979-0381 Fax: +1 530 688 5518
Nof Ayalon Cell Phone: +972 51 297-545
D.N. Shimshon 99785 ISRAEL
|
|
0
|
|
|
|
Reply
|
arnold
|
11/6/2003 7:18:34 AM
|
|
Dear customer,
Our dear customer Laura wrote:
>> Well, FSF get my vote as the next Microsoft,
>> they seem to have their own ideas, good luck to them
we are commited here to bring you the freshest software ever.
Below is a patch which breaks the backward compatibility of gawk
and implements what standards dictate as the only colour for this winter.
(Please note that backward compatibility is not required by MS POSIX.)
Stepan Kasal, Microsoft gawk department
(Invoice follows, in accordance with the terms of service you clicked
through on our MS FSF page.)
In article <3faa03da$1@news.012.net.il>, Aharon Robbins wrote:
> Also note that the POSIX standard has changed in this area. The 1992
> standard was quite brain-damaged, so I chose a more reasonable route
> with gawk. The new standard is fairly recent and I haven't gotten
Thank you and Tapani for explaining this to us.
*** builtin.c.orig Thu Nov 6 14:14:27 2003
--- builtin.c Thu Nov 6 14:33:33 2003
***************
*** 2068,2089 ****
repllen--;
scan++;
}
! } else { /* (proposed) posix '96 mode */
! if (strncmp(scan, "\\\\\\&", 4) == 0) {
! /* \\\& --> \& */
! repllen -= 2;
! scan += 3;
! } else if (strncmp(scan, "\\\\&", 3) == 0) {
! /* \\& --> \<string> */
! ampersands++;
! repllen--;
! scan += 2;
! } else if (scan[1] == '&') {
! /* \& --> & */
repllen--;
scan++;
! } /* else
! leave alone, it goes into the output */
}
}
}
--- 2068,2079 ----
repllen--;
scan++;
}
! } else { /* posix mode */
! /* \& --> &, \\ --> \ */
! if (scan[1] == '&' || scan[1] == '\\') {
repllen--;
scan++;
! }
}
}
}
***************
*** 2161,2184 ****
scan++;
} else /* \q for any q --> q */
*bp++ = *++scan;
! } else { /* posix '96 mode, bleah */
! if (strncmp(scan, "\\\\\\&", 4) == 0) {
! /* \\\& --> \& */
! *bp++ = '\\';
! *bp++ = '&';
! scan += 3;
! } else if (strncmp(scan, "\\\\&", 3) == 0) {
! /* \\& --> \<string> */
! *bp++ = '\\';
! for (cp = matchstart; cp < matchend; cp++)
! *bp++ = *cp;
! scan += 2;
! } else if (scan[1] == '&') {
! /* \& --> & */
! *bp++ = '&';
scan++;
! } else
! *bp++ = *scan;
}
} else
*bp++ = *scan;
--- 2151,2161 ----
scan++;
} else /* \q for any q --> q */
*bp++ = *++scan;
! } else { /* posix mode */
! /* \& --> &, \\ --> \ */
! if (scan[1] == '&' || scan[1] == '\\')
scan++;
! *bp++ = *scan;
}
} else
*bp++ = *scan;
|
|
0
|
|
|
|
Reply
|
Stepan
|
11/7/2003 5:39:25 PM
|
|
Aharon Robbins <arnold@skeeve.com> wrote on 6 Nov 2003 10:18:34 +0300:
> Gee, that's a real nice way to say "thank you" for the years of time
> I've spent on this, without receiving a dime. Nothing is stopping you
> from using either of the other two freely available source code awks,
> or from spending money on Kenny's beloved tawk. Why don't you do go
> that?
For what it's worth, Aharon, I like gawk, and I like it a _lot_. I don't
use it a great deal, don't do very sophisticated things with it (unless
you count extracting info from a leafnode .overview file sopisticated),
yet - without AWK, I'd not be able to do these simple things at all. Or
I'd have to learn **rl. :-(
Thank you indeed for making gawk what it is.
> Aharon (Arnold) Robbins
--
Alan Mackenzie (Munich, Germany)
Email: aacm@muuc.dee; to decode, wherever there is a repeated letter
(like "aa"), remove half of them (leaving, say, "a").
|
|
0
|
|
|
|
Reply
|
Alan
|
11/11/2003 8:16:30 PM
|
|
|
11 Replies
291 Views
(page loaded in 0.167 seconds)
|