Compare to '0A'x

  • Follow


WHen doing non-stric compare of a null or blank screen to hex 0A it
comes up as being equal, using Regina in Win32. Can somebody please
explain why that should be?

Here's the code snipit.
Thanks,
Steve

/* a = null or any length blank string -- results are the same */

a = '     '
say  a = '0A'x     /* returns 1 */
say  a == '0A'x    /* returns 0 */

exit

0
Reply steve 2/27/2005 6:51:30 AM

<steve.weisberg@gmail.com> wrote in message
news:1109487090.072681.152140@l41g2000cwc.googlegroups.com...
>
> WHen doing non-stric compare of a null or blank screen
> to hex 0A it comes up as being equal, using Regina in
> Win32. Can somebody please explain why that should be?
>
> Here's the code snipit.
> Thanks,
> Steve
>
> /* a = null or any length blank string -- results are the same */
>
> a = '     '
> say  a = '0A'x     /* returns 1 */
> say  a == '0A'x    /* returns 0 */
>

It all has to do, I believe, with two things:

* How the term 'blank' is defined
* What happens in non-strict comparision operations

On an ASCII platform the following are part of the set that constitute what
is considered as the 'blank' character:

   '09'x -> HT [horizontal tab]
   '0A' -> LF [line feed]
   '0B'x -> VT [vertical tab]
   '0C'x -> NP [form feed]
   '0D'x -> CR [carriage return]
   '20'x -> space

Based on this, any single one of these should, non-strictly, equal any other
in the set. That a set of several characters may, non-strictly, equal a
different number of characters from this set has to do with the stripping
away of leading and trailing blanks before a comparision is made. Put
simply, if two strings, containing nothing but blanks, are non-strictly
compared, all the characters will [I believe] end up being removed, and the
comparison will actually consist of comparing two 'empty' strings.

There's a pretty in-depth discussion of this topic in the newsgroup
archive - Google using:

    Blanks, REXX, and portability group:*rexx*

I hope this hlps.

Anthony Borla


0
Reply Anthony 2/27/2005 12:26:30 PM


steve.weisberg@gmail.com wrote:

> a = '     '
> say  a = '0A'x     /* returns 1 */

That's strictly incorrect for TRL:  all leading and trailing
blanks are ignored, the shorter string is padded with blanks
on the right.  Anthony's explanation is nice, only he's not
talking about REXX and blanks, but some other language and
white space.  Or they changed this in ANSI REXX, then they
should be be shot (blasphemy, among other other crimes).

                   Bye, Frank (TRL page 26)

0
Reply Frank 2/28/2005 2:40:45 AM

Both return 0 with Windows Object Rexx 2.1.3 - will try it on ooRexx
3.0.0 when I get to the office on Monday.

Lee

On 26 Feb 2005 22:51:30 -0800, steve.weisberg@gmail.com wrote:

>WHen doing non-stric compare of a null or blank screen to hex 0A it
>comes up as being equal, using Regina in Win32. Can somebody please
>explain why that should be?
>
>Here's the code snipit.
>Thanks,
>Steve
>
>/* a = null or any length blank string -- results are the same */
>
>a = '     '
>say  a = '0A'x     /* returns 1 */
>say  a == '0A'x    /* returns 0 */
>
>exit

0
Reply Lee 2/28/2005 3:12:47 AM

"Frank Ellermann" <nobody@xyzzy.claranet.de> wrote in message
news:422284AD.6193@xyzzy.claranet.de...
> steve.weisberg@gmail.com wrote:
>
> > a = '     '
> > say  a = '0A'x     /* returns 1 */
>
> That's strictly incorrect for TRL:  all leading and trailing
> blanks are ignored, the shorter string is padded with blanks
> on the right.  Anthony's explanation is nice, only he's not
> talking about REXX and blanks, but some other language and
> white space.  Or they changed this in ANSI REXX, then they
> should be be shot (blasphemy, among other other crimes).
>
>                    Bye, Frank (TRL page 26)
>

I didn't check what the expected behaviour is according to any of the
standards documents.

The facts for my explanation I obtained from the link I supplied in my
previous post, and I verified this behaviour using Regina under both Win32
and *NIX [of course, this only proves that Regina behaves this way, and not
that all REXX implementations do :)].

To me it seems a reasonable sort of behaviour for a non-strict comparison
operation especially given that if a more precise type of comparision is
needed, a strict comparison operation is also available for use. However, I
haven't analysed this issue in depth, and have no firm views on what is the
'best' approah to this issue.

Cheers,

Anthony Borla


0
Reply Anthony 2/28/2005 6:01:13 AM

In REXX I get the following

[C:\home\default]rexxtry a = '    '; say a = '0A'x
0
   ................................................ REXXTRY.CMD on OS/2

[C:\home\default]


Lee Peedin wrote:
> Both return 0 with Windows Object Rexx 2.1.3 - will try it on ooRexx
> 3.0.0 when I get to the office on Monday.
> 
> Lee
> 
> On 26 Feb 2005 22:51:30 -0800, steve.weisberg@gmail.com wrote:
> 
> 
>>WHen doing non-stric compare of a null or blank screen to hex 0A it
>>comes up as being equal, using Regina in Win32. Can somebody please
>>explain why that should be?
>>
>>Here's the code snipit.
>>Thanks,
>>Steve
>>
>>/* a = null or any length blank string -- results are the same */
>>
>>a = '     '
>>say  a = '0A'x     /* returns 1 */
>>say  a == '0A'x    /* returns 0 */
>>
>>exit
> 
> 
0
Reply Dennis 2/28/2005 6:55:06 AM

Hi Steve

steve.weisberg@gmail.com schrieb:
> WHen doing non-stric compare of a null or blank screen to hex 0A it
> comes up as being equal, using Regina in Win32. Can somebody please
> explain why that should be?
> 
> Here's the code snipit.
> Thanks,
> Steve
> 
> /* a = null or any length blank string -- results are the same */
> 
> a = '     '
> say  a = '0A'x     /* returns 1 */
> say  a == '0A'x    /* returns 0 */
> 
> exit
> 

This question raises every 3 or 4 weeks.

Regina started in ancient days using a sloppy comparision where
space = linefeed (but with space \== linefeed).

ANSI says that only spaces should be ignored during comparison,
so Regina changed its behaviour during a beta phase. But many
"customers" with long-term running scripts claimed about this,
so we decided to use the old behaviour as is. You can select
the ANSI behaviour by setting an OPTIONS string.

Either set

OPTIONS STRICT_WHITE_SPACE_COMPARISON
OPTIONS STRICT_ANSI
OPTIONS ANSI

according to the documentation. You can use the environment variable
REGINA_OPTIONS to make changes in options global.

Cheers, Florian


NB: TRL is less interesting to Regina. ANSI is the standard for
     Regina and that's it. If there is a difference between
     ANSI and TRL, Regina will always honour ANSI style. This
     is no blasphemy.
0
Reply ISO 2/28/2005 7:00:31 AM

On Mon, 28 Feb 2005 08:00:31 +0100, Florian Gro�e-Coosmann got out a
palm pilot and hastily scribbled some words to post to comp.lang.rexx:
>ANSI says that only spaces should be ignored during comparison,

It doesn't.  Here's my reasoning:

* Section 7.4.7, on the value of a comparison, contains a script which
  calls strip() on each side of the comparison.
* Section 9.3.20, the STRIP function, defaults the third argument
  to the value of #AllBlanks.
* Section 8.2.1, on program initialization, sets #AllBlanks to
  ' ' concatenated with the result of Config_OtherBlankCharacters.
* Config_OtherBlankCharacters is defined in section 5.4.2 to return
  the configurations other_blank_characters (a possibly empty list
  of characters which the configuration chooses to be equivalent to
  blanks.
* Section 5.3.3 describes other_blank_characters and lists CR, LF,
  FF, VT and HT as [the only] possible members of this category.

Note also that words() and the other word-related functions use #AllBlanks
too.
-- 
---- Ian Collier : imc@comlab.ox.ac.uk : WWW page (including REXX section):
------ http://users.comlab.ox.ac.uk/ian.collier/imc.shtml

New to this group?  Answers to frequently-asked questions can be had from
http://rexx.hursley.ibm.com/rexx/ .
0
Reply imc 2/28/2005 12:22:30 PM

On Mon, 28 Feb 2005 03:40:45 +0100, Frank Ellermann got out a palm pilot
and hastily scribbled some words to post to comp.lang.rexx:
>steve.weisberg@gmail.com wrote:
>> a = '     '
>> say  a = '0A'x     /* returns 1 */

>That's strictly incorrect for TRL:  all leading and trailing
>blanks are ignored, the shorter string is padded with blanks
>on the right. 

Yes, but what is a blank?  I don't think it says anywhere in TRL
that the word "blank" refers to ASCII 32 and nothing else.

>not talking about REXX and blanks, but some other language and
>white space.  Or they changed this in ANSI REXX, then they
>should be be shot (blasphemy, among other other crimes).

They did, and I see no particular reason why they were wrong
to do so.
-- 
---- Ian Collier : imc@comlab.ox.ac.uk : WWW page (including REXX section):
------ http://users.comlab.ox.ac.uk/ian.collier/imc.shtml

New to this group?  Answers to frequently-asked questions can be had from
http://rexx.hursley.ibm.com/rexx/ .
0
Reply imc 2/28/2005 12:25:42 PM

Not exactly, I'd say.
The Strip-function accepts only ONE Character as third argument
So strip(string,,' ')==string2 should give the same as
string1=string2 (except for numbers, e.g., 1E0 is not == 1, that is)

Oh well

Walter

"Ian Collier" <imc@comlab.ox.ac.uk> schrieb im Newsbeitrag 
news:27414-prophecy.imc@comlab.ox.ac.uk...
> On Mon, 28 Feb 2005 08:00:31 +0100, Florian Gro�e-Coosmann got out a
> palm pilot and hastily scribbled some words to post to comp.lang.rexx:
>>ANSI says that only spaces should be ignored during comparison,
>
> It doesn't.  Here's my reasoning:
>
> * Section 7.4.7, on the value of a comparison, contains a script which
>  calls strip() on each side of the comparison.
> * Section 9.3.20, the STRIP function, defaults the third argument
>  to the value of #AllBlanks.
> * Section 8.2.1, on program initialization, sets #AllBlanks to
>  ' ' concatenated with the result of Config_OtherBlankCharacters.
> * Config_OtherBlankCharacters is defined in section 5.4.2 to return
>  the configurations other_blank_characters (a possibly empty list
>  of characters which the configuration chooses to be equivalent to
>  blanks.
> * Section 5.3.3 describes other_blank_characters and lists CR, LF,
>  FF, VT and HT as [the only] possible members of this category.
>
> Note also that words() and the other word-related functions use #AllBlanks
> too.
> -- 
> ---- Ian Collier : imc@comlab.ox.ac.uk : WWW page (including REXX 
> section):
> ------ http://users.comlab.ox.ac.uk/ian.collier/imc.shtml
>
> New to this group?  Answers to frequently-asked questions can be had from
> http://rexx.hursley.ibm.com/rexx/ . 


0
Reply Walter 2/28/2005 12:48:47 PM

A short while ago I became aware that Walter u. Christel Pachl had
written:
>Not exactly, I'd say.
>The Strip-function accepts only ONE Character as third argument

True, but you look at the code - the default value for the third
argument is #AllBlanks, which can contain more than one character.
This is of course the only way in which it's possible to have a
multi-character string as the third argument of strip().
-- 
---- Ian Collier : imc@comlab.ox.ac.uk : WWW page (including REXX section):
------ http://users.comlab.ox.ac.uk/ian.collier/imc.shtml

New to this group?  Answers to frequently-asked questions can be had from
http://rexx.hursley.ibm.com/rexx/ .
0
Reply imc 2/28/2005 1:36:13 PM

I beg to differ:-)

The Standard has lines like...

If the third arg is omitted then
  Unwanted=#AllBlanks
Else
  Unwanted is the third argument (one character, i.e., the blank in my 
example)

and later pos(left(string,1),Unwanted)>0...

strip(string,,' '||CR||LF) wou�de definitely rais syntax (untested)

Walter

"Ian Collier" <imc@comlab.ox.ac.uk> schrieb im Newsbeitrag 
news:27416-cyclone.imc@comlab.ox.ac.uk...
>A short while ago I became aware that Walter u. Christel Pachl had
> written:
>>Not exactly, I'd say.
>>The Strip-function accepts only ONE Character as third argument
>
> True, but you look at the code - the default value for the third
> argument is #AllBlanks, which can contain more than one character.
> This is of course the only way in which it's possible to have a
> multi-character string as the third argument of strip().
> -- 
> ---- Ian Collier : imc@comlab.ox.ac.uk : WWW page (including REXX 
> section):
> ------ http://users.comlab.ox.ac.uk/ian.collier/imc.shtml
>
> New to this group?  Answers to frequently-asked questions can be had from
> http://rexx.hursley.ibm.com/rexx/ . 


0
Reply Walter 2/28/2005 3:00:17 PM

Ian Collier schrieb:
> On Mon, 28 Feb 2005 08:00:31 +0100, Florian Gro�e-Coosmann got out a
> palm pilot and hastily scribbled some words to post to comp.lang.rexx:
> 
>>ANSI says that only spaces should be ignored during comparison,
> 
> 
> It doesn't.  Here's my reasoning:
> 
> * Section 7.4.7, on the value of a comparison, contains a script which
>   calls strip() on each side of the comparison.

My apologies. It seems I have an old version of the ANSI standard.
My current version is a PDF file named J18PUB.PM5 created on July 6th, 1998.
It contains these lines in 7.4.7 for "compare normal" which leads to our
assumptions:
<snip>
if datatype(lhs)\== 'NUM' | datatype(rhs)\== 'NUM' then do
    /* Non-numeric non-strict comparison */
    lhs=strip(lhs, 'B', ' ') /* ExtraBlanks not stripped */
    rhs=strip(rhs, 'B', ' ')
    if length(lhs)>length(rhs) then rhs=left(rhs,length(lhs))
                               else lhs=left(lhs,length(rhs))
    if lhs>>rhs then #Test='G'
                else if lhs<<rhs then #Test='L'
                                 else #Test='E'
    end
else do /* Numeric comparison */
    if left(-lhs,1) == '-' & left(+rhs,1) \== '-' then #Test='G'
    else if left(-rhs,1) == '-' & left(+lhs,1) \== '-' then #Test='L'
       else do
          Difference=lhs - rhs /* Will never raise an arithmetic condition. */
          if Difference > 0 then #Test='G'
                            else if Difference < 0 then #Test='L'
                                                   else #Test='E'
          end
    end
</snip>

This version still contains ' ' as the third argument to strip()
which lead to my assumption.

Do you know where to get a cheap version of the standard? I don't
want to buy an expensive paper from ANSI.

Cheers, Florian
0
Reply ISO 2/28/2005 5:12:57 PM

On Mon, 28 Feb 2005 18:12:57 +0100, Florian Gro�e-Coosmann got out a
palm pilot and hastily scribbled some words to post to comp.lang.rexx:
>Ian Collier schrieb:
>> * Section 7.4.7, on the value of a comparison, contains a script which
>>   calls strip() on each side of the comparison.

>My apologies. It seems I have an old version of the ANSI standard.

Hmph... maybe it is I who have an old version!  I have a printed-out
copy in my office which I had assumed was more or less the same as the
final version.  I do have an official copy but I am not sure of its
current whereabouts.  So I apologise for posting (possibly) wrong
information in this thread...
-- 
---- Ian Collier : imc@comlab.ox.ac.uk : WWW page (including REXX section):
------ http://users.comlab.ox.ac.uk/ian.collier/imc.shtml

New to this group?  Answers to frequently-asked questions can be had from
http://rexx.hursley.ibm.com/rexx/ .
0
Reply imc 2/28/2005 10:11:46 PM

In a previous episode of comp.lang.rexx, Walter u. Christel Pachl was
heard to say:
>I beg to differ:-)

>The Standard has lines like...
>
>If the third arg is omitted then
>  Unwanted=#AllBlanks
>Else
>  Unwanted is the third argument (one character, i.e., the blank in my 
>example)
>
>and later pos(left(string,1),Unwanted)>0...
>
>strip(string,,' '||CR||LF) wou�de definitely rais syntax (untested)

You don't seem to have disagreed with anything I said...
-- 
---- Ian Collier : imc@comlab.ox.ac.uk : WWW page (including REXX section):
------ http://users.comlab.ox.ac.uk/ian.collier/imc.shtml

New to this group?  Answers to frequently-asked questions can be had from
http://rexx.hursley.ibm.com/rexx/ .
0
Reply imc 2/28/2005 10:48:10 PM

---- Ian Collier : imc@comlab.ox.ac.uk : WWW page (including REXX section):
------ http://users.comlab.ox.ac.uk/ian.collier/imc.shtml didn't see 
disagreement.

This is the sentence I disagree with (maybe nitpickingly):

* Section 9.3.20, the STRIP function, defaults the third argument
  to the value of #AllBlanks.

When the argument is defaulted, I could also specify it.

The STRIP function behaves the way you describe if the third character is 
NOT specified.

Pleeez find your official copy!
And tell us how to strip:-)

Walter


0
Reply Walter 3/1/2005 5:37:16 AM

Ian Collier wrote:
 
> what is a blank?  I don't think it says anywhere in TRL
> that the word "blank" refers to ASCII 32 and nothing else.

EBCDIC 64 is also fine, if I can still interpret my 1976
"System/30 Reference summary" correctly ;-)  See TRL page 17
for the bloody details.  Numerous concepts won't work for a
charset with more than one encoding of a blank.

But I'd like the idea of a hypothetical new strip(), where
the third argument could be a string indicating a set of
characters to be stripped.  The same idea could work for
space().

>> Or they changed this in ANSI REXX, then they should be
[...]
>> shot (blasphemy, among other other crimes).
 
> They did, and I see no particular reason why they were
> wrong to do so.

Portability and backwards compatibility are very important
features.  At least I now know that I will be never eager
to find an ANSI REXX implementation.  The one thing I miss
is DO OVER, but I won't find it in ANSI REXX.

The new date(), time(), and LOSTDIGITS could be nice, but
not so nice that I'd want bogus results of comparisons.  I
have a working gawk on my box for hardcore matching tasks.

                       Bye, Frank

0
Reply Frank 3/1/2005 10:06:45 AM

In a previous episode of comp.lang.rexx, Walter u. Christel Pachl was
heard to say:
>This is the sentence I disagree with (maybe nitpickingly):

>* Section 9.3.20, the STRIP function, defaults the third argument
>  to the value of #AllBlanks.

>When the argument is defaulted, I could also specify it.

In this case it defaults to something that you are not allowed to
specify explicitly.  I'm not sure what word other than "default"
you want to use to describe this behaviour.
-- 
---- Ian Collier : imc@comlab.ox.ac.uk : WWW page (including REXX section):
------ http://users.comlab.ox.ac.uk/ian.collier/imc.shtml

New to this group?  Answers to frequently-asked questions can be had from
http://rexx.hursley.ibm.com/rexx/ .
0
Reply imc 3/1/2005 3:37:17 PM

"Ian Collier" <imc@comlab.ox.ac.uk> schrieb im Newsbeitrag 
news:27420-gentian.imc@comlab.ox.ac.uk...
>  I'm not sure what word other than "default"
> you want to use to describe this behaviour.

None at all ---
I'd just describe the bevavior:
"If the third argumenz is omitted, Strip reemoves all blanks and characters 
considered to be white space in the source.


0
Reply Walter 3/2/2005 9:19:25 AM

Anthony Borla wrote:

> I verified this behaviour using Regina under both Win32
> and *NIX [of course, this only proves that Regina behaves
> this way, and not that all REXX implementations do :)].

Sure, I know this issue, with TRL REXX something like...

 /**/ do while X > '' ; parse var X Y 2 X ; nop ; end

....might be not what you want.  In one case I wanted X <> ''
instead of X > '', in another case I wanted do until X == ''

But that's "obvious", you get it wrong at most once.  Now
with your obscure ANSI REXX, does it really...

 /**/ if d2c(10) > d2c(9) then say "TRL rocks" ; else TOAST

...."toast" ?  That would be more than obscure, it would be
just wrong, and if ANSI REXX sanctions it, then it's a bug
in ANSI REXX.  It would be against Mike's design principles.

                           Bye, Frank

0
Reply Frank 3/2/2005 10:18:34 AM

On Wed, 02 Mar 2005 11:18:34 +0100, Frank Ellermann got out a palm pilot
and hastily scribbled some words to post to comp.lang.rexx:
>Sure, I know this issue, with TRL REXX something like...

> /**/ do while X > '' ; parse var X Y 2 X ; nop ; end

>...might be not what you want.  In one case I wanted X <> ''
>instead of X > '', in another case I wanted do until X == ''

>But that's "obvious", you get it wrong at most once.  Now
>with your obscure ANSI REXX, does it really...

> /**/ if d2c(10) > d2c(9) then say "TRL rocks" ; else TOAST

>..."toast" ?  That would be more than obscure, it would be
>just wrong, and if ANSI REXX sanctions it, then it's a bug
>in ANSI REXX.  It would be against Mike's design principles.

Since I was apparently using a bugged version of the standard,
I think ANSI does not sanction this behaviour.

However, let's take up the devil's advocate position.  Supposing
all whitespace, rather than just literal blanks, were stripped
when using non-strict comparison: why is this bad?

You have exhibited the case "d2c(10) > d2c(9)" which would
end up being false because both strings being compared contain
only whitespace.  I don't see why that's so bad, since even the
current non-strict comparison operators have various surprises.

 d2c(32) || d2c(34) < d2c(33)    -- false (leading space is stripped)
 '00'x < '0000'x                 -- false (right-hand side is space-padded)

 "1e2" < "2d1" & "2d1" < "2e1" & "2e1" < "1e2"
                                 -- true (last comparison is numeric)

Anyway, if I don't care aboout blanks in my strings then why would
I care about other random whitespace characters?  If I don't want
the blanks stripped I must use strict comparison; it's no great hardship
to do the same for the other whitespace characters.  Particularly when
it's often impossible to tell the difference between d2c(9) (ASCII TAB)
and a sequence of blanks when you look at them on a terminal.

Also note that ANSI REXX sanctions this:

 t = '09'x       /* ASCII TAB */
 say t = ''      /* false */
 parse var t x . /* parse t into whitespace-delimited tokens */
 say x == ''     /* true */

-- 
---- Ian Collier : imc@comlab.ox.ac.uk : WWW page (including REXX section):
------ http://users.comlab.ox.ac.uk/ian.collier/imc.shtml

New to this group?  Answers to frequently-asked questions can be had from
http://rexx.hursley.ibm.com/rexx/ .
0
Reply imc 3/2/2005 11:11:55 AM

Ian Collier wrote:

> let's take up the devil's advocate position

That's a dangerous position when you're talking with a REXX fan
and KISS fundamentalist.  I already screamed "blasphemy".

> when using non-strict comparison: why is this bad?

There's the old principle of avoiding new features with a "high
astonishing factor" like hell.

> even the current non-strict comparison operators have
> various surprises.

Yes, but the underlying rule is "obvious", you'd get it wrong
at most once.  Okay, it was more like twice in my case, but
then I got it, it's an implicit strip()

To define strip() without 3rd argument to strip any WSP is a
hack, it's no proper engineering or whatever ANSI claims to be.

I'm often interested in ASCII minus printable minus NO-WS-CTL
as defined in RfC 2822, IIRC that's SP, HT, CR, or LF.  No VT.
IETF 1 : Regina 0.  Maybe other users are interested in SP, HT,
CR, LF, or SUB. that's a technical k.o.

There's no such thing as a common definition of WSP.  Even
hardcore *NIX and C fans might need SUB, either as the old
CP/M and DOS EOF char., or as a "non-character" after Unicode
to whatever 8-bit charset conversions.

It's trivial to use an explicit translate() when needed for
the various sets of "white space".  And it's also portable.

> '00'x < '0000'x -- false (right-hand side is space-padded)

Tough, but it's a part of the old simple strip() package.

> "1e2" < "2d1" & "2d1" < "2e1" & "2e1" < "1e2"

LOL, that's nice.  Add it to a collection of examples for "why
data types are not generally a bad idea".  We can't have both
in the same language (excl. NetRexx for the moment).

> if I don't care aboout blanks in my strings then why
> would I care about other random whitespace characters?

You care about the definition of "white space".  And you know
that the one and only "blank" SP plays a major role in REXX.
It's a default in many REXX functions, it's the delimiter for
the REXX idea "words", it can be inserted in strings
representing hex. numbers at even places, etc.

OTOH cases where you really must consider HT, VT, CR, LF, SUB,
or some versions of NBSP / ZWNJ / ... are rare.  And there's no
common definition of "white space".  BTW, don't forget u+0085
NEL for XML 1.1.
                  Bye, Frank

0
Reply Frank 3/2/2005 2:26:43 PM

A short while ago I became aware that Frank Ellermann had written:
>Ian Collier wrote:
>> even the current non-strict comparison operators have
>> various surprises.

>Yes, but the underlying rule is "obvious", you'd get it wrong
>at most once.  Okay, it was more like twice in my case, but
>then I got it, it's an implicit strip()

I don't think that blank-stripping is any more or less "obvious" than
whitespace-stripping, aside from the obvious question of what whitespace
actually is.  (ANSI gives a definition, which you may not agree with but
at least it does answer the question.)

>To define strip() without 3rd argument to strip any WSP is a
>hack, it's no proper engineering or whatever ANSI claims to be.

You still haven't said why it's a hack or not proper engineering.  If I
have a string of text and I call strip() on it then why would I want to
keep random whitespace characters on the ends of the string just because
they don't match ASCII 32?

>I'm often interested in ASCII minus printable minus NO-WS-CTL
>as defined in RfC 2822, IIRC that's SP, HT, CR, or LF.  No VT.
>IETF 1 : Regina 0.  Maybe other users are interested in SP, HT,
>CR, LF, or SUB. that's a technical k.o.

If you're dealing with control characters, you don't want to use
text-oriented functions such as non-strict comparison, strip() with
no third argument or word().  Use the right tools for the job and
you won't get surprises.

>There's no such thing as a common definition of WSP.  Even
>hardcore *NIX and C fans might need SUB, either as the old
>CP/M and DOS EOF char., or as a "non-character" after Unicode
>to whatever 8-bit charset conversions.

Ideally I think the definition of whitespace should be "whatever is
defined as whitespace in your locale".  However, ANSI limited it to a
specific list of characters, which was a bit short-sighted in my view.
Of course, an implementation in which ASCII 32 was the only whitespace
character would still be ANSI-conforming.

>You care about the definition of "white space".  And you know
>that the one and only "blank" SP plays a major role in REXX.
>It's a default in many REXX functions, it's the delimiter for
>the REXX idea "words", it can be inserted in strings
>representing hex. numbers at even places, etc.

Although ANSI didn't (as it turns out) change the non-strict comparison
operators, they did change strip(), parse, words() and datatype(foo,'X').

>OTOH cases where you really must consider HT, VT, CR, LF, SUB,
>or some versions of NBSP / ZWNJ / ... are rare. 

.... which is why my program shouldn't fail just because someone decided
to insert a TAB between two words instead of a space.  Incidentally, it
has always been possible to indent your source code with TABs instead of
spaces - is this a double standard creeping in?
-- 
---- Ian Collier : imc@comlab.ox.ac.uk : WWW page (including REXX section):
------ http://users.comlab.ox.ac.uk/ian.collier/imc.shtml

New to this group?  Answers to frequently-asked questions can be had from
http://rexx.hursley.ibm.com/rexx/ .
0
Reply imc 3/4/2005 10:32:53 AM

Ian Collier wrote:

> (ANSI gives a definition, which you may not agree with but
> at least it does answer the question.)

I don't agree with answering the question at all.  If cases
where you need any "WSP" handling are rare, and if the actual
definition of WSP varies with your application, then don't
hide it in some obscure strip() "default argument", which
cannot be used as explicit strip() argument.  No other built-in
REXX function does this.

> If you're dealing with control characters, you don't want to
> use text-oriented functions such as non-strict comparison,

I'm dealing with different concepts of WSP and REXX portablity.
For TRL I'd use a simple STUFF = translate( STUFF, /**/, WSP )
and then process STUFF with the TRL idea of words and blanks.

For your concept with a hardwired set ANSIWSP that won't work,
it would still handle any character in ANSIWSP as SP in some
contexts like comparisons and the "improved" strip(), and in
other contexts like parsing it could handle d2c(12) as "word",
if d2c(12) belongs to ANSIWSP but not the wanted WSP.

If that's not the case for ASCII FF please read ASCII VT and
d2c(11),  They are both used as some "vertical white space".
BTW, what's the ANSI REXX idea of WSP for EBCDIC platforms ?

> Use the right tools for the job

Yes, that would be TRL REXX and translate().

> Ideally I think the definition of whitespace should be
> "whatever is defined as whitespace in your locale".

Is that some new POSIX feature, WSP depending on the locale ?
On my OS/2 box I don't have this, unfortunately, but I found
one important setting:  SET POSIXLY_CORRECT=.

> they did change strip(), parse, words() and datatype(foo,'X')

For this dubious idea of WSP, or are you talking about other
changes ?  What's new in datatype(foo,'X') ?  I use it, rarely.

> my program shouldn't fail just because someone decided to
> insert a TAB between two words instead of a space.

Use STUFF = translate( STUFF, /**/, d2c(9)) in this case.  I'm
almost sure that you'd be very unhappy if your program fails
because some (!) REXX interpreters handle FF and / or VT like
SP in some (!) contexts without your explicit permission.

> it has always been possible to indent your source code with
> TABs instead of spaces - is this a double standard creeping
> in?

Yes.  Is this a documented feature of TRL, or only a feature
of some interpreters ?  I tested REXXSAA 4.0 and Quercus Rexx.

                          Bye, Frank

0
Reply Frank 3/4/2005 5:27:13 PM

Frank Ellermann schrieb:
> Ian Collier wrote:
> 
> 
>>(ANSI gives a definition, which you may not agree with but
>>at least it does answer the question.)
> 
> 
> I don't agree with answering the question at all.  If cases
> where you need any "WSP" handling are rare, and if the actual
> definition of WSP varies with your application, then don't
> hide it in some obscure strip() "default argument", which
> cannot be used as explicit strip() argument.  No other built-in
> REXX function does this.

The *WORD* family does this, too. It uses #AllBlanks which is
configuration depending. TRANSLATE uses Config_Upper(), where
nothing points to anything downscaling the character set to
US_ASCII.

In fact, modern systems must support various codespages. Look
at a web server. If one is using REXX as the scripting language,
what do you expect to happen when a Turkish, a German, an
Islandic and a French try to get something that should be
generated on demand where the client's browser claims the
user's language as the preferred language? Shall everything be
returned in English? NO! There must be a way to support
both a codepage (some codepages support fixed width spaces) and
a locale (e.g. where translate('�') == '�'). I don't expect this
behaviour in REXX's source, but in data processing if wanted.

A good choice may be a function, an environment variable or
a command line option to choose the desired codepage/language.
Feel free to persuade the ANSI committee to reopen the
REXX issue.

If ANSI permits #AllBlanks and other system depending stuff
you should accept this and ask the authors for the specific
rules of the interpreter. If you want just TRL1 or TRL2 you
should stick to an interpreter that sticks to that version
of REXX. Don't expect Regina to do so or maybe even ooRexx.

If you need exactly Blank, HTab, FormFeed and whatever as
strip's third argument you should provide it as a string.
This is best programming style.
Keep in mind that line ends in DOS based systems like DOS,
OS/2 and Win32 prefer to use CarriageReturn/Linefeed as
end of line. How shall data processing act in unix with a
compatible codepage where a single Linefeed is end-of-line?
Is Carriage Return strippable? Always?

There are many questions like this. A single system with
a preselected codepage has other requirements than those
driving multiple language programs.

Cheers, Florian

BTW: Regina uses the C-libraries locale "C" to select
character translation tables and white space selection.
This depends on the C-libraries default for ANSI-C. You
can change this by setting the environment variable
"REGINA_LANG=" [msg-language] ["," [locale-and-language]]
0
Reply ISO 3/5/2005 8:22:58 AM

Florian Gro=DFe-Coosmann wrote:

 [default arguments which can't be set explicitly]
> The *WORD* family does this, too. It uses #AllBlanks which
> is configuration depending.

word, words, wordpos, subword all don't have an argument,
where I could specify another delimiter.  Insert, space, and
some others have a default blank, and I could also specify it.

What does a Regina space( "a" || copies( d2c(9), 5) || "b" ),
would it return "a b", and if so, isn't that FUBAR ?

> TRANSLATE uses Config_Upper(), where nothing points to
> anything downscaling the character set to US_ASCII.

That's not exactly the same problem as for strip() and space(),
because whatever translate() implicitly does, I can also do it
explicitly, modulo EBCDIC-portability.

For ASCII platforms the defaults for one argument are A..Z and
a..z resp.  I could also say translate( STUFF, UPPER, LOWER ):

LOWER =3D xrange('a','i') || xrange('j','r') || xrange('s','z')
UPPER =3D xrange('A','I') || xrange('J','R') || xrange('S','Z')
if translate( LOWER ) =3D=3D translate( LOWER, UPPER, LOWER )
   then say "who needs EBCDIC ?" ; else say abs( /* oops */ )

 [Web servers]
> Shall everything be returned in English? NO!

Please let's keep "languages" strictly apart from "charsets".

Adding an umlauted a, o, u to LOWER and an umlauted A, O, U to
UPPER is 1 - trivial, 2 - incomplete for many variants of "de"
(missing &szlig; =3D> SS), 3 - hopelessly incomplete for almost
all 8 or more bits charsets I know, and 4 - beside the point
for weird cases like UTF-8.

> There must be a way to support both a codepage (some
> codepages support fixed width spaces) and a locale (e.g.
> where translate('=E4') =3D=3D '=C4').

IMHO that's mostly settled, we're supposed to think "Unicode".

Therefore translate whatever you have to UTF-32, or use UTF-16
and deal with its surrogates.  Normalize it, process it (e.g.
find &Auml; for &auml;).  Eventually translate it back to the
local charset, cry foul a.k.a ASCII SUB if that's impossible,
and display the result.  Where do you need a locale in this
procedure ?  Serious question, I have some "stringprep" files
on my system, but so far I haven't read them, they are long.

> I don't expect this behaviour in REXX's source, but in data
> processing if wanted.

Unicode has many obscure white space characters like the old
ZWNJ (zero width non-joiner).  SHY (soft hyphen) is also fun.
NEL (u+0085, Latin-1 133, EBCDIC 21) even made it into XML 1.1.

> Feel free to persuade the ANSI committee to reopen the REXX
> issue.

I'm happy with TRL.  And ANSI REXX has no DO OVER, that's the
one AWK feature I miss in REXX.  Of course some interpreters
have obscure ways to emulate it, and AFAIK OREXX has it (?).

> If ANSI permits #AllBlanks and other system depending stuff
> you should accept this

Pardon ?  Why should I take shit on the say-so by a standard ?
I also don't accept RfC 2231, it's a security risk, a disgrace
for the beautiful MIME (the best standard I know), and as far
as I'm concerned RfC 2231 boundary* constructs can go straight
to /dev/null at the MDA when found in mail.

It's always the same problem, add feature after feature, hack
as you see fit, but sooner or later there's a limit where any
further turn means that you've screwed up,  It happens to all
of us, it's not necessary to count say XHTML 1.1 Web pages or
relevant ADA applications.

> Keep in mind that line ends in DOS based systems like DOS,
> OS/2 and Win32 prefer to use CarriageReturn/Linefeed as
> end of line.

Same idea as in many Internet standards, yes.  Obviously I do
not want a magic implicitly turning all LF terminated lines in
a piece of text generated elsewhere to blanks, I want a proper
error when I try stunts like this... ;-)

REXX is for users, it's meant to be obvious.  An isolated LF
is rarely only "white space".

> How shall data processing act in unix with a compatible
> codepage where a single Linefeed is end-of-line?

Quercus REXX allows to get the same effect as with a C for DOS
(or similar) compiler, you can open files as text or "binary".

For text EOL is CrLF, for "binary" it's LF.  Or something in
this direction.  My text editor (Kedit) offers all relevant
conventions, it even offers to be smart on input.  Probably
ANSI REXX offers something similar like Quercus REXX in the
stream() interface.  Or it uses OPTIONS for this purpose.

If my text editor has EOFIN ALLOW, EOLIN CRORLF, TABSIN OFF 8,
TABSOUT OFF 8, EOLOUT CRLF, EOFOUT EOLEOF, then I know exactly
what it does.  If I see a LF or CR in a file loaded with EOLIN
CRORLF, then it's PEBKAC and not "white space".

> Is Carriage Return strippable? Always?

Of course not.  It could be a MAC EOL (?), or an explicit CR
meant to overtype the same output line, the first char. of a
CRLF, a funny u+266A "eighth note" (?), and what else, let
alone all binary usages.  As I said, "WSP" varies with the
application.  MIME is not yenc, yenc is not BOCU, BOCU is not
UTF-8, UTF-8 is not EBCDIC, and so on.

> You can change this by setting the environment variable
> "REGINA_LANG=3D" [msg-language] ["," [locale-and-language]]

Unfortuately OS/2 is no POSIX system.  OTOH they now declared
it to be formally dead next year, and as long as it runs on my
system I won't need Regina.  But I will be very unhappy when
some of my REXX scripts fail under Linux and Regina later.

Shamelessly stealing a sig:  "I18N is no feature, I18N is an
architecture".  US-ASCII WSP ideas are incomplete and obsolete
in this millenium, they invented more than an eigth bit in the
last decades, whether MicroSoft patented it or not.  Bye, Frank

0
Reply Frank 3/5/2005 8:43:46 PM

Frank Ellermann schrieb:
> word, words, wordpos, subword all don't have an argument,
> where I could specify another delimiter.  Insert, space, and
> some others have a default blank, and I could also specify it.

The point is that word() and friends use #AllBlanks as interword
characters. Strip() for instances uses #AllBlanks, too. This is
space (ASCII32) and as far as I understand it: CR, LF, FF, HT, VT
and maybe some more.

> What does a Regina space( "a" || copies( d2c(9), 5) || "b" ),
> would it return "a b", and if so, isn't that FUBAR ?

My newsreader showed "a" || d2c(32) || "b" for your "a b" and
copies() always copies what it gets. Therefore they will never
be same or equal. Keep in mind that "=" doesn't affect whitespaces
in the middle of an argument. By the way: What do you mean with
FUBAR?
> 
>>TRANSLATE uses Config_Upper(), where nothing points to
>>anything downscaling the character set to US_ASCII.
> 
> 
> That's not exactly the same problem as for strip() and space(),
> because whatever translate() implicitly does, I can also do it
> explicitly, modulo EBCDIC-portability.

This is true for every BIF. In particular, you get problems
if you have no way to identify the codepage and the language.
And it depends on the implementation of the REXX interpreter.

> For ASCII platforms the defaults for one argument are A..Z and
> a..z resp.  I could also say translate( STUFF, UPPER, LOWER ):
> 
> LOWER = xrange('a','i') || xrange('j','r') || xrange('s','z')
> UPPER = xrange('A','I') || xrange('J','R') || xrange('S','Z')
> if translate( LOWER ) == translate( LOWER, UPPER, LOWER )
>    then say "who needs EBCDIC ?" ; else say abs( /* oops */ )

What do you want to show with this example?

>  [Web servers]
> 
>>Shall everything be returned in English? NO!
> 
> 
> Please let's keep "languages" strictly apart from "charsets".
> 
> Adding an umlauted a, o, u to LOWER and an umlauted A, O, U to
> UPPER is 1 - trivial, 2 - incomplete for many variants of "de"
> (missing &szlig; => SS), 3 - hopelessly incomplete for almost
> all 8 or more bits charsets I know, and 4 - beside the point
> for weird cases like UTF-8.

Except that �->SS isn't the correct translation in all cases, what
do you want to say here? I argued in prior postings that the
assertion of the runtime system to get translations is much better
than doing it on your own.
UTF8 isn't comparable to any 8-bit-character set. I'm still
inconvinced that it is allowed to use something other than
7-bit or 8-bit character sets in ANSI at all. But that's an
other discussion issue.

>>There must be a way to support both a codepage (some
>>codepages support fixed width spaces) and a locale (e.g.
>>where translate('�') == '�').
> 
> 
> IMHO that's mostly settled, we're supposed to think "Unicode".

Where is the REXX interpreter using Unicode? Where is the
definition for Unicode support of Rexx?

> Therefore translate whatever you have to UTF-32, or use UTF-16
> and deal with its surrogates.  Normalize it, process it (e.g.
> find &Auml; for &auml;).  Eventually translate it back to the
> local charset, cry foul a.k.a ASCII SUB if that's impossible,
> and display the result.  Where do you need a locale in this
> procedure ?  Serious question, I have some "stringprep" files
> on my system, but so far I haven't read them, they are long.

Locales are needed to distinguish between sorting rules and
used character letters. The translation of umlaut(a) isn't
useful in all languages.
Translating to UTF32/16 needs the same knowledge of the locale
and the codepage as you need to do the whole operation in
your 8 bit codepage. You don't gain anything.

>>I don't expect this behaviour in REXX's source, but in data
>>processing if wanted.
> 
> 
> Unicode has many obscure white space characters like the old
> ZWNJ (zero width non-joiner).  SHY (soft hyphen) is also fun.
> NEL (u+0085, Latin-1 133, EBCDIC 21) even made it into XML 1.1.

It is very new to me that char 133 is defined in Latin-1.
Some character positions are undefined in Latin-1.

>>Feel free to persuade the ANSI committee to reopen the REXX
>>issue.
> 
> 
> I'm happy with TRL.  And ANSI REXX has no DO OVER, that's the
> one AWK feature I miss in REXX.  Of course some interpreters
> have obscure ways to emulate it, and AFAIK OREXX has it (?).
> 
> 
>>If ANSI permits #AllBlanks and other system depending stuff
>>you should accept this
> 
> 
> Pardon ?  Why should I take shit on the say-so by a standard ?

Because YOU are complaining about "errors" in programs that conform
to that standard. (Or try to do so.) If you don't accept this
standard you should stop using these programs IMMEDIATELY. And
you should stop talking about, at least in this way.

I don't see any reason to continue talking about.

> I also don't accept RfC 2231, it's a security risk, a disgrace
> for the beautiful MIME (the best standard I know), and as far
> as I'm concerned RfC 2231 boundary* constructs can go straight
> to /dev/null at the MDA when found in mail.
> 
> It's always the same problem, add feature after feature, hack
> as you see fit, but sooner or later there's a limit where any
> further turn means that you've screwed up,  It happens to all
> of us, it's not necessary to count say XHTML 1.1 Web pages or
> relevant ADA applications.
> 
> 
>>Keep in mind that line ends in DOS based systems like DOS,
>>OS/2 and Win32 prefer to use CarriageReturn/Linefeed as
>>end of line.
> 
> 
> Same idea as in many Internet standards, yes.  Obviously I do
> not want a magic implicitly turning all LF terminated lines in
> a piece of text generated elsewhere to blanks, I want a proper
> error when I try stunts like this... ;-)
> 
> REXX is for users, it's meant to be obvious.  An isolated LF
> is rarely only "white space".
> 
> 
>>How shall data processing act in unix with a compatible
>>codepage where a single Linefeed is end-of-line?
> 
> 
> Quercus REXX allows to get the same effect as with a C for DOS
> (or similar) compiler, you can open files as text or "binary".
> 
> For text EOL is CrLF, for "binary" it's LF.  Or something in
> this direction.  My text editor (Kedit) offers all relevant
> conventions, it even offers to be smart on input.  Probably
> ANSI REXX offers something similar like Quercus REXX in the
> stream() interface.  Or it uses OPTIONS for this purpose.
> 
> If my text editor has EOFIN ALLOW, EOLIN CRORLF, TABSIN OFF 8,
> TABSOUT OFF 8, EOLOUT CRLF, EOFOUT EOLEOF, then I know exactly
> what it does.  If I see a LF or CR in a file loaded with EOLIN
> CRORLF, then it's PEBKAC and not "white space".
> 
> 
>>Is Carriage Return strippable? Always?
> 
> 
> Of course not.  It could be a MAC EOL (?), or an explicit CR
> meant to overtype the same output line, the first char. of a
> CRLF, a funny u+266A "eighth note" (?), and what else, let
> alone all binary usages.  As I said, "WSP" varies with the
> application.  MIME is not yenc, yenc is not BOCU, BOCU is not
> UTF-8, UTF-8 is not EBCDIC, and so on.
> 
> 
>>You can change this by setting the environment variable
>>"REGINA_LANG=" [msg-language] ["," [locale-and-language]]
> 
> 
> Unfortuately OS/2 is no POSIX system.  OTOH they now declared
> it to be formally dead next year, and as long as it runs on my
> system I won't need Regina.  But I will be very unhappy when
> some of my REXX scripts fail under Linux and Regina later.
> 
> Shamelessly stealing a sig:  "I18N is no feature, I18N is an
> architecture".  US-ASCII WSP ideas are incomplete and obsolete
> in this millenium, they invented more than an eigth bit in the
> last decades, whether MicroSoft patented it or not.  Bye, Frank
> 
0
Reply ISO 3/7/2005 10:22:32 AM

Florian Gro=DFe-Coosmann wrote:

> as far as I understand it: CR, LF, FF, HT, VT and maybe some
> more.

With NEL 0x85 they'd enter the territory of deep shit, same
problem without it.  Other potentially weird cases include:

| MicroSoft                  IBM
| 1A 0x001A SUBstitute       1A 0x001C File Separator
| 1C 0x001C File Separator   1C 0x007F DELete
| 7F 0x007F DELete           7F 0x001A SUBstitute

>> What does a Regina space( "a" || copies( d2c(9), 5) || "b" )
>> would it return "a b", and if so, isn't that FUBAR ?
 =

> My newsreader showed "a" || d2c(32) || "b" for your "a b"

That's what I wanted and intended, a case of space() where the
result contains a character d2c(32) not found in the source
string argument.

> they will never be same or equal

Yes, it wasn't a convincing example, the third space() argument
is only the pad-character, not the word delimiter, not the same
case as the third strip() argument.  An error on my side, no
problem with space().

> What do you mean with FUBAR?

<gopher://dict.org:2628/?DEFINE%20*%20?FUBAR> but in that case
it was only a PEBKAC causing no harm for REXX or Regina users.

> In particular, you get problems if you have no way to
> identify the codepage and the language.

Oh yes, like the beautiful d2c(170), d2c(172), d2c(95), or
d2c(94) - let's see, my browser wants probably "=AC" Alt-172.

> And it depends on the implementation of the REXX interpreter.

True, but it's relatively simple to get the TRL-idea of "blank"
right, either EBCDIC or rest of the world.  With #AllBlanks you
get a can of worms.  And REXX users normally don't want worms,
that's why they don't use perl or Java.  IMHO.

>> if translate( LOWER ) =3D=3D translate( LOWER, UPPER, LOWER )
>>    then say "who needs EBCDIC ?" ; else say abs( /* oops */

> What do you want to show with this example?

You started to discuss I18N issues like languages, charsets,
and locales.  The example shows that it's hopeless to squeeze
these concepts into the good old translate().  It worked for
date('L') with a completely new argument 'L', maybe ANSI REXX
adopted this idea.  It's not that I'm against all changes.

If an attempt to port a REXX program to another version fails
spectactularly all is fine, the user is free to fix it.   It's
less funny if this attempt apparently works, but only as long
as the input doesn't contain stray #AllBlanks minus SP.

When I write stuff like utf8.cmd and know that it won't work
for EBCDIC I normally say so within the first ten lines of the
script.  But with #AllBlanks I'm lost, I can't test Regina on
my platform, it doesn't implement the OS/2 concept of external
function libraries.  So maybe none of my scripts work under an
#AllBlanks Regina.  And maybe popstop.cmd won't catch the mail
worm, which could be tough for the unlucky popstop.cmd user.

> It is very new to me that char 133 is defined in Latin-1.
> Some character positions are undefined in Latin-1.

Latin-1 are the first 256 Unicode points, like US-ASCII are the
first 128 Unicode points, a one-to-one correspondence.  Please
don't ask me about any "extended Latin" char.s beyond the first
256, I've no idea.  u+007F up to u+009F are identical in all
ISO Latin charsets, 0x7F .. 0x9F or 127 - 159.  Maybe 160 NBSP
is also identical.

I prefer windows-1252, but that is different for 27 of the 32
characters 0x80 - 0x9F (incl. 133).  Actually I'd prefer 858 =3D
pc-multilingual-850+euro, but unfortunately that codepage is
kind of unusual today in the Internet.

> YOU are complaining about "errors" in programs that conform
> to that standard. (Or try to do so.)

You read <422284AD.6193@xyzzy.claranet.de> - I didn't even know
that this issue exists but (fortunately) added the disclaimer:
"Or they changed this in ANSI REXX, then they should be shot".

It's either an error in the REXX interpreter if it claims to be
REXX language level 4, or an error in the standard.  The OP
asked about this behaviour, I only tried to explain it.  And
with luck because it was certainly not knowledge I managed to
avoid an incorrect answer.

> you should stop using these programs IMMEDIATELY.

As I said I cannot test Regina, because it doesn't work on my
system.  But that doesn't guarantee that I don't stumble over
an apparently working REXX script which doesn't check the REXX
language level.  And it obviously doesn't guarantee that I don't
publish REXX scripts where this incompatibility between TRL and
ANSI could cause major havoc for poor users who don't know these
peculiarities.
                       Bye, Frank

0
Reply Frank 3/8/2005 3:53:05 AM

Probably Frank Ellermann typed into a real computer:
>BTW, what's the ANSI REXX idea of WSP for EBCDIC platforms ?

ANSI lists characters by their generic names (e.g., "carriage return"
and "horizontal tabulation") so it can be translated to EBCDIC (or any
other encoding) as the implementer sees fit.

>Is that some new POSIX feature, WSP depending on the locale ?

Not particularly new, as far as I am aware.  If you have a C compiler
then see its documentation on the isspace() function.

>> they did change strip(), parse, words() and datatype(foo,'X')

>For this dubious idea of WSP, or are you talking about other
>changes ?  What's new in datatype(foo,'X') ?  I use it, rarely.

The datatype(foo,'X') function returns true if foo contains hex digits,
optionally separated into pairs (or chunks of even numbers of digits)
by blanks.  In ANSI you are allowed any whitespace character as well
as just blanks.

>> it has always been possible to indent your source code with
>> TABs instead of spaces - is this a double standard creeping
>> in?

>Yes.  Is this a documented feature of TRL, or only a feature
>of some interpreters ?  I tested REXXSAA 4.0 and Quercus Rexx.

TRL talks about "blanks" in REXX clauses.  So if you think the word
"blank" does not include tabs then all interpreters which allow tabs
in your source code are, strictly speaking, extensions of TRL.  On
the other hand, an interpreter which didn't accept tab as a synonym
for blank in your source code could be rather irritating, especially
as some text editors can place tabs in your text file without your
explicit instruction (usually when auto-indenting).
-- 
---- Ian Collier : imc@comlab.ox.ac.uk : WWW page (including REXX section):
------ http://users.comlab.ox.ac.uk/ian.collier/imc.shtml

New to this group?  Answers to frequently-asked questions can be had from
http://rexx.hursley.ibm.com/rexx/ .
0
Reply imc 3/8/2005 11:43:37 AM

On Friday, in article <27423-journal.imc@comlab.ox.ac.uk>
     imc@comlab.ox.ac.uk "Ian Collier" wrote:

> Ideally I think the definition of whitespace should be "whatever is
> defined as whitespace in your locale".  However, ANSI limited it to a
> specific list of characters, which was a bit short-sighted in my view.

I tend to agree with this viewpoint; if the suggestion is that current or
future standardization processes (including the ooRexx efforts) should
take account of the current locale, I'm fully in agreement.

That way, one could use the � character in Latin-1 (which is codepoint
'AC'x) and have it be recognized as the NOT operator, which is currently
only recognized, at least under OS/2, if the codepoint 'AA'x is used
(which is the NOT character in CP437 and CP850).

> ... which is why my program shouldn't fail just because someone decided
> to insert a TAB between two words instead of a space.  Incidentally, it
> has always been possible to indent your source code with TABs instead of
> spaces - is this a double standard creeping in?

Perhaps a Unixism: they have long held to the viewpoint that HT is
nothing like SP (especially in some extremely critical places, like
makefiles, and config files for syslogd)?

-- 
Brian {Hamilton Kelly}                                     bhk@dsl.co.uk
         "Je n'ai fait celle-ci plus longue que parce que je n'ai pas eu
         le loisir de la faire plus courte."
                             Blaise Pascal, /Lettres Provinciales/, 1657
0
Reply bhk 3/8/2005 7:58:20 PM

Ian Collier wrote:

> ANSI lists characters by their generic names (e.g., "carriage
> return" and "horizontal tabulation") so it can be translated
> to EBCDIC

If that includes "New Line" as WSP, then there is a problem
with Latin-? vs. windows-? charsets on "MIME platforms", unless
the REXX interpreter can determine the actual codepage.  Which
could be changed dynamically.

> If you have a C compiler then see its documentation on the
> isspace() function.

Still a #define in ctype.h for the EMX compiler which survived
a disk crash.  It operates on an external array _ctype[] with
its eight flags.  Initializing or dynamically modifying this
array depending on the locale might be possible.  I can't say
what Watcom 10.x did, it's on the dead disk.

> TRL talks about "blanks" in REXX clauses.  So if you think
> the word "blank" does not include tabs then all interpreters
> which allow tabs in your source code are, strictly speaking,
> extensions of TRL.

Yes.  OTOH tokenizing the source is the business of the REXX
interpreter, one of them won't accept LF as end of line and
insists on CRLF.  The other has its private ideas about "EOF"
(ASCII SUB).  My editor allows me to please them all.

> an interpreter which didn't accept tab as a synonym for blank
> in your source code could be rather irritating, especially
> as some text editors can place tabs in your text file without
> your explicit instruction (usually when auto-indenting).

It starts to get very irritating if different people edit the
same sources with their personal tools and TAB-settings using
a version control system based on diffs.  I'm a bit paranoid
about TABs, I tried to edit sendmail.cf.

                       Bye, Frank



0
Reply Frank 3/8/2005 9:21:49 PM

Hi,

Brian {Hamilton Kelly} schrieb:
>>Ideally I think the definition of whitespace should be "whatever is
>>defined as whitespace in your locale".  However, ANSI limited it to a
>>specific list of characters, which was a bit short-sighted in my view.
> 
> 
> I tend to agree with this viewpoint; if the suggestion is that current or
> future standardization processes (including the ooRexx efforts) should
> take account of the current locale, I'm fully in agreement.
> 
> That way, one could use the � character in Latin-1 (which is codepoint
> 'AC'x) and have it be recognized as the NOT operator, which is currently
> only recognized, at least under OS/2, if the codepoint 'AA'x is used
> (which is the NOT character in CP437 and CP850).

This is a big problem in systems with various and changing charsets and
locales. Writing a script using x2c(AC) as NOT may not work if another
user than the writer changed the codepage to 850 as an example. The
correct interpretation of x2c(AC) is no longer possible and an error
must be raised.

It is needed to specify the correct codepage. This may happen either
a) in the code (first line?)
b) outside the script

a) is impossible to manage properly if the script is copied to another
    system where the codepage doesn't exist or the codepage is changed
    during copy.
b) is impossible to manage because the user needs to know much about the
    script he/she even may not know that it exists.

The final solution may be a unicode source of the script where it is
clear what shall happen. But who uses a unicode editor?

In opposite to codepage-depending source code I think it is a good
idea having codepage support for executing, e.g. TRANSLATE is
able to translate foreign lower case letters to uppercase.

Cheers, Florian
0
Reply ISO 3/9/2005 9:11:02 AM

In a previous episode of comp.lang.rexx, Brian {Hamilton Kelly} was
heard to say:
>Perhaps a Unixism: they have long held to the viewpoint that HT is
>nothing like SP (especially in some extremely critical places, like
>makefiles, and config files for syslogd)?

Yes and no...

There are certainly some utilities for which fields are separated
by HT rather than by space (including the ones you mention).  On the
other hand, I think probably the majority of utilities don't make a
distinction between HT and space.  When vi or emacs auto-indents your
cursor with a mixture of tabs and spaces it's usually doing it for
visual appearance (on the assumption that leading tabs and spaces
will be ignored by the interpreter/compiler anyway) and because
using HT instead of eight spaces makes your file smaller.
-- 
---- Ian Collier : imc@comlab.ox.ac.uk : WWW page (including REXX section):
------ http://users.comlab.ox.ac.uk/ian.collier/imc.shtml

New to this group?  Answers to frequently-asked questions can be had from
http://rexx.hursley.ibm.com/rexx/ .
0
Reply imc 3/9/2005 9:58:31 AM

32 Replies
335 Views

(page loaded in 7.245 seconds)

Similiar Articles:


















7/16/2012 5:02:23 PM


Reply: