COMPGROUPS.NET | Search | Post Question | Groups | Stream | About | Register

### Unicode: ugh!

• Email
• Follow

The Unicode standard says this in section 3.9:

"For example, a string is defined as a pointer to char in the
C language, and is conventionally terminated with a NULL
character."

You'd think folks writing standards would bother to properly read
and understand the other standards that they reference.
--
"There's only one thing that will make them stop hating you.
And that's being so good at what you do that they can't ignore you.
I told them you were the best.  Now you damn well better be."
--Orson Scott Card, _Ender's Game_

 0
Reply blp (3955) 3/13/2006 5:55:51 PM

See related articles to this posting

Ben Pfaff wrote:
> The Unicode standard says this in section 3.9:
>
>     "For example, a string is defined as a pointer to char in the
>     C language, and is conventionally terminated with a NULL
>     character."
>
> You'd think folks writing standards would bother to properly read
> and understand the other standards that they reference.

;-) ;-)

--ag

--
Artie Gold -- Austin, Texas
http://goldsays.blogspot.com
"You can't KISS* unless you MISS**"
[*-Keep it simple, stupid. **-Make it simple, stupid.]

 0
Reply artiegold (849) 3/13/2006 6:04:18 PM

"Ben Pfaff" writes:

> The Unicode standard says this in section 3.9:
>
>    "For example, a string is defined as a pointer to char in the
>    C language, and is conventionally terminated with a NULL
>    character."
>
> You'd think folks writing standards would bother to properly read
> and understand the other standards that they reference.

What's your complaint?  That the ASCII null should be spelled NUL?


 0
Reply r124c4u1022 (2303) 3/13/2006 7:27:09 PM

"osmium" <r124c4u102@comcast.net> writes:

> "Ben Pfaff" writes:
>
>> The Unicode standard says this in section 3.9:
>>
>>    "For example, a string is defined as a pointer to char in the
>>    C language, and is conventionally terminated with a NULL
>>    character."
>>
>> You'd think folks writing standards would bother to properly read
>> and understand the other standards that they reference.
>
> What's your complaint?  That the ASCII null should be spelled NUL?

Here is the definition of a string:

A string is a contiguous sequence of characters terminated
by and including the first null character.

A string is not a pointer to char: it is a sequence of
characters.  It is not "conventionally" terminated by a null
character, it is always terminated by one (otherwise it is not a
string).  In C, the null terminator is not a NULL character (NULL
is a null pointer constant); it is not the NUL character either,
because that assumes an ASCII character set; the null terminator
is in fact the "null character", as quoted above.

It's amazing how much they managed to get wrong in a single
sentence.
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it."
--Brian Kernighan

 0
Reply blp (3955) 3/13/2006 7:49:27 PM

"Ben Pfaff" wrote:

> "osmium" <r124c4u102@comcast.net> writes:
>
>> "Ben Pfaff" writes:
>>
>>> The Unicode standard says this in section 3.9:
>>>
>>>    "For example, a string is defined as a pointer to char in the
>>>    C language, and is conventionally terminated with a NULL
>>>    character."
>>>
>>> You'd think folks writing standards would bother to properly read
>>> and understand the other standards that they reference.
>>
>> What's your complaint?  That the ASCII null should be spelled NUL?
>
> Here is the definition of a string:
>
>     A string is a contiguous sequence of characters terminated
>     by and including the first null character.
>
> A string is not a pointer to char: it is a sequence of
> characters.  It is not "conventionally" terminated by a null
> character, it is always terminated by one (otherwise it is not a
> string).  In C, the null terminator is not a NULL character (NULL
> is a null pointer constant); it is not the NUL character either,
> because that assumes an ASCII character set; the null terminator
> is in fact the "null character", as quoted above.

I glossed over the word "conventionally", that is not a good basis for a
definition.  As far as the ASCII component, I figured  that was justified
somewhere in the thicket of documents.  Every UTF I have seen embeds ASCII
in it.  But I don't claim to have seen all the UTF's that exist.

> It's amazing how much they managed to get wrong in a single
> sentence.

I just read it again and I now agree with you.  I thought earlier you were
nit-picking on the extra 'L'.


 0
Reply r124c4u1022 (2303) 3/13/2006 8:26:27 PM

"osmium" <r124c4u102@comcast.net> writes:
> "Ben Pfaff" wrote:
>> "osmium" <r124c4u102@comcast.net> writes:
>>> "Ben Pfaff" writes:
>>>> The Unicode standard says this in section 3.9:
>>>>
>>>>    "For example, a string is defined as a pointer to char in the
>>>>    C language, and is conventionally terminated with a NULL
>>>>    character."
>>>>
>>>> You'd think folks writing standards would bother to properly read
>>>> and understand the other standards that they reference.
>>>
>>> What's your complaint?  That the ASCII null should be spelled NUL?
>>
>> Here is the definition of a string:
>>
>>     A string is a contiguous sequence of characters terminated
>>     by and including the first null character.
>>
>> A string is not a pointer to char: it is a sequence of
>> characters.  It is not "conventionally" terminated by a null
>> character, it is always terminated by one (otherwise it is not a
>> string).  In C, the null terminator is not a NULL character (NULL
>> is a null pointer constant); it is not the NUL character either,
>> because that assumes an ASCII character set; the null terminator
>> is in fact the "null character", as quoted above.
>
> I glossed over the word "conventionally", that is not a good basis for a
> definition.  As far as the ASCII component, I figured  that was justified
> somewhere in the thicket of documents.  Every UTF I have seen embeds ASCII
> in it.  But I don't claim to have seen all the UTF's that exist.
>
>> It's amazing how much they managed to get wrong in a single
>> sentence.
>
> I just read it again and I now agree with you.  I thought earlier you were
> nit-picking on the extra 'L'.

Even if that were the only problem, it would be enough of a basis to
criticize it.  NULL has a very well-defined meaning in C, and it has
very little to do with the '\0' character.

--
Keith Thompson (The_Other_Keith) kst-u@mib.org  <http://www.ghoti.net/~kst>
San Diego Supercomputer Center             <*>  <http://users.sdsc.edu/~kst>
We must do something.  This is something.  Therefore, we must do this.

 0
Reply kst-u (21963) 3/13/2006 9:49:25 PM

On 2006-03-13, Keith Thompson <kst-u@mib.org> wrote:
>
> Even if that were the only problem, it would be enough of a basis to
> criticize it.  NULL has a very well-defined meaning in C, and it has
> very little to do with the '\0' character.
>

In restrospect it was a bit silly to have a NULL and a "null"
character,'\0', and then to compound it all with a "null pointer"... 2
seconds with google shows generations of confusion and standards
abuse.


 0
Reply rgrdev (1814) 3/13/2006 9:58:46 PM

On 2006-03-13, Keith Thompson <kst-u@mib.org> wrote:
> "osmium" <r124c4u102@comcast.net> writes:
>> "Ben Pfaff" wrote:
>>> "osmium" <r124c4u102@comcast.net> writes:
>>>> "Ben Pfaff" writes:
>>>>> The Unicode standard says this in section 3.9:
>>>>>
>>>>>    "For example, a string is defined as a pointer to char in the
>>>>>    C language, and is conventionally terminated with a NULL
>>>>>    character."
>>>>>
>>>>> You'd think folks writing standards would bother to properly read
>>>>> and understand the other standards that they reference.
>>>>
>>>> What's your complaint?  That the ASCII null should be spelled NUL?
>>>
>>> Here is the definition of a string:
>>>
>>>     A string is a contiguous sequence of characters terminated
>>>     by and including the first null character.
>>>
>>> A string is not a pointer to char: it is a sequence of
>>> characters.  It is not "conventionally" terminated by a null
>>> character, it is always terminated by one (otherwise it is not a
>>> string).  In C, the null terminator is not a NULL character (NULL
>>> is a null pointer constant); it is not the NUL character either,
>>> because that assumes an ASCII character set; the null terminator
>>> is in fact the "null character", as quoted above.
>>
>> I glossed over the word "conventionally", that is not a good basis for a
>> definition.  As far as the ASCII component, I figured  that was justified
>> somewhere in the thicket of documents.  Every UTF I have seen embeds ASCII
>> in it.  But I don't claim to have seen all the UTF's that exist.
>>
>>> It's amazing how much they managed to get wrong in a single
>>> sentence.
>>
>> I just read it again and I now agree with you.  I thought earlier you were
>> nit-picking on the extra 'L'.
>
> Even if that were the only problem, it would be enough of a basis to
> criticize it.  NULL has a very well-defined meaning in C, and it has
> very little to do with the '\0' character.

Though, '\0' is incidentally a null pointer constant... so #define NULL
'\0' would be legal.

 0
Reply random832 (841) 3/13/2006 10:19:14 PM

Jordan Abel <random832@gmail.com> writes:
> On 2006-03-13, Keith Thompson <kst-u@mib.org> wrote:
[...]
>> Even if that were the only problem, it would be enough of a basis to
>> criticize it.  NULL has a very well-defined meaning in C, and it has
>> very little to do with the '\0' character.
>
> Though, '\0' is incidentally a null pointer constant... so #define NULL
> '\0' would be legal.

Yes, of course; that's the "very little" I was referring to.

--
Keith Thompson (The_Other_Keith) kst-u@mib.org  <http://www.ghoti.net/~kst>
San Diego Supercomputer Center             <*>  <http://users.sdsc.edu/~kst>
We must do something.  This is something.  Therefore, we must do this.

 0
Reply kst-u (21963) 3/13/2006 10:29:29 PM

In article <lnoe0am2vv.fsf@nuthaus.mib.org>,
Keith Thompson  <kst-u@mib.org> wrote:
>Even if that were the only problem, it would be enough of a basis to
>criticize it.  NULL has a very well-defined meaning in C, and it has
>very little to do with the '\0' character.

However, the text is in the Unicode standard, and there NULL means the
character with code 0.

-- Richard

 0
Reply richard91 (3692) 3/13/2006 10:43:56 PM

Ben Pfaff <blp@cs.stanford.edu> wrote:
>
> You'd think folks writing standards would bother to properly read
> and understand the other standards that they reference.

Not after you've worked in the standards world for a while, you
wouldn't.  :-)  Of course, some committees are better than others.

-Larry Jones

Wow, how existential can you get? -- Hobbes

 0
Reply lawrence.jones (203) 3/14/2006 5:12:16 AM

Ben Pfaff <blp@cs.stanford.edu> wrote:

> The Unicode standard says this in section 3.9:
>
>     "For example, a string is defined as a pointer to char in the
>     C language, and is conventionally terminated with a NULL
>     character."
>
> You'd think folks writing standards would bother to properly read
> and understand the other standards that they reference.

The Unicode Standard (at least, the PDFs I have) also says this in its
first character chart (chapter 16, I believe):

"0000 [NUL] <control>
= NULL"

So _within the bounds of Unicode_ that comment is correct. The Unicode
name for character 0, ASCII NUL, the null character, is (and has been
for a while, don't know how long) NULL. This choice was ill-advised,
yes, but it having been made, your quotation is wrong in a general C
context, but correct in a Unicode context.

Richard

 0
Reply rlb (4118) 3/14/2006 4:06:39 PM

rlb@hoekstra-uitgeverij.nl (Richard Bos) writes:

> Ben Pfaff <blp@cs.stanford.edu> wrote:
>
>> The Unicode standard says this in section 3.9:
>>
>>     "For example, a string is defined as a pointer to char in the
>>     C language, and is conventionally terminated with a NULL
>>     character."
>>
>> You'd think folks writing standards would bother to properly read
>> and understand the other standards that they reference.

> [NULL is correct for Unicode.]

There's a lot more wrong with it than misspelling "null".
--
"The way I see it, an intelligent person who disagrees with me is
probably the most important person I'll interact with on any given
day."
--Billy Chambless

 0
Reply blp (3955) 3/14/2006 4:13:34 PM

On 2006-03-14, Richard Bos <rlb@hoekstra-uitgeverij.nl> wrote:
> Ben Pfaff <blp@cs.stanford.edu> wrote:
>
>> The Unicode standard says this in section 3.9:
>>
>>     "For example, a string is defined as a pointer to char in the
>>     C language, and is conventionally terminated with a NULL
>>     character."
>>
>> You'd think folks writing standards would bother to properly read
>> and understand the other standards that they reference.
>
> The Unicode Standard (at least, the PDFs I have) also says this in its
> first character chart (chapter 16, I believe):
>
>   "0000 [NUL] <control>
>               = NULL"
>
> So _within the bounds of Unicode_ that comment is correct. The Unicode
> name for character 0, ASCII NUL, the null character, is (and has been
> for a while, don't know how long) NULL. This choice was ill-advised,
> yes, but it having been made, your quotation is wrong in a general C
> context, but correct in a Unicode context.

Well, yeah. That's the english word/phrase for which NUL is an
abbreviation, just like we have START OF TEXT for STX, and so on.

 0
Reply random832 (841) 3/14/2006 5:50:59 PM

In article <4416e933.363393776@news.xs4all.nl> rlb@hoekstra-uitgeverij.nl (Richard Bos) writes:
....
>                                                            The Unicode
> name for character 0, ASCII NUL, the null character, is (and has been
> for a while, don't know how long) NULL.

That name was already present in Unicode 1.1.5 (July 1995) (the earliest
reference that is available online).
--
dik t. winter, cwi, kruislaan 413, 1098 sj  amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn  amsterdam, nederland; http://www.cwi.nl/~dik/

 0
Reply Dik.Winter (1625) 3/15/2006 11:26:16 AM

rlb@hoekstra-uitgeverij.nl (Richard Bos) writes:

> Ben Pfaff <blp@cs.stanford.edu> wrote:
>
>> The Unicode standard says this in section 3.9:
>>
>>     "For example, a string is defined as a pointer to char in the
>>     C language, and is conventionally terminated with a NULL
>>     character."
>>
>> You'd think folks writing standards would bother to properly read
>> and understand the other standards that they reference.
>
> The Unicode Standard (at least, the PDFs I have) also says this in its
> first character chart (chapter 16, I believe):
>
>   "0000 [NUL] <control>
>               = NULL"
>
> So _within the bounds of Unicode_ that comment is correct. The Unicode
> name for character 0, ASCII NUL, the null character, is (and has been
> for a while, don't know how long) NULL. This choice was ill-advised,
> yes, but it having been made, your quotation is wrong in a general C
> context, but correct in a Unicode context.

I took another look at the text I quoted.  At a second look, it
is clearly *not* referring to the Unicode character called NULL,
because Unicode character names in the Unicode standard are
expressed in small capital letters.  The NULL in the paragraph
above is in full-size capital letters, so it does not refer to a
Unicode character name.
--
"I don't have C&V for that handy, but I've got Dan Pop."
--E. Gibbons

 0
Reply blp (3955) 3/15/2006 4:17:04 PM

In article <873bhjsmwv.fsf@benpfaff.org>,
Ben Pfaff  <blp@cs.stanford.edu> wrote:

>I took another look at the text I quoted.  At a second look, it
>is clearly *not* referring to the Unicode character called NULL,
>because Unicode character names in the Unicode standard are
>expressed in small capital letters.

They are in small caps when written as (for example) "U+004B LATIN
CAPITAL LETTER K", but they also appear (without the U+XXXX) in plain
capitals (e.g. in the character tables themselves), lower case, and
italics.  So I don't think you can deduce that they are not referring
to the Unicode character.  Though they might well be using it in a
more generic sense of a null character without reference to Unicode in
particular (which would be more accurate in a sense, because as far as
I can see nothing guarantees that C's string-terminating character
maps to U+0000).

Anyway, I certainly don't think you can assume the author was
confusing it with the C macro NULL.

-- Richard

 0
Reply richard91 (3692) 3/15/2006 5:34:38 PM

richard@cogsci.ed.ac.uk (Richard Tobin) wrote:

> In article <873bhjsmwv.fsf@benpfaff.org>,
> Ben Pfaff  <blp@cs.stanford.edu> wrote:
>
> >I took another look at the text I quoted.  At a second look, it
> >is clearly *not* referring to the Unicode character called NULL,
> >because Unicode character names in the Unicode standard are
> >expressed in small capital letters.
>
> They are in small caps when written as (for example) "U+004B LATIN
> CAPITAL LETTER K", but they also appear (without the U+XXXX) in plain
> capitals (e.g. in the character tables themselves), lower case, and
> italics.  So I don't think you can deduce that they are not referring
> to the Unicode character.  Though they might well be using it in a
> more generic sense of a null character without reference to Unicode in
> particular (which would be more accurate in a sense, because as far as
> I can see nothing guarantees that C's string-terminating character
> maps to U+0000).

The null character in C must be a character with value zero. U+0000
trivially also has value zero. If an implementation manages not to map
the one onto the other, I would say that that implementation does not
have Unicode as its character set, but at most Unicode-rearranged.

Richard

 0
Reply rlb (4118) 3/16/2006 11:17:30 AM

17 Replies
74 Views

Similar Articles

12/20/2013 1:05:17 AM
page loaded in 231554 ms. (0)

Similar Artilces:

Ugh
(I tried to call) The onclicks not working is definitely due to missing closing tags. The way I found this was to copy a pieces of code and made a mini-app. I then ran that app and copied the html code generated into another html page. I (W3C) validated that page and it disclosed my errors. Well, I fixed all the errors and now all the onclicks work. I did this for Supplier Configuration. One "little" problem came out of that. The vertical scrollbars have disappeared from that page AND from the Dimensions page (and maybe others as well). This, **of course**, is ONLY fo...

Strange unicode / no unicode phenomen with mysql
I have a strange unicode problem with mySQL and sqlite. In my application I get a table as a sqlite table which is being compared to an existing mySQL Table. The sqlite drive returns all strings from the table as a unicode string which is Ok. The mysql drive returns all strings as utf-8 coded strings (no unicode!). When opening the mySQL database, use unicode is set to true, so the driver should return unicode strings. Any ideas ? This is the mySQL table definition: CREATE TABLE USERNAMES ( NAME varchar(256) COLLATE utf8_bin NOT NULL, ID mediumint(8) unsigned NO...

Resolving unicode escapes to unicode character
Hi all, I'm trying to parse escaped unicode characters. The basic goal is to read the string '\u00F3' (or "\\u00F3") as '=F3'. I have a workaround below that uses eval (code below), but I'd be grateful if anyone had a less dangerous solution or suggestion. In python, you can 'import codecs' and use string.decode("unicode-escape"), is something similar possible in Ruby? Thanks! Tyler File.open("test.txt", 'w') {|file| file.puts "Asociaci\\u00F3n Alumni \nF\\u00FAtbol"} File.open "test.txt", 'r' ...

Unicode again
Hello this code works: unsigned char buffer[1025]; fopen("toto.bmp", "rb"); fread(buffer, sizeof(unsigned char),1024, pf); fclose(pf); if(buffer[0] == 'B' && buffer[1] == 'M') ... this code does not work: (compiled with UNICODE and _UNICODE) WCHAR buffer[1025]; _wfopen(L"toto.bmp", L"rb"); fread(buffer,sizeof(WCHAR),1024,pf); fclose(pf); if(buffer[0] == 'B' && buffer[1] == 'M') ... it's the comparison that does not work. (i tried if(buffer[0] == L'B' &...

unicode
This command perl -pi -e 's/(\p{IsAplha}+)(=E9)(\s)/($1$2)/g' text.txt returns Can't find Unicode property definition "Aplha" at -e line 1, <> line 1. What should I do to correct the problem ? Thanks nicolas_laurent545@hotmail.com wrote: > This command perl -pi -e 's/(\p{IsAplha}+)(�)(\s)/($1$2)/g' text.txt > returns > Can't find Unicode property definition "Aplha" at -e line 1, <> line 1. > > What should I do to correct the problem ? I guess you could probably try to spell "Alpha" correctly. I am not an ...

To unicode or not?
SQL Server 2005 64-bit 9.00.3042 SP2 I am migrating a database to SQL Server that is not unicode... Assuming that I have no plan to store different languages (other than English), does it make sense to change my char() -> nchar() and my varchar() -> nvarchar()? My collation is sql_latin1_general_cp1_ci_as. Is there any relationship between the collation and unicode support? I see that I can alter existing columns, changing from char -> nchar. So I could go unicode later, yes? Is this a good idea? If I go unicode, am I asking for trouble from other tools/protocols that access t...

Ugh.
I'm getting sick on all of this halloween candy. -- Knight37 - http://knightgames.blogspot.com Once a Gamer, Always a Gamer. Knight37 wrote: > I'm getting sick on all of this halloween candy. Stop eating it, then. -- chainbreaker "chainbreaker" <noone@nowhere.com> wrote in message news:jPmdnSHRluNjSPveRVn-pg@comcast.com... > Knight37 wrote: > > I'm getting sick on all of this halloween candy. > > Stop eating it, then. > > -- > chainbreaker > > And give it to me. On 1 Nov 2005 02:33:43 GMT, Knight37 <knight37m@gm...

what is unicode
Please help me to find out what a unicode is . thanks In article <5c375106.0405210449.5ca5b0ee@posting.google.com> ciwshirazstudent@yahoo.com "azadeh" writes: > Please help me to find out what a unicode is . IIRC http://www.unicode.org is the main website. Before you go there, it may help you to know U+ is a multi-byte method of assigning standard codepoint values to symbols from a very wide range of alphabets, symbol sets and such. The bottom 128 are ASCII. Some codes are reserved for special purposes: eg, the Byte Order Mark, 0xFEFF, provides a way of check...

Unicode
Dear TeX and LaTeX gurus I found the following example on the web: \documentclass{article} \usepackage[ansinew]{inputenc} \usepackage[T1]{fontenc} \usepackage{CJK} \begin{document} \begin{CJK}{JIS}{goth} \CJKchar[UTF8]{"075}{"03B} \end{CJK} \begin{CJK}{UTF8}{goth} \Unicode{"075}{"03B} \end{CJK} \end{document} which allowed me to produce a beautiful PDF with some Chinese characters. I rely on your wisdom to answer to the following questions: - How can switch from Chinese to Japanese? - What are the ranges of the codes in {...}{...}? Than...

create unicode file and read unicode file
Hi &nbsp; &nbsp;&nbsp;&nbsp; How can create a unicode file and open unicode file in LV &nbsp; Regards Madhu At <a href="http://forums.lavag.org/Convert-between-ASCII-and-Unicode-file2.html" target="_blank">LAVA</a> there is some pure LabVIEW code. Ton <a href="http://forums.ni.com/ni/board/message?board.id=170&amp;message.id=266806&amp;query.id=154543#M266806" target="_blank">here </a>and <a href="http://forums.ni.com/ni/board/message?board.id=200&amp;message.id=18135&amp;query.id=154...