Hello,
We are porting some UTF-8 ready application from Linux to SunOS 5.9 and
running in the following unclear problem. After a lot of digging I'm able
to simplify the problem in the following snipp of C-code:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <locale.h>
main()
{
char *asc = "a";
char *utf = "\303\204"; /* this is an UTF-8 German A with dots */
char buf[80];
setlocale(LC_ALL, "");
sprintf(buf, "%-*.*s", 16, 16, asc);
printf("strlen of buf with ascii char %d\n", strlen(buf));
printf("[%s]\n", buf);
sprintf(buf, "%-*.*s", 16, 16, utf);
printf("strlen of buf utf char %d\n", strlen(buf));
printf("[%s]\n", buf);
exit(0);
}
If you compile and run it you will see that in some environment the
resulting string is not (as expected) 16 bytes long, but 17:
$ ./a.out
strlen of buf with ascii char 16
[a ]
strlen of buf utf char 16
[� ]
$ LC_ALL="" ./a.out
strlen of buf with ascii char 16
[a ]
strlen of buf utf char 16
[� ]
$ LC_ALL=de_DE.UTF-8 ./a.out
strlen of buf with ascii char 16
[a ]
strlen of buf utf char 17 <*******************************************
[� ]
$ LC_ALL=de_DE.UTF-8 ./a.out | od -t x1
0000000 73 74 72 6c 65 6e 20 6f 66 20 62 75 66 20 77 69
0000020 74 68 20 61 73 63 69 69 20 63 68 61 72 20 31 36
0000040 0a 5b 61 20 20 20 20 20 20 20 20 20 20 20 20 20
0000060 20 20 5d 0a 73 74 72 6c 65 6e 20 6f 66 20 62 75
0000100 66 20 75 74 66 20 63 68 61 72 20 31 37 0a 5b c3
0000120 84 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
0000140 5d 0a
0000142
i.e. the problem shows up when the source buffer contains a 2-byte UTF-8
char and you
1) have LC_ALL=de_DE.UTF-8 in the env *and*
2) set this back inside to LC_ALL=""
you can also see that the output is plain 2 byte UTF-8 code for the German
letter A with dots, followed by 15 chars of blank, which gives 17 chars in
the case of "�" (and 16 in the case of "a");
the behaviour is the same for SunOS 5.9 and SunOS 5.10, but not on
FreeBSD 8.x and not on Linux SLES10;
the man page of setlocale(3C) does not mention any influence of the
settings on sprintf(3C), but on things (logically) like strftime,
ctype, ...
what does this mean? is this a bug? IMHO sprintf(3C) should
just add bytes to a buffer as described in its format string and should
count a string of 2-bytes (the UTF-8 �) as two bytes, regardless what the
two bytes mean, and should fill the rest of the buffer with (in our case)
14 blanks;
Any idea or any pointer to an explanation?
Thanks in advance
Matthias
--
http://www.unixarea.de/
|
|
0
|
|
|
|
Reply
|
rebelde
|
5/31/2010 1:11:34 PM |
|
rebelde wrote:
> i.e. the problem shows up when the source buffer contains a 2-byte UTF-8
> char and you
> 1) have LC_ALL=de_DE.UTF-8 in the env *and*
> 2) set this back inside to LC_ALL=""
You have it in the sprintf(3c) man page:
If the conversion specifier is s or S, a standard-
conforming application (see standards(5)) interprets the
precision as the maximum number of bytes to be written;
an application that is not standard-conforming inter-
prets the precision as the maximum number of columns of
screen display. For an application that is not
standard-conforming, %.5s would print only the portion
of the string that would display in 5 screen columns.
Only complete characters are written.
With your example program on my system (Solaris 10):
{morrigan}~/trash> cc loc.c
"loc.c", line 7: warning: old-style declaration or incorrect type for: main
{morrigan}~/trash> LC_ALL=en_US.UTF-8 ./a.out
strlen of buf with ascii char 16
[a ]
strlen of buf utf char 17
[� ]
{morrigan}~/trash> c99 loc.c
"loc.c", line 7: warning: old-style declaration or incorrect type for: main
{morrigan}~/trash> LC_ALL=en_US.UTF-8 ./a.out
strlen of buf with ascii char 16
[a ]
strlen of buf utf char 16
[� ]
> what does this mean? is this a bug?
Take a look at standards(5) and define your compilation environment to
better suite your needs. (Invoking c99 is just the simplest way to get
standard conforming environment. It's not necessarily the best for your
needs.)
--
.-. .-. Yes, I am an agent of Satan, but my duties are largely
(_ \ / _) ceremonial.
|
| dave@fly.srk.fer.hr
|
|
0
|
|
|
|
Reply
|
Drazen
|
5/31/2010 2:34:54 PM
|
|
Drazen Kacar wrote:
> With your example program on my system (Solaris 10):
>
> {morrigan}~/trash> cc loc.c
> "loc.c", line 7: warning: old-style declaration or incorrect type for:
> main
> {morrigan}~/trash> LC_ALL=en_US.UTF-8 ./a.out
> strlen of buf with ascii char 16
> [a � � � � � � � ]
> strlen of buf utf char 17
> [� � � � � � � � �]
> {morrigan}~/trash> c99 loc.c
> "loc.c", line 7: warning: old-style declaration or incorrect type for:
> main
> {morrigan}~/trash> LC_ALL=en_US.UTF-8 ./a.out
> strlen of buf with ascii char 16
> [a � � � � � � � ]
> strlen of buf utf char 16
> [� � � � � � � � ]
>
>> what does this mean? is this a bug?
>
> Take a look at standards(5) and define your compilation environment to
> better suite your needs. (Invoking c99 is just the simplest way to get
> standard conforming environment. It's not necessarily the best for your
> needs.)
>
Hello Drazen,
Thanks for your reply and hints. I've checked before standards(5) and it was
not really clear for me what was meant with 'columns of screen display';
now I understand what the idea is... in our case, the result of the
sprintf(3C) is to be stored in database columns and need to be the exact
number of bytes, rather something longer.
We're using a gcc
$ gcc --version
gcc (GCC) 3.4.6
....
which does not know the -xc99 flag:
$ gcc -xc99 str.c
gcc: language c99 not recognized
Will check what would be the best way to solve this...
Thanks again
Matthias
--
Matthias Apitz
t +49-89-61308 351 - f +49-89-61308 399 - m +49-170-4527211
e <guru@unixarea.de> - w http://www.unixarea.de/
Solidarity with the zionistic pirates of Israel? Not in my name!
�Solidaridad con los piratas sionistas de Israel? �No en mi nombre!
|
|
0
|
|
|
|
Reply
|
rebelde
|
6/1/2010 1:53:58 PM
|
|
On Tue, 01 Jun 2010 15:53:58 +0200, rebelde <guru@unixarea.de> wrote:
> number of bytes, rather something longer.
>
> We're using a gcc
If you want standards, the Sun Studio is better.
> $ gcc --version
> gcc (GCC) 3.4.6
> ...
gcc -std=c99 -pedantic is the equivalent.
A bientot
Paul
--
Paul Floyd http://paulf.free.fr
|
|
0
|
|
|
|
Reply
|
Paul
|
6/1/2010 6:07:08 PM
|
|
Paul Floyd wrote:
>> $ gcc --version
>> gcc (GCC) 3.4.6
>> ...
>
> gcc -std=c99 -pedantic is the equivalent.
>
But gives also 17 byte for %-16.16s in case of a UTF-8 char:
$ gcc -std=c99 -pedantic str.c
str.c:8: warning: return type defaults to `int'
$ LC_ALL="" ./a.out
strlen of buf with ascii char 16
[a ]
strlen of buf utf char 16
[� ]
$ LC_ALL=de_DE.UTF-8 ./a.out
strlen of buf with ascii char 16
[a ]
strlen of buf utf char 17
[� ]
$
matthias
--
Matthias Apitz
t +49-89-61308 351 - f +49-89-61308 399 - m +49-170-4527211
e <guru@unixarea.de> - w http://www.unixarea.de/
Solidarity with the zionistic pirates of Israel? Not in my name!
�Solidaridad con los piratas sionistas de Israel? �No en mi nombre!
|
|
0
|
|
|
|
Reply
|
rebelde
|
6/2/2010 8:17:30 AM
|
|
On Wed, 02 Jun 2010 10:17:30 +0200, rebelde <guru@unixarea.de> wrote:
> Paul Floyd wrote:
>
>>> $ gcc --version
>>> gcc (GCC) 3.4.6
>>> ...
>>
>> gcc -std=c99 -pedantic is the equivalent.
>>
>
> But gives also 17 byte for %-16.16s in case of a UTF-8 char:
OK, so GCC isn't conforming to the standards.
A bientot
Paul
--
Paul Floyd http://paulf.free.fr
|
|
0
|
|
|
|
Reply
|
Paul
|
6/3/2010 8:00:45 PM
|
|
rebelde wrote:
[Sun Studio in standard-compliant mode]
>> gcc -std=c99 -pedantic is the equivalent.
Does gcc -ansi work?
> But gives also 17 byte for %-16.16s in case of a UTF-8 char:
I don't have any solaris at hand to check, but IIRC "standard mode" is
enabled by having a global variable: int __xpg4=1; in your program (not
that you should do it yourself, the compiler should pass some appropriate
..o to the linker).
|
|
0
|
|
|
|
Reply
|
Marc
|
6/3/2010 9:01:27 PM
|
|
Marc wrote:
> rebelde wrote:
>
> [Sun Studio in standard-compliant mode]
>>> gcc -std=c99 -pedantic is the equivalent.
>
> Does gcc -ansi work?
>
>> But gives also 17 byte for %-16.16s in case of a UTF-8 char:
>
> I don't have any solaris at hand to check, but IIRC "standard mode" is
> enabled by having a global variable: int __xpg4=1; in your program (not
> that you should do it yourself, the compiler should pass some appropriate
> .o to the linker).
Having int __xpg4=1; in my source, it works:
$ fgrep xpg4 str.c
int __xpg4=1;
$ gcc -ansi str.c
$ LC_ALL=de_DE.UTF-8 ./a.out
strlen of buf with ascii char 16
[a ]
strlen of buf utf char 16
[� ]
$ gcc str.c
$ LC_ALL=de_DE.UTF-8 ./a.out
strlen of buf with ascii char 16
[a ]
strlen of buf utf char 16
[� ]
Using only -ansi does not work either
Thanks for the trick.
Matthias
--
Matthias Apitz
t +49-89-61308 351 - f +49-89-61308 399 - m +49-170-4527211
e <guru@unixarea.de> - w http://www.unixarea.de/
Solidarity with the zionistic pirates of Israel? Not in my name!
�Solidaridad con los piratas sionistas de Israel? �No en mi nombre!
|
|
0
|
|
|
|
Reply
|
rebelde
|
6/4/2010 8:13:22 AM
|
|
|
7 Replies
389 Views
(page loaded in 0.084 seconds)
|