f



Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser)

Hello Usenet.

I'm subject to some confusion with XML and UTF8. I'm working with
XML-Simple and I try to decode some XML with with german umlauts
(ISO-8859-1). The first XML line declared the encoding correct (see code
below). But I'm getting different results using XML-Simple with the
default XML parser named XML::Sax and a second parser named XML::Parser. 
The following code tries to decode the mini XML file and prints the UTF8
flags of the resulting strings.

Can someone run this code on his machine and post the results? Thanks. 
The results on my machine are this:

���äöü� (0) cmp ������� (0) = -1
������� (1) cmp ������� (0) = 0

The first line was parsed by XML::Sax and the second line was parsed by
XML::Parser. My conclusions:

1) Line 1 is wrong, line 2 is correct
2) The output should be line 2 two times.
3) There is a bug in XML::Sax

Your opinion?

The code (written in ISO-8859-1 on disc):

#!/usr/bin/perl -w

use strict;
use warnings;

use XML::Simple;
use Encode;

foreach (1..2)
{
   my $q1 = XMLin("<?xml version='1.0' encoding='iso-8859-1'?>\n<a>�������</a>");
   my $q2 = "�������";

   printf "%s (%d) cmp %s (%d) = %d\n"
      ,   $q1, Encode::is_utf8($q1)
      ,   $q2, Encode::is_utf8($q2)
      ,   $q1 cmp $q2;
   # and again with the non default parser
   $XML::Simple::PREFERRED_PARSER = 'XML::Parser';
}

PS: I'm using perl v5.8.7, XML-SAX-0.13, XML-Parser-2.34 and 
expat-1.95.8.

-- 
So long... Fuzz
0
fuzz
3/2/2006 3:17:12 PM
comp.lang.perl.misc 33233 articles. 2 followers. brian (1246) is leader. Post Follow

6 Replies
1607 Views

Similar Articles

[PageSpeed] 45

fuzz@uni-paderborn.de (Erik Wasser) wrote in
news:nsejd3-rep.ln1@wasser-7359.user.cis.dfn.de: 

> I'm subject to some confusion with XML and UTF8. I'm working with
> XML-Simple and I try to decode some XML with with german umlauts
> (ISO-8859-1). The first XML line declared the encoding correct (see
> code below). But I'm getting different results using XML-Simple with
> the default XML parser named XML::Sax and a second parser named
> XML::Parser. The following code tries to decode the mini XML file and
> prints the UTF8 flags of the resulting strings.
> 
> Can someone run this code on his machine and post the results? Thanks.
> The results on my machine are this:
> 
> ���äöü� (0) cmp ������� (0) = -1
> ������� (1) cmp ������� (0) = 0
> 
> The first line was parsed by XML::Sax and the second line was parsed
> by XML::Parser. My conclusions:
> 
> 1) Line 1 is wrong, line 2 is correct
> 2) The output should be line 2 two times.
> 3) There is a bug in XML::Sax
> 
> Your opinion?
> 
> The code (written in ISO-8859-1 on disc):
> 
> #!/usr/bin/perl -w
> 
> use strict;
> use warnings;
> 
> use XML::Simple;
> use Encode;
> 
> foreach (1..2)
> {
>    my $q1 = XMLin("<?xml version='1.0'
>    encoding='iso-8859-1'?>\n<a>�������</a>"); my $q2 = "�������";
> 
>    printf "%s (%d) cmp %s (%d) = %d\n"
>       ,   $q1, Encode::is_utf8($q1)
>       ,   $q2, Encode::is_utf8($q2)
>       ,   $q1 cmp $q2;
>    # and again with the non default parser
>    $XML::Simple::PREFERRED_PARSER = 'XML::Parser';
> }
> 
> PS: I'm using perl v5.8.7, XML-SAX-0.13, XML-Parser-2.34 and 
> expat-1.95.8.

First off, let me say I don't know much about this stuff. I am on the US 
English version of XP. I copied and pasted the code above into Gvim, and 
then ran it. I got:


D:\Home\asu1\UseNet\clpmisc> r > results.txt

D:\Home\asu1\UseNet\clpmisc> cat results.txt
������� (1) cmp ������� (0) = 0
������� (1) cmp ������� (0) = 0

I would be inclined to look at what changed in XML-SAX between versions 
0.12 and 0.13, but then, as I said, I don't know much about encodings 
etc.

I have XML-SAX-0.12 and XML-Parser-2.34 and

D:\Home\asu1\UseNet\clpmisc> perl -v

This is perl, v5.8.7 built for MSWin32-x86-multi-thread
(with 14 registered patches, see perl -V for more detail)

Copyright 1987-2005, Larry Wall

Binary build 815 [211909] provided by ActiveState 
http://www.ActiveState.com
ActiveState is a division of Sophos.
Built Nov  2 2005 08:44:52

Sinan
-- 
A. Sinan Unur <1usa@llenroc.ude.invalid>
(reverse each component and remove .invalid for email address)

comp.lang.perl.misc guidelines on the WWW:
http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html

0
A
3/2/2006 3:57:53 PM
On Thu, 02 Mar 2006 15:57:53 GMT, "A. Sinan Unur" <1usa@llenroc.ude.invalid> wrote:

>fuzz@uni-paderborn.de (Erik Wasser) wrote in
>news:nsejd3-rep.ln1@wasser-7359.user.cis.dfn.de: 
>
>> I'm subject to some confusion with XML and UTF8. I'm working with
>> XML-Simple and I try to decode some XML with with german umlauts
>> (ISO-8859-1). The first XML line declared the encoding correct (see
>> code below). But I'm getting different results using XML-Simple with
>> the default XML parser named XML::Sax and a second parser named
>> XML::Parser. The following code tries to decode the mini XML file and
>> prints the UTF8 flags of the resulting strings.
>> 
>> Can someone run this code on his machine and post the results? Thanks.
>> The results on my machine are this:
>> 

You didn't try to decode in German! You might have changed the "code page"
to German to get different character sets. It doesn't matter. I'm looking at
your character in whatever "code page" is on my machine. UTF8 is Unicode.
Its not discernable unless you have a Unicode "aware" renderer. You can't
just change the characters on the page via cut & paste and it turns into
Unicode. If you open or save a Unicode document from a Unicode aware editor
the represented character will not be noticable as Unicode, so it's not
something that can be "cut 'n pasted" into a newsgroup, as code to be 
tested! UTF8, even "multi-byte" is transparent to the user and only known
to the renderer. Data from a file that is read into a parser (or a Perl 
program that is UTF8 aware) that is Unicode is treated as Unicode in its
variable representation and interaction with other variables. If a regex
is to be applied to Unicode data from an aware Perl parser, it works
every time. 
0
robic0
3/5/2006 1:30:09 AM
On Sat, 04 Mar 2006 17:30:09 -0800, robic0 wrote:

>On Thu, 02 Mar 2006 15:57:53 GMT, "A. Sinan Unur" <1usa@llenroc.ude.invalid> wrote:
>
>>fuzz@uni-paderborn.de (Erik Wasser) wrote in
>>news:nsejd3-rep.ln1@wasser-7359.user.cis.dfn.de: 
>>
>>> I'm subject to some confusion with XML and UTF8. I'm working with
>>> XML-Simple and I try to decode some XML with with german umlauts
>>> (ISO-8859-1). The first XML line declared the encoding correct (see
>>> code below). But I'm getting different results using XML-Simple with
>>> the default XML parser named XML::Sax and a second parser named
>>> XML::Parser. The following code tries to decode the mini XML file and
>>> prints the UTF8 flags of the resulting strings.
>>> 
>>> Can someone run this code on his machine and post the results? Thanks.
>>> The results on my machine are this:
>>> 
>
>You didn't try to decode in German! You might have changed the "code page"
>to German to get different character sets. It doesn't matter. I'm looking at
>your character in whatever "code page" is on my machine. UTF8 is Unicode.
>Its not discernable unless you have a Unicode "aware" renderer. You can't
>just change the characters on the page via cut & paste and it turns into
>Unicode. If you open or save a Unicode document from a Unicode aware editor
>the represented character will not be noticable as Unicode, so it's not
>something that can be "cut 'n pasted" into a newsgroup, as code to be 
>tested! UTF8, even "multi-byte" is transparent to the user and only known
>to the renderer. Data from a file that is read into a parser (or a Perl 
>program that is UTF8 aware) that is Unicode is treated as Unicode in its
>variable representation and interaction with other variables. If a regex
>is to be applied to Unicode data from an aware Perl parser, it works
>every time. 

Just a followup, I know your question was with xml, but if you wan't to use 
unicode "outside" the 0-128 bracket fro regex you might want to use the
codes as in this simple example (which just uses various "ranges"):

@UC_Nstart = (
"\\x{C0}-\\x{D6}",
"\\x{D8}-\\x{F6}",
"\\x{F8}-\\x{2FF}",
"\\x{370}-\\x{37D}",
"\\x{37F}-\\x{1FFF}",
"\\x{200C}-\\x{200D}",
"\\x{2070}-\\x{218F}",
"\\x{2C00}-\\x{2FEF}",
"\\x{3001}-\\x{D7FF}",
"\\x{F900}-\\x{FDCF}",
"\\x{FDF0}-\\x{FFFD}",
"\\x{10000}-\\x{EFFFF}",
); 
0
robic0
3/5/2006 1:43:12 AM
robic0 wrote:

> Just a followup, I know your question was with xml, but if you wan't to use 
> unicode "outside" the 0-128 bracket fro regex you might want to use the
> codes as in this simple example (which just uses various "ranges"):

My question was: why two XML parsers are getting different results? The 
different results are confusing me not unicode itself.

-- 
So long... Fuzz
0
fuzz
3/5/2006 11:49:22 AM
Erik Wasser wrote:

[XML::Simple gives correct results with XML::Parser, but wrong results
with XML::SAX]

> My question was: why two XML parsers are getting different results?
> The different results are confusing me not unicode itself.

Looks like a bug in XML::SAX or one of the libraries it uses.
However, like Sinan, I cannot reproduce it here on a Debian Sarge
system:

perl, v5.8.4 built for i386-linux-thread-multi
XML::Simple version 2.14
XML::SAX version 0.12
XML::Parser version 2.34
libexpat1      1.95.8-3

So it may be caused by something weird in your einvironment.

        hp

-- 
This is not a signature
0
Peter
3/5/2006 10:09:03 PM
On Sun, 05 Mar 2006 23:09:03 +0100, "Peter J. Holzer" <hjp-usenet2@hjp.at> wrote:

>Erik Wasser wrote:
>
>[XML::Simple gives correct results with XML::Parser, but wrong results
>with XML::SAX]
>
>> My question was: why two XML parsers are getting different results?
>> The different results are confusing me not unicode itself.
>
>Looks like a bug in XML::SAX or one of the libraries it uses.
>However, like Sinan, I cannot reproduce it here on a Debian Sarge
>system:
>
>perl, v5.8.4 built for i386-linux-thread-multi
>XML::Simple version 2.14
>XML::SAX version 0.12
>XML::Parser version 2.34
>libexpat1      1.95.8-3
>
>So it may be caused by something weird in your einvironment.
>
>        hp
I'm going to have to agree. Using many parsers at the same time will
cause either slowdows or indeterminate results.
xml:sax is not a good parser. Just because it has "sax" in the title
(simple api xml) is bellweather on its functionality/performance.
After using xml:sax in place of expat once, the performance fell off
by %800. If you are going to parse and capture and expand a string
to be later converted into a hash, be carefull of what you use and
how its used. Expat and Simple (with expat directive) are a good 
combination. Good cleanup is required. Keep your instantiation,
single operation, closure, sub-scoped. If you are doing schema checking
with Xerces, keep that at a different scope, and a preliminary to
data extraction parsing.

Any ?'s (oh fuck my spelling), let me know

0
robic0
3/9/2006 12:54:10 AM
Reply: