Hi, I have been driven mad by the following, which took ages to track
down. What is going on? I appears it is invalid to use toString on the
document object.
#! /usr/local/bin/perl -w
use strict;
use warnings;
use utf8;
use Encode;
use XML::LibXML;
binmode(STDOUT, ":utf8");
my $src= join("",<DATA>);
print "string \$src is invalid \n" unless ( Encode::is_utf8($src,1) );
my $parser = XML::LibXML->new();
my $x = $parser->parse_string($src)->documentElement();
my $str=$x->toString(1);
print "$str\n";
print "string 1 is invalid \n" unless ( Encode::is_utf8($str,1) );
$x = $parser->parse_string($src);
$str=$x->toString(1);
print "$str\n";
print "string 2 is invalid \n" unless ( Encode::is_utf8($str,1) );
__DATA__
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<plugin name="\xc5\x81"></plugin>
|
|
0
|
|
|
|
Reply
|
fergus76 (25)
|
7/12/2012 4:46:54 AM |
|
Quoth fergus@twig-me-uk.not.here (Fergus McMenemie):
> Hi, I have been driven mad by the following, which took ages to track
> down. What is going on? I appears it is invalid to use toString on the
> document object.
>
>
> #! /usr/local/bin/perl -w
> use strict;
> use warnings;
> use utf8;
> use Encode;
> use XML::LibXML;
> binmode(STDOUT, ":utf8");
>
> my $src= join("",<DATA>);
> print "string \$src is invalid \n" unless ( Encode::is_utf8($src,1) );
Don't do that. Encode::is_utf8 checks the state of the SvUTF8 flag,
which is internal to perl and none of your business. (The Encode
documentation is not as clear about this as is might be, because it only
became clear through experience that this is the only approach which
works.)
What are you actually trying to find out?
Ben
|
|
0
|
|
|
|
Reply
|
ben6057 (865)
|
7/12/2012 6:29:26 AM
|
|
Ben Morrow <ben@morrow.me.uk> wrote:
> Quoth fergus@twig-me-uk.not.here (Fergus McMenemie):
> > Hi, I have been driven mad by the following, which took ages to track
> > down. What is going on? I appears it is invalid to use toString on the
> > document object.
> >
> >
> > #! /usr/local/bin/perl -w
> > use strict;
> > use warnings;
> > use utf8;
> > use Encode;
> > use XML::LibXML;
> > binmode(STDOUT, ":utf8");
> >
> > my $src= join("",<DATA>);
> > print "string \$src is invalid \n" unless ( Encode::is_utf8($src,1) );
>
> Don't do that. Encode::is_utf8 checks the state of the SvUTF8 flag,
> which is internal to perl and none of your business. (The Encode
> documentation is not as clear about this as is might be, because it only
> became clear through experience that this is the only approach which
> works.)
Agreed, the warnings are there. However it did appear to make the
issue clearer. This example is rather goofy and posting it to USEnet
added a few more wrinkles. My original code and the real program
contained the actual characters. However my USEnet reader would not
let me post the real chars. Hence the octets.
My issue is that document->toString does not appear to work. Please
ignore the use of us_utf8.
> What are you actually trying to find out?
I have to pass references to DOM objects around all over the
place. I find I am having to make use of either documentElement()
or ownerDocument() depending on what I am doing. I would like to have
a consistent "pattern" for doing this. I would like to setting on
passing the document object around but it is anoying that I cant then
use toString.
|
|
0
|
|
|
|
Reply
|
fergus76 (25)
|
7/13/2012 3:59:03 PM
|
|
Quoth fergus@twig-me-uk.not.here (Fergus McMenemie):
> Ben Morrow <ben@morrow.me.uk> wrote:
> > Quoth fergus@twig-me-uk.not.here (Fergus McMenemie):
> > > Hi, I have been driven mad by the following, which took ages to track
> > > down. What is going on? I appears it is invalid to use toString on the
> > > document object.
> > >
> > >
> > > #! /usr/local/bin/perl -w
> > > use strict;
> > > use warnings;
> > > use utf8;
> > > use Encode;
> > > use XML::LibXML;
> > > binmode(STDOUT, ":utf8");
> > >
> > > my $src= join("",<DATA>);
> > > print "string \$src is invalid \n" unless ( Encode::is_utf8($src,1) );
> >
> > Don't do that. Encode::is_utf8 checks the state of the SvUTF8 flag,
> > which is internal to perl and none of your business. (The Encode
> > documentation is not as clear about this as is might be, because it only
> > became clear through experience that this is the only approach which
> > works.)
>
> Agreed, the warnings are there. However it did appear to make the
> issue clearer. This example is rather goofy and posting it to USEnet
> added a few more wrinkles. My original code and the real program
> contained the actual characters. However my USEnet reader would not
> let me post the real chars. Hence the octets.
It can certainly be difficult, given that Usenet officially doesn't
support anything but ASCII. Unofficially, if you can get your newsreader
to produce it, articles in UTF-8 with 'Content-type: text/plain;
charset=UTF-8' seem to work perfectly well.
Another thing you can do is explicitly decode the data in the program
you post; possibly something like
my $str = <DATA>;
$str =~ s/%([0-9a-f][0-9a-f])/chr hex $1/egi;
$str = Encode::decode "utf8", $str;
This uses URL-encoding rather than backslashes; you can pick whatever is
convenient for the data you are trying to post.
> My issue is that document->toString does not appear to work. Please
> ignore the use of us_utf8.
OK.
> > What are you actually trying to find out?
> I have to pass references to DOM objects around all over the
> place. I find I am having to make use of either documentElement()
> or ownerDocument() depending on what I am doing. I would like to have
> a consistent "pattern" for doing this. I would like to setting on
> passing the document object around but it is anoying that I cant then
> use toString.
I'm afraid I don't understand. When I run the original program I get the
results I would have expected: the first prints the XML without the
<?xml?>, the second prints it with it. What is going wrong for you?
Ben
|
|
0
|
|
|
|
Reply
|
ben6057 (865)
|
7/13/2012 4:51:33 PM
|
|
Ben Morrow <ben@morrow.me.uk> wrote:
> > > What are you actually trying to find out?
> > I have to pass references to DOM objects around all over the
> > place. I find I am having to make use of either documentElement()
> > or ownerDocument() depending on what I am doing. I would like to have
> > a consistent "pattern" for doing this. I would like to setting on
> > passing the document object around but it is anoying that I cant then
> > use toString.
>
> I'm afraid I don't understand. When I run the original program I get the
> results I would have expected: the first prints the XML without the
> <?xml?>, the second prints it with it. What is going wrong for you?
Thanks for the tip. My code now reads:-
use strict;
use warnings;
use Encode;
use XML::LibXML;
binmode(STDOUT, ":utf8");
my $src= join("",<DATA>);
$src =~ s/\\x([0-9a-f][0-9a-f])/chr hex $1/egi;
$src = Encode::decode "utf8", $src;
print "LibXML VERSION=$XML::LibXML::VERSION\n";
print "string \$src is invalid \n" unless ( Encode::is_utf8($src,1) );
my $parser = XML::LibXML->new();
my $x = $parser->parse_string($src)->documentElement();
my $str=$x->toString(1);
print "$str\n";
print "string 1 is invalid \n" unless ( Encode::is_utf8($str,1) );
$x = $parser->parse_string($src);
$str=$x->toString(1);
print "$str\n";
print "string 2 is invalid \n" unless ( Encode::is_utf8($str,1) );
__DATA__
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<plugin
name="\xef\xbd\xb1\xef\xbd\xb2\xef\xbd\xb3\xef\xbd\xb4\xef\xbd\xb5"></pl
ugin>
And fails on my mac running OS X Snow Leopard. But the 'real' version is
running with perl 5.12 on centos and also fails there. No sure about the
version of LibXML.
Does it work for your?
|
|
0
|
|
|
|
Reply
|
fergus76 (25)
|
7/14/2012 12:22:16 PM
|
|
Ben Morrow <ben@morrow.me.uk> wrote:
> Quoth fergus@twig-me-uk.not.here (Fergus McMenemie):
> > Ben Morrow <ben@morrow.me.uk> wrote:
> > > Quoth fergus@twig-me-uk.not.here (Fergus McMenemie):
�@�
> > > > Hi, I have been driven mad by the following, which took ages to track
> > > > down. What is going on? I appears it is invalid to use toString on the
> > > > document object.
> > > >
> > > >
> > > > #! /usr/local/bin/perl -w
> > > > use strict;
> > > > use warnings;
> > > > use utf8;
> > > > use Encode;
> > > > use XML::LibXML;
> > > > binmode(STDOUT, ":utf8");
> > > >
> > > > my $src= join("",<DATA>);
> > > > print "string \$src is invalid \n" unless ( Encode::is_utf8($src,1) );
> > >
> > > Don't do that. Encode::is_utf8 checks the state of the SvUTF8 flag,
> > > which is internal to perl and none of your business. (The Encode
> > > documentation is not as clear about this as is might be, because it only
> > > became clear through experience that this is the only approach which
> > > works.)
> >
> > Agreed, the warnings are there. However it did appear to make the
> > issue clearer. This example is rather goofy and posting it to USEnet
> > added a few more wrinkles. My original code and the real program
> > contained the actual characters. However my USEnet reader would not
> > let me post the real chars. Hence the octets.
>
> It can certainly be difficult, given that Usenet officially doesn't
> support anything but ASCII. Unofficially, if you can get your newsreader
> to produce it, articles in UTF-8 with 'Content-type: text/plain;
> charset=UTF-8' seem to work perfectly well.
>
> Another thing you can do is explicitly decode the data in the program
> you post; possibly something like
>
> my $str = <DATA>;
> $str =~ s/%([0-9a-f][0-9a-f])/chr hex $1/egi;
> $str = Encode::decode "utf8", $str;
>
> This uses URL-encoding rather than backslashes; you can pick whatever is
> convenient for the data you are trying to post.
>
> > My issue is that document->toString does not appear to work. Please
> > ignore the use of us_utf8.
>
> OK.
>
> > > What are you actually trying to find out?
> > I have to pass references to DOM objects around all over the
> > place. I find I am having to make use of either documentElement()
> > or ownerDocument() depending on what I am doing. I would like to have
> > a consistent "pattern" for doing this. I would like to setting on
> > passing the document object around but it is anoying that I cant then
> > use toString.
>
> I'm afraid I don't understand. When I run the original program I get the
> results I would have expected: the first prints the XML without the
> <?xml?>, the second prints it with it. What is going wrong for you?
>
> Ben
|
|
0
|
|
|
|
Reply
|
fergus76 (25)
|
7/14/2012 1:10:59 PM
|
|
Quoth fergus@twig-me-uk.not.here (Fergus McMenemie):
> Ben Morrow <ben@morrow.me.uk> wrote:
>
> > > > What are you actually trying to find out?
> > > I have to pass references to DOM objects around all over the
> > > place. I find I am having to make use of either documentElement()
> > > or ownerDocument() depending on what I am doing. I would like to have
> > > a consistent "pattern" for doing this. I would like to setting on
> > > passing the document object around but it is anoying that I cant then
> > > use toString.
> >
> > I'm afraid I don't understand. When I run the original program I get the
> > results I would have expected: the first prints the XML without the
> > <?xml?>, the second prints it with it. What is going wrong for you?
<snip>
>
> And fails on my mac running OS X Snow Leopard. But the 'real' version is
> running with perl 5.12 on centos and also fails there. No sure about the
> version of LibXML.
>
> Does it work for your?
Yes, it works as documented for me. Are you getting confused by the fact
that ->toString produces a byte string for whole documents, but a
character string for just an element? Read the 'ENCODINGS SUPPORT'
section in perldoc XML::LibXML: you don't want a :utf8 layer if you're
printing a whole document, because the document isn't necessarily in
UTF-8.
Ben
|
|
0
|
|
|
|
Reply
|
ben6057 (865)
|
7/14/2012 4:20:47 PM
|
|
In <l906d9-niu2.ln1@anubis.morrow.me.uk>, on 07/13/2012
at 05:51 PM, Ben Morrow <ben@morrow.me.uk> said:
>It can certainly be difficult, given that Usenet officially doesn't
>support anything but ASCII.
What gives you that idea? RFC 5536 explicitly allows MIME-encoded
data, e.g.,
1.2. Scope
This document specifies the syntax of Netnews articles in the
context
of the Internet Message Format [RFC5322] and Multipurpose Internet
Mail Extensions (MIME) [RFC2045]. This document obsoletes
[RFC1036],
2.2. Header Fields
o The character set for header fields is US-ASCII. Where the
use of non-ASCII characters is required, they MUST be encoded
using the MIME mechanisms defined in [RFC2047] and [RFC2231].
2.3. MIME Conformance
User agents MUST meet the definition of MIME conformance in
[RFC2049]
and MUST also support [RFC2231]. This level of MIME conformance
provides support for internationalization and multimedia in message
bodies [RFC2045], [RFC2046], and [RFC2231], and support for
internationalization of header fields [RFC2047] and [RFC2231].
Note
that [Errata] currently exist for [RFC2045], [RFC2046], [RFC2047]
and
[RFC2231].
3.2. Optional Header Fields
The MIME header fields MIME-Version, Content-Type,
Content-Transfer-
Encoding, Content-Disposition, and Content-Language are used in
Netnews articles in the same circumstances and with the same
meanings
as those specified in [RFC2045], [RFC2183], and [RFC3282], with the
added restrictions detailed above in Section 2.2.
4. Internationalization Considerations
Internationalization of Netnews article header fields and bodies is
provided using the MIME mechanisms discussed in Section 2.3. Note
that the generation of internationalized <newsgroup-name>s for use
in
header fields is not addressed in this document.
Now, admittedly there's still a lot of software written to RFC 1036
and son-of-1036, but 5536 is a few years old and is on the standards
track, so I officially MIME is allowed.
> Unofficially, if you can get your newsreader to produce it,
>articles in UTF-8 with 'Content-type: text/plain; charset=UTF-8'
>seem to work perfectly well.
Not just defacto but de jure; see above for what is officially
allowed.
--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>
Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to spamtrap@library.lspace.org
|
|
0
|
|
|
|
Reply
|
spamtrap16 (3672)
|
7/15/2012 2:28:22 AM
|
|
Quoth Shmuel (Seymour J.) Metz <spamtrap@library.lspace.org.invalid>:
> In <l906d9-niu2.ln1@anubis.morrow.me.uk>, on 07/13/2012
> at 05:51 PM, Ben Morrow <ben@morrow.me.uk> said:
>
> >It can certainly be difficult, given that Usenet officially doesn't
> >support anything but ASCII.
>
> What gives you that idea? RFC 5536 explicitly allows MIME-encoded
> data, e.g.,
Ooh, they've actually published an update. I didn't know that.
Ben
|
|
0
|
|
|
|
Reply
|
ben6057 (865)
|
7/15/2012 10:02:45 PM
|
|
Ben Morrow <ben@morrow.me.uk> wrote:
> > What gives you that idea? RFC 5536 explicitly allows MIME-encoded
> > data, e.g.,
>
> Ooh, they've actually published an update. I didn't know that.
My newsreader does not properly upport UTF8 I guess lots of others still
dont either.
MacSoup - my soups gone off!
|
|
0
|
|
|
|
Reply
|
fergus76 (25)
|
7/17/2012 6:21:11 AM
|
|
Ben Morrow <ben@morrow.me.uk> wrote:
> Yes, it works as documented for me. Are you getting confused by the fact
> that ->toString produces a byte string for whole documents, but a
> character string for just an element? Read the 'ENCODINGS SUPPORT'
> section in perldoc XML::LibXML: you don't want a :utf8 layer if you're
> printing a whole document, because the document isn't necessarily in
> UTF-8.
Duh!
Thanks I dont know how I managed to miss that bit.
|
|
0
|
|
|
|
Reply
|
fergus76 (25)
|
7/17/2012 6:21:13 AM
|
|
In <1knd7af.1e602xit3x7eN%fergus@twig-me-uk.not.here>, on 07/17/2012
at 07:21 AM, fergus@twig-me-uk.not.here (Fergus McMenemie) said:
>My newsreader does not properly upport UTF8 I guess lots of others
>still dont either.
Mine doesn't either[1], but that's an issue with my news client, not
with the Usenet standards.
[1] But at least it supports the ISO-8859-* character sets.
--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>
Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to spamtrap@library.lspace.org
|
|
0
|
|
|
|
Reply
|
spamtrap16 (3672)
|
7/19/2012 2:43:43 PM
|
|
|
12 Replies
41 Views
(page loaded in 0.202 seconds)
Similiar Articles: comp.lang.perl.miscLibXML element->toString vs document->toString 12 7 (7/12/2012 4:46:54 AM) Hi, I have been driven mad by the following, which took ages to track down. XML::LibXML::Document(3) - XML::LibXML DOM Document ClassNOTE: XML::LibXML::Document::toString returns the data in the document encoding rather ... This function enables you to set the root element for a document. The function ... XML::LibXML::Node - search.cpan.orgSo, for example, one cannot match the root element of an XHTML document with $node ... Unlike XML::LibXML::Document::toString, in this case the resulting string is by ... 7/30/2012 2:00:11 PM
|