|
|
XMLgawk question about UNPARSED text
Hi All,
The XMLgawk documentation says: "XMLUNPARSED indicates a text that did
not fit into any other category."
http://home.vrweb.de/~juergen.kahrs/gawk/XML/xmlgawk.html#Sorting-out-all-kinds-of-data-from-an-XML-file
However, I saw now that basically all my XML documents have UNPARSED
text. These are the newline character(s) after the XML declaration and
empty line(s) at the end of an XML document which are reported as
UNPARSED text, e.g. in this test.xml document:
> cat -A test.xml
<?xml version="1.0" encoding="UTF-8"?>$
<root>$
<parent>$
<child>Some character data here</child>$
</parent>$
</root>$
$
As far as I can see: the only way to avoid UNPARSED text is to refomat
the XML document like this:
> cat -A test2.xml
<?xml version="1.0" encoding="UTF-8"?><root>$
<parent>$
<child>Some character data here</child>$
</parent>$
Is this a feature or a bug (in particular as the other newline
characters in my test files above are reported as being XMLCHARDATA
rather than XMLUNPARSED
?
Thanks in advance, Hermann
|
|
0
|
|
|
|
Reply
|
peifer (24)
|
9/17/2007 9:47:00 AM |
|
On Sep 17, 11:47 am, Hermann <pei...@gmx.net> wrote:
> > cat -A test2.xml
>
> <?xml version="1.0" encoding="UTF-8"?><root>$
> <parent>$
> <child>Some character data here</child>$
> </parent>$
The last line got lost in the above example. It should of course read:
> cat -A test2.xml
<?xml version="1.0" encoding="UTF-8"?><root>$
<parent>$
<child>Some character data here</child>$
</parent>$
</root>
Hermann
|
|
0
|
|
|
|
Reply
|
hpeifer
|
9/17/2007 10:00:01 AM
|
|
Hermann escribi�:
> Hi All,
>
> The XMLgawk documentation says: "XMLUNPARSED indicates a text that did
> not fit into any other category."
> http://home.vrweb.de/~juergen.kahrs/gawk/XML/xmlgawk.html#Sorting-out-all-kinds-of-data-from-an-XML-file
>
> However, I saw now that basically all my XML documents have UNPARSED
> text. These are the newline character(s) after the XML declaration and
> empty line(s) at the end of an XML document which are reported as
> UNPARSED text, e.g. in this test.xml document:
>
>>cat -A test.xml
>
> <?xml version="1.0" encoding="UTF-8"?>$
> <root>$
> <parent>$
> <child>Some character data here</child>$
> </parent>$
> </root>$
> $
>
> As far as I can see: the only way to avoid UNPARSED text is to refomat
> the XML document like this:
>
>>cat -A test2.xml
>
> <?xml version="1.0" encoding="UTF-8"?><root>$
> <parent>$
> <child>Some character data here</child>$
> </parent>$
> </root>$
>
> Is this a feature or a bug (in particular as the other newline
> characters in my test files above are reported as being XMLCHARDATA
> rather than XMLUNPARSED
This is a feature of the Expat XML parser used by XMLGawk. The XML
standard states:
"2.4 Character Data and Markup
Text consists of intermingled character data and markup. [Definition:
Markup takes the form of start-tags, end-tags, empty-element tags,
entity references, character references, comments, CDATA section
delimiters, document type declarations, processing instructions, XML
declarations, text declarations, and any white space that is at the top
level of the document entity (that is, outside the document element and
not inside any other markup).]
[Definition: All text that is not markup constitutes the character data
of the document.]"
Please note that whitespace outside the document root element is defined
as markup, and not as character data.
Regards.
--
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado
|
|
0
|
|
|
|
Reply
|
Manuel
|
9/17/2007 11:16:43 AM
|
|
On Sep 17, 1:16 pm, Manuel Collado <m.coll...@see-
signature.ls.fi.upm.es> wrote:
>
> Please note that whitespace outside the document root element is defined
> as markup, and not as character data.
>
Thanks. I wasn't aware of this detail. Now I know better.
Hermann
|
|
0
|
|
|
|
Reply
|
hpeifer
|
9/17/2007 4:10:39 PM
|
|
|
3 Replies
89 Views
(page loaded in 0.082 seconds)
|
|
|
|
|
|
|
|
|