f



read an XML file in TCL

Hello
  I am looking to read an XML file into my TCL program and extract
information. How do we read an XML file in TCL? Are there man pages
that describe these?

Thanks
krithiga

0
krithiga81 (141)
1/20/2006 5:53:37 PM
comp.lang.tcl 23428 articles. 2 followers. Post Follow

8 Replies
1103 Views

Similar Articles

[PageSpeed] 8

one way is:

http://wiki.tcl.tk/tdom

0
1/20/2006 7:11:55 PM
another (and faster, but should be used only for valid stuff) is

http://wiki.tcl.tk/11020

0
Nite4Hawks (25)
1/20/2006 7:22:30 PM
Torsten Edler wrote:
> another (and faster, but should be used only for valid stuff) is
> 

I'm curious as to this assertion that the shallow regexp parser is 
faster than tdom, I would have expected things to be the other way 
around. Do you have a benchmark? Were you testing tdom building a DOM 
tree or just SAX parsing (the latter would be a fairer comparison)?

My own brief benchmark (parsing the XML 1.0 spec, ~200KB) produced these 
results:

proc ParseXMLRegexp xml {
     XML::Init $xml
     while 1 {
         lassign [XML::NextToken] type val attr etype
         if {$type eq "EOF"} { break }
     }
}
proc ParseXMLTdom xml {
     dom parse $xml doc
}
puts "Regexp: [time { ParseXMLRegexp $xml } 10]"
puts "Tdom:   [time { ParseXMLTdom $xml } 10]"

output:
Regexp: 2419835.7 microseconds per iteration
Tdom:   52872.1 microseconds per iteration

which makes tdom roughly 45X faster than the regular expression parser, 
and that's with tdom building and destroying a full DOM tree.

-- Neil
0
nem3909 (1000)
1/21/2006 3:31:08 PM
Hi Neil,

i typed my answer a bit too fast and mixed up TclXML with TcLDOM.

Few months ago i had to read and extract all data from an XML-file. The
file was an export of an MS-Access table with 17000 records with
roughly 50 columns and a total of 16 Mbyte size. (aside: why not use
CSV files - memo fields with lots of funny characters). Basically i
wanted to import the data from Access to Oracle-DB (aside: there are
other and much faster ways but in the time frame given, this was may
first try, later i did it with tclodbc).

So my task was beyond "parsing" it was also "extracting" from the file
(which i guess is also the OPs task at hand). I started out with TclXML
and it proved to be very slow. So searching the TCL-wiki for XML i
found references to TclDOM and the page i mentioned.

I have to admit i was not at all interested in learning DOM, XML and
all the techniques around it, i just wanted to get my data out of that
file (even considered regex-ing myself). So i decided for the little
nifty interface of the script. The documentation for DOM and a little
sentence in the ActiveTcl help files

<<The DOM specification should be read in conjunction with this
reference manual, as it explains the meaning and purpose of the various
interfaces. This manual is not a tutorial on how to use the DOM>>

already gave me the creeps as i was short on time and therefore not
willing to look at too many docs.

So actually i am not able to say whether TclDOM will run faster or
slower. But when it comes to "learning time" and dodging the complexity
of TclDOM, i would still decide the same.

0
Nite4Hawks (25)
1/21/2006 4:44:14 PM
krithiga81@yahoo.com wrote:
>   I am looking to read an XML file into my TCL program and extract
> information. How do we read an XML file in TCL? Are there man pages
> that describe these?

Basically, there are three ways to do this:

1. as a stream of data (faster, uses less memory, but requires you to
write more code),
2. as a tree of nodes (ie. DOM; not-so-fast, uses a lot of memory, but
good when you have to make multiple passes over the data and/or need to
analyse the document's structure),
3. using XSLT (generally most useful when you have to produce a
corresponding output document - if you are extracting data for use in
your Tcl program then this option would only be for advanced users ;-)
)

For option 1 you would use SAX (or something like it).  TclXML provides
this level of interface.

For option 2 you would use DOM.  TclDOM provides this level of
interface.

For option 3 you would use TclXSLT.

For documentation and man pages on these three packages see
http://tclxml.sf.net/.

HTHs,
Steve Ball

0
Steve.Ball (89)
1/22/2006 11:59:32 PM
Torsten Edler wrote:
> Hi Neil,
> 
....
> I have to admit i was not at all interested in learning DOM, XML and
> all the techniques around it, i just wanted to get my data out of that
> file (even considered regex-ing myself). 

A very sane attitude to take! :-)

> So i decided for the little
> nifty interface of the script. The documentation for DOM and a little
> sentence in the ActiveTcl help files
> 
> <<The DOM specification should be read in conjunction with this
> reference manual, as it explains the meaning and purpose of the various
> interfaces. This manual is not a tutorial on how to use the DOM>>
> 
> already gave me the creeps as i was short on time and therefore not
> willing to look at too many docs.

That's understandable. However, building a DOM tree is only one option. 
There is also a SAX-style interface (SAX = Simple API for XML) which 
both tdom and TclXML provide, where you get a series of events as a 
document is parsed that you can respond to. The regexp shallow parser on 
the wiki is another separate technique.

> 
> So actually i am not able to say whether TclDOM will run faster or
> slower. But when it comes to "learning time" and dodging the complexity
> of TclDOM, i would still decide the same.

Just to clear up, there are 2 main "vendors" (for want of a better word) 
of XML processing packages for Tcl:

* The TclXML/TclDOM/TclXSLT project (http://tclxml.sourceforge.net/);
* The tDOM project (http://www.tdom.org/).

Both projects provide roughly the same functionality:

* A SAX event-based stream parser (fast, lightweight);
* A DOM tree parser (holds entire doc in memory, flexible);
* An XPath expression engine, for querying DOM trees;
* An XSLT processor for transforming DOM trees using an XML-based 
declarative transformation language (XSLT);
* Some sort of easy XML-generation library (xmlgen vs. tdom's 
createNodeCmd).

There are other bits and pieces, but those are the most fundamental 
technologies that people doing much XML work should be aware of. In 
tDOM, these are all just part of the one package. In TclXML the 
technologies are broken into separate packages: TclXML (SAX), TclDOM 
(DOM), TclXSLT (well, you get the picture :). The TclXML project also 
hosts a number of different implementations such as a pure-Tcl version 
and versions based on the Gnome libxml stuff.

Now, next to all this is the regexp shallow parser on the wiki which is 
an entirely different approach. If it worked for you, great! However, 
the general advice to someone new to XML processing with Tcl would be to 
check out one of the above projects, which are known to work well and 
conform to the XML specifications. I certainly understand a reaction 
against the complexity and jargon surrounding XML technologies and just 
wanting to get things done. However, using a basic SAX parser isn't too 
difficult. For example, see http://wiki.tcl.tk/2741 .

Cheers,

-- Neil
0
nem3909 (1000)
1/23/2006 3:18:53 PM
Hi Neil
    In the proc dom for xml where is the xml file specified. Is it the
argument to the procedure or should the xml file be reads as file open
and set to a variable?

Thanks
kc
Neil Madden wrote:
> Torsten Edler wrote:
> > another (and faster, but should be used only for valid stuff) is
> >
>
> I'm curious as to this assertion that the shallow regexp parser is
> faster than tdom, I would have expected things to be the other way
> around. Do you have a benchmark? Were you testing tdom building a DOM
> tree or just SAX parsing (the latter would be a fairer comparison)?
>
> My own brief benchmark (parsing the XML 1.0 spec, ~200KB) produced these
> results:
>
> proc ParseXMLRegexp xml {
>      XML::Init $xml
>      while 1 {
>          lassign [XML::NextToken] type val attr etype
>          if {$type eq "EOF"} { break }
>      }
> }
> proc ParseXMLTdom xml {
>      dom parse $xml doc
> }
> puts "Regexp: [time { ParseXMLRegexp $xml } 10]"
> puts "Tdom:   [time { ParseXMLTdom $xml } 10]"
>
> output:
> Regexp: 2419835.7 microseconds per iteration
> Tdom:   52872.1 microseconds per iteration
>
> which makes tdom roughly 45X faster than the regular expression parser,
> and that's with tdom building and destroying a full DOM tree.
> 
> -- Neil

0
krithiga81 (141)
1/23/2006 7:26:52 PM
krithiga81@yahoo.com wrote:
> Hi Neil
>     In the proc dom for xml where is the xml file specified. Is it the
> argument to the procedure or should the xml file be reads as file open
> and set to a variable?

In the test I ran, the XML was read into a variable:

proc readfile file {
     set fid [open $file]
     set xml [read $fid]
     close $fid
     return $xml
}
set xml [readfile somexmlfile.xml]
ParseXMLTdom $xml

etc..

That's not ideal as it doesn't handle XML encodings etc properly. tDOM 
has a [tDOM::xmlReadFile] that you can use, and there are other options 
for dealing with channels etc.

-- Neil
0
nem3909 (1000)
1/23/2006 9:05:28 PM
Reply: