Need help on File parsing

  • Follow


I'm writing a C program which would parse a xml file as its input and
perform specific operations...
Now what i have in my mind is that i should declare a two dimensional
array and store the xml file in it

for example:::  char country[][]={<countries>,
                                                 <country>,
                                                     <text>Norway</
text>,
                                                    <value>N</value>,
                                                </country>}, and so on


My question is... is there any better way to do this, i.e. is there
any better way to store the xml input input..

Thanks

0
Reply Maxx 3/21/2011 8:35:01 PM

On 03/22/11 09:35 AM, Maxx wrote:
> I'm writing a C program which would parse a xml file as its input and
> perform specific operations...
> Now what i have in my mind is that i should declare a two dimensional
> array and store the xml file in it
>
> for example:::  char country[][]={<countries>,
>                                                   <country>,
>                                                       <text>Norway</
> text>,
>                                                      <value>N</value>,
>                                                  </country>}, and so on
>
>
> My question is... is there any better way to do this, i.e. is there
> any better way to store the xml input input..

That's more of a generic programming question than a C one.  Have a look 
at a common XML parser like libxml, the documentation will give you 
ideas even if you choose not to use the library.

-- 
Ian Collins
0
Reply Ian 3/21/2011 8:43:30 PM


Maxx <grungeddd.maxx@gmail.com> writes:

> I'm writing a C program which would parse a xml file as its input and
> perform specific operations...

What specific operations?  See below...

> Now what i have in my mind is that i should declare a two dimensional
> array and store the xml file in it
>
> for example:::  char country[][]={<countries>,
>                                                  <country>,
>                                                      <text>Norway</
> text>,
>                                                     <value>N</value>,
>                                                 </country>}, and so on
>
>
> My question is... is there any better way to do this, i.e. is there
> any better way to store the xml input input..

It's almost impossible to say without knowing how a piece of data is
going to be accessed (or manipulated).

A good place to post would be comp.programming.  If you say what you
propose to do with the XML you should get good help there.  Be prepared
to be told that you should use an existing XML parsing library (because
that is almost always the right answer).

-- 
Ben.
0
Reply Ben 3/21/2011 9:53:32 PM

On Mon, 21 Mar 2011 13:35:01 -0700, Maxx wrote:

> I'm writing a C program which would parse a xml file as its input and
> perform specific operations...
> Now what i have in my mind is that i should declare a two dimensional
> array and store the xml file in it

> My question is... is there any better way to do this, i.e. is there any
> better way to store the xml input input..

Yes. In fact, it would be hard to imagine a worse way.

First, I wouldn't recommend trying to actually parse the XML yourself, as
you're practically bound to get it wrong. Use an XML parsing library
instead.

XML parsing libraries come in two main flavours: DOM and SAX. DOM
constructs a parse tree for the entire file, which the application can
then query. SAX generates events (reported via callbacks) as it parses the
file; it's up to the application to actually store the data.

Which flavour to use and exactly how to do it depend upon the details of
the application.

0
Reply Nobody 3/22/2011 3:41:31 AM

On Mar 21, 10:35=A0pm, Maxx <grungeddd.m...@gmail.com> wrote:
>
> My question is... is there any better way to do this, i.e. is there
> any better way to store the xml input input..
>
Think of the XML as a tree, and build what is known as a recursive
descent parser.

Basically it's the same problem as a mathematical expression with
deeply nested parentheses, in a slightly different form. You need one
token of lookahead.

Once you've converted the XML to a tree, you'll usually want to walk
the tree to convert to a set of nested arrays, but sometimes it will
be better to keep the data in tree form.


0
Reply Malcolm 3/22/2011 6:40:36 AM

Malcolm McLean <malcolm.mclean5@btinternet.com> writes:

> On Mar 21, 10:35 pm, Maxx <grungeddd.m...@gmail.com> wrote:
>>
>> My question is... is there any better way to do this, i.e. is there
>> any better way to store the xml input input..
>>
> Think of the XML as a tree, and build what is known as a recursive
> descent parser.
>
> Basically it's the same problem as a mathematical expression with
> deeply nested parentheses, in a slightly different form. You need one
> token of lookahead.
>
> Once you've converted the XML to a tree, you'll usually want to walk
> the tree to convert to a set of nested arrays, but sometimes it will
> be better to keep the data in tree form.

I did it the other way round.  First I wrote a good generic "values"
handling system that allowed me to have named strings, integers, lists,
string-indexed-arrays, all as recursive as you like.   That was the
difficult bit.

They I just hooked xmlparse up to it and it sucked the XML in nicely.

Think hard about what you want to, if anything, to distinguish between:

<stuff>
<item>fred</item>
</stuff>

<stuff item="fred"/>

To summarise - you need more a specification of the problem before
starting to find a solution.
-- 
Online waterways route planner            | http://canalplan.eu
Plan trips, see photos, check facilities  | http://canalplan.org.uk
0
Reply Dr 3/22/2011 7:01:02 AM

On Mar 21, 1:43=A0pm, Ian Collins <ian-n...@hotmail.com> wrote:
> On 03/22/11 09:35 AM, Maxx wrote:
>
> > I'm writing a C program which would parse a xml file as its input and
> > perform specific operations...
> > Now what i have in my mind is that i should declare a two dimensional
> > array and store the xml file in it
>
> > for example::: =A0char country[][]=3D{<countries>,
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 <country>,
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 <text>Norway</
> > text>,
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0<value>N</value>,
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0 =A0 =A0 =A0</country>}, and so on
>
> > My question is... is there any better way to do this, i.e. is there
> > any better way to store the xml input input..
>
> That's more of a generic programming question than a C one. =A0Have a loo=
k
> at a common XML parser like libxml, the documentation will give you
> ideas even if you choose not to use the library.
>
> --
> Ian Collins

Alright i've looked up libxml and seems to have hit jackpot... It does
contains the necessary function which i need...
Thanks
0
Reply Maxx 3/22/2011 8:10:06 PM

On Mar 21, 8:41=A0pm, Nobody <nob...@nowhere.com> wrote:
> On Mon, 21 Mar 2011 13:35:01 -0700, Maxx wrote:
> > I'm writing a C program which would parse a xml file as its input and
> > perform specific operations...
> > Now what i have in my mind is that i should declare a two dimensional
> > array and store the xml file in it
> > My question is... is there any better way to do this, i.e. is there any
> > better way to store the xml input input..
>
> Yes. In fact, it would be hard to imagine a worse way.
>
> First, I wouldn't recommend trying to actually parse the XML yourself, as
> you're practically bound to get it wrong. Use an XML parsing library
> instead.
>
> XML parsing libraries come in two main flavours: DOM and SAX. DOM
> constructs a parse tree for the entire file, which the application can
> then query. SAX generates events (reported via callbacks) as it parses th=
e
> file; it's up to the application to actually store the data.
>
> Which flavour to use and exactly how to do it depend upon the details of
> the application.

Actually the xml file that i was going to provide the program will
always have a predefined format, like the one example i gave above.It
will always parse the same format and simply extract the values from
the fields and write another xml file having the same template... so i
was looking for the easiest way to solve it, instead of requiring to
call extensive library functions...

any ways Thanks
0
Reply Maxx 3/22/2011 8:13:55 PM

On Mar 21, 11:40=A0pm, Malcolm McLean <malcolm.mcle...@btinternet.com>
wrote:
> On Mar 21, 10:35=A0pm, Maxx <grungeddd.m...@gmail.com> wrote:
>
> > My question is... is there any better way to do this, i.e. is there
> > any better way to store the xml input input..
>
> Think of the XML as a tree, and build what is known as a recursive
> descent parser.
>
> Basically it's the same problem as a mathematical expression with
> deeply nested parentheses, in a slightly different form. You need one
> token of lookahead.
>
> Once you've converted the XML to a tree, you'll usually want to walk
> the tree to convert to a set of nested arrays, but sometimes it will
> be better to keep the data in tree form.

Yeah i had this concept in mind at first, but as i was going to write
a simple program which would simply extract values from a set of
predefined fields, so i kinda avoided going into trees.. Although i
recon a tree would be the best solution but i'm still quite naive in
trees.

Thanks
0
Reply Maxx 3/22/2011 8:17:15 PM

On Mar 22, 12:01=A0am, Dr Nick <3-nos...@temporary-address.org.uk>
wrote:
> Malcolm McLean <malcolm.mcle...@btinternet.com> writes:
> > On Mar 21, 10:35=A0pm, Maxx <grungeddd.m...@gmail.com> wrote:
>
> >> My question is... is there any better way to do this, i.e. is there
> >> any better way to store the xml input input..
>
> > Think of the XML as a tree, and build what is known as a recursive
> > descent parser.
>
> > Basically it's the same problem as a mathematical expression with
> > deeply nested parentheses, in a slightly different form. You need one
> > token of lookahead.
>
> > Once you've converted the XML to a tree, you'll usually want to walk
> > the tree to convert to a set of nested arrays, but sometimes it will
> > be better to keep the data in tree form.
>
> I did it the other way round. =A0First I wrote a good generic "values"
> handling system that allowed me to have named strings, integers, lists,
> string-indexed-arrays, all as recursive as you like. =A0 That was the
> difficult bit.
>
> They I just hooked xmlparse up to it and it sucked the XML in nicely.
>
> Think hard about what you want to, if anything, to distinguish between:
>
> <stuff>
> <item>fred</item>
> </stuff>
>
> <stuff item=3D"fred"/>
>
> To summarise - you need more a specification of the problem before
> starting to find a solution.
> --
> Online waterways route planner =A0 =A0 =A0 =A0 =A0 =A0|http://canalplan.e=
u
> Plan trips, see photos, check facilities =A0|http://canalplan.org.uk

Yeah yeah a generic list of values would be helpful but i need more
ideas on how to implement it.. I'm trying to avoid library function in
this program as it will always parse the same fields over and over
again..

0
Reply Maxx 3/22/2011 8:20:26 PM

On Mar 22, 4:13=A0pm, Maxx <grungeddd.m...@gmail.com> wrote:
> On Mar 21, 8:41=A0pm, Nobody <nob...@nowhere.com> wrote:
>
>
>
> > On Mon, 21 Mar 2011 13:35:01 -0700, Maxx wrote:
> > > I'm writing a C program which would parse a xml file as its input and
> > > perform specific operations...
> > > Now what i have in my mind is that i should declare a two dimensional
> > > array and store the xml file in it
> > > My question is... is there any better way to do this, i.e. is there a=
ny
> > > better way to store the xml input input..
>
> > Yes. In fact, it would be hard to imagine a worse way.
>
> > First, I wouldn't recommend trying to actually parse the XML yourself, =
as
> > you're practically bound to get it wrong. Use an XML parsing library
> > instead.
>
> > XML parsing libraries come in two main flavours: DOM and SAX. DOM
> > constructs a parse tree for the entire file, which the application can
> > then query. SAX generates events (reported via callbacks) as it parses =
the
> > file; it's up to the application to actually store the data.
>
> > Which flavour to use and exactly how to do it depend upon the details o=
f
> > the application.
>
> Actually the xml file that i was going to provide the program will
> always have a predefined format, like the one example i gave above.It
> will always parse the same format and simply extract the values from
> the fields and write another xml file having the same template... so i
> was looking for the easiest way to solve it, instead of requiring to
> call extensive library functions...

Note that it always starts this way.  It is easy to hand parse the XML
if it is in a truly fixed format, so why use a real parser?  But then
there are modifications/extensions/etc.  People hand edit the file and
add white space, which won't confuse a parser but messes up your less
flexible hand parse.  People write a mixture of <element></element>
instead of <element/>, which should parse as equivalent and somehow
don't when hand parsing.  People suddenly want validation.  etc.
Going with a real parser is very much the way to go in a real
application, much more future friendly even if not apparently needed
up front...
0
Reply David 3/23/2011 6:45:47 PM

On Mar 23, 1:45=A0pm, David Resnick <lndresn...@gmail.com> wrote:
> On Mar 22, 4:13=A0pm, Maxx <grungeddd.m...@gmail.com> wrote:
>
>
>
>
>
> > On Mar 21, 8:41=A0pm, Nobody <nob...@nowhere.com> wrote:
>
> > > On Mon, 21 Mar 2011 13:35:01 -0700, Maxx wrote:
> > > > I'm writing a C program which would parse a xml file as its input a=
nd
> > > > perform specific operations...
> > > > Now what i have in my mind is that i should declare a two dimension=
al
> > > > array and store the xml file in it
> > > > My question is... is there any better way to do this, i.e. is there=
 any
> > > > better way to store the xml input input..
>
> > > Yes. In fact, it would be hard to imagine a worse way.
>
> > > First, I wouldn't recommend trying to actually parse the XML yourself=
, as
> > > you're practically bound to get it wrong. Use an XML parsing library
> > > instead.
>
> > > XML parsing libraries come in two main flavours: DOM and SAX. DOM
> > > constructs a parse tree for the entire file, which the application ca=
n
> > > then query. SAX generates events (reported via callbacks) as it parse=
s the
> > > file; it's up to the application to actually store the data.
>
> > > Which flavour to use and exactly how to do it depend upon the details=
 of
> > > the application.
>
> > Actually the xml file that i was going to provide the program will
> > always have a predefined format, like the one example i gave above.It
> > will always parse the same format and simply extract the values from
> > the fields and write another xml file having the same template... so i
> > was looking for the easiest way to solve it, instead of requiring to
> > call extensive library functions...
>
> Note that it always starts this way. =A0It is easy to hand parse the XML
> if it is in a truly fixed format, so why use a real parser? =A0But then
> there are modifications/extensions/etc. =A0People hand edit the file and
> add white space, which won't confuse a parser but messes up your less
> flexible hand parse. =A0People write a mixture of <element></element>
> instead of <element/>, which should parse as equivalent and somehow
> don't when hand parsing. =A0People suddenly want validation. =A0etc.
> Going with a real parser is very much the way to go in a real
> application, much more future friendly even if not apparently needed
> up front...

Not to mention it's code that *you* don't have to write or test.

Figuring out how to use the library in your code will take less time
than writing a robust parser from scratch.  Yes, you can hand-hack a
minimal, non-validating, less-than-totally-robust XML parser in an
afternoon (I've done it), but you'll be tweaking that sucker
*constantly* (which I did as well).
0
Reply John 3/23/2011 9:57:15 PM

In article 
<05d2e0d8-44de-440c-b862-7e267a920dd9@r4g2000vbq.googlegroups.com>,
 David Resnick <lndresnick@gmail.com> wrote:

> On Mar 22, 4:13 pm, Maxx <grungeddd.m...@gmail.com> wrote:
> > On Mar 21, 8:41 pm, Nobody <nob...@nowhere.com> wrote:
> >
> >
> >
> > > On Mon, 21 Mar 2011 13:35:01 -0700, Maxx wrote:
> > > > I'm writing a C program which would parse a xml file as its input and
> > > > perform specific operations...
> > > > Now what i have in my mind is that i should declare a two dimensional
> > > > array and store the xml file in it
> > > > My question is... is there any better way to do this, i.e. is there any
> > > > better way to store the xml input input..
> >
> > > Yes. In fact, it would be hard to imagine a worse way.
> >
> > > First, I wouldn't recommend trying to actually parse the XML yourself, as
> > > you're practically bound to get it wrong. Use an XML parsing library
> > > instead.
> >
> > > XML parsing libraries come in two main flavours: DOM and SAX. DOM
> > > constructs a parse tree for the entire file, which the application can
> > > then query. SAX generates events (reported via callbacks) as it parses the
> > > file; it's up to the application to actually store the data.
> >
> > > Which flavour to use and exactly how to do it depend upon the details of
> > > the application.
> >
> > Actually the xml file that i was going to provide the program will
> > always have a predefined format, like the one example i gave above.It
> > will always parse the same format and simply extract the values from
> > the fields and write another xml file having the same template... so i
> > was looking for the easiest way to solve it, instead of requiring to
> > call extensive library functions...
> 
> Note that it always starts this way.  It is easy to hand parse the XML
> if it is in a truly fixed format, so why use a real parser?  But then
> there are modifications/extensions/etc.  People hand edit the file and
> add white space, which won't confuse a parser but messes up your less
> flexible hand parse.  People write a mixture of <element></element>
> instead of <element/>, which should parse as equivalent and somehow
> don't when hand parsing.  People suddenly want validation.  etc.
> Going with a real parser is very much the way to go in a real
> application, much more future friendly even if not apparently needed
> up front...

XML is the same as csh. Every time somebody raises a
problem with XML somebody else steps in and presents an
easy workaround. Eventually you are told not even to
try writing a parser. It is the death of a thousand
cuts. And for what? 

XML gives PHBs the illusion that they know about
programming; and adventurers a cozy berth. XML is a scam.

Has XML gotten to the point a universal Turing machine 
could be written in XML, or is it still singing "Daisy"?

-- 
Michael Press
0
Reply Michael 3/24/2011 8:45:25 AM

On Mar 24, 4:45=A0am, Michael Press <rub...@pacbell.net> wrote:
> In article
> <05d2e0d8-44de-440c-b862-7e267a920...@r4g2000vbq.googlegroups.com>,
> =A0David Resnick <lndresn...@gmail.com> wrote:
>
>
>
> > On Mar 22, 4:13=A0pm, Maxx <grungeddd.m...@gmail.com> wrote:
> > > On Mar 21, 8:41=A0pm, Nobody <nob...@nowhere.com> wrote:
>
> > > > On Mon, 21 Mar 2011 13:35:01 -0700, Maxx wrote:
> > > > > I'm writing a C program which would parse a xml file as its input=
 and
> > > > > perform specific operations...
> > > > > Now what i have in my mind is that i should declare a two dimensi=
onal
> > > > > array and store the xml file in it
> > > > > My question is... is there any better way to do this, i.e. is the=
re any
> > > > > better way to store the xml input input..
>
> > > > Yes. In fact, it would be hard to imagine a worse way.
>
> > > > First, I wouldn't recommend trying to actually parse the XML yourse=
lf, as
> > > > you're practically bound to get it wrong. Use an XML parsing librar=
y
> > > > instead.
>
> > > > XML parsing libraries come in two main flavours: DOM and SAX. DOM
> > > > constructs a parse tree for the entire file, which the application =
can
> > > > then query. SAX generates events (reported via callbacks) as it par=
ses the
> > > > file; it's up to the application to actually store the data.
>
> > > > Which flavour to use and exactly how to do it depend upon the detai=
ls of
> > > > the application.
>
> > > Actually the xml file that i was going to provide the program will
> > > always have a predefined format, like the one example i gave above.It
> > > will always parse the same format and simply extract the values from
> > > the fields and write another xml file having the same template... so =
i
> > > was looking for the easiest way to solve it, instead of requiring to
> > > call extensive library functions...
>
> > Note that it always starts this way. =A0It is easy to hand parse the XM=
L
> > if it is in a truly fixed format, so why use a real parser? =A0But then
> > there are modifications/extensions/etc. =A0People hand edit the file an=
d
> > add white space, which won't confuse a parser but messes up your less
> > flexible hand parse. =A0People write a mixture of <element></element>
> > instead of <element/>, which should parse as equivalent and somehow
> > don't when hand parsing. =A0People suddenly want validation. =A0etc.
> > Going with a real parser is very much the way to go in a real
> > application, much more future friendly even if not apparently needed
> > up front...
>
> XML is the same as csh. Every time somebody raises a
> problem with XML somebody else steps in and presents an
> easy workaround. Eventually you are told not even to
> try writing a parser. It is the death of a thousand
> cuts. And for what?
>
> XML gives PHBs the illusion that they know about
> programming; and adventurers a cozy berth. XML is a scam.
>
> Has XML gotten to the point a universal Turing machine
> could be written in XML, or is it still singing "Daisy"?
>

XML is great in its place.  Not a PHB, and don't believe
it to be a scam.  I love it for flatfiles that need
structured information and flexibility.  Easy to extend,
easy (with XPATH queries say) to get stuff out of.
Standard, everyone knows what it means, how to add
to it, how to parse and validate it.  Does it solve
all problems in the world?  Of course not...

-David



0
Reply David 3/24/2011 12:38:19 PM

On Wed, 23 Mar 2011 11:45:47 -0700, David Resnick wrote:

> Note that it always starts this way.  It is easy to hand parse the XML
> if it is in a truly fixed format,

If you restrict the application to reading a subset of XML, that defeats
the purpose of using XML in the first place.

You can find a wide range of tools which can process XML, but the range of
tools which can process a particular custom subset of XML is likely to be
much smaller (i.e. those tools which you write yourself).

If you think that you only need to support files written by a particular
program, you're likely to end up only supporting files which were directly
written by that program and not post-processed in any way. This often
makes your program less useful than you had originally assumed.

0
Reply Nobody 3/25/2011 7:40:31 AM

On Mar 24, 12:57=A0am, John Bode <jfbode1...@gmail.com> wrote:
> On Mar 23, 1:45=A0pm, David Resnick <lndresn...@gmail.com> wrote:
>
> Figuring out how to use the library in your code will take less time
> than writing a robust parser from scratch. =A0Yes, you can hand-hack a
> minimal, non-validating, less-than-totally-robust XML parser in an
> afternoon (I've done it), but you'll be tweaking that sucker
> *constantly* (which I did as well).
>
The problem is that it becomes harder to distribute the program. Even
if you have source to the library, it's often in messy files that are
hard to integrate and distract the reader from the actual logical core
of the program.

0
Reply Malcolm 3/25/2011 8:19:26 AM

On Mar 25, 3:40=A0am, Nobody <nob...@nowhere.com> wrote:
> On Wed, 23 Mar 2011 11:45:47 -0700, David Resnick wrote:
> > Note that it always starts this way. =A0It is easy to hand parse the XM=
L
> > if it is in a truly fixed format,
>
> If you restrict the application to reading a subset of XML, that defeats
> the purpose of using XML in the first place.
>
> You can find a wide range of tools which can process XML, but the range o=
f
> tools which can process a particular custom subset of XML is likely to be
> much smaller (i.e. those tools which you write yourself).
>
> If you think that you only need to support files written by a particular
> program, you're likely to end up only supporting files which were directl=
y
> written by that program and not post-processed in any way. This often
> makes your program less useful than you had originally assumed.

Holy out of context quotes, Batman.  Your reply misses the entire
point
of mine, which is that hand parsing is a bad idea.  Did you read the
rest of the post or just answer after the first 2 lines?

-David
0
Reply David 3/25/2011 11:31:43 AM

On Fri, 25 Mar 2011 04:31:43 -0700, David Resnick wrote:

> Holy out of context quotes, Batman.  Your reply misses the entire
> point of mine, which is that hand parsing is a bad idea.  Did you read the
> rest of the post or just answer after the first 2 lines?

I wasn't "replying" to your comments. I elaborated on your reply,
providing more reasons why it's a bad idea to assume that you only need
to handle a subset.

0
Reply Nobody 3/25/2011 6:15:07 PM

On Mar 25, 2:15=A0pm, Nobody <nob...@nowhere.com> wrote:
> On Fri, 25 Mar 2011 04:31:43 -0700, David Resnick wrote:
> > Holy out of context quotes, Batman. =A0Your reply misses the entire
> > point of mine, which is that hand parsing is a bad idea. =A0Did you rea=
d the
> > rest of the post or just answer after the first 2 lines?
>
> I wasn't "replying" to your comments. I elaborated on your reply,
> providing more reasons why it's a bad idea to assume that you only need
> to handle a subset.

Just seemed to be replying to my comments, as that was the only quoted
text being addressed.  My mistake.

-David
0
Reply David 3/25/2011 6:26:50 PM

On Mar 23, 11:45=A0am, David Resnick <lndresn...@gmail.com> wrote:
> On Mar 22, 4:13=A0pm, Maxx <grungeddd.m...@gmail.com> wrote:
>
>
>
> > On Mar 21, 8:41=A0pm, Nobody <nob...@nowhere.com> wrote:
>
> > > On Mon, 21 Mar 2011 13:35:01 -0700, Maxx wrote:
> > > > I'm writing a C program which would parse a xml file as its input a=
nd
> > > > perform specific operations...
> > > > Now what i have in my mind is that i should declare a two dimension=
al
> > > > array and store the xml file in it
> > > > My question is... is there any better way to do this, i.e. is there=
 any
> > > > better way to store the xml input input..
>
> > > Yes. In fact, it would be hard to imagine a worse way.
>
> > > First, I wouldn't recommend trying to actually parse the XML yourself=
, as
> > > you're practically bound to get it wrong. Use an XML parsing library
> > > instead.
>
> > > XML parsing libraries come in two main flavours: DOM and SAX. DOM
> > > constructs a parse tree for the entire file, which the application ca=
n
> > > then query. SAX generates events (reported via callbacks) as it parse=
s the
> > > file; it's up to the application to actually store the data.
>
> > > Which flavour to use and exactly how to do it depend upon the details=
 of
> > > the application.
>
> > Actually the xml file that i was going to provide the program will
> > always have a predefined format, like the one example i gave above.It
> > will always parse the same format and simply extract the values from
> > the fields and write another xml file having the same template... so i
> > was looking for the easiest way to solve it, instead of requiring to
> > call extensive library functions...
>
> Note that it always starts this way. =A0It is easy to hand parse the XML
> if it is in a truly fixed format, so why use a real parser? =A0But then
> there are modifications/extensions/etc. =A0People hand edit the file and
> add white space, which won't confuse a parser but messes up your less
> flexible hand parse. =A0People write a mixture of <element></element>
> instead of <element/>, which should parse as equivalent and somehow
> don't when hand parsing. =A0People suddenly want validation. =A0etc.
> Going with a real parser is very much the way to go in a real
> application, much more future friendly even if not apparently needed
> up front...

I'm using the parser so that i can extract the necessary values from
specific fields...Anyways i have decided to go with a real parser as
its becoming too cumbersome.


Thanks
0
Reply Maxx 3/25/2011 6:38:18 PM

John Bode wrote:

> Not to mention it's code that *you* don't have to write or test.

Not necessarily.  Depending on what the programmer intends to do, when he 
adopts a 3rd party parser instead of writing his own what he is doing is 
delegating only a portion of the work he must do in order to extract 
information from a given format, while being forced to do the legwork in 
the remaining of the work.

More specifically, when a programmer employs a 3rd party parser, he is 
implicitly dividing the simple task of parsing a given format into two 
different tasks: 
- parsing the information described in a base format in order to build a 
data structure 
- parsing the data structure in order to extract the information he 
intended to extract

While the first test may be delegated to a parser developed by a 3rd 
party, which ends up being implemented by a small generic code snippet, 
the second task ends up being needlessly cumbersome, error-prone and 
needlessly wasting resources which, in some cases, the programmer may not 
have.  Yet, it still needs code which *you* have to write and, more 
importantly, *you* must test, with the added difficulty of consisting of a 
couple of layers of abstraction.


Rui Maciel
0
Reply Rui 3/25/2011 8:06:25 PM

Nobody wrote:

> If you restrict the application to reading a subset of XML, that defeats
> the purpose of using XML in the first place.

Every XML application a language which is a subset of XML.  Every 
application of XML is nothing more than the definition of languages which 
are a subset of XML.  The main advantages of XML is that it's human-
readable, the languages based on it tend to be self-descriptive and it's a 
common base format of a series of languages.  This means that it becomes 
easier to add support for other languages, even if you don't have the 
entire specification. 

Therefore, claiming that restricting the application to reading a subset 
of XML defeats the purpose of adopting a XML-based language doesn't make 
sense.  It doesn't make sense because the only purpose of XML is to reduce 
it to a subset.


> You can find a wide range of tools which can process XML, but the range
> of tools which can process a particular custom subset of XML is likely
> to be much smaller (i.e. those tools which you write yourself).

An image editor is a tool that can only process a particular custom subset 
of XML (for example, SVG).  The same applies to office applications, RSS 
readers, web browsers and other applications.  Therefore, there is no harm 
in that.  That's what programs are designed to do.


> If you think that you only need to support files written by a particular
> program, you're likely to end up only supporting files which were
> directly written by that program and not post-processed in any way. This
> often makes your program less useful than you had originally assumed.

That problem has absolutely nothing to do with XML and everything to do 
with adopting/creating open standards to exchange information.


Rui Maciel
0
Reply Rui 3/25/2011 8:33:32 PM

David Resnick wrote:

> Note that it always starts this way.  It is easy to hand parse the XML
> if it is in a truly fixed format, so why use a real parser?  But then
> there are modifications/extensions/etc.  People hand edit the file and
> add white space, which won't confuse a parser but messes up your less
> flexible hand parse.  

Adding white spaces can only mess up a parser if the parser wasn't develop 
to handle that language.  Therefore, you can't claim that writing parsers 
by hand is a bad thing to do if the only problem that you can point out is 
that your parser fails to parse the language it was intended to parse.


> People write a mixture of <element></element>
> instead of <element/>, which should parse as equivalent and somehow
> don't when hand parsing.  

Only if you failed to add support for that in your parser.


> People suddenly want validation.  etc.

The beautiful thing about parsers is that they automatically and 
implicitly validate a given language.  Therefore, it's a non-issue.


> Going with a real parser is very much the way to go in a real
> application, much more future friendly even if not apparently needed
> up front...

This idea that a parser developed by a programmer is somehow not a "real 
parser" is silly.  Either you mispoke or you don't know what you are 
talking about.


Rui Maciel
0
Reply Rui 3/25/2011 8:41:48 PM

Malcolm McLean wrote:

> Think of the XML as a tree, and build what is known as a recursive
> descent parser.
> 
> Basically it's the same problem as a mathematical expression with
> deeply nested parentheses, in a slightly different form. You need one
> token of lookahead.
> 
> Once you've converted the XML to a tree, you'll usually want to walk
> the tree to convert to a set of nested arrays, but sometimes it will
> be better to keep the data in tree form.

If someone goes through the trouble of writing a dedicated parser for a 
particular language then there is no need to parse it to an intermediate 
form.  That just forces the need to parse essentially the same information 
twice just to be able to access that information.  Just parse the document 
and handle the information in an appropriate way once it is parsed.


Rui Maciel
0
Reply Rui 3/25/2011 8:50:17 PM

Nobody wrote:

> I wasn't "replying" to your comments. I elaborated on your reply,
> providing more reasons why it's a bad idea to assume that you only need
> to handle a subset.

Let's say that we developed a new XML-based language intended to replace 
all documents encoded in the INI document format.  The language would be 
something like:


<?xml version="1.0" encoding="UTF-8" ?>
<document version="1.0">
  <section>
    <name> section name </name>
    <entry>
      <label> label name </label> <value> this label's value</value>
    </entry>
    ...
  </section>
  ...
</document>


In this XML-based language, the only accepted element name for the root 
element is the string "document".  The root element must have an attribute 
to declare the format's version number and may have zero or more "section" 
elements.  Each "section" element must have a "name" element, followed by 
zero or more "entry" elements.  Each "entry" element consists of a "label" 
element followed by a "value" element, whose content can only be character 
data.  Every other XML construct is either ignored or declared as an 
error.

Considering this, why do you believe it is a bad idea to write a parser 
that only accepts this subset of XML?


Rui Maciel

0
Reply Rui 3/25/2011 9:15:59 PM

On Fri, 25 Mar 2011 21:15:59 +0000, Rui Maciel wrote:

>> I wasn't "replying" to your comments. I elaborated on your reply,
>> providing more reasons why it's a bad idea to assume that you only need
>> to handle a subset.
> 
> Let's say that we developed a new XML-based language intended to replace 
> all documents encoded in the INI document format.  The language would be 
> something like:

[snip]

> In this XML-based language, the only accepted element name for the root
> element is the string "document".  The root element must have an
> attribute to declare the format's version number and may have zero or
> more "section" elements.  Each "section" element must have a "name"
> element, followed by zero or more "entry" elements.  Each "entry"
> element consists of a "label" element followed by a "value" element,
> whose content can only be character data.  Every other XML construct is
> either ignored or declared as an error.
> 
> Considering this, why do you believe it is a bad idea to write a parser
> that only accepts this subset of XML?

That isn't what we're talking about. Any validating parser rejects
invalid documents; that doesn't mean that such a parser only accepts a
subset of the language.

A subset of the /language/ implies that, for any given data, only a subset
of the valid representations are accepted, e.g. requiring <tag></tag>
rather than <tag/>, imposing constraints upon whitespace within
tags, requiring attributes to be specified in a particular order, etc.

Having said that, the main reasons why writing such a parser would be a
bad idea are:

1. It doesn't help. The parser wouldn't be significantly simpler than one
which parsed arbitrary XML; in fact, it would probably be more
complicated, as the parser would be performing checks which most
(non-validating) parsers leave to the application.

2. If you want to extend the format, you have to change the code for the
parser. With a generic non-validing parser, you don't have to change
anything; with a generaic validating parser, you only have to change
the DTD. In either case, the application would only need to be changed if
it didn't just ignore unrecognised elements.

0
Reply Nobody 3/26/2011 9:12:53 AM

On Fri, 25 Mar 2011 20:41:48 +0000, Rui Maciel wrote:

> Adding white spaces can only mess up a parser if the parser wasn't develop 
> to handle that language.  Therefore, you can't claim that writing parsers 
> by hand is a bad thing to do if the only problem that you can point out is 
> that your parser fails to parse the language it was intended to parse.

Right. Which is exactly what we mean by a "subset" of XML.

I'm not sure whether you're playing devil's advocate or you actually
aren't aware of just how common a problem this is. I've lost track of the
number of times I've seen stuff like "sed 's!<title>\(.*\)</title>!\1!' ...".


0
Reply Nobody 3/26/2011 9:24:50 AM

Nobody wrote:

> On Fri, 25 Mar 2011 20:41:48 +0000, Rui Maciel wrote:
> 
>> Adding white spaces can only mess up a parser if the parser wasn't
>> develop
>> to handle that language.  Therefore, you can't claim that writing
>> parsers by hand is a bad thing to do if the only problem that you can
>> point out is that your parser fails to parse the language it was
>> intended to parse.
> 
> Right. Which is exactly what we mean by a "subset" of XML.

This particular issue has nothing to do with a language being or not being 
a subset of XML.  It's a problem caused by adopting a poorly thought out 
language which fails to cover the intended use case.


> I'm not sure whether you're playing devil's advocate or you actually
> aren't aware of just how common a problem this is. I've lost track of
> the number of times I've seen stuff like "sed
> 's!<title>\(.*\)</title>!\1!' ...".

I've written a few parsers, including a couple of generic parsers for a 
markup language, and supporting white spaces between elements (or any 
equivalent nesting construct) is one of the most trivial things that one 
can add to a parsers, particularly because it either represents a single 
terminal in the production or it doesn't even need to be supported in the 
language's grammar.  

In the case of XML, as an element may have character data between the 
element's start tag and end tag, then it would probably be better to add 
support for it in the production and, depending on how the language was 
designed, ignore it or throw some kind of error.


Rui Maciel
0
Reply Rui 3/26/2011 11:26:40 AM

Nobody wrote:

> That isn't what we're talking about. Any validating parser rejects
> invalid documents; that doesn't mean that such a parser only accepts a
> subset of the language.
> 
> A subset of the /language/ implies that, for any given data, only a
> subset of the valid representations are accepted, e.g. requiring
> <tag></tag> rather than <tag/>, imposing constraints upon whitespace
> within tags, requiring attributes to be specified in a particular order,
> etc.

A subset of a language is still a language on it's own, which means that a 
parser designed to handle it either accepts a document as valid or rejects 
it.  

Knowing this, a subset of XML will only impose the constraints which it 
was designed to impose; no more, no less.  If you write a parser that 
rejects certain language constructs then you either failed to design your 
language or you failed to write your parser.  Your failure to do any of 
these things does not mean that it is a bad idea to develop parsers.  It 
only means that you failed to develop the language and/or parser that you 
needed.


> Having said that, the main reasons why writing such a parser would be a
> bad idea are:
> 
> 1. It doesn't help. The parser wouldn't be significantly simpler than
> one which parsed arbitrary XML; in fact, it would probably be more
> complicated, as the parser would be performing checks which most
> (non-validating) parsers leave to the application.

If you keep in mind that a generic parser only manages to transform the 
information between two formats (i.e., parse a document and build up a 
data structure) and that you are still forced to parse the end-format to 
validate your format and extract the information (i.e., traverse the data 
structure, perform sanity checks according to the information found on the 
data structure, extract information, etc...).  

This means that once you adopt a generic parser to parse a document then, 
unless you intend to parse a home-brew format that will not be exchanged 
by anyone and will only be used by a specific version of a specific 
program, you are only fooling yourself to believe that you are simplifying 
things.  You aren't.  You are adding a new abstraction layer to your 
program that does nothing more than convert the information between 
formats, both of which you still have to parse.


> 2. If you want to extend the format, you have to change the code for the
> parser. With a generic non-validating parser, you don't have to change
> anything; with a generaic validating parser, you only have to change
> the DTD. In either case, the application would only need to be changed
> if it didn't just ignore unrecognised elements.

Not quite.  The DTD only helps you set the generic parser to perform a set 
of sanity checks.  Meanwhile you are still forced to rely on two separate 
parsers to parse a single piece of information.

Adding to this, relying on generic parsers and DTDs won't help you with 
basic tasks such as adding support for multiple versions of the same 
language.  That means that if you rely on a generic parser and you are 
suddenly forced to tweak your document format and therefore support 
multiple versions of the same format then you are either screwed or you 
are forced to employ a scheme to convert all instances of a format into 
the new format, something which in some cases it's impossible.


Rui Maciel
0
Reply Rui 3/26/2011 11:57:03 AM

On 03/27/11 12:57 AM, Rui Maciel wrote:
> Nobody wrote:
>
>> That isn't what we're talking about. Any validating parser rejects
>> invalid documents; that doesn't mean that such a parser only accepts a
>> subset of the language.
>>
>> A subset of the /language/ implies that, for any given data, only a
>> subset of the valid representations are accepted, e.g. requiring
>> <tag></tag>  rather than<tag/>, imposing constraints upon whitespace
>> within tags, requiring attributes to be specified in a particular order,
>> etc.
>
> A subset of a language is still a language on it's own, which means that a
> parser designed to handle it either accepts a document as valid or rejects
> it.
>
> Knowing this, a subset of XML will only impose the constraints which it
> was designed to impose; no more, no less.  If you write a parser that
> rejects certain language constructs then you either failed to design your
> language or you failed to write your parser.  Your failure to do any of
> these things does not mean that it is a bad idea to develop parsers.  It
> only means that you failed to develop the language and/or parser that you
> needed.

The problem of what to do with the data in an XML document (or any other 
structured document) is one of the reasons why there are two types of 
XML parser.  One can either use a SAX (stream) parser to process 
elements as they are encountered, or parse the complete document into a 
DOM (Document Object Model) tree.

I use both, depending on the problem at hand.  If the data has to be 
manipulated as a complete set, I use my (heavy) DOM parser.  If not 
(loading a configuration for example), I use my light SAX parser.

A SAX parser uses callback functions to handle various events triggered 
by the document, which makes it easy to translate elements of interest 
into application data structures or actions, which would be ideal for 
the OP's requirement.

-- 
Ian Collins
0
Reply Ian 3/26/2011 10:58:26 PM

Ian Collins wrote:

> The problem of what to do with the data in an XML document (or any other
> structured document) is one of the reasons why there are two types of
> XML parser.  One can either use a SAX (stream) parser to process
> elements as they are encountered, or parse the complete document into a
> DOM (Document Object Model) tree.
> 
> I use both, depending on the problem at hand.  If the data has to be
> manipulated as a complete set, I use my (heavy) DOM parser.  If not
> (loading a configuration for example), I use my light SAX parser.
> 
> A SAX parser uses callback functions to handle various events triggered
> by the document, which makes it easy to translate elements of interest
> into application data structures or actions, which would be ideal for
> the OP's requirement.

The SAX approach is basically a partially developed parser.  In essence, a 
SAX API provides a stream of terminal tokens while performing sanity 
checks on the base format.  To put it in other words, a SAX parser is 
basically a lexer that converts a set of terminal tokens from a base 
language (say, XML) to a single terminal token from a different language 
(say, SVG).  In this process, it also implicitly performs a set of sanity 
checks on the base language.

This means that when a programmer opts to parse a given document following 
the SAX approach, what he is doing is essentially picking up a specialized 
lexer and writing his own parser around that particular lexer.  So, this 
means that although the programmer avoids parsing a much larger language 
(i.e., what the SAX lexer returns as "open element A" may be "terminal 
token '<' followed by terminal token text string, with the string 'A', 
followed by token '>') he still has to set a production for his language 
and develop a parser to parse his language. 


Rui Maciel
0
Reply Rui 3/27/2011 12:58:44 PM

On 03/28/11 01:58 AM, Rui Maciel wrote:
> Ian Collins wrote:
>
>> The problem of what to do with the data in an XML document (or any other
>> structured document) is one of the reasons why there are two types of
>> XML parser.  One can either use a SAX (stream) parser to process
>> elements as they are encountered, or parse the complete document into a
>> DOM (Document Object Model) tree.
>>
>> I use both, depending on the problem at hand.  If the data has to be
>> manipulated as a complete set, I use my (heavy) DOM parser.  If not
>> (loading a configuration for example), I use my light SAX parser.
>>
>> A SAX parser uses callback functions to handle various events triggered
>> by the document, which makes it easy to translate elements of interest
>> into application data structures or actions, which would be ideal for
>> the OP's requirement.
>
> The SAX approach is basically a partially developed parser.  In essence, a
> SAX API provides a stream of terminal tokens while performing sanity
> checks on the base format.  To put it in other words, a SAX parser is
> basically a lexer that converts a set of terminal tokens from a base
> language (say, XML) to a single terminal token from a different language
> (say, SVG).  In this process, it also implicitly performs a set of sanity
> checks on the base language.
>
> This means that when a programmer opts to parse a given document following
> the SAX approach, what he is doing is essentially picking up a specialized
> lexer and writing his own parser around that particular lexer.  So, this
> means that although the programmer avoids parsing a much larger language
> (i.e., what the SAX lexer returns as "open element A" may be "terminal
> token '<' followed by terminal token text string, with the string 'A',
> followed by token '>') he still has to set a production for his language
> and develop a parser to parse his language.

Which he will end up doing no matter what approach is used to parse the 
source document.

-- 
Ian Collins
0
Reply Ian 3/27/2011 7:44:20 PM

On Sat, 26 Mar 2011 11:26:40 +0000, Rui Maciel wrote:

>> I'm not sure whether you're playing devil's advocate or you actually
>> aren't aware of just how common a problem this is. I've lost track of
>> the number of times I've seen stuff like "sed
>> 's!<title>\(.*\)</title>!\1!' ...".
> 
> I've written a few parsers, including a couple of generic parsers for a 
> markup language, and supporting white spaces between elements (or any 
> equivalent nesting construct) is one of the most trivial things that one 
> can add to a parsers,

Dealing with whitespace may be trivial (unless the underlying I/O code is
line-oriented, as XML allows linefeeds within tags), but it's frequently
omitted.

It's less trivial to deal with the fact that attributes may appear in any
order.

0
Reply Nobody 3/28/2011 4:13:15 PM

Nobody wrote:

> Dealing with whitespace may be trivial (unless the underlying I/O code
> is line-oriented, as XML allows linefeeds within tags), but it's
> frequently omitted.

The implementation details of the IO part of a parser are irrelevant.  
Whether the IO is line-oriented or not, the IO code should never insert or 
ommit information, which means that a parser only handles the information 
provided by a stream.


> It's less trivial to deal with the fact that attributes may appear in
> any order.

I don't believe that constitutes a real problem.  For example, consider a 
XML-based file format which consists of a single element "element" which 
may have a set of attributes labelled "alpha", "beta" an "gamma".  For 
that language, a valid document could be something like:

<element alpha="true" />


If the language accepts repeated attributes then a possible (and crude) 
production[1] would be something like:

<example>

document	= "<" "element" *<tags> "/" ">"

tag>	= "alpha" "=" text string
	= "beta" "=" text string
	= "gamma" "=" text_string

</example>

The support for the tags specified in the above production in a LL parser, 
ignoring error handling, may be around 3 states (6, if we count a "ghost" 
state to push the attribute values into a data structure).

If, instead, the attributes must follow a specific order (alpha, beta, 
gamma) where:
- each attribute can either be present or not
- an attribute appearing out of it's rightful place is considered an error

then, the following production applies:

<example>

document	= "<" "element" *1alpha_tag *1beta_tag *1gamma_tag "/" ">"

alpha_tag	= "alpha" "=" text string
beta_tag	= "beta" "=" text string
gamma_tag	= "gamma" "=" text_string

</example>

The support for the tags specified in the above production in a LL parser, 
ignoring error handling, is yet again achieved by adding 3 states (6, with 
the "ghost" states).

If your language accepts any possible attribute combination then the 
production starts to become a bit more demanding.  Yet, you only need to 
deal with this if you specifically wish that your grammar accepts your 
attributes in any random order, which means that you are creating your own 
problem.

Nonetheless, notice that you will be faced with the exact same problem if 
you wish to rely on a generic parser instead of one which you develop 
yourself.  In that case, you will be faced with a more demanding problem, 
as you are forced to deal with nodes in a tree structure instead of a 
simple stream of terminal tokens.


Rui Maciel

[1] http://tools.ietf.org/html/rfc5234
0
Reply rui.maciel (1746) 3/29/2011 12:04:54 AM

Ian Collins wrote:

> Which he will end up doing no matter what approach is used to parse the
> source document.

If a programmer opts for a DOM-type approach then he will be faced with a 
problem which is considerably (and needlessly) more complicated.

But considering that the programmer opts for a SAX-type approach, and 
knowing that the only thing that he gets is a tricked-out lexer and that 
he is still forced to develop his own parser, by adopting a XML library 
which provides SAX the programmer is essentially being forced to adopt a 
particular language which more often than not does not even fit the 
intended purpose.  

So, if a generic XML API doesn't eliminate the need to develop a parser to 
extract information then what's the point of adopting a generic parser to 
begin with, let alone base their document format on XML?


Rui Maciel
0
Reply rui.maciel (1746) 3/29/2011 12:15:09 AM

On 03/29/11 01:15 PM, Rui Maciel wrote:
> Ian Collins wrote:
>
>> Which he will end up doing no matter what approach is used to parse the
>> source document.
>
> If a programmer opts for a DOM-type approach then he will be faced with a
> problem which is considerably (and needlessly) more complicated.
>
> But considering that the programmer opts for a SAX-type approach, and
> knowing that the only thing that he gets is a tricked-out lexer and that
> he is still forced to develop his own parser, by adopting a XML library
> which provides SAX the programmer is essentially being forced to adopt a
> particular language which more often than not does not even fit the
> intended purpose.
>
> So, if a generic XML API doesn't eliminate the need to develop a parser to
> extract information then what's the point of adopting a generic parser to
> begin with, let alone base their document format on XML?

Indeed, that's one reason I prefer JSON.

But the choice of representation isn't always one the developer can 
make.  I have written a lot of code (in a variety of languages) to 
extract data from OpenOffice documents.  The client does not care that I 
have to work with an XML document, they just want the data from the 
document.

-- 
Ian Collins
0
Reply ian-news (9881) 3/29/2011 12:34:38 AM

On Tue, 29 Mar 2011 01:15:09 +0100, Rui Maciel wrote:

> But considering that the programmer opts for a SAX-type approach, and 
> knowing that the only thing that he gets is a tricked-out lexer and that 
> he is still forced to develop his own parser,

You make it sound as if it's a significant issue. Once you have the lexer,
XML is trivial to parse. There are no shift-reduce or reduce-reduce
conflicts, because every construct begins with a token which is unique to
that construct.

> So, if a generic XML API doesn't eliminate the need to develop a parser
> to extract information then what's the point of adopting a generic
> parser to begin with, let alone base their document format on XML?

The point is that you don't have to code dedicated utilities for common
tasks, as you can just use xslt, xquery, etc. You don't have to write
bindings for a variety of languages, as every common language already has
XML parsers (and more, e.g. tools which will generate class definitions
from a DTD or vice-versa).

In many cases, the only valid reason for /not/ using XML is efficiency (I
don't consider the vendor lock-in which proprietary formats offer to be a
"valid" reason).

0
Reply nobody (4805) 3/29/2011 1:14:28 AM

On Tue, 29 Mar 2011 01:04:54 +0100, Rui Maciel wrote:

>> Dealing with whitespace may be trivial (unless the underlying I/O code
>> is line-oriented, as XML allows linefeeds within tags), but it's
>> frequently omitted.
> 
> The implementation details of the IO part of a parser are irrelevant.  

Not if it constrains the data flow, i.e. when you don't get to carry
state over between lines, i.e. what happens when people try to parse XML
with grep/sed/perl/etc.


0
Reply nobody (4805) 3/29/2011 1:27:02 AM

Nobody wrote:

> Not if it constrains the data flow, i.e. when you don't get to carry
> state over between lines, i.e. what happens when people try to parse XML
> with grep/sed/perl/etc.

If people rely on grep to parse XML then they are intentionally creating 
their own problems.  No one decides to open a ditch with a screwdriver and 
complains that the job is simply too complicated to perform.  

The same applies to Perl if people try to employ it to parse XML as in the 
grep case.  This would, obviously, be stupid as it is quite possible to 
write parsers in Perl.


Rui Maciel
0
Reply rui.maciel (1746) 3/29/2011 10:28:36 AM

On Mar 28, 8:15=A0pm, Rui Maciel <rui.mac...@gmail.com> wrote:
> Ian Collins wrote:
> > Which he will end up doing no matter what approach is used to parse the
> > source document.
>
> If a programmer opts for a DOM-type approach then he will be faced with a
> problem which is considerably (and needlessly) more complicated.
>
> But considering that the programmer opts for a SAX-type approach, and
> knowing that the only thing that he gets is a tricked-out lexer and that
> he is still forced to develop his own parser, by adopting a XML library
> which provides SAX the programmer is essentially being forced to adopt a
> particular language which more often than not does not even fit the
> intended purpose. =A0
>
> So, if a generic XML API doesn't eliminate the need to develop a parser t=
o
> extract information then what's the point of adopting a generic parser to
> begin with, let alone base their document format on XML?
>
> Rui Maciel

You can use a DOM parser and a query language like XPATH, makes
getting information pretty simple.  Parse, ask for what you need.
Of course, not appropriate for all uses, but nice for getting
what you want out of the doc.

-David
0
Reply lndresnick (326) 3/29/2011 11:58:43 AM

Nobody wrote:

> On Tue, 29 Mar 2011 01:15:09 +0100, Rui Maciel wrote:
> 
>> But considering that the programmer opts for a SAX-type approach, and
>> knowing that the only thing that he gets is a tricked-out lexer and
>> that he is still forced to develop his own parser,
> 
> You make it sound as if it's a significant issue. Once you have the
> lexer, XML is trivial to parse. There are no shift-reduce or
> reduce-reduce conflicts, because every construct begins with a token
> which is unique to that construct.

Writing your own parser is not a significant issue.  That's why people opt 
for the SAX approach.  And you only stumble on shift-reduce/reduce-reduce 
conflicts if you are trying to develop a parser for a language which 
suffers from ambiguity issues, which doesn't really apply to XML or any 
language based on XML.


>> So, if a generic XML API doesn't eliminate the need to develop a parser
>> to extract information then what's the point of adopting a generic
>> parser to begin with, let alone base their document format on XML?
> 
> The point is that you don't have to code dedicated utilities for common
> tasks, as you can just use xslt, xquery, etc. 

The point is that if a programmer tries to avoid developing a parser for 
his language because he believes it takes too much work, adopting layers 
of 3rd party libraries won't save him any work in the end, nor will it 
make his life any easier.  

Probably the only benefit a programmer gets from insisting in using those 
3rd party libraries is that he can pad his resume with lots of buzzwords, 
although in the end the only thing they demonstrate is that that 
programmer invests his time implementing bloated tools and forcing the 
wrong solutions onto jobs which otherwise would be considerably simpler 
and more efficient.


> You don't have to write
> bindings for a variety of languages

It's irrelevant.  Once you know how to develop a parser in a given 
language you are able to develop a parser in any language you know.


> , as every common language already
> has XML parsers (and more, e.g. tools which will generate class
> definitions from a DTD or vice-versa).

As I've stated before, adopting a 3rd party library that handles XML 
doesn't mean you avoided the need to develop your parser.  You are still 
forced to develop a parser, whether to parse a tree structure which is 
assembled by the 3rd party library or to implement a working parser from 
the glorified lexer which has been provided.

Adding to this, when someone mindlessly adopts a 3rd party library to 
process XML documents and does so not because he believes XML is the right 
tool for the job but simply because the 3rd party library is there and he 
doesn't know any better, that person tends to be forced to shoe-horn XML 
into an application which it isn't suited.  This is one of the reasons we 
tend to see XML being forced into uses that clearly it isn't the best tool 
for the job.  Or even adequate.  It's one of those examples of "if all you 
have is a hammer, everything looks like a nail".


> In many cases, the only valid reason for /not/ using XML is efficiency
> (I don't consider the vendor lock-in which proprietary formats offer to
> be a "valid" reason).

There are plenty of reasons why XML is not the right tool for the job, and 
thanks to the "but there is an API for that" mentality, there are plenty 
of examples that demonstrate how XML is being forced into jobs it isn't 
fit to do.  For example, it doesn't make any sense to rely on XML to 
encode any data structure beyond trees.  And some people insist on 
pounding the XML hammer on that nail.


Rui Maciel
0
Reply rui.maciel (1746) 3/29/2011 5:18:14 PM

40 Replies
220 Views

(page loaded in 0.405 seconds)

Similiar Articles:


















7/30/2012 9:58:24 AM


Reply: