There seem to be a stream of questions about doing some simple XML
scanning in AWK. There is a generalized, simple way to do it that's
white space independent. (Yes, the pun is intentional ;-)
Start your AWK program with this line:
BEGIN { RS="<"; FS=">" }
Now, as the file is processed, a record will consist of either:
- an "opening tag", in which case $1 will start with the element name
and contain all the attributes, and $2 will contain all the text until
the next tag, or
- a "closing tag", in which case $1 will consist of a "/" followed by
the element name, and $2 will contain all of the text up to the start
of the next tag.
This is only useful for scanning, as it makes no attempt to deal with
element nesting. And, remember, "all of the text" includes all white
space, including normal end-of-line terminators.
-jh
|
|
0
|
|
|
|
Reply
|
jhart (9)
|
6/13/2008 2:38:00 AM |
|
On Jun 13, 4:38=A0am, jh...@mail.avcnet.org wrote:
> There seem to be a stream of questions about doing some simple XML
> scanning in AWK. There is a generalized, simple way to do it that's
> white space independent. (Yes, the pun is intentional ;-)
>
> Start your AWK program with this line:
>
> BEGIN { RS=3D"<"; FS=3D">" }
>
> Now, as the file is processed, a record will consist of either:
>
> =A0- an "opening tag", in which case $1 will start with the element name
> and contain all the attributes, and $2 will contain all the text until
> the next tag, or
>
> =A0- a "closing tag", in which case $1 will consist of a "/" followed by
> the element name, and $2 will contain all of the text up to the start
> of the next tag.
>
> This is only useful for scanning, as it makes no attempt to deal with
> element nesting. And, remember, "all of the text" includes all white
> space, including normal end-of-line terminators.
>
> -jh
I am not quite sure what your actual question is but you might wish to
have a look at:
http://home.vrweb.de/~juergen.kahrs/gawk/XML/xmlgawk.html#Reading-XML-Data-w=
ith-POSIX-AWK
Please note that, against common belief, ">" is legal character data
and hence not really suitable as FS or RS.
> The right angle bracket (>) *may* be represented using the string ">" .=
...
See: http://www.w3.org/TR/2006/REC-xml-20060816/#syntax
Hermann
|
|
0
|
|
|
|
Reply
|
Hermann
|
6/13/2008 7:33:24 AM
|
|
jhart@mail.avcnet.org wrote:
> BEGIN { RS="<"; FS=">" }
>
> Now, as the file is processed, a record will consist of either:
>
> - an "opening tag", in which case $1 will start with the element name
> and contain all the attributes, and $2 will contain all the text until
> the next tag, or
>
> - a "closing tag", in which case $1 will consist of a "/" followed by
> the element name, and $2 will contain all of the text up to the start
> of the next tag.
>
> This is only useful for scanning, as it makes no attempt to deal with
> element nesting. And, remember, "all of the text" includes all white
> space, including normal end-of-line terminators.
Unless the XML data has a very fixed and predictable format, or you have
very specific requirements, imho in general using awk (or sed, or any other
line-oriented tool for that matter) to do serious parsing and processing of
XML data in a generic way is at best awkward (pardon the pun). As Hermann
already said, many tools already exist that can reliably process XML
(xmlgawk if you want an awk-like interface), or that can turn it into a
format processable with the usual tools. Furthermore, ">" is a valid XML
character *inside* a tag, so this might make your method fragile.
--
awk 'BEGIN{O="~"~"~";o="=="=="==";o+=+o;x=o""o;while(X++<x-o-O)c=c"%c";
X=O""O;printf c,O+x*o*o+X,(X+x)*(O+o)-o,+X*X-o-O,o+x*o*o+X,x*o*o+X-o-o,
x*(o+o)+X-O,+X*X-X+o+o,x+x+x-o,o+X+O+o+x*o*o,x+O+x*o*o,x*o*o+x+O+o+o+O,
x+o+x*o*o,x+x*o*o+O,o+x+x*o*o,o+X*o*o,X+x*o*o,x*o*o+O+x,x+x*o*o-O,X-O}'
|
|
0
|
|
|
|
Reply
|
Dave
|
6/13/2008 9:04:59 AM
|
|
|
2 Replies
288 Views
(page loaded in 0.006 seconds)
|