[Q] Text vs Binary Files

  • Follow


Assume that disk space is not an issue 
(the files will be small < 5k in general for the purpose of storing
preferences)

Assume that transportation to another OS may never occur.


Are there any solid reasons to prefer text files over binary files
files?

Some of the reasons I can think of are:

-- should transportation to another OS become useful or needed, 
   the text files would be far easier to work with

-- tolerant of basic data type size changes (enumerated types have been
   known to change in size from one version of a compiler to the next)

-- if a file becomes corrupted, it would be easier to find and repair
   the problem potentially avoiding the annoying case of just 
   throwing it out

I would like to begin using XML for the storage of application
preferences, but I need to convince others who are convinced that binary
files are the superior method that text files really are the way to go.

Thoughts? Comments?
0
Reply egDfAusenetE5fz (78) 5/27/2004 1:23:11 AM

On Thu, 27 May 2004, Eric wrote:
>
> Assume that disk space is not an issue [...]
> Assume that transportation to another OS may never occur.
> Are there any solid reasons to prefer text files over binary files?
>
> Some of the reasons I can think of are:
>
> -- should transportation to another OS become useful or needed,
>    the text files would be far easier to work with

  I would guess this is wrong, in general.  Think of the difference
between a DOS/Win32 text file, a MacOS text file, and a *nix text
file (hint: linefeeds and carriage returns).  Now think of the
difference between the same systems' binary files (hint: nothing).
  There do exist many free tools to deal with line-ending troubles,
though, so this isn't really a disadvantage; just a counter to your
claim.

> -- tolerant of basic data type size changes (enumerated types have been
>    known to change in size from one version of a compiler to the next)

  It's about five minutes' work to write portable binary I/O functions
in most languages, if you're worried about the size of 'int' on your
next computer or something.  Check out any file-format standard for
ideas, and Google "network byte order."  If you're coming from a C
background, then you'll understand when I tell you that 'fwrite' should
never, ever be applied to anything but buffers of 'unsigned char'! :)

> -- if a file becomes corrupted, it would be easier to find and repair
>    the problem potentially avoiding the annoying case of just
>    throwing it out

  Yes, definitely.  Also, it's much easier to tell if text has been
corrupted in transmission --- it won't look like text anymore!
Binary always looks like binary; you need explicit checksums and
guards against corruption there.  (Again, see file-format standards,
especially my favorite, the PNG image standard.)

> I would like to begin using XML for the storage of application
> preferences, but I need to convince others who are convinced that binary
> files are the superior method that text files really are the way to go.

  One major advantage of plain text is that it can be sent over HTTP
and other Web protocols without "armoring."  You can put plain text
in the body of a POST request, for example, where I doubt arbitrary
bytes would be accepted.  (I dunno, though.)
  Along the same lines, you can email your data files back and forth
in the body of an email message, rather than mucking about with
attachments.

  The disadvantage is size; but you don't seem worried about that.
Another possible disadvantage would be that text is easily read and
reverse-engineered, if you're worried about that (e.g., proprietary
config files or savefiles for a game) --- but then you can always
encrypt whatever you don't want read immediately.  [Whatever you
don't want read *ever*, you simply don't give to your users, because
they'll crack anything given enough time.]

HTH,
-Arthur


0
Reply ajo (1601) 5/27/2004 2:03:42 AM


Arthur J. O'Dwyer <ajo@nospam.andrew.cmu.edu> wrote:

> > -- should transportation to another OS become useful or needed,
> >    the text files would be far easier to work with
> 
> I would guess this is wrong, in general.  Think of the difference
> between a DOS/Win32 text file, a MacOS text file, and a *nix text
> file (hint: linefeeds and carriage returns). 

Which is why I mentioned at the end using a solid XML parser to deal
with such issues transparently. I likely wouldn't consider using a text
file if something like XML and solid parsers weren't available and free.

> Now think of the
> difference between the same systems' binary files (hint: nothing).

Well, you say 'same systems'...so, yes, in general, reading & writing a
binary file that will never be moved to another OS shouldn't present any
serious issues. (or am I wrong here?)

However, the point was that it could be moved, in which case dealing
with big/little endian issues would become important.

> > -- tolerant of basic data type size changes (enumerated types have
> >    been known to change in size from one version of a compiler to 
> > the next)
>
> It's about five minutes' work to write portable binary I/O functions
> in most languages

Ah, but it's five minutes I don't want to spend, especially since the
time would need to be spent every time something changed. I believe in
fixing a problem once.

Plus, the potental for spending time attempting to figure out why the
@#$%@$ isn't being read properly isn't accounted for here.

> Another possible disadvantage would be that text is easily read and
> reverse-engineered

In my case, this is a benefit.

-- 
== Eric Gorr ========= http://www.ericgorr.net ========= ICQ:9293199 ===
"Therefore the considerations of the intelligent always include both 
benefit and harm." - Sun Tzu
== Insults, like violence, are the last refuge of the incompetent... ===
0
Reply egDfAusenetE5fz (78) 5/27/2004 3:55:57 AM

egDfAusenetE5fz@verizon.net (Eric) wrote in message news:<1geew2n.10d70ck1mpdeeN%egDfAusenetE5fz@verizon.net>...
> Assume that disk space is not an issue 
> (the files will be small < 5k in general for the purpose of storing
> preferences)
> 
> Assume that transportation to another OS may never occur.
> 
> 
> Are there any solid reasons to prefer text files over binary files
> files?
> 
> Some of the reasons I can think of are:
> 
> -- should transportation to another OS become useful or needed, 
>    the text files would be far easier to work with
> 
> -- tolerant of basic data type size changes (enumerated types have been
>    known to change in size from one version of a compiler to the next)
> 
> -- if a file becomes corrupted, it would be easier to find and repair
>    the problem potentially avoiding the annoying case of just 
>    throwing it out

All good reasons...

> I would like to begin using XML for the storage of application
> preferences, but I need to convince others who are convinced that binary
> files are the superior method that text files really are the way to go.
> 
> Thoughts? Comments?

For your application i think you have it right.  Preferences in an XML
text file are more flexible for the user/admin (can be edited by hand
as last resort) and also for you as developer, a text file can have
entries listed 'out of order' and with the right tags and parsing it
won't really matter.  For the same reasons they can also be easier to
change and add to over time.

The main reasons for using binary files to store preferences are:

-security (but they're crackable, and text files can be encrypted
anyway)
-programming ease, it can be easier to just have a preference
structure than to attempt a robust parsing of a given set of text
items, the text could be messed with after all
-size, relevant if they need to be shuttled around a network a lot or
will take up lots disk space

It sounds like they don't apply in your case.
0
Reply gswork (648) 5/27/2004 8:01:17 AM

On Thu, 27 May 2004, Eric wrote:
>
> Arthur J. O'Dwyer <ajo@nospam.andrew.cmu.edu> wrote:
> [Eric wrote]
> > > -- should transportation to another OS become useful or needed,
> > >    the text files would be far easier to work with
> >
> > I would guess this is wrong, in general.  Think of the difference
> > between a DOS/Win32 text file, a MacOS text file, and a *nix text
> > file (hint: linefeeds and carriage returns).
>
> Which is why I mentioned at the end using a solid XML parser to deal
> with such issues transparently. I likely wouldn't consider using a text
> file if something like XML and solid parsers weren't available and free.

  Ah, but what do you do when the XML standard changes? :)  Seriously,
this is something you really need to consider IMHO.  (Of course, this
is cross-posted to an XML group, and I don't know much about XML, so
don't take my word about anything...)  There are XML Version Foo parsers
available now, but when XML Version Bar comes out, there'll be lag time.
Think of the messes with HTML 4.0 [about which I know little] and C'99
[about which I know much].
  Free parsers *are* nice, though, no dispute there. :)

> > Now think of the
> > difference between the same systems' binary files (hint: nothing).
>
> Well, you say 'same systems'...so, yes, in general, reading & writing a
> binary file that will never be moved to another OS shouldn't present any
> serious issues. (or am I wrong here?)

  Misunderstood.  By "the same systems," I meant the systems I just
mentioned: DOS/Win32, Unix, and MacOS.  Their binary data formats are
identical.

> > > -- tolerant of basic data type size changes (enumerated types have
> > >    been known to change in size from one version of a compiler to
> > > the next)
> >
> > It's about five minutes' work to write portable binary I/O functions
> > in most languages
>
> Ah, but it's five minutes I don't want to spend,

  Versus five minutes trying to make your free XML parser compile?
I'd take five minutes with binary files any day. ;-)

> especially since the
> time would need to be spent every time something changed. I believe in
> fixing a problem once.

  So do I.  That's why you spend the five minutes writing your portable
binary I/O functions.  Then you never need to write them again.  For
a not-so-hot-but-portable-across-aforementioned-systems example, see
http://www.contrib.andrew.cmu.edu/~ajo/free-software/ImageFmtc.c,
functions 'fread_endian' and 'bwrite_endian'.  Write once, use many
times.
  The number of bits in a 32-bit integer is *never* going to change.
The number of bits in a machine word is *definitely* going to change.
This is why all existing file-format standards explicitly state that
they are dealing with 32-bit integers, not machine words: so the
file-format code never has to change, no matter where it runs.

> Plus, the potental for spending time attempting to figure out why the
> @#$%@$ isn't being read properly isn't accounted for here.

  Of course not.  I/O is trivial.  It's your *algorithms* that are
going to be broken; and they'd be broken no matter what output format
you used.

> > Another possible disadvantage would be that text is easily read and
> > reverse-engineered
>
> In my case, this is a benefit.

  Good. :)

-Arthur
0
Reply ajo (1601) 5/27/2004 2:07:22 PM

Arthur J. O'Dwyer <ajo@nospam.andrew.cmu.edu> wrote:

> On Thu, 27 May 2004, Eric wrote:
> >
> > Arthur J. O'Dwyer <ajo@nospam.andrew.cmu.edu> wrote:
> > [Eric wrote]
> > > > -- should transportation to another OS become useful or needed,
> > > >    the text files would be far easier to work with
> > >
> > > I would guess this is wrong, in general.  Think of the difference
> > > between a DOS/Win32 text file, a MacOS text file, and a *nix text
> > > file (hint: linefeeds and carriage returns).
> >
> > Which is why I mentioned at the end using a solid XML parser to deal
> > with such issues transparently. I likely wouldn't consider using a text
> > file if something like XML and solid parsers weren't available and free.
> 
>   Ah, but what do you do when the XML standard changes? :)  

Please correct me if I am wrong, but the design of XML already takes
this into account. In otherwords, the idea that it can and will change
is a part of the design - this is one reason why XML is such a nifty
technology. 

> Misunderstood.  By "the same systems," I meant the systems I just
> mentioned: DOS/Win32, Unix, and MacOS.  Their binary data formats are
> identical.

What do you mean by 'their binary data formats are identical'?...this
would seem to imply that big/little endian issues are a thing of the
past...?

> > > > -- tolerant of basic data type size changes (enumerated types have
> > > >    been known to change in size from one version of a compiler to
> > > > the next)
> > >
> > > It's about five minutes' work to write portable binary I/O functions
> > > in most languages
> >
> > Ah, but it's five minutes I don't want to spend,
> 
>   Versus five minutes trying to make your free XML parser compile?

Binaries of the better parsers are available, so this is a non-issue.
:-) 

> > Plus, the potental for spending time attempting to figure out why the
> > @#$%@$ isn't being read properly isn't accounted for here.
> 
> Of course not.  I/O is trivial.  

Once you track down the problem...however, it would not be uncommon to
think the problem lies elsewhere first and spend hours before finding
the trivial fix.

> It's your *algorithms* that are
> going to be broken; and they'd be broken no matter what output format
> you used.

With XML, the risk of this is far less, as long as you're not changing
the tag names or what they mean, if it really exists at all. 
0
Reply egusenet (56) 5/27/2004 2:40:10 PM

On Thu, 27 May 2004, Eric wrote:
>
> Arthur J. O'Dwyer <ajo@nospam.andrew.cmu.edu> wrote:
> >
> >   Ah, but what do you do when the XML standard changes? :)
>
> Please correct me if I am wrong, but the design of XML already takes
> this into account. In otherwords, the idea that it can and will change
> is a part of the design - this is one reason why XML is such a nifty
> technology.

  Probably true.  I don't know much about XML's namespacing rules
(by which I mean the rules that say that <foo> is an okay tag for
a user to create, but <bar> could be given special meaning by
future standards).  [If anyone wants to give me a lecture, that's
fine; otherwise, I'll just look it up when I need to know. ;) ]

> > Misunderstood.  By "the same systems," I meant the systems I just
> > mentioned: DOS/Win32, Unix, and MacOS.  Their binary data formats are
> > identical.
>
> What do you mean by 'their binary data formats are identical'?...this
> would seem to imply that big/little endian issues are a thing of the
> past...?

  Yup.  The vast majority of computers these days use eight-bit
byte-oriented transmission and storage protocols.  Whatever bit-ordering
problems there are have moved "downstream" to those people involved in
the construction of hardware that has to choose whether to transmit
bit 0 or bit 7 first (and I'm sure they have their own relevant
standards in those fields, too).
  Again, I refer you to standards like RFCs 1950, 1951, and 1952
(Google "RFC 1950").  Note the utter lack of concern with the vagaries
of the machine.  We have indeed moved past big/little-endian wars;
now, whoever's[1] writing the relevant standard simply says, "All eggs
distributed according to the Fred protocol must be broken at the
big end," and that's the end of *that!*


> > > Plus, the potental for spending time attempting to figure out why the
> > > @#$%@$ isn't being read properly isn't accounted for here.
> >
> > Of course not.  I/O is trivial.
>
> Once you track down the problem...however, it would not be uncommon to
> think the problem lies elsewhere first and spend hours before finding
> the trivial fix.

  You misunderstand me.  I/O is trivial; thus, after the first five
minutes spent making sure the trivial code is correct (which is trivial
to prove), you never need to touch it or look at it again.  If you
never touch it, you can't possibly introduce bugs into it.  And if it
starts out bugfree (trivially proven), and never has any bugs introduced
into it (because it's never modified), then it will remain bugfree
forever.  (And thus you never need to fix it, trivially or not.)

  I'm completely serious and not using hyperbole at all when I say
I/O is trivial.  It really is.

-Arthur

[1] - In speech I'd say "who'sever writing...," but that looks
awful no matter how I spell it.  Whosever?  Whos'ever?  Who's-ever?
Yuck. :(
0
Reply ajo (1601) 5/27/2004 4:11:01 PM

A few things I've learned where computers and their inner workings are
involved:

They change constantly
Operating systems change
Formats change
Philosophies change
They change on a very regular basis
Over a period of years, the changes become more complex

No matter what you create today,  there's an extremely good chance that it
won't work tomorrow.  Binary file formats change.  Text file formats change.  (
side note: binary files are NOT all the same -- no one even has an accepted
definition of what a binary file is for that matter ).

If you want your "stuff" to last as long as possible...  then "KISS",  "keep it
simple stupid".

To me...  that means....
   Do not use Binary.
   Do not use XML either.

If you don't think XML will change in ways you can't cope with....
then go ahead and use it.  A few years down the road, we will talk and see how
things are going.

Personally...  I'd go for a simple text files.
It's less likely to undergo radical format changes than most other formats.
It's readable to humans.
You can make printed-out backup copies for the extreme event of total computer
file losses.

Oh...  if you think that your stuff won't ever be transferred to another OS... 
Are you using a version of Windows by chance??  Windows changes with every
single version.  Windows even changes with revisions of a single version of
itself.  Microsoft will surely continue creating more "advanced" versions of
it's OS for the next decade or two on a regular basis.  And... you'll be pretty
much forced to upgrade ( at least if you want to use newly developed programs
).  Thus... you will be changing OS's.  Think about it...  in only 15 years... 
how many different versions of Windows have we been subjected to?  And we've
all had compatibility problems.

As for XML...  I really don't know much about it.  So...  I really don't know
how nice it is to work with.   However,   I have seen some XML used in a major
piece of engineering software produced where I work.  The created input & data
files the program produces are in XML.  The files are Extreeemely long.  And
they are VERY difficult to read.   They take hundreds of lines to describe data
that could be very easily, cleanly and readably laid out in about 30 to 50
lines.  What's more... the data in the XML format is hard on the eyes.  Now... 
I don't know if this is typical XML or not... and  I am sure that the guys who
produced are not XML aces.  I just know that the product produced at my work
place is IMHO... total insanity.  It smacks of "maintainance headaches" about
as much as bad spagehtti code.   If "good" XML is not a LOT better than what
I've seen,  then...  I would not personally go that route.

Dan  :-)

0
Reply dantex1 (525) 5/27/2004 5:37:38 PM

Dan  Tex1 <dantex1@aol.com> wrote:

> The files are Extreeemely long.  And
> they are VERY difficult to read.   

XML files are only meant to be read by human eyes as a last resort. You
should be using an app that will read and interpret the content for you.

> They take hundreds of lines to describe data
> that could be very easily, cleanly and readably laid out in about 30 to 50
> lines. 

Probably true, but you would loose the benefits that XML
provides...mostly the ease with which additional fields could be added.
0
Reply egusenet (56) 5/27/2004 6:08:41 PM

On Thu, 27 May 2004, Eric wrote:

> Assume that disk space is not an issue
> (the files will be small < 5k in general for the purpose of storing
> preferences)
>
> Assume that transportation to another OS may never occur.
>
>
> Are there any solid reasons to prefer text files over binary files
> files?
>
> Some of the reasons I can think of are:
>
> -- should transportation to another OS become useful or needed,
>    the text files would be far easier to work with
>
> -- tolerant of basic data type size changes (enumerated types have been
>    known to change in size from one version of a compiler to the next)
>
> -- if a file becomes corrupted, it would be easier to find and repair
>    the problem potentially avoiding the annoying case of just
>    throwing it out
>
> I would like to begin using XML for the storage of application
> preferences, but I need to convince others who are convinced that binary
> files are the superior method that text files really are the way to go.
>
> Thoughts? Comments?

In favour of binary, if a customer has access to it, they will be more
likely to muck with a text file then a binary file.

In favour of text, will you ever need to diff the files (old version
against new version)? Will you need to source control and/or merge the
files? Easier to do as text.

-- 
Send e-mail to: darrell at cs dot toronto dot edu
Don't send e-mail to vice.president@whitehouse.gov
0
Reply darrell13 (357) 5/27/2004 7:29:02 PM

Arthur J. O'Dwyer wrote:
> On Thu, 27 May 2004, Eric wrote:
>>Which is why I mentioned at the end using a solid XML parser to deal
>>with such issues transparently. I likely wouldn't consider using a text
>>file if something like XML and solid parsers weren't available and free.
> 
>   Ah, but what do you do when the XML standard changes? :)  Seriously,
> this is something you really need to consider IMHO.  (Of course, this
> is cross-posted to an XML group, and I don't know much about XML, so
> don't take my word about anything...)  There are XML Version Foo parsers
> available now, but when XML Version Bar comes out, there'll be lag time.
> Think of the messes with HTML 4.0 [about which I know little] and C'99
> [about which I know much].
>   Free parsers *are* nice, though, no dispute there. :)

XML was created to solve the problem of the HTML version mess. The 
specification itself is very flexible (yet precise) with the result that 
the language can be extended without needing a change to the 
specification (or parsers based on the specification).

It's so good it's almost magical.

>   The number of bits in a 32-bit integer is *never* going to change.
> The number of bits in a machine word is *definitely* going to change.
> This is why all existing file-format standards explicitly state that
> they are dealing with 32-bit integers, not machine words: so the
> file-format code never has to change, no matter where it runs.

IIRC in C++ (and I'm sure C) there is no such guarantee of a "32-bit 
integer" - the int type can be more than 32-bits.

>>Plus, the potental for spending time attempting to figure out why the
>>@#$%@$ isn't being read properly isn't accounted for here.
> 
>   Of course not.  I/O is trivial.  It's your *algorithms* that are
> going to be broken; and they'd be broken no matter what output format
> you used.

Unless you're using somebody else's parser, which may not be broken. 
Such as libxml2 which is *very* unlikely to be broken.

-- 
Ben M.
0
Reply saint_abroadremove (144) 5/27/2004 10:20:53 PM

Dan Tex1 wrote:
> If you don't think XML will change in ways you can't cope with....
> then go ahead and use it.  A few years down the road, we will talk and see how
> things are going.

XML can be extended without changing the specifications on which it is 
based. Thus XML can be extended without changing the parser.

Even if the specifications encounter a *major* upgrade (highly unlikely 
because of its current flexibility) all parsers based on previous 
specifications will work with XML based on later specifications.

Besides, I doubt you'd even remember me (or this post) in a few years. ;-)

> Personally...  I'd go for a simple text files.
> It's less likely to undergo radical format changes than most other formats.
> It's readable to humans.
> You can make printed-out backup copies for the extreme event of total computer
> file losses.

IMO XML it has all those going for it (arguably stronger on the first 
point than simple text preferences).

> As for XML...  I really don't know much about it.  So...  I really don't know
> how nice it is to work with.   However,   I have seen some XML used in a major
> piece of engineering software produced where I work.  The created input & data
> files the program produces are in XML.  The files are Extreeemely long.  And
> they are VERY difficult to read.   They take hundreds of lines to describe data
> that could be very easily, cleanly and readably laid out in about 30 to 50
> lines.  What's more... the data in the XML format is hard on the eyes.  Now... 
> I don't know if this is typical XML or not... and  I am sure that the guys who
> produced are not XML aces.  I just know that the product produced at my work
> place is IMHO... total insanity.  It smacks of "maintainance headaches" about
> as much as bad spagehtti code.   If "good" XML is not a LOT better than what
> I've seen,  then...  I would not personally go that route.

I suggest either
a) they designed it more complex that it needed to be; or
b) the data relationships really are complex.

The XML file need only be as complex as the data relationships.

-- 
Ben M.
0
Reply saint_abroadremove (144) 5/27/2004 10:50:44 PM

On Thu, 27 May 2004, Ben Measures wrote:
>
> XML was created to solve the problem of the HTML version mess. The
> specification itself is very flexible (yet precise) with the result that
> the language can be extended without needing a change to the
> specification (or parsers based on the specification).
>
> It's so good it's almost magical.

  Okay, I'm convinced, then. :)


> >   The number of bits in a 32-bit integer is *never* going to change.
> > The number of bits in a machine word is *definitely* going to change.
> > This is why all existing file-format standards explicitly state that
> > they are dealing with 32-bit integers, not machine words: so the
> > file-format code never has to change, no matter where it runs.
>
> IIRC in C++ (and I'm sure C) there is no such guarantee of a "32-bit
> integer" - the int type can be more than 32-bits.

  More is better.  A 33-bit integer can hold all the values that a
32-bit integer can, and then some.  If the particular algorithms in
question are defined not to use the "and then some" part of the integer,
that's fine.  (The at-least-32-bit type in C and C++ is 'long int'.
When I use the word 'integer', I'm using it in the same sense as the
C standard: to mean "any integral type," not to mean "'int' type."
Just in case that was confusing you.)

  *Again* I urge the consultation of the RFCs defining any standard
binary file format, and the notice of the complete lack of regard
for big-endian/little-endian/19-bit-int/37-bit-int issues.  At the
byte level, these things simply never come up.


> >>Plus, the potental for spending time attempting to figure out why the
> >>@#$%@$ isn't being read properly isn't accounted for here.
> >
> >   Of course not.  I/O is trivial.  It's your *algorithms* that are
> > going to be broken; and they'd be broken no matter what output format
> > you used.
>
> Unless you're using somebody else's parser, which may not be broken.
> Such as libxml2 which is *very* unlikely to be broken.

  I don't see the connection between my statement and your reply.
What is the antecedent of your "Unless"?  (Literally, you're saying
that if you use libxml2 for I/O, then your non-I/O-related algorithms
will have no bugs.  This is what used to be called "spooky action at a
distance," and I don't think it applies to code. :)

-Arthur
0
Reply ajo (1601) 5/28/2004 2:05:35 PM

Dan  Tex1 writes:

> As for XML...  [...] I have seen some XML used in a major piece of
> engineering software produced where I work.  The created input &
> data files the program produces are in XML.

This sounds like my new world (my new position).  Enterprise Application
Integration (sounds grand, don't it :-), and a lot of the data exchange
is done with XML.

> And they are VERY difficult to read.  They take hundreds of lines to
> describe data that could be very easily, cleanly and readably laid out
> in about 30 to 50 lines.

Ah, friend, you know not whereof you speak.  XML is structured and
tagged data.  That is, the data is arranged in a tree structure, which
turns out to be very useful, and each datum has a name.

It's actually kinda cool.

The heirarchal organization could be quite useful for storing prefs,
I'd think.

> What's more... the data in the XML format is hard on the eyes.

Not unlike RTF or HTML, it's not meant to be seen naked.  (-:


-- 
|_ CJSonnack <Chris@Sonnack.com> _____________| How's my programming? |
|_ http://www.Sonnack.com/ ___________________| Call: 1-800-DEV-NULL  |
|_____________________________________________|_______________________|
0
Reply Chris7 (2511) 5/28/2004 2:38:14 PM

Eric writes:
Arthur J. O'Dwyer writes:

E> ...the files will be [...] for the purpose of storing preferences)
E> 
E> Assume that transportation to another OS may never occur.
E> [...]
E> -- should transportation to another OS become useful or needed,
E>    the text files would be far easier to work with

A> I would guess this is wrong, in general.  Think of the difference
A> between a DOS/Win32 text file, a MacOS text file, and a *nix text
A> file (hint: linefeeds and carriage returns).  Now think of the
A> difference between the same systems' binary files (hint: nothing)

Sizes are different.  Endian-ness is different.  Formats may be
different (think: floating point and other more exotic formats).

Consider finding the file in five years and not having any of the
previous tools that used it.  Which is likely to be easier to get
the data out of: text or binary?

How often have we had people come here to ask help in decyphering
a binary file?

A> The vast majority of computers these days use eight-bit
A> byte-oriented transmission and storage protocols.  Whatever
A> bit-ordering problems there are have moved "downstream" to
A> those people involved in the construction of hardware that
A> has to choose whether to transmit bit 0 or bit 7 first...

So what happens when I transmit a binary floating point number to
a machine with a different format?

I agree these issues are quite solveable, but I think they are
more *Easily* solveable with text as an intermediate format.

A> It's about five minutes' work to write portable binary I/O
A> functions in most languages, if you're worried about the
A> size of 'int' on your next computer or something.

Might be a little more than five minutes, but I agree it's not hard.

But what IS five minutes work is a CR/CRLF/LF converter!  (-:

I know this 'cause I've done it several times over the years.



FOOD FOR THOUGHT:
=================
Consider: The Rosetta Stone.

Now consider the bestest, most *useful* binary format you can name.
Think it stands any chance AT ALL of surviving that long?

If you want the broadest, most robust, most portable format
possible, there is only one answer: TEXT!

Accept no substitutes!  (-:

-- 
|_ CJSonnack <Chris@Sonnack.com> _____________| How's my programming? |
|_ http://www.Sonnack.com/ ___________________| Call: 1-800-DEV-NULL  |
|_____________________________________________|_______________________|
0
Reply Chris7 (2511) 5/28/2004 2:38:14 PM

Programmer Dude wrote:
> Ah, friend, you know not whereof you speak.  XML is structured and
> tagged data.  That is, the data is arranged in a tree structure, which
> turns out to be very useful, and each datum has a name.
> 
> It's actually kinda cool.
> 
> The heirarchal organization could be quite useful for storing prefs,
> I'd think.

Since you are into Lisp and all... Did you know that s-expressions
are equivalent to XML syntax? And that they also come with the
full power of Lisp? It seems that XML has been around since the
fifties ;)

-- 
Thomas.

0
Reply tstegen (281) 5/28/2004 3:34:11 PM

Thomas stegen writes:

>> The heirarchal organization could be quite useful for storing prefs,
>> I'd think.
> 
> Since you are into Lisp and all... Did you know that s-expressions
> are equivalent to XML syntax? And that they also come with the
> full power of Lisp? It seems that XML has been around since the
> fifties ;)

I hadn't really thought about it, but you're quite right!
-- 
|_ CJSonnack <Chris@Sonnack.com> _____________| How's my programming? |
|_ http://www.Sonnack.com/ ___________________| Call: 1-800-DEV-NULL  |
|_____________________________________________|_______________________|
0
Reply Chris7 (2511) 5/28/2004 5:13:41 PM

On Fri, 28 May 2004, Programmer Dude wrote:
>
> Arthur J. O'Dwyer writes:
> > I would guess this is wrong, in general.  Think of the difference
> > between a DOS/Win32 text file, a MacOS text file, and a *nix text
> > file (hint: linefeeds and carriage returns).  Now think of the
> > difference between the same systems' binary files (hint: nothing)
>
> Sizes are different.  Endian-ness is different.  Formats may be
> different (think: floating point and other more exotic formats).

  [For --hopefully-- the last time: I wasn't talking about sizes,
or endianness, or floating-point formats.  I was talking about the
format in which a binary file is stored.  Binary means bytes.  On
the vast majority of modern computers, that's eight bits per byte.
I refer you to the file format standard for ANYTHING EVER, but
especially PNG, because it's very cool and quite possibly *more*
modular than XML. :-) ]

> Consider finding the file in five years and not having any of the
> previous tools that used it.  Which is likely to be easier to get
> the data out of: text or binary?

  Without any of the computers that used it?  Pretty close to zero,
even with the help of an electron microscope.  Assuming you have
no hex editor, but you do have a computer and a text editor, then
obviously text will be easier to display.  Contrariwise, if you
have no text editor but do have a hex editor, binary will be easier
to display.  Neither will necessarily be easier to interpret unless
you have a copy of the relevant file format standard, and then the
point is pretty much moot anyway.

> How often have we had people come here to ask help in decyphering
> a binary file?

  How often have people come here to ask help in writing "Hello
world!" programs?  How often have people come to sci.crypt to
ask help in "deciphering" cryptograms?  If you're saying that a
lot of people are stupid, I'm inclined to agree with you.

> A> The vast majority of computers these days use eight-bit
> A> byte-oriented transmission and storage protocols.  Whatever
> A> bit-ordering problems there are have moved "downstream" to
> A> those people involved in the construction of hardware that
> A> has to choose whether to transmit bit 0 or bit 7 first...
>
> So what happens when I transmit a binary floating point number to
> a machine with a different format?

  Ick, floating point! ;)  Seriously, I don't have much experience
with floating point, but I would expect you'd either use a fixed-point
representation (common in the domains in which I work), or you'd
convert to some IEEE format (about which I know little, and your
point about relevant standards' becoming extinct may well apply).

> I agree these issues are quite solveable, but I think they are
> more *Easily* solveable with text as an intermediate format.

  How do you save a floating-point number to a text file?
Losslessly?  How many lines of <your PLOC here> code is that? :)
Once I've seen a compelling answer to that, I may start thinking
in earnest about how to save floating-point numbers losslessly in
binary.  And we'll see who comes out on top. ;)


> FOOD FOR THOUGHT:
> =================
> Consider: The Rosetta Stone.
>
> Now consider the bestest, most *useful* binary format you can name.
> Think it stands any chance AT ALL of surviving that long?
>
> If you want the broadest, most robust, most portable format
> possible, there is only one answer: TEXT!

  Written on STONE TABLETS!  And then BURIED IN THE DESERT!

> Accept no substitutes!  (-:

  Absolutely 100% agreed!  (-:

-Arthur
0
Reply ajo (1601) 5/28/2004 7:19:56 PM

In article <Pine.LNX.4.58-035.0405271156530.3494@unix45.andrew.cmu.edu>, "Arthur J. O'Dwyer" <ajo@nospam.andrew.cmu.edu> writes:
> On Thu, 27 May 2004, Eric wrote:
> > Arthur J. O'Dwyer <ajo@nospam.andrew.cmu.edu> wrote:
> 
>   Probably true.  I don't know much about XML's namespacing rules
> (by which I mean the rules that say that <foo> is an okay tag for
> a user to create, but <bar> could be given special meaning by
> future standards).

Shall we say it doesn't appear that this will pose a problem, and
leave it at that?  This probably isn't the best place to get into a
long-winded discussion of XML.  XML has its drawbacks, and is
certainly not an ideal representation for everything, but it is
designed to avoid problems with upward compatibility.  (For the
particular case of namespaces: XML has explicit namespaces, and a
policy for registering ones for public use.)

>   You misunderstand me.  I/O is trivial; thus, after the first five
> minutes spent making sure the trivial code is correct (which is trivial
> to prove), you never need to touch it or look at it again.

While this is arguably true for simple file I/O, there's a lot more
to I/O than that.  Indexes.  Transactional I/O.  Databases.  Robust
network I/O.  And so forth.

> [1] - In speech I'd say "who'sever writing...," but that looks
> awful no matter how I spell it.  Whosever?  Whos'ever?  Who's-ever?
> Yuck. :(

"whoever's writing", as you had in the original text, is indeed
preferred.  It's a contraction of "whoever is writing", after all,
not "who is ever writing" - "ever" here modifies "who", not
"writing", so it shouldn't be moved into the middle of the verb
phrase.  "who'sever writing" is an artifact of speech, like "a
whole nother" for "another whole".

The easiest thing to do when faced with a contraction that seems
awkward, though, is to uncontract it (assuming it's a contraction
for which the uncontracted form is still used in English).  Try
"whoever is writing".

-- 
Michael Wojcik                  michael.wojcik@microfocus.com

I will shoue the world one of the grate Wonders of the world in 15
months if Now man mourders me in Dors or out Dors
   -- "Lord" Timothy Dexter, _A Pickle for the Knowing Ones_
0
Reply mwojcik (1874) 5/29/2004 3:48:35 AM

In article <9Rttc.4587$Eu3.44552281@news-text.cableinet.net>, Ben Measures <saint_abroadremove@removehotmail.com> writes:
> Arthur J. O'Dwyer wrote:
> 
> XML was created to solve the problem of the HTML version mess.

No it wasn't.  It was created to provide an SGML binding that would
be suitable for a wide range of structured document types, with a
view toward producing a platform-agnostic structured data format.
Look at the original XML Working Draft[1], not to mention any of the
official W3C XML documents since.

XHTML was created to clean up HTML - almost three years after XML 1.0
was released.

> It's so good it's almost magical.

It's just a popular SGML binding, really - the base technology was
there in the 1980s.  XML is a clever use of SGML, true; for example,
XML Schema was a smart idea, since it replaces DTDs (which are not
themselves written in XML) with an equivalent which *is* written in
XML, so you only need one parser rather than two in order to process
both a document and its specification.  And the peripheral standards
like XPath and XSLT and so forth are very handy.  The real success
story of XML is the marketing, though.

I like XML as much as the next fellow, assuming he believes it's a
decent format for structured data in some situations, with a wide
range of tools available, and sufficiently popular that I have to
support it anyway.  But it is not, of course, a silver bullet.

> Unless you're using somebody else's parser, which may not be broken. 
> Such as libxml2 which is *very* unlikely to be broken.

I can't speak for libxml2, but many, many people use the Xerces
parser, and anyone who subscribes to the Xerces mailing lists
knows that there are still bugs aplenty there.  Full-featured
XML parsers, particularly ones like Xerces that try to incorporate
everything allowed by the standard and provide multiple APIs in
the bargain, are big and tough to get right.

Hmm.  Just checked, and libxml2 has had bug fixes released publically
as recently as 17 May.[2]


1. http://www.w3.org/TR/WD-xml-961114.html
2. http://www.xmlsoft.org/news.html

-- 
Michael Wojcik                  michael.wojcik@microfocus.com

Art is our chief means of breaking bread with the dead ... but the social
and political history of Europe would be exactly the same if Dante and
Shakespeare and Mozart had never lived.  -- W. H. Auden
0
Reply mwojcik (1874) 5/29/2004 4:19:10 AM

Without taking a stand pro or con binary or text in this discussion, I like 
to point out that XML files ARE stored in binary format, conformant to the 
encoding attribute in the XML declaration. Now, not all encodings are ASCII 
like, think of the various EBCDIC character sets. If you must view an 
EBCDIC encoded XML file on your PC at home you need code conversion 
(implemented in XML parsers). A simple text editor like NotePad will not be 
very helpfull.

When XML data is transmitted over networks it should be done binary, not in 
text mode, because in text mode, the data may be translated to some other 
encoding scheme. But the encoding attribute, being part of the data, will 
not be adjusted. The result is no XML anymore.

Piet
0
Reply Piet 5/29/2004 5:37:52 AM

Piet Blok wrote:
> Without taking a stand pro or con binary or text in this discussion, I like 
> to point out that XML files ARE stored in binary format, conformant to the 
> encoding attribute in the XML declaration. Now, not all encodings are ASCII 
> like, think of the various EBCDIC character sets. If you must view an 
> EBCDIC encoded XML file on your PC at home you need code conversion 
> (implemented in XML parsers). A simple text editor like NotePad will not be 
> very helpfull.
> 
> When XML data is transmitted over networks it should be done binary, not in 
> text mode, because in text mode, the data may be translated to some other 
> encoding scheme. But the encoding attribute, being part of the data, will 
> not be adjusted. The result is no XML anymore.
> 
> Piet

Good point not yet considered.

-- 
Ben M.
0
Reply saint_abroadremove (144) 5/30/2004 1:58:11 PM

Since this post is solely about XML I have set followup to ctx.

Michael Wojcik wrote:
> Ben Measures writes:
> 
>>XML was created to solve the problem of the HTML version mess.
> 
> No it wasn't.  It was created to provide an SGML binding that would
> be suitable for a wide range of structured document types

"The goal is to enable generic SGML to be served, received, and 
processed on the Web in the way that is now possible with HTML."

Although I have read that sentence many times I hadn't actually viewed 
it quite the way you put it. I just viewed HTML as losely based on SGML 
(though only in appearance) and XML as the finalisation of SGML 
conformance. However what you say is correct and whilst XML wasn't 
designed to replace HTML, it does offer an alternative.

In any case it has none of the problems HTML has with versioning 
incompatibilities.

> XHTML was created to clean up HTML - almost three years after XML 1.0
> was released.

Strictly speaking that's very true. However proir to XHTML there wasn't 
much stopping you from doing the same in XML.

> I like XML as much as the next fellow, assuming he believes it's a
> decent format for structured data in some situations, with a wide
> range of tools available, and sufficiently popular that I have to
> support it anyway.  But it is not, of course, a silver bullet.

No language ever is. It does however offer much of the power and 
flexibility of SGML with very little of SGML's complexity.

-- 
Ben M.
0
Reply saint_abroadremove (144) 5/30/2004 2:23:40 PM

hi,

I agree, you can make the same errors in coding information using either
plain text or regulated plain text like xml. But you have more tools in your
hand if you dont invent your own format. There are more abstraction layers
used in xml. If you think over it you often have to invent such abstraction
layers in your proprietary format too. With the difference it has to be
invented from every new programmer every time he has to code a new type of
information. If you write down a rule how such coding should work (e.g.
every new tupel is finished by a <cr>-sign) you are following a similar way,
the xml-developers did.

greetings

Sammy


0
Reply ni_h (1) 6/1/2004 1:05:34 PM

Ben Pfaff <blp@cs.stanford.edu> wrote:

> dantex1@aol.com (Dan  Tex1) writes:
> 
> > On the flip side...  there is an advantage to creating your own format.
> > You own it and you have no depedancy on anyone else or how some
> > committee somewhere changes it's format.  If your application type is
> > relatively straight forward...  I think having your own format is
> > infinitely better than adopting someone elses  ( because no one else can
> > break your application ).
> 
> It depends on what you're trying to do.  Inventing a new graphics
> format, instead of using PNG or JPEG or whatever, is not a good
> idea.

Not necessarily. 

For example, one of the great benefits of PNG is the ability to store
your own custom data in it for use by your application. 

However, not all applications that read the PNG format will maintain
that custom information even though they are supposed to...in which
case, you may need your own graphics format so that your user doesn't
suddenly discover vital information that your app has stored with the
image has been lost just because they needed to use Photoshop (or last I
heard Photoshop does a terrible job of maintaining this meta-data) to do
some image editing.
 
0
Reply egusenet (56) 6/1/2004 5:37:19 PM

Eric wrote:
> Ben Pfaff <blp@cs.stanford.edu> wrote:
> 
> 
>>dantex1@aol.com (Dan  Tex1) writes:
>>
>>
>>>On the flip side...  there is an advantage to creating your own format.
>>>You own it and you have no depedancy on anyone else or how some
>>>committee somewhere changes it's format.  If your application type is
>>>relatively straight forward...  I think having your own format is
>>>infinitely better than adopting someone elses  ( because no one else can
>>>break your application ).
>>
>>It depends on what you're trying to do.  Inventing a new graphics
>>format, instead of using PNG or JPEG or whatever, is not a good
>>idea.
> 
> 
> Not necessarily. 
> 
> For example, one of the great benefits of PNG is the ability to store
> your own custom data in it for use by your application. 
> 
> However, not all applications that read the PNG format will maintain
> that custom information even though they are supposed to...in which
> case, you may need your own graphics format so that your user doesn't
> suddenly discover vital information that your app has stored with the
> image has been lost just because they needed to use Photoshop (or last I
> heard Photoshop does a terrible job of maintaining this meta-data) to do
> some image editing.
>  

And lose all that compatibility?

If it's really that important, put it in an xml file and distribute with 
the PNG (in a zip file). No new formats necessary, no reinvention necessary.

-- 
Ben M.
0
Reply saint_abroadremove (144) 6/1/2004 7:02:15 PM

Ben Measures <saint_abroadremove@removehotmail.com> wrote:

> Eric wrote:
> > Ben Pfaff <blp@cs.stanford.edu> wrote:
> > 
> > 
> >>dantex1@aol.com (Dan  Tex1) writes:
> >>
> >>
> >>>On the flip side...  there is an advantage to creating your own format.
> >>>You own it and you have no depedancy on anyone else or how some
> >>>committee somewhere changes it's format.  If your application type is
> >>>relatively straight forward...  I think having your own format is
> >>>infinitely better than adopting someone elses  ( because no one else can
> >>>break your application ).
> >>
> >>It depends on what you're trying to do.  Inventing a new graphics
> >>format, instead of using PNG or JPEG or whatever, is not a good
> >>idea.
> > 
> > 
> > Not necessarily. 
> > 
> > For example, one of the great benefits of PNG is the ability to store
> > your own custom data in it for use by your application. 
> > 
> > However, not all applications that read the PNG format will maintain
> > that custom information even though they are supposed to...in which
> > case, you may need your own graphics format so that your user doesn't
> > suddenly discover vital information that your app has stored with the
> > image has been lost just because they needed to use Photoshop (or last I
> > heard Photoshop does a terrible job of maintaining this meta-data) to do
> > some image editing.
> >  
> 
> And lose all that compatibility?

Yes. 

Far better to loose the compatibility then to lose results that may have
taken hours to generate. 

As long as one provides a method to export/import into PNG, etc, the
user can easily do that and then bring it back into the application
where the custom data or the image data can then be copied over and have
everything saved in a single file. 

When all application that support PNG start fully implementing the PNG
standard, it would be a great solution.

> If it's really that important, put it in an xml file and distribute with
> the PNG (in a zip file). No new formats necessary, no reinvention necessary.

This really isn't a good idea. The potential for user error in failing
to copy all of the files is staggering.

However, MacOSX has solved this problem with the implentation of
Document Bundles, but unfortunately, no other OS (as far as I know)
supports this idea - Document Bundles look like folders under Windows
and that's really not what you want - back to user error again.



 
0
Reply egusenet (56) 6/1/2004 7:40:11 PM

Eric wrote:
> Ben Measures <saint_abroadremove@removehotmail.com> wrote:
> 
>>If it's really that important, put it in an xml file and distribute with
>>the PNG (in a zip file). No new formats necessary, no reinvention necessary.
> 
> This really isn't a good idea. The potential for user error in failing
> to copy all of the files is staggering.

Then make your application output zip files containing PNG and XML data. 
No new formats necessary, no reinvention necessary, no extra risk of 
user-error.

Seriously, creating your own format from scratch is foolish when there 
are many standards around (which are created and discussed by people far 
smarter than you or me). Pick, mix.

-- 
Ben M.
0
Reply saint_abroadremove (144) 6/1/2004 9:10:48 PM

Ben Measures <saint_abroadremove@removehotmail.com> wrote:

> Eric wrote:
> > Ben Measures <saint_abroadremove@removehotmail.com> wrote:
> > 
> >>If it's really that important, put it in an xml file and distribute with
> >>the PNG (in a zip file). No new formats necessary, no reinvention necessary.
> > 
> > This really isn't a good idea. The potential for user error in failing
> > to copy all of the files is staggering.
> 
> Then make your application output zip files containing PNG and XML data.
> No new formats necessary, no reinvention necessary, no extra risk of 
> user-error.

Yes, this works if there is never a need to use another application to
modify the image data.

However, if there is, the staggering potential for user error remains as
the would have to uncompress the archive to get at the PNG file, modify
the file and then recompress all of the files, making sure all of the
files remain in the archive. These extra steps for the user would almost
certainly generate complaints themselves.

I truly do wish there was a good solution in this case that did not
involve creating a new file format, but there just isn't...at least not
any that I am aware of.
0
Reply egusenet (56) 6/1/2004 9:33:31 PM

Eric wrote:
> Ben Measures <saint_abroadremove@removehotmail.com> wrote:
> 
>>Eric wrote:
>>
>>>Ben Measures <saint_abroadremove@removehotmail.com> wrote:
>>>
>>>
>>>>If it's really that important, put it in an xml file and distribute with
>>>>the PNG (in a zip file). No new formats necessary, no reinvention necessary.
>>>
>>>This really isn't a good idea. The potential for user error in failing
>>>to copy all of the files is staggering.
>>
>>Then make your application output zip files containing PNG and XML data.
>>No new formats necessary, no reinvention necessary, no extra risk of 
>>user-error.
> 
> 
> Yes, this works if there is never a need to use another application to
> modify the image data.
> 
> However, if there is, the staggering potential for user error remains as
> the would have to uncompress the archive to get at the PNG file, modify
> the file and then recompress all of the files, making sure all of the
> files remain in the archive. These extra steps for the user would almost
> certainly generate complaints themselves.
> 
> I truly do wish there was a good solution in this case that did not
> involve creating a new file format, but there just isn't...at least not
> any that I am aware of.

Tell me, how would inventing your own format help "use another 
application to modify the image data"?

What you say here is that if you use standard formats there is the 
option of application compatibility. If you really think it too 
difficult for them, don't offer that option - an option which is not 
available with a proprietry format.

So tell me, how would "creating a new file format" offer "a good 
solution in this case"?

-- 
Ben M.
0
Reply saint_abroadremove (144) 6/2/2004 9:50:45 AM

Ben Measures <saint_abroadremove@removehotmail.com> wrote:

> Eric wrote:
> > Ben Measures <saint_abroadremove@removehotmail.com> wrote:
> > 
> >>Eric wrote:
> >>
> >>>Ben Measures <saint_abroadremove@removehotmail.com> wrote:
> >>>
> >>>
> >>>>If it's really that important, put it in an xml file and distribute
> >>>>with the PNG (in a zip file). No new formats necessary, no reinvention
> >>>>necessary.
> >>>
> >>>This really isn't a good idea. The potential for user error in failing
> >>>to copy all of the files is staggering.
> >>
> >>Then make your application output zip files containing PNG and XML data.
> >>No new formats necessary, no reinvention necessary, no extra risk of
> >>user-error.
> > 
> > 
> > Yes, this works if there is never a need to use another application to
> > modify the image data.
> > 
> > However, if there is, the staggering potential for user error remains as
> > the would have to uncompress the archive to get at the PNG file, modify
> > the file and then recompress all of the files, making sure all of the
> > files remain in the archive. These extra steps for the user would almost
> > certainly generate complaints themselves.
> > 
> > I truly do wish there was a good solution in this case that did not
> > involve creating a new file format, but there just isn't...at least not
> > any that I am aware of.
> 
> Tell me, how would inventing your own format help "use another 
> application to modify the image data"?

There are two competing requirements here. 

The first is the requirement to maintain custom data which may have
taken hours, days or weeks to generate. 

The second is to allow the user to modify the image data using another
application.

When taking into account the potential for user error in separating the
custom meta data from the image data and the fact that other
applications may or may not maintain custom data no matter what the
standard based format (TIFF, PNG, etc..), the first requirement can only
lead to a single file, custom file format.

To handle the second requirement, the application need only be able to
import/export the image data to a TIFF, PNG, etc. Once the new image
data is imported, the custom meta data can be easily associated with the
new image data.

This offers the best solution available until every company who works
with the standard image formats start caring about fully implementing
the standards. 

It's not surprising that TIFF is not fully supported since that format
is basically a nightmare to work with (and HDF is not much better), but
I believe there is hope for PNG if everyone starts demanding that
companies like Adobe fully support the standard.



-- 
== Eric Gorr ========= http://www.ericgorr.net ========= ICQ:9293199 ===
"Therefore the considerations of the intelligent always include both 
benefit and harm." - Sun Tzu
== Insults, like violence, are the last refuge of the incompetent... ===
0
Reply egDfAusenetE5fz (78) 6/2/2004 1:07:23 PM

Arthur J. O'Dwyer writes:

>> Sizes are different.  Endian-ness is different.  Formats may be
>> different (think: floating point and other more exotic formats).
> 
>  [For --hopefully-- the last time: I wasn't talking about sizes,
> or endianness, or floating-point formats.  I was talking about the
> format in which a binary file is stored.  Binary means bytes.

Usually.  Only usually.  (-:

> On the vast majority of modern computers, that's eight bits per byte.

Usually.  Only usually.  (-: (-:

> I refer you to the file format standard for ANYTHING EVER,..

And folks who write code that deals with these formats need to be
fully up to speed on the format, don't they.  And in the case of
evolving formats, need to consider upgrading so they can continue
to read newer formats.  This thread has touched on many of the
*tools* (e.g. network transport layers) available to deal with these
binary formats, AND THAT'S THE POINT: you need all this *stuff* and
knowledge.

Text is simple.  You stop even *thinking* about a lot of stuff.
And it has the advantage of easy human readability, a "nice to have"
for debugging and maintanence purposes.

Binary, in comparison, is a headache.  (-:

>> Consider finding the file in five years and not having any of the
>> previous tools that used it.  Which is likely to be easier to get
>> the data out of: text or binary?
> 
> Without any of the computers that used it?  Pretty close to zero,
> even with the help of an electron microscope.

No, it would be silly of me to mean that.

> Assuming you have no hex editor,...

Hey, I'll even grant you the hex editor!

> ...but you do have a computer and a text editor, then
> obviously text will be easier to display.

Even if you can examine the hex, do you see the hassle required to
analyse what all those bits *mean*?  Compare that to a text file
that very likely *tags* (labels) the data!  I mean, come on, how
can you beat named, trivial-to-view data?

And ya see that?  Even given equal ability to examine the raw file
(that is, sans intelligent interpreter), text is a monster winner.

> Contrariwise, if you have no text editor but do have a hex editor,
> binary will be easier to display.

Ummmm, you're winging it here.  (-:  First, really, a hex viewer, but
no text viewer?  I think that'd be a first in computing history, but
stranger things have happened. :-)

Second, doncha think viewing the text in the hex viewer would still
be a lot more obvious (given those labels) than the raw bin bits?

Even when you tilt the playing field insanely, text still wins!  (-:

> Neither will necessarily be easier to interpret unless you have a
> copy of the relevant file format standard, and then the point is
> pretty much moot anyway.

Well, right, we're assuming the fileformat is lost or unavailable.
And even if we somehow lost the "format" to text/plain, the pattern
of text lines with repeating delimiters is a red flag.  Consider too
that at this extreme--where we've forgotten ASCII--how much harder
would it be to figure out binary storage formats (remember there's
likely no clue where object boundaries are)?

>> How often have we had people come here to ask help in decyphering
>> a binary file?
> 
>   How often have people come here to ask help in writing "Hello
> world!" programs?  How often have people come to sci.crypt to
> ask help in "deciphering" cryptograms?  If you're saying that a
> lot of people are stupid, I'm inclined to agree with you.

No (well, actually, yes that's true, but not my point right now :-).

I'm pointing out--comparing like with like--no one stumbling on a
text file containing important data comes begging interpretation.
Cryptograms are play, and I doubt the urgent, often work-related
situation happens in s.crypt.

>> So what happens when I transmit a binary floating point number to
>> a machine with a different format?
> 
>   Ick, floating point! ;)

[bwg] Exactly my point!  Which would you rather deal with:

	"99.1206"   0x42c63dbf


> Seriously, I don't have much experience with floating point, but I
> would expect you'd either use a fixed-point representation (common
> in the domains in which I work),...

Let me guess.  CAD/CAM or NC or something involving physical coords?
Fixed point isn't uncommon in environments when you know the range
of values expected.  When you don't and need the largest range possible.
(Or when you DO and need a huge range.) You need floating point.

>> I agree these issues are quite solveable, but I think they are
>> more *Easily* solveable with text as an intermediate format.
> 
>   How do you save a floating-point number to a text file?

As you'd expect.  printf("%d") ... strtod()

> Losslessly?

Within certain parameters, close enough.  Once you're dealing with FP,
you sorta have to give up the concept of lossless.  Experts in FP know
how to deal with it to make the pain as low as possible, but FP is all
about approximation.

If you need absolute precision, you could always save the bytes as a
hex string.  Fast and easy in and out.

> How many lines of <your PLOC here> code is that? :)

Only a few surrounding strtod() if you don't mind a little edge loss.
(IIRC, within precision limits, text<=>FP *is* fully deterministic?)


-- 
|_ CJSonnack <Chris@Sonnack.com> _____________| How's my programming? |
|_ http://www.Sonnack.com/ ___________________| Call: 1-800-DEV-NULL  |
|_____________________________________________|_______________________|
0
Reply Chris7 (2511) 6/8/2004 1:22:52 PM

Arthur J. O'Dwyer wrote:

> 
> On Thu, 27 May 2004, Eric wrote:
>>
>> Assume that disk space is not an issue [...]
>> Assume that transportation to another OS may never occur.
>> Are there any solid reasons to prefer text files over binary files?
>>
>> Some of the reasons I can think of are:
>>
>> -- should transportation to another OS become useful or needed,
>>    the text files would be far easier to work with
> 
>   I would guess this is wrong, in general.  Think of the difference
> between a DOS/Win32 text file, a MacOS text file, and a *nix text
> file (hint: linefeeds and carriage returns).

Linefeeds and carriage returns don't matter in XML. The other
differences are ruled out by specifying the encoding. Any XML parser
should understand utf-8.

> Now think of the difference between the same systems' binary files
> (hint: nothing). 

That's wrong. Under most (but not all) DOS compilers, int is 16bit,
under Windows, it's 32bit. Under Linux on x86, long double is 80bit,
und Windows, it's 64bit. And the OS is not the only thing that matters.
On the Motorola CPUs, data is stored in big endian, on x86 in little
endian. A 64bit CPU might use a 64bit type for long (or it might not),
while on most 32bit CPUs, long is 32bit. Some systems have special
alginment reqirements, others don't. And there are a lot of other
potential problems with binary data. Those problems can all be worked
around, but it's a lot easier with text, especially xml.

0
Reply ramagnus (3485) 6/9/2004 12:31:18 PM

Rolf Magnus wrote:
> Arthur J. O'Dwyer wrote:
> 
>>On Thu, 27 May 2004, Eric wrote:
>>
>>>Assume that disk space is not an issue [...]
>>>Assume that transportation to another OS may never occur.
>>>Are there any solid reasons to prefer text files over binary files?
>>>
>>>Some of the reasons I can think of are:
>>>
>>>-- should transportation to another OS become useful or needed,
>>>   the text files would be far easier to work with
>>
>>  I would guess this is wrong, in general.  Think of the difference
>>between a DOS/Win32 text file, a MacOS text file, and a *nix text
>>file (hint: linefeeds and carriage returns).
> 
> Linefeeds and carriage returns don't matter in XML. The other
> differences are ruled out by specifying the encoding. Any XML parser
> should understand utf-8.

Actually, to be an XML parser it must support UTF-8, and UTF-16. UTF-16 
has byte ordering issues. Writing an UTF-16 file on different cpus can 
result in text files that are different. This can be resolved because of 
the encoding the the UTF standards use but it means that any true XML 
parser must deal with high-endian, low-endian issues.

Most people consider having to write code in a way that translates the 
format to your specific cpu as the measure for data not being portable. 
XML does have this issue so if thats your definition of portable then 
XML isn't portable.

"All XML processors MUST accept the UTF-8 and UTF-16 encodings of 
Unicode 3.1"
- http://www.w3.org/TR/REC-xml/#charsets

"The primary feature of Unicode 3.1 is the addition of 44,946 new 
encoded characters. These characters cover several historic scripts, 
several sets of symbols, and a very large collection of additional CJK 
ideographs.

For the first time, characters are encoded beyond the original 16-bit 
codespace or Basic Multilingual Plane (BMP or Plane 0). These new 
characters, encoded at code positions of U+10000 or higher, are 
synchronized with the forthcoming standard ISO/IEC 10646-2."
- http://www.unicode.org/reports/tr27/

The majority of XML parsers only use 16-bit characters. This means that 
the majority of XML parsers can't actually read XML.

Jeff Brooks
0
Reply Jeff 6/9/2004 8:09:19 PM

Jeff Brooks wrote:

> Rolf Magnus wrote:
<snip>
>>
>> Linefeeds and carriage returns don't matter in XML. The other
>> differences are ruled out by specifying the encoding. Any XML parser
>> should understand utf-8.
> 
> Actually, to be an XML parser it must support UTF-8, and UTF-16. UTF-16 
> has byte ordering issues. Writing an UTF-16 file on different cpus can 
> result in text files that are different. This can be resolved because of 
> the encoding the the UTF standards use but it means that any true XML 
> parser must deal with high-endian, low-endian issues.

Don't want to be seen to be supporting XML here, but doesn't the UTF-16 
standard define byte ordering?  I was under the impression (without 
having done any work with it) that a UTF-16 multi-byte sequence could be 
parsed as a byte stream.

-- 
Corey Murtagh
The Electric Monk
"Quidquid latine dictum sit, altum viditur!"
0
Reply emonk3 (287) 6/9/2004 9:09:17 PM

Jeff Brooks (jeff_brooks@nospam.com) wrote:
: Rolf Magnus wrote:
: > Arthur J. O'Dwyer wrote:
: > 
: >>On Thu, 27 May 2004, Eric wrote:
: >>
: >>>Assume that disk space is not an issue [...]
: >>>Assume that transportation to another OS may never occur.
: >>>Are there any solid reasons to prefer text files over binary files?
: >>>
: >>>Some of the reasons I can think of are:
: >>>
: >>>-- should transportation to another OS become useful or needed,
: >>>   the text files would be far easier to work with
: >>
: >>  I would guess this is wrong, in general.  Think of the difference
: >>between a DOS/Win32 text file, a MacOS text file, and a *nix text
: >>file (hint: linefeeds and carriage returns).
: > 
: > Linefeeds and carriage returns don't matter in XML. The other
: > differences are ruled out by specifying the encoding. Any XML parser
: > should understand utf-8.

: Actually, to be an XML parser it must support UTF-8, and UTF-16. UTF-16 
: has byte ordering issues. 

You can only have byte order issues when you store the UTF-16 as 8 bit
bytes.  But a stream of 8 bit bytes is _not_ UTF-16, which by definition
is a stream of 16 bit entities, so it is not the UTF-16 that has byte
order issues.

However, even the storage issue should have been trivial to solve - and
would simply have consisted of requiring 8 bit streams encoding 16 bit
unicode values to use network byte order, as is required in similar
situation within internet protocols (which are used with no
interoperability issues between all sorts of endians).  The lack of
specifying and requiring this, and instead using zero width non-breaking
spaces to help the reader "quess" that byte ordering was used in the
translation from 16 bit information units into 8 bit storage units, is by
far one of the biggest kludges ever.

0
Reply yf110 (571) 6/9/2004 10:11:30 PM

In article <1086815397.803910@radsrv1.tranzpeer.net>,
Corey Murtagh  <emonk@slingshot.no.uce> wrote:
>Don't want to be seen to be supporting XML here

???

> but doesn't the UTF-16 standard define byte ordering?

No.  There are names for the encodings corresponding to
big-endian-UTF-16 and little-endian-UTF-16, but UTF-16 itself can be
stored in either order.

XML processors can distinguish between them easily because any XML
document not in UTF-8 must begin with a less-than or a byte-order mark
(unless some external indication of encoding is given).

-- Richard
0
Reply richard91 (3683) 6/9/2004 10:16:48 PM

In article <40c77d02@news.victoria.tc.ca>,
Malcolm Dew-Jones <yf110@vtn1.victoria.tc.ca> wrote:

>You can only have byte order issues when you store the UTF-16 as 8 bit
>bytes.

Which is to say, always in practice.

-- Richard
0
Reply richard91 (3683) 6/9/2004 10:18:00 PM

Corey Murtagh wrote:

> Jeff Brooks wrote:
> 
>> Rolf Magnus wrote:
> 
> <snip>
> 
>>>
>>> Linefeeds and carriage returns don't matter in XML. The other
>>> differences are ruled out by specifying the encoding. Any XML parser
>>> should understand utf-8.
>>
>>
>> Actually, to be an XML parser it must support UTF-8, and UTF-16. 
>> UTF-16 has byte ordering issues. Writing an UTF-16 file on different 
>> cpus can result in text files that are different. This can be resolved 
>> because of the encoding the the UTF standards use but it means that 
>> any true XML parser must deal with high-endian, low-endian issues.
> 
> Don't want to be seen to be supporting XML here, but doesn't the UTF-16 
> standard define byte ordering?  I was under the impression (without 
> having done any work with it) that a UTF-16 multi-byte sequence could be 
> parsed as a byte stream.

Unicode FAQ
http://www.unicode.org/unicode/faq/utf_bom.html#37

Jeff Brooks
0
Reply Jeff 6/10/2004 3:32:41 AM

Malcolm Dew-Jones wrote:

> Jeff Brooks (jeff_brooks@nospam.com) wrote:
> : Rolf Magnus wrote:
> : > Arthur J. O'Dwyer wrote:
> : > 
> : >>On Thu, 27 May 2004, Eric wrote:
> : >>
> : >>>Assume that disk space is not an issue [...]
> : >>>Assume that transportation to another OS may never occur.
> : >>>Are there any solid reasons to prefer text files over binary files?
> : >>>
> : >>>Some of the reasons I can think of are:
> : >>>
> : >>>-- should transportation to another OS become useful or needed,
> : >>>   the text files would be far easier to work with
> : >>
> : >>  I would guess this is wrong, in general.  Think of the difference
> : >>between a DOS/Win32 text file, a MacOS text file, and a *nix text
> : >>file (hint: linefeeds and carriage returns).
> : > 
> : > Linefeeds and carriage returns don't matter in XML. The other
> : > differences are ruled out by specifying the encoding. Any XML parser
> : > should understand utf-8.
> 
> : Actually, to be an XML parser it must support UTF-8, and UTF-16. UTF-16 
> : has byte ordering issues. 
> 
> You can only have byte order issues when you store the UTF-16 as 8 bit
> bytes.  But a stream of 8 bit bytes is _not_ UTF-16, which by definition
> is a stream of 16 bit entities, so it is not the UTF-16 that has byte
> order issues.

http://www.unicode.org/unicode/faq/utf_bom.html#37

Jeff Brooks
0
Reply Jeff 6/10/2004 3:38:25 AM

Jeff Brooks wrote:
> Rolf Magnus wrote:
> 
>> Linefeeds and carriage returns don't matter in XML. The other
>> differences are ruled out by specifying the encoding. Any XML parser
>> should understand utf-8.
> 
> Actually, to be an XML parser it must support UTF-8, and UTF-16. UTF-16 
> has byte ordering issues. Writing an UTF-16 file on different cpus can 
> result in text files that are different. This can be resolved because of 
> the encoding the the UTF standards use but it means that any true XML 
> parser must deal with high-endian, low-endian issues.
> 
> "All XML processors MUST accept the UTF-8 and UTF-16 encodings of 
> Unicode 3.1"
> - http://www.w3.org/TR/REC-xml/#charsets

"Entities encoded in UTF-16 MUST [snip] begin with the Byte Order Mark 
described by section 2.7 of [Unicode3]"
http://www.w3.org/TR/REC-xml/#charencoding

This makes it trivial to overcome any endian issues, and since endian 
issues are so fundamental I don't see it as making XML any less portable.

-- 
Ben M.
0
Reply saint_abroadremove (144) 6/10/2004 9:03:31 AM

[Followups restricted to comp.programming.]

In article <P7Kxc.681713$Pk3.125780@pd7tw1no>, Jeff Brooks <jeff_brooks@nospam.com> writes:
> 
> "All XML processors MUST accept the UTF-8 and UTF-16 encodings of 
> Unicode 3.1"
> - http://www.w3.org/TR/REC-xml/#charsets
> 
> "The primary feature of Unicode 3.1 is the addition of 44,946 new 
> encoded characters. ...
> 
> For the first time, characters are encoded beyond the original 16-bit 
> codespace or Basic Multilingual Plane (BMP or Plane 0). These new 
> characters, encoded at code positions of U+10000 or higher, are 
> synchronized with the forthcoming standard ISO/IEC 10646-2."
> - http://www.unicode.org/reports/tr27/
> 
> The majority of XML parsers only use 16-bit characters. This means that 
> the majority of XML parsers can't actually read XML.

I don't believe this is correct.  UTF-16 encodes characters in U+10000
- U+10FFFF as surrogate pairs.  None of the surrogate code points match
any of the scalar code points, so there's no ambiguity - all surrogate
pairs are composed of 16-bit values that can't be mistaken for scalar
UTF-16 characters.

As long as the parser processes the surrogate pair without altering
it and recognizes it unambiguously, the parser would seem to be
complying with the XML specification.  None of those characters (in
their surrogate-pair UTF-16 representation or any other) has any
special meaning in XML, so a parser that treated the surrogate pair
as a pair of 16-bit characters should do just fine.

In other words, the parser doesn't have to recognize that characters
from U+10000 and up (in their surrogate-pair encoding) are special,
because to it they aren't special.

The only case that immediately comes to mind where the distinction
would matter is if the parser had an API that returned data character-
by-character, which should have special provisions for surrogate
pairs (or be documented as returning them in halves).  However, I've
not seen such a parser, AFAIK, and I don't know why one would provide
such an API.

Or, I suppose, if the parser offered to transform the document data
among various supported encodings.  In that case, not handling UTF-16
surrogate pairs would indeed be a bug.  On the other hand, I'm not
sure such transformations are necessarily the job of an XML parser;
that could be considered a bug in a set of additional utilities
provided alongside the parser.

-- 
Michael Wojcik                  michael.wojcik@microfocus.com

Even though there may be some misguided critics of what we're trying
to do, I think we're on the wrong path.  -- Reagan
0
Reply mwojcik (1874) 6/10/2004 6:58:21 PM

On Fri, 28 May 2004 10:05:35 -0400, Arthur J. O'Dwyer wrote:

> 
>   *Again* I urge the consultation of the RFCs defining any standard
> binary file format, and the notice of the complete lack of regard
> for big-endian/little-endian/19-bit-int/37-bit-int issues.  At the
> byte level, these things simply never come up.
> 
> 

Try (for example) RFC 1314.

These things certainly do come up, and they're handled by encoding the
rules in a header of the format.


0
Reply droby2 (108) 8/28/2004 12:49:02 AM

On Sat, 28 Aug 2004, Donald Roby wrote:
>
> On Fri, 28 May 2004 10:05:35 -0400, Arthur J. O'Dwyer wrote:
>>   *Again* I urge the consultation of the RFCs defining any standard
>> binary file format, and the notice of the complete lack of regard
>> for big-endian/little-endian/19-bit-int/37-bit-int issues.  At the
>> byte level, these things simply never come up.
>
> Try (for example) RFC 1314.

[RFC defining among other things a subset(?) of the TIFF image
file format]

> These things certainly do come up, and they're handled by
> encoding the rules in a header of the format.

   Not really.  TIFF /is/ weird in that it explicitly provides
both a "big-endian" format and a "little-endian" format, and TIFF
readers have to provide routines to read both formats.  But the
endianness/word size of the machine never comes up.  If it did,
we wouldn't be able to write TIFF writers or readers that worked
on platforms with different endiannesses.  (IIRC, this whole thread
was started way back in the mists of time with the idea that

     fputs("42000\n", fp);

produces different results on different machines (because of the
embedded newline, which produces different bytes on different
systems; not to mention the possibility of EBCDIC!), while

     unsigned int result = 42000;
     unsigned char buffer[8];
     buffer[0] = (result>>24)&0xFF;
     buffer[1] = (result>>16)&0xFF;
     buffer[2] = (result>>8)&0xFF;
     buffer[3] = (result>>0)&0xFF;
     fwrite(buffer, 1, 4, fp);

produces the exact same bytes on every platform.  Thus "binary
is better than text" if you care about portability more than
human-readability.

   But since we already had that discussion (several months ago,
IIRC), I'm not going to get back into it.

-Arthur,
signing off
0
Reply ajo (1601) 8/28/2004 4:06:46 PM

In article <Pine.LNX.4.60-041.0408281156250.2873@unix42.andrew.cmu.edu>,
Arthur J. O'Dwyer <ajo@nospam.andrew.cmu.edu> wrote:

% on platforms with different endiannesses.  (IIRC, this whole thread
% was started way back in the mists of time with the idea that
% 
%      fputs("42000\n", fp);
% 
% produces different results on different machines (because of the
% embedded newline, which produces different bytes on different
% systems; not to mention the possibility of EBCDIC!), while
% 
%      unsigned int result = 42000;
%      unsigned char buffer[8];
%      buffer[0] = (result>>24)&0xFF;
%      buffer[1] = (result>>16)&0xFF;
%      buffer[2] = (result>>8)&0xFF;
%      buffer[3] = (result>>0)&0xFF;
%      fwrite(buffer, 1, 4, fp);
% 
% produces the exact same bytes on every platform.  Thus "binary
% is better than text" if you care about portability more than
% human-readability.

Your binary example includes a new-line, which will be written
as \r\n on the same systems which write 42000\n as 42000\r\n.

The beauty of text encoding is that things like new-line differences
don't really matter, while in binary formats, they result in corruption.

-- 

Patrick TJ McPhee
East York  Canada
ptjm@interlog.com
0
Reply ptjm (515) 8/29/2004 3:30:29 AM

44 Replies
22 Views

(page loaded in 1.029 seconds)

Similiar Articles:


















7/26/2012 6:53:54 AM


Reply: