f



is there definition of "binary file" or "text file"

even so called "text file", acturally it's also stored in
the binary format.
0
niy38
6/23/2004 7:11:20 PM
comp.unix.shell 15484 articles. 2 followers. Post Follow

11 Replies
709 Views

Similar Articles

[PageSpeed] 33

niy38@hotmail.com (Niy) writes:

> even so called "text file", acturally it's also stored in
> the binary format.

There isn't a difference on unix. Unix sees a file as simply a
sequence of bytes, which it does not try to interpret. 

Interpretation of file formats is done at the application level. For
example, a database program would expect a certain format, a text
editor another format, but the standard I/O routines just deal with
streams of bytes.

What's the issue you're trying to figure out?

Joe
-- 
Folks who don't know why America is the Land of Promise should be here
during an election campaign. 
  -- Milton Berle
0
joe
6/23/2004 7:22:15 PM
joe@invalid.address wrote:
> niy38@hotmail.com (Niy) writes:
> 
>>even so called "text file", acturally it's also stored in
>>the binary format.
> 
> There isn't a difference on unix. Unix sees a file as simply a
> sequence of bytes, which it does not try to interpret. 

True, on OS level.

> Interpretation of file formats is done at the application level. For
> example, a database program would expect a certain format, a text
> editor another format, but the standard I/O routines just deal with
> streams of bytes.

On programming language library level there are functions that regard
line terminators (NL, CR/NL, or CR, depending on OS) on input and output.

*Informally* I'd say text files comprise lines that contain no literal
control codes within the line, and the line terminators at the end.
There are surely better definitions of binary vs. text files (I once
was pointed to a POSIX definition; outside POSIX I did not find any
other definition, yet).

Problems arise when some application level tools don't terminate lines
by one of the above terminators, but just *delimit* one line from the
other, such that there is a missing terminator at the end of the file.

cat'ing or wc'ing (or many other Unix typical text oriented operations)
such files will lead to... hmm... unexpected results.

> What's the issue you're trying to figure out?
> 
> Joe

Janis
0
Janis
6/23/2004 7:56:26 PM
Janis Papanagnou <Janis_Papanagnou@hotmail.com> writes:

> joe@invalid.address wrote:
>> niy38@hotmail.com (Niy) writes:
>>
>>>even so called "text file", acturally it's also stored in
>>>the binary format.

>> There isn't a difference on unix. Unix sees a file as simply a
>> sequence of bytes, which it does not try to interpret.
>
> True, on OS level.

Err, isn't unix at the OS level?

>> Interpretation of file formats is done at the application
>> level. For example, a database program would expect a certain
>> format, a text editor another format, but the standard I/O routines
>> just deal with streams of bytes.
>
> On programming language library level there are functions that
> regard line terminators (NL, CR/NL, or CR, depending on OS) on input
> and output.

Yes, but that's not Unix, that's the application. That's why it's in
the standard library, not the OS.

Another way of saying that is that the standard library is userland
code which interprets the stream of bytes provided to it by the
OS. The OS does not interpret the stream of bytes, it simply gives the
stdio routines however many of those bytes it asked for (or however
many there are left in the disk file).

> *Informally* I'd say text files comprise lines that contain no
> literal control codes within the line, and the line terminators at
> the end. 

A line terminator *is* a control code as you define above :-) Or so
the application level libraries think.

> There are surely better definitions of binary vs. text files (I once
> was pointed to a POSIX definition; outside POSIX I did not find any
> other definition, yet).

The normal distinctions between binary vs text is OS specific, as are
your comments above. VMS has different ideas about file structure
than Unix does, for example.

> Problems arise when some application level tools don't terminate
> lines by one of the above terminators, but just *delimit* one line
> from the other, such that there is a missing terminator at the end
> of the file.

That's not something that Unix cares about though. Applications might
care about it.

> cat'ing or wc'ing (or many other Unix typical text oriented
> operations) such files will lead to... hmm... unexpected results.

Ok, but those things are not the OS, they're utilities provided with
the OS. Which is to say, they're application level.

Joe
-- 
Folks who don't know why America is the Land of Promise should be here
during an election campaign. 
  -- Milton Berle
0
joe
6/23/2004 8:32:25 PM
Janis Papanagnou wrote:
> joe@invalid.address wrote:
> 
>> niy38@hotmail.com (Niy) writes:
>>
>>> even so called "text file", acturally it's also stored in
>>> the binary format.
>>
>>
>> There isn't a difference on unix. Unix sees a file as simply a
>> sequence of bytes, which it does not try to interpret. 
> 
> 
> True, on OS level.
> 
>> Interpretation of file formats is done at the application level. For
>> example, a database program would expect a certain format, a text
>> editor another format, but the standard I/O routines just deal with
>> streams of bytes.
> 
> 
> On programming language library level there are functions that regard
> line terminators (NL, CR/NL, or CR, depending on OS) on input and output.
> 
> *Informally* I'd say text files comprise lines that contain no literal
> control codes within the line, and the line terminators at the end.
> There are surely better definitions of binary vs. text files (I once
> was pointed to a POSIX definition; outside POSIX I did not find any
> other definition, yet).
> 
> Problems arise when some application level tools don't terminate lines
> by one of the above terminators, but just *delimit* one line from the
> other, such that there is a missing terminator at the end of the file.
> 
> cat'ing or wc'ing (or many other Unix typical text oriented operations)
> such files will lead to... hmm... unexpected results.
> 

right, cat a.out, then xterm is corrupt.

but

more a.out

******** a.out: Not a text file ********


It seems "more" has a definition of "text" file


0
niy38
6/23/2004 8:36:31 PM
On 2004-06-23, Niy wrote:
> even so called "text file", acturally it's also stored in
> the binary format.

   Yes, all files are binary files; text files are a subset of binary
   files. Colloquially, a binary file is a non-text file.

<http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html#tag_03_392>

 3.392 Text File

      A file that contains characters organized into one or more
      lines. The lines do not contain NUL characters and none can
      exceed {LINE_MAX} bytes in length, including the
      <newline>. Although IEEE Std 1003.1-2001 does not distinguish
      between text files and binary files (see the ISO C standard),
      many utilities only produce predictable or meaningful output
      when operating on text files. The standard utilities that have
      such restrictions always specify "text files" in their STDIN or
      INPUT FILES sections.


-- 
    Chris F.A. Johnson         http://cfaj.freeshell.org/shell
    ==========================================================
    -+   Everything in moderation -- including moderation   +-
0
Chris
6/23/2004 8:47:03 PM
joe@invalid.address wrote:
> Janis Papanagnou <Janis_Papanagnou@hotmail.com> writes:
> 
>>True, on OS level.
> 
> Err, isn't unix at the OS level?

Yes, I had been agreeing to what you said for OS level.

>>*Informally* I'd say text files comprise lines that contain no
>>literal control codes within the line, and the line terminators at
>>the end. 
> 
> A line terminator *is* a control code as you define above :-) Or so
> the application level libraries think.

I've written "within the line" as opposed to "at the end" to make
clear what I think to be the essence of a text file.

Something like: <non-control-code>-sequence <control-code>

Janis
0
Janis
6/23/2004 9:22:44 PM
Janis Papanagnou <Janis_Papanagnou@hotmail.com> writes:

> joe@invalid.address wrote:
>> Janis Papanagnou <Janis_Papanagnou@hotmail.com> writes:
>>
>>>True, on OS level.
>> Err, isn't unix at the OS level?
>
> Yes, I had been agreeing to what you said for OS level.

So we're violently agreeing? :-)

>>>*Informally* I'd say text files comprise lines that contain no
>>>literal control codes within the line, and the line terminators at
>>>the end.

>> A line terminator *is* a control code as you define above :-) Or so
>> the application level libraries think.
>
> I've written "within the line" as opposed to "at the end" to make
> clear what I think to be the essence of a text file.
>
> Something like: <non-control-code>-sequence <control-code>

Well, I was a little confused about the distinction between a
literal control code and a non-literal one. But the line terminator
normally used by applications in unix is a control code, and only the
application cares about that.

The system calls used to retrieve bytes from disk simply retrieve
bytes from disk. The kernel doesn't interpret them at all.

Interpretation of the bytes retrieved from disk is done at the
application level, not the OS.

Joe
-- 
Folks who don't know why America is the Land of Promise should be here
during an election campaign. 
  -- Milton Berle
0
joe
6/23/2004 9:35:09 PM
Janis Papanagnou wrote:
 > joe@invalid.address wrote:
 >> Janis Papanagnou <Janis_Papanagnou@hotmail.com> writes:
 >>> *Informally* I'd say text files comprise lines that contain no
 >>> literal control codes within the line, and the line terminators at
 >>> the end.
 >>
 >> A line terminator *is* a control code as you define above :-) Or so
 >> the application level libraries think.
 >
 > I've written "within the line" as opposed to "at the end" to make
 > clear what I think to be the essence of a text file.
 >
 > Something like: <non-control-code>-sequence <control-code>

But you have yet to define control-code.  Does it include \t etc.?

The Open Group definition cited by Chris (which only prohibits NUL)
seems adequate.

-- 
Kevin Rodgers

0
Kevin
6/23/2004 9:41:56 PM
2004-06-23, 20:47(+00), Chris F.A. Johnson:
[...]
>  3.392 Text File
>
>       A file that contains characters organized into one or more
>       lines. The lines do not contain NUL characters and none can
>       exceed {LINE_MAX} bytes in length, including the
>       <newline>. Although IEEE Std 1003.1-2001 does not distinguish
>       between text files and binary files (see the ISO C standard),
>       many utilities only produce predictable or meaningful output
>       when operating on text files. The standard utilities that have
>       such restrictions always specify "text files" in their STDIN or
>       INPUT FILES sections.
[...]

I thought it was "zero or more lines". So, an empty file is not
a text file.

So a file is not a text file when ("or"ed):
- it's empty
- it contains at least one '\0' character
- it contains a sequence of non-'\n' characters longer than
  LINE_MAX-1.
- it ends in a character different from '\n'.

-- 
Stephane
0
Stephane
6/24/2004 10:19:21 AM
"Chris F.A. Johnson" <c.fa.johnson@rogers.com> wrote in message
news:2ju8i7F168ebnU1@uni-berlin.de...
>    Yes, all files are binary files; text files are a subset of binary
>    files. Colloquially, a binary file is a non-text file.

This is only true on binary computers.


0
Craig
6/24/2004 2:47:48 PM
"Stephane CHAZELAS" <this.address@is.invalid> wrote in message
news:slrncdlal9.dc.stephane.chazelas@spam.is.invalid...
> 2004-06-23, 20:47(+00), Chris F.A. Johnson:
> So a file is not a text file when ("or"ed):
> - it's empty
> - it contains at least one '\0' character
> - it contains a sequence of non-'\n' characters longer than
>   LINE_MAX-1.
> - it ends in a character different from '\n'.

And so, to tie this into Janis and Joe's argument...  By this definition, a
text file can contain "control characters" (0x1..0x1f range), and also
non-ASCII characters (0x80..0xff range).  This allows UTF-8 encoded text
files, for example.  The *only* character that cannot be in a text file is
NUL.

I question the value of this definition, however, even if you restrict it to
UNIX systems (which is the intended scope of OpenGroup's specifications).
Exactly what is LINE_MAX defined to be?  Since it can be different from
system to system, there exist files which are text files on one UNIX system
but not on another.  Nice.


0
Craig
6/24/2004 3:00:34 PM
Reply: