f



file command: "XML document text" vs "XML document text"

I've just used the file command on four files of RDF-XML with the
following output:

augtfidf.rdf: XML document text
kyoto.rdf:    XML document text
stuff.rdf:    XML  document text
tfidf.rdf:    XML document text

What does it mean that one of them has an extra space between "XML"
and "document"?


-- 
The kid's a hot prospect. He's got a good head for merchandising, an
agent who can take you downtown and one of the best urine samples I've
seen in a long time.                           [Dead Kennedys t-shirt]
0
a240613 (153)
10/16/2012 2:06:38 PM
comp.unix.shell 15484 articles. 3 followers. Post Follow

13 Replies
1971 Views

Similar Articles

[PageSpeed] 38

On 16.10.2012 16:06, Adam Funk wrote:
> I've just used the file command on four files of RDF-XML with the
> following output:
> 
> augtfidf.rdf: XML document text
> kyoto.rdf:    XML document text
> stuff.rdf:    XML  document text
> tfidf.rdf:    XML document text
> 
> What does it mean that one of them has an extra space between "XML"
> and "document"?

Hard to tell without further information.

What does   file *.rdf | od -c   show you?

Janis

0
10/16/2012 3:20:14 PM
Janis Papanagnou wrote:

> On 16.10.2012 16:06, Adam Funk wrote:
>> I've just used the file command on four files of RDF-XML with the
>> following output:
>> 
>> augtfidf.rdf: XML document text
>> kyoto.rdf:    XML document text
>> stuff.rdf:    XML  document text
>> tfidf.rdf:    XML document text
>> 
>> What does it mean that one of them has an extra space between "XML"
>> and "document"?
> 
> Hard to tell without further information.
> 
> What does   file *.rdf | od -c   show you?
> 
> Janis

I've found out something strange. I don't know why this happens (perhaps a 
bug in file/libmagic?), maybe it helps investigating further.

I've used locate+file+grep to look for .rdf files in my system with the two 
spaces and found some. It seems that the double space appears when using 
single quotes in the <?xml version='1.0'?> start tag. See this:

mirko@WizBox:~$ cp /etc/kompozer/profile/localstore.rdf .
mirko@WizBox:~$ cp localstore.rdf localstore_orig.rdf
mirko@WizBox:~$ vim localstore.rdf 
mirko@WizBox:~$ file localstore*
localstore_orig.rdf: XML document text
localstore.rdf:      XML  document text
mirko@WizBox:~$ head -n1 localstore*
==> localstore_orig.rdf <==
<?xml version="1.0"?>

==> localstore.rdf <==
<?xml version='1.0'?>




0
10/16/2012 5:23:55 PM
Adam Funk wrote:

> I've just used the file command on four files of RDF-XML with the
> following output:
> 
> augtfidf.rdf: XML document text
> kyoto.rdf:    XML document text
> stuff.rdf:    XML  document text
> tfidf.rdf:    XML document text
> 
> What does it mean that one of them has an extra space between "XML"
> and "document"?

file(1) is free software.  UTSL: <ftp://ftp.astron.com/pub/file/>.

-- 
PointedEars

Twitter: @PointedEars2
Please do not Cc: me. / Bitte keine Kopien per E-Mail.
0
PointedEars (2489)
10/16/2012 6:37:41 PM
On 2012-10-16, Janis Papanagnou wrote:

> On 16.10.2012 16:06, Adam Funk wrote:
>> I've just used the file command on four files of RDF-XML with the
>> following output:
>> 
>> augtfidf.rdf: XML document text
>> kyoto.rdf:    XML document text
>> stuff.rdf:    XML  document text
>> tfidf.rdf:    XML document text
>> 
>> What does it mean that one of them has an extra space between "XML"
>> and "document"?
>
> Hard to tell without further information.
>
> What does   file *.rdf | od -c   show you?

Not much.  I tried od -x too, and that confirmed that the two spaces
are both 0x20.


-- 
The history of the world is the history of a privileged few.
                                            --- Henry Miller
0
a240613 (153)
10/16/2012 8:45:51 PM
On 16.10.2012 22:45, Adam Funk wrote:
> On 2012-10-16, Janis Papanagnou wrote:
> 
>> On 16.10.2012 16:06, Adam Funk wrote:
>>> I've just used the file command on four files of RDF-XML with the
>>> following output:
>>>
>>> augtfidf.rdf: XML document text
>>> kyoto.rdf:    XML document text
>>> stuff.rdf:    XML  document text
>>> tfidf.rdf:    XML document text
>>>
>>> What does it mean that one of them has an extra space between "XML"
>>> and "document"?
>>
>> Hard to tell without further information.
>>
>> What does   file *.rdf | od -c   show you?
> 
> Not much.  I tried od -x too, and that confirmed that the two spaces
> are both 0x20.

The file(1) man page points to magic(5). Depending on the actual file
characteristics there seem to be more than one entry possible for a
file, independent of its extension. As a wild guess: maybe depending
on byte order or any such file characteristic.

It may be helpful if you inspect that 'magic' file on your system and
see what entries are present for XML and what's the difference in the
definition of the respective entries with those two text strings that
you observed.

Janis

> 
> 

0
10/16/2012 9:28:51 PM
Janis Papanagnou wrote:

> The file(1) man page points to magic(5). Depending on the actual file
> characteristics there seem to be more than one entry possible for a
> file, independent of its extension. As a wild guess: maybe depending
> on byte order or any such file characteristic.

There are indeed multiple possible entries for a "file type". The point of 
'file' and 'magic' is exactly that it tries to determine the type (and 
possible sub-types/versions/etc) from the file's content (certain special 
byte/string sequences near the beginning of the file, aka "magic numbers"), 
instead of the extension, and actually, 'file' completely ignores the 
extension.

Since there is an unlimited number of file formats, 'file' depends on a 
heuristic which is easy to trick.

> It may be helpful if you inspect that 'magic' file on your system and
> see what entries are present for XML and what's the difference in the
> definition of the respective entries with those two text strings that
> you observed.

Unfortunately, there is no single "magic" text file on Ubuntu (and Debian) 
anymore. It has been replaced with some binary format 
(/usr/share/misc/magic.mgc). One has to install the source of the 'file' (or 
libmagic1) package to view the different magic files that this magic.mgc is 
compiled from). The relevant file (in the source) is "Magdir/sgml". I cannot 
fluently read it, but didn't found an obvious explanation for the symptoms 
described by the OP.

Still, my observation written my other post should a helpful start for 
asking the file/libmagic devs about this. :-)

HTH
0
10/16/2012 11:54:01 PM
On 17.10.2012 01:54, Mirko K. wrote:
> 
>> It may be helpful if you inspect that 'magic' file on your system and
>> see what entries are present for XML and what's the difference in the
>> definition of the respective entries with those two text strings that
>> you observed.
> 
> Unfortunately, there is no single "magic" text file on Ubuntu (and Debian) 
> anymore. It has been replaced with some binary format 
> (/usr/share/misc/magic.mgc). One has to install the source of the 'file' (or 
> libmagic1) package to view the different magic files that this magic.mgc is 
> compiled from). [...]

The binary and text files are, on my Xubuntu, both under /usr/share/file,
also referenced through soft-links in /usr/share/misc. There's nothing
you'd have to install separately.

 magic:      magic text file for file(1) cmd
 magic.mgc:  magic binary file for file(1) cmd (version 7) (little endian)
 magic.mime: magic text file for file(1) cmd

The file called magic was the one that I inspected.

Janis

0
10/17/2012 12:32:32 AM
Janis Papanagnou wrote:

> The binary and text files are, on my Xubuntu, both under /usr/share/file,
> also referenced through soft-links in /usr/share/misc. There's nothing
> you'd have to install separately.
> 
>  magic:      magic text file for file(1) cmd
>  magic.mgc:  magic binary file for file(1) cmd (version 7) (little endian)
>  magic.mime: magic text file for file(1) cmd
> 
> The file called magic was the one that I inspected.
> 
> Janis

That seems to have changed somewhere between Ubuntu 10.04 and 12.04, also 
Debian 6. On my old U10.04 installation these files are still there, on this 
U12.04 there is certainly no single plain text magic file anymore and I had 
to download the source (does not mean, that Xubuntu 12.04 might not have 
them.)
0
10/17/2012 12:51:08 PM
On 2012-10-16, Mirko K. wrote:

> I've found out something strange. I don't know why this happens (perhaps a 
> bug in file/libmagic?), maybe it helps investigating further.
>
> I've used locate+file+grep to look for .rdf files in my system with the two 
> spaces and found some. It seems that the double space appears when using 
> single quotes in the <?xml version='1.0'?> start tag. See this:
>
> mirko@WizBox:~$ cp /etc/kompozer/profile/localstore.rdf .
> mirko@WizBox:~$ cp localstore.rdf localstore_orig.rdf
> mirko@WizBox:~$ vim localstore.rdf 
> mirko@WizBox:~$ file localstore*
> localstore_orig.rdf: XML document text
> localstore.rdf:      XML  document text
> mirko@WizBox:~$ head -n1 localstore*
>==> localstore_orig.rdf <==
><?xml version="1.0"?>
>
>==> localstore.rdf <==
><?xml version='1.0'?>


I get the same results --- harmless, I guess, but very strange.


-- 
No sport is less organized than Calvinball!
0
a240613 (153)
10/19/2012 1:04:59 PM
Adam Funk <a24061@ducksburg.com> writes:

> On 2012-10-16, Mirko K. wrote:
>
>> I've found out something strange. I don't know why this happens (perhaps a 
>> bug in file/libmagic?), maybe it helps investigating further.
>>
>> I've used locate+file+grep to look for .rdf files in my system with the two 
>> spaces and found some. It seems that the double space appears when using 
>> single quotes in the <?xml version='1.0'?> start tag. See this:
>>
>> mirko@WizBox:~$ cp /etc/kompozer/profile/localstore.rdf .
>> mirko@WizBox:~$ cp localstore.rdf localstore_orig.rdf
>> mirko@WizBox:~$ vim localstore.rdf 
>> mirko@WizBox:~$ file localstore*
>> localstore_orig.rdf: XML document text
>> localstore.rdf:      XML  document text
>> mirko@WizBox:~$ head -n1 localstore*
>>==> localstore_orig.rdf <==
>><?xml version="1.0"?>
>>
>>==> localstore.rdf <==
>><?xml version='1.0'?>
>
>
> I get the same results --- harmless, I guess, but very strange.

I chased this down yesterday, but the end result was not very
satisfactory.  The magic patterns allow subsequent matches to add text
to previous results and the 'file' code adds a space between them.  All
good so far.

The double space comes from two rules.  The first adds "XML" and the
second adds " document text" so you get two spaces.  The tricky part is
that the rules that adds " document text" is this:

  >15   search/1      >\0      %.3s document text

where the %.3s expands to nothing.  Simply writing this

  >15   search/1      >\0      document text

fixes the problem but the %0.3s is very suggestive.  Maybe this change
would break other file calls?

I suspect that maybe the code should be passing the old string as an
argument to the formatted print call, but then the result would be "XML
XML document type" so that alone is not the problem.  Maybe the file
code is supposed to treat strings with formats differently by not
appending the old and the new strings?  I could not tell.

I thought the situation complex enough that I didn't think I could
anything over simply reporting the effect.  Has anyone done that?

-- 
Ben.
0
ben.usenet (6790)
10/19/2012 4:05:43 PM
Ben Bacarisse wrote:

> I chased this down yesterday, but the end result was not very
> satisfactory.  The magic patterns allow subsequent matches to add text
> to previous results and the 'file' code adds a space between them.  All
> good so far.
> 
> The double space comes from two rules.  The first adds "XML" and the
> second adds " document text" so you get two spaces.  The tricky part is
> that the rules that adds " document text" is this:
> 
>   >15   search/1      >\0      %.3s document text
> 
> where the %.3s expands to nothing.  Simply writing this
> 
>   >15   search/1      >\0      document text
> 
> fixes the problem but the %0.3s is very suggestive.  Maybe this change
> would break other file calls?
> 
> I suspect that maybe the code should be passing the old string as an
> argument to the formatted print call, but then the result would be "XML
> XML document type" so that alone is not the problem.  Maybe the file
> code is supposed to treat strings with formats differently by not
> appending the old and the new strings?  I could not tell.
> 
> I thought the situation complex enough that I didn't think I could
> anything over simply reporting the effect.  Has anyone done that?
> 

Not yet. I found a different (partial) fix. First there is:

0	string/t	\<?xml\ version="	XML

And a few lines later:

0	string		\<?xml\ version='	XML

Changing that second field to string/t seems to fix it.

However, this seems to break it for UTF-x files (the t makes the test only 
for ASCII files I think).

Feel free to report it if you think that's enough info. Otherwise I'll 
report it in the next days, I want to play around with this a little more.
0
10/19/2012 6:08:52 PM
"Mirko K. " <mirkok.lists@googlemail.com> writes:

> Ben Bacarisse wrote:
>
>> I chased this down yesterday, but the end result was not very
>> satisfactory.  The magic patterns allow subsequent matches to add text
>> to previous results and the 'file' code adds a space between them.  All
>> good so far.
>> 
>> The double space comes from two rules.  The first adds "XML" and the
>> second adds " document text" so you get two spaces.  The tricky part is
>> that the rules that adds " document text" is this:
>> 
>>   >15   search/1      >\0      %.3s document text
>> 
>> where the %.3s expands to nothing.  Simply writing this
>> 
>>   >15   search/1      >\0      document text
>> 
>> fixes the problem but the %0.3s is very suggestive.  Maybe this change
>> would break other file calls?
>> 
>> I suspect that maybe the code should be passing the old string as an
>> argument to the formatted print call, but then the result would be "XML
>> XML document type" so that alone is not the problem.  Maybe the file
>> code is supposed to treat strings with formats differently by not
>> appending the old and the new strings?  I could not tell.
>> 
>> I thought the situation complex enough that I didn't think I could
>> anything over simply reporting the effect.  Has anyone done that?
>> 
>
> Not yet. I found a different (partial) fix. First there is:
>
> 0	string/t	\<?xml\ version="	XML
>
> And a few lines later:
>
> 0	string		\<?xml\ version='	XML
>
> Changing that second field to string/t seems to fix it.
>
> However, this seems to break it for UTF-x files (the t makes the test only 
> for ASCII files I think).
>
> Feel free to report it if you think that's enough info. Otherwise I'll 
> report it in the next days, I want to play around with this a little
> more.

I don't think I'll have time, but I'll post here if I am about to -- no
point in duplicating.

-- 
Ben.
0
ben.usenet (6790)
10/19/2012 9:16:13 PM
In article <s6m2l9-blg.ln1@WizBox.localnet>,
Mirko K.  <mirkok.lists@googlemail.com> wrote:
>
>That seems to have changed somewhere between Ubuntu 10.04 and 12.04, also 
>Debian 6. On my old U10.04 installation these files are still there, on this 
>U12.04 there is certainly no single plain text magic file anymore and I had 
>to download the source (does not mean, that Xubuntu 12.04 might not have 
>them.)

I complained about this last year. It's Debian bug#625259. Nothing has been
done about it.

-- 
Alan Curry
0
pacman5 (289)
10/22/2012 8:55:50 PM
Reply: