f



Reading Text File Encoding and converting to Perls internal UTF-8 encoding

Need help from Unicode guru's or anybody with some knowledge on the subject.

I maybe have a text (character) file I just open. But I don't know the encoding and I
can't open it with any encoding attribute.

It would appear to me that at the start of the file, there is an encoding mark (or none),
assuming a text file, a sort of BOM sequence of octets that mark what its encoding is.

Given that I might be passed a file descriptor only, I am module, and I rewind the position
to the start of the file, is there any  way I can tell the encoding. If I could, and
its not utf8, I could decode() the rest of the file as octets, ie: in-place memeory decode,
create a temp file decoded, or possibly re-open it with the proper encoding.

I think that encoding is the usual 8/16/32 bit utf but with many locales (chars).

I am still sketchy where to find a list of encoding markers to be able to find out
this information. And still sketchy on the methods available for analysis and transformation.

I know Perl has a massive 'use Encode' lib, nevertheless, this is what I need to do to finalize
a module I'm working on.

Thanks for the help.
-sln
0
sln
4/17/2009 9:27:02 PM
comp.lang.perl.misc 33233 articles. 1 followers. brian (1246) is leader. Post Follow

3 Replies
1136 Views

Similar Articles

[PageSpeed] 40

sln@netherlands.com wrote:

> Given that I might be passed a file descriptor only, I am module, and I rewind the position
> to the start of the file, is there any  way I can tell the encoding. If I could, and
> its not utf8, I could decode() the rest of the file as octets, ie: in-place memeory decode,
> create a temp file decoded, or possibly re-open it with the proper encoding.

As I understand it, and I have just written some Perl code that happily
mixes two dozen languages in one web page, there isn't a really good way
of doing what you want. Part of the reason for this is that given a big
block of text encoded as plain ASCII, the same text in UTF8 is exactly,
bit for bit, the same. It's only when you introduce "wide" characters in
other alphabets that UTF8 does anything.

In some cases it may be possible to make an intelligent guess at the
encoding, but no more.

Incidentally, and somewhat off-topic, is there anyone else for whom the
letters UTF automatically mean 'use the force'?

-- 
I am Robert Billing, Christian, author, inventor, traveller, cook and
animal lover. "It burned me from within. It quickened; I was with book
as a woman is with child."

Quality e-books for portable readers: http://www.alex-library.com
0
Robert
4/17/2009 10:48:10 PM
On Fri, 17 Apr 2009 23:48:10 +0100, Robert Billing <unclebob@tnglwood.demon.co.uk> wrote:

>sln@netherlands.com wrote:
>
>> Given that I might be passed a file descriptor only, I am module, and I rewind the position
>> to the start of the file, is there any  way I can tell the encoding. If I could, and
>> its not utf8, I could decode() the rest of the file as octets, ie: in-place memeory decode,
>> create a temp file decoded, or possibly re-open it with the proper encoding.
>
>As I understand it, and I have just written some Perl code that happily
>mixes two dozen languages in one web page, there isn't a really good way
>of doing what you want. Part of the reason for this is that given a big
>block of text encoded as plain ASCII, the same text in UTF8 is exactly,
>bit for bit, the same. It's only when you introduce "wide" characters in
>other alphabets that UTF8 does anything.
>
>In some cases it may be possible to make an intelligent guess at the
>encoding, but no more.
>
>Incidentally, and somewhat off-topic, is there anyone else for whom the
>letters UTF automatically mean 'use the force'?

I'm sorry, 'I exists and therefore I am' doesen't seem to work.

-sln
0
sln
4/17/2009 11:22:47 PM
Quoth Robert Billing <unclebob@tnglwood.demon.co.uk>:
> sln@netherlands.com wrote:
> 
> > Given that I might be passed a file descriptor only, I am module, and
> I rewind the position
> > to the start of the file, is there any  way I can tell the encoding.
> If I could, and
> > its not utf8, I could decode() the rest of the file as octets, ie:
> in-place memeory decode,
> > create a temp file decoded, or possibly re-open it with the proper encoding.
> 
> As I understand it, and I have just written some Perl code that happily
> mixes two dozen languages in one web page, there isn't a really good way
> of doing what you want. Part of the reason for this is that given a big
> block of text encoded as plain ASCII, the same text in UTF8 is exactly,
> bit for bit, the same. It's only when you introduce "wide" characters in
> other alphabets that UTF8 does anything.

In that particular case it doesn't matter: whether you decode it as
UTF-8 or ASCII you still get the same characters. What's more important
is that any piece of UTF-8 text is also valid (though probably
nonsensical) in any of the 8bit ISO-8859-X encodings, and anything in
one of those encodings is also valid in all the others.

There is the Encode::Guess module that attempts to guess the encoding of
a given string, but as you can tell from the name it's only a heuristic.
Depending on what you're decoding it may be good enough, though.

> Incidentally, and somewhat off-topic, is there anyone else for whom the
> letters UTF automatically mean 'use the force'?

Err... I doubt it? :)

Ben

0
Ben
4/18/2009 12:58:50 AM
Reply: