f



How do I create a new text file with utf-8 encoding

I use Activeperl version 5.8.8.817 on windows xp.

I try create a new text file and add some content but when I open it
in notepad, it says its a ansi encoded file. Why?

Here is my code snippit:

open my $fh, '>:encoding(UTF-8)', "testfile.txt";
print $fh "Welcome to Muppet Show\n";
close $fh;

What do I do wrong?

0
bk (3)
5/10/2007 12:37:39 PM
comp.lang.perl.misc 33233 articles. 1 followers. brian (1246) is leader. Post Follow

3 Replies
1079 Views

Similar Articles

[PageSpeed] 23

bk@docstream.no wrote:
> I use Activeperl version 5.8.8.817 on windows xp.
>
> I try create a new text file and add some content but when I open it
> in notepad, it says its a ansi encoded file. Why?
>
> open my $fh, '>:encoding(UTF-8)', "testfile.txt";
> print $fh "Welcome to Muppet Show\n";
> close $fh;
>
> What do I do wrong?

Your sample text has the identical byte sequence in ASCII, Windows-1252 (aka 
ANSI), UTF-8, ISO-Latin1, ISO-Latin15, and probably a dozen other encodings. 
Therefore your sample is useless for testing for the correct encoding.

Notepad relies on the byte order mark (BOM) do identify Unicode files, 
including UTF-8 where the BOM of course is meaningless and not used except 
by Notepad itself. In not so many words: Notepad has no clue what it is 
talking about. But for your sample text nor would any other tool.

Step 1: use some sample text that contains characters, that have different 
code points in each encoding.
Step 2: don't use Notepad. Write to a (trivial) HTML file and then use a web 
browser to view that file. There you can change the encoding and determine, 
if those characters are displayed correctly for the desired encoding.

In over 8 years as software localization engineer and international program 
manager this has proven to be the only practical and reliable way to 
identify the actual encoding of a file.

jue




0
J
5/10/2007 2:05:09 PM
On May 10, 3:05 pm, "J=FCrgen Exner" <jurge...@hotmail.com> wrote:
> b...@docstream.no wrote:
> > I use Activeperl version 5.8.8.817 on windows xp.
>
> > I try create a new text file and add some content but when I open it
> > in notepad, it says its a ansi encoded file. Why?
>
> > open my $fh, '>:encoding(UTF-8)', "testfile.txt";
> > print $fh "Welcome to Muppet Show\n";
> > close $fh;
>
> > What do I do wrong?
>
> Your sample text has the identical byte sequence in ASCII, Windows-1252 (=
aka
> ANSI), UTF-8, ISO-Latin1, ISO-Latin15, and probably a dozen other encodin=
gs.
> Therefore your sample is useless for testing for the correct encoding.
>
> Notepad relies on the byte order mark (BOM) do identify Unicode files,
> including UTF-8 where the BOM of course is meaningless and not used except
> by Notepad itself.

You mean Windows not Notepad. Most Windows programs will recognise a
file with a utf8 BOM at the start as utf8.

In a situation where you've got a mixture of  Windows-1252 and utf8
files knocking about then it's not a bad way to distinguish them. I'm
not saying I particularly liked Microsoft's unilateral adoption of BOM
in utf8 but I have to admit it makes the best of a bad job.

In Perl I'd like to be able to say something like

open my $fh, '>:encoding(UTF-8 BOM)', "testfile.txt";

But AFIAK I can't and I just have to

print $fh "\x{FEFF}"; # BOM


0
Brian
5/10/2007 5:59:57 PM
Brian McCauley wrote:
> In a situation where you've got a mixture of  Windows-1252 and utf8
> files knocking about then it's not a bad way to distinguish them. I'm
> not saying I particularly liked Microsoft's unilateral adoption of BOM
> in utf8 but I have to admit it makes the best of a bad job.

Fair enough, you got a point.
However calling it a _Byte_Order_ Mark in context of UTF-8 is a misnomer if 
there ever has been one ;-)

jue 


0
J
5/10/2007 8:50:27 PM
Reply: