Hi all !
I have written a tcl script that aims to write text files holding non
ascii characters (higher 128 bytes). Shall I open the text file with
the "w" option or with the "wb" option (binary) ?
Using an ActiveState distro 8.5.6, both work fine, but using CVS HEAD
(and compiling my own tclsh) only "wb" works.
Any suggestion ? Thanks !
Thomas
|
|
0
|
|
|
|
Reply
|
Thomas
|
12/13/2010 10:58:49 AM |
|
On 13 d=E9c, 11:58, Thomas MENEZ <menez.tho...@gmail.com> wrote:
> Hi all !
>
> I have written a tcl script that aims to write text files holding non
> ascii characters (higher 128 bytes). Shall I open the text file with
> the "w" option or with the "wb" option (binary) ?
> Using an ActiveState distro 8.5.6, both work fine, but using CVS HEAD
> (and compiling my own tclsh) only "wb" works.
It is *always* necessary to put the channel in -translation binary to
get sane behavior when writing binary, either with the 'b' or BINARY
access parameter to [open], or with explicit [fconfigure -translation
binary] (personal preference, because it works also for stdchannels,
[socket] and [chan pipe]). It might work without it in cases where the
system encoding is 8-bit *and* the default translation doesn't play
with CR/LFs, but that is at best an accident.
The interplay of platform defaults, buggy tolerance in older versions
of Tcl, and chance, may give the illusion or "working". Don't trust
that :)
-Alex
|
|
0
|
|
|
|
Reply
|
Alexandre
|
12/13/2010 11:32:35 AM
|
|
At Mon, 13 Dec 2010 02:58:49 -0800 (PST) Thomas MENEZ <menez.thomas@gmail.com> wrote:
>
> Hi all !
>
> I have written a tcl script that aims to write text files holding non
> ascii characters (higher 128 bytes). Shall I open the text file with
> the "w" option or with the "wb" option (binary) ?
> Using an ActiveState distro 8.5.6, both work fine, but using CVS HEAD
> (and compiling my own tclsh) only "wb" works.
>
> Any suggestion ? Thanks !
>
> Thomas
What operating system?
The 'b' suffix on open is only meaningful under MS-Windows and only
means that CFLF handling is turned off (without the 'b' suffix, '\n'
gets written as "\r\n" and "\r\n" gets read as "\n"). At least at the C
interface level. Under Tcl, there is no 'b' suffix (at least it is not
documented in the 8.4 docs (what I have on my CentOS box). I thought
this sort of thing is handled by the fconfigure command under Tcl:
-translation mode
-translation {inMode outMode}
In Tcl scripts the end of a line is always represented using a
single newline character (\n). However, in actual files and
devices the end of a line may be represented differently on dif-
ferent platforms, or even for different devices on the same
platform. For example, under UNIX newlines are used in files,
whereas carriage-return-linefeed sequences are normally used in
network connections. On input (i.e., with gets and read) the
Tcl I/O system automatically translates the external end-of-line
representation into newline characters. Upon output (i.e., with
puts), the I/O system translates newlines to the external end-
of-line representation. The default translation mode, auto,
handles all the common cases automatically, but the -translation
option provides explicit control over the end of line transla-
tions.
(man fconfigure for the full details)
And it is my understanding that Tcl I/O is 8-bit clean otherwise.
>
--
Robert Heller -- 978-544-6933 / heller@deepsoft.com
Deepwoods Software -- http://www.deepsoft.com/
() ascii ribbon campaign -- against html e-mail
/\ www.asciiribbon.org -- against proprietary attachments
|
|
0
|
|
|
|
Reply
|
Robert
|
12/13/2010 11:42:43 AM
|
|
On Dec 13, 11:42=A0am, Robert Heller <hel...@deepsoft.com> wrote:
> The 'b' suffix on open is only meaningful under MS-Windows and only
> means that CFLF handling is turned off (without the 'b' suffix, '\n'
> gets written as "\r\n" and "\r\n" gets read as "\n").
No. In 8.5 it means that (the equivalent of) [fconfigure $chan -
translation binary] is applied. That is meaningful on all platforms,
though the effects are particularly noticeable on Windows.
Donal.
|
|
0
|
|
|
|
Reply
|
Donal
|
12/13/2010 11:56:12 AM
|
|
On Dec 13, 3:58=A0am, Thomas MENEZ <menez.tho...@gmail.com> wrote:
> Hi all !
>
> I have written a tcl script that aims to write text files holding non
> ascii characters (higher 128 bytes). Shall I open the text file with
> the "w" option or with the "wb" option (binary) ?
> Using an ActiveState distro 8.5.6, both work fine, but using CVS HEAD
> (and compiling my own tclsh) only "wb" works.
>
> Any suggestion ? Thanks !
>
> Thomas
If these are really text files and not binary files, then just open
them with the w option and configure them to the proper encoding. For
example, to create a file in UTF-8 format:
set fhandle [open $filename w]
fconfigure $fhandle -encoding utf-8
puts $fhandle $unicodeText
close $fhandle
|
|
0
|
|
|
|
Reply
|
Aric
|
12/13/2010 6:34:02 PM
|
|
Sorry Alex,
but writing internal hold UTF-8 text binary to a file ... shouldn't be
that a UTF-8 encoded file?
Since the internal the strings are hold in UTF-8 encoding, storing a
text with binary translation or identity encoding into a file, I would
expect this file will contain UTF-8 encoded text.
So I experimented a bit:
# writing a test file containing 8bit ASCII characters, internally
hold as
# UTF-8 encoded characters
#
set file [file join $env(TEMP) utf8Test.txt]
set fd [open $file wb];
puts $fd "implicit binary (open $file wb):";
puts $fd "(-translation [fconfigure $fd -translation] -encoding
[fconfigure $fd -encoding])"
puts $fd "";
puts $fd "\taou=E4=F6=FC";
puts $fd "";
fconfigure $fd -translation binary;
puts $fd "explicit binary (fconfigure -translation binary):";
puts $fd "(-translation [fconfigure $fd -translation] -encoding
[fconfigure $fd -encoding])"
puts $fd "";
puts $fd "\taou=E4=F6=FC";
puts $fd "";
puts $fd "tests with setting the encoding via fconfigure"
foreach encoding {ascii cp1252 identity utf-8 unicode} {
puts $fd "\t$encoding:";
fconfigure $fd -encoding $encoding;
puts $fd "\t(-translation [fconfigure $fd -translation] -
encoding [fconfigure $fd -encoding])"
puts $fd "\t\taou=E4=F6=FC";
puts $fd "";
}
close $fd;
The output was quite surprising to me:
implicit binary (open C:/Users/MARTIN~1/AppData/Local/Temp/
utf8Test.txt wb):
(-translation lf -encoding binary)
aou=E4=F6=FC
explicit binary (fconfigure -translation binary):
(-translation lf -encoding binary)
aou=E4=F6=FC
tests with setting the encoding via fconfigure
ascii:
(-translation lf -encoding ascii)
aou???
cp1252:
(-translation lf -encoding cp1252)
aou=E4=F6=FC
identity:
(-translation lf -encoding identity)
aou=C3=A4=C3=B6=C3=BC
utf-8:
(-translation lf -encoding utf-8)
aou=C3=A4=C3=B6=C3=BC
unicode:
...
So writing an internally UTF-8 encoded string to a file configured for
(implicit and explicit) binary "translation" uses an encoding, that is
not present ...
% expr {"binary" in [encoding names]}
0
.... and the UTF-8 encoded string is encoded to what ever (ascii,
cp1252, iso8859-1, etc.), but does not contain multi-byte-characters
(the =E4, =F6, =FC in UTF-8 are two byte long).
Setting the encoding directly to "identity" the "binary" encoding,
than real UTF-8 is written from hold UTF-8 encoded text.
Is this behavior really documented, wanted, correct?
I've tested this with 8.6b1.1 and 8.5.7.
Best regards,
Martin
On 13 Dez., 12:32, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:
> On 13 d=E9c, 11:58, Thomas MENEZ <menez.tho...@gmail.com> wrote:
>
> > Hi all !
>
> > I have written a tcl script that aims to write text files holding non
> > ascii characters (higher 128 bytes). Shall I open the text file with
> > the "w" option or with the "wb" option (binary) ?
> > Using an ActiveState distro 8.5.6, both work fine, but using CVS HEAD
> > (and compiling my own tclsh) only "wb" works.
>
> It is *always* necessary to put the channel in -translation binary to
> get sane behavior when writing binary, either with the 'b' or BINARY
> access parameter to [open], or with explicit [fconfigure -translation
> binary] (personal preference, because it works also for stdchannels,
> [socket] and [chan pipe]). It might work without it in cases where the
> system encoding is 8-bit *and* the default translation doesn't play
> with CR/LFs, but that is at best an accident.
>
> The interplay of platform defaults, buggy tolerance in older versions
> of Tcl, and chance, may give the illusion or "working". Don't trust
> that :)
>
> -Alex
|
|
0
|
|
|
|
Reply
|
MartinLemburg
|
12/14/2010 9:57:22 AM
|
|
On 14 d=E9c, 10:57, "MartinLemburg@Siemens-PLM"
<martin.lemburg.siemens-...@gmx.net> wrote:
> Sorry Alex,
>
> but writing internal hold UTF-8 text binary to a file ... shouldn't be
> that a UTF-8 encoded file?
Don't equate Tcl's internal pseudo-utf8 string representations with
the external utf8 encoding. They are different, for example 00 vs
C080, and above all they don't live in the same universe:
- universe 1 is string reps in pseudo-utf8
- universe 2 is the Byte Array object
- universe 3 is the String object, in unicode and/or utf-8
When you read from a [-encoding binary], the result starts its life as
a pure Byte Array. Then if its later fate induces string rep
computation, it is done by using iso-8859-1 as the mapping between the
byte array's single bytes and the pseudo-utf8 string rep, with byte 0
mapped to (normally invalid utf8) sequence C080.
When you read from a [-encoding SOMEENCODING], the result is a String
object (with no string rep initially), containing an unicode
representation of the original characters. This way the String object
may serve as an interlingua between various encodings, with no
shimmering, like when you [puts] the value to another [-encoding
SOMETHINGELSE] channel.
In all cases of shimmering, you can reason as though everything went
through universe 1 (string rep).
> So writing an internally UTF-8 encoded string to a file configured for
> (implicit and explicit) binary "translation" uses an encoding, that is
> not present ...
Yes, see above, it is iso-88591-1. It has to be constant for Byte
Array sanity (not system dependent), and 8-bit. ASCII was thus no
option; iso-latin1 doesn't look like a bad idea in hindsight ;-)
> Setting the encoding directly to "identity" the "binary" encoding,
> than real UTF-8 is written from hold UTF-8 encoded text.
"identity" is an internal encoding to play tricks. Don't use it :D
> Is this behavior really documented, wanted, correct?
Everything is wanted and correct; only identity is not meant for
scri[t-level use.
-Alex
|
|
0
|
|
|
|
Reply
|
Alexandre
|
12/14/2010 10:58:55 AM
|
|
On 14 d=E9c, 11:58, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:
> On 14 d=E9c, 10:57, "MartinLemburg@Siemens-PLM"
>
> <martin.lemburg.siemens-...@gmx.net> wrote:
> > Sorry Alex,
>
> > but writing internal hold UTF-8 text binary to a file ... shouldn't be
> > that a UTF-8 encoded file?
>
> Don't equate Tcl's internal pseudo-utf8 string representations with
> the external utf8 encoding. They are different, for example 00 vs
> C080, and above all they don't live in the same universe:
>
> =A0- universe 1 is string reps in pseudo-utf8
> =A0- universe 2 is the Byte Array object
> =A0- universe 3 is the String object, in unicode and/or utf-8
>
> When you read from a [-encoding binary], the result starts its life as
> a pure Byte Array. Then if its later fate induces string rep
> computation, it is done by using iso-8859-1 as the mapping between the
> byte array's single bytes and the pseudo-utf8 string rep, with byte 0
> mapped to (normally invalid utf8) sequence C080.
>
> When you read from a [-encoding SOMEENCODING], the result is a String
> object (with no string rep initially), containing an unicode
> representation of the original characters. This way the String object
> may serve as an interlingua between various encodings, with no
> shimmering, like when you [puts] the value to another [-encoding
> SOMETHINGELSE] channel.
>
> In all cases of shimmering, you can reason as though everything went
> through universe 1 (string rep).
>
> > So writing an internally UTF-8 encoded string to a file configured for
> > (implicit and explicit) binary "translation" uses an encoding, that is
> > not present ...
>
> Yes, see above, it is iso-88591-1. It has to be constant for Byte
> Array sanity (not system dependent), and 8-bit. ASCII was thus no
> option; iso-latin1 doesn't look like a bad idea in hindsight ;-)
>
> > Setting the encoding directly to "identity" the "binary" encoding,
> > than real UTF-8 is written from hold UTF-8 encoded text.
>
> "identity" is an internal encoding to play tricks. Don't use it :D
>
> > Is this behavior really documented, wanted, correct?
>
> Everything is wanted and correct; only identity is not meant for
> scri[t-level use.
>
> -Alex
Thank you for all your answers.
My OS is Win XP.
I have found something interesting :
1- open a file in w mode (not wb);
2- call fconfigure $myFileHandler -encoding
Result with ActiveState 8.5.7 : cp1252
Result with a custom built tclsh : iso8859-1
Does it mean I have to use --with-encoding=3Dcp1252 to have it by
default within my custom built tclsh ? WHy isn't it set by default ?
Thanks !
|
|
0
|
|
|
|
Reply
|
Thomas
|
12/17/2010 5:05:26 PM
|
|
On Dec 17 2010, 6:05=A0pm, Thomas MENEZ <menez.tho...@gmail.com> wrote:
> On 14 d=E9c, 11:58, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
> wrote:
>
>
>
> > On 14 d=E9c, 10:57, "MartinLemburg@Siemens-PLM"
>
> > <martin.lemburg.siemens-...@gmx.net> wrote:
> > > Sorry Alex,
>
> > > but writing internal hold UTF-8 text binary to a file ... shouldn't b=
e
> > > that a UTF-8 encoded file?
>
> > Don't equate Tcl's internal pseudo-utf8 string representations with
> > the external utf8 encoding. They are different, for example 00 vs
> > C080, and above all they don't live in the same universe:
>
> > =A0- universe 1 is string reps in pseudo-utf8
> > =A0- universe 2 is the Byte Array object
> > =A0- universe 3 is the String object, in unicode and/or utf-8
>
> > When you read from a [-encoding binary], the result starts its life as
> > a pure Byte Array. Then if its later fate induces string rep
> > computation, it is done by using iso-8859-1 as the mapping between the
> > byte array's single bytes and the pseudo-utf8 string rep, with byte 0
> > mapped to (normally invalid utf8) sequence C080.
>
> > When you read from a [-encoding SOMEENCODING], the result is a String
> > object (with no string rep initially), containing an unicode
> > representation of the original characters. This way the String object
> > may serve as an interlingua between various encodings, with no
> > shimmering, like when you [puts] the value to another [-encoding
> > SOMETHINGELSE] channel.
>
> > In all cases of shimmering, you can reason as though everything went
> > through universe 1 (string rep).
>
> > > So writing an internally UTF-8 encoded string to a file configured fo=
r
> > > (implicit and explicit) binary "translation" uses an encoding, that i=
s
> > > not present ...
>
> > Yes, see above, it is iso-88591-1. It has to be constant for Byte
> > Array sanity (not system dependent), and 8-bit. ASCII was thus no
> > option; iso-latin1 doesn't look like a bad idea in hindsight ;-)
>
> > > Setting the encoding directly to "identity" the "binary" encoding,
> > > than real UTF-8 is written from hold UTF-8 encoded text.
>
> > "identity" is an internal encoding to play tricks. Don't use it :D
>
> > > Is this behavior really documented, wanted, correct?
>
> > Everything is wanted and correct; only identity is not meant for
> > scri[t-level use.
>
> > -Alex
>
> Thank you for all your answers.
>
> My OS is Win XP.
>
> I have found something interesting :
>
> =A01- open a file in w mode (not wb);
> =A02- call fconfigure $myFileHandler -encoding
>
> Result with ActiveState 8.5.7 : cp1252
> Result with a custom built tclsh : iso8859-1
>
> Does it mean I have to use --with-encoding=3Dcp1252 to have it by
> default within my custom built tclsh ? WHy isn't it set by default ?
>
> Thanks !
Well after fixing two things, it now seems to be OK :
1- I got the encoding system right by exporting TK__LIBRARY and
TCL_LIBRARY env vars before initializing tcl;
2- I used the Tcl_ExternalToUtf (...) function to convert my non ascii
data from source files before sending the script lines to the
interpreter. Not using this resulted in very strange behaviour, but I
finally figured it out myself.
Thomas
|
|
0
|
|
|
|
Reply
|
menez.thomas (69)
|
1/4/2011 10:54:29 PM
|
|
|
8 Replies
637 Views
(page loaded in 0.193 seconds)
|