f



How can I use tcl to read files written in GBK or GB18030 encoding?

I know that Tcl has quite a large list of encodings that it supports.
However, I've a request for guidance by someone who needs to read
files using either GBK or GB18030 (I think these are alternate names
for the same encoding...).

Has anyone worked out what one needs to do for this?
0
lvirden (1938)
1/31/2008 3:26:43 PM
comp.lang.tcl 23428 articles. 2 followers. Post Follow

12 Replies
1057 Views

Similar Articles

[PageSpeed] 1

On 31 Jan., 16:26, "Larry W. Virden" <lvir...@gmail.com> wrote:
> I know that Tcl has quite a large list of encodings that it supports.
> However, I've a request for guidance by someone who needs to read
> files using either GBK or GB18030 (I think these are alternate names
> for the same encoding...).
>
> Has anyone worked out what one needs to do for this?

From Wikipedia I see that GB18030 has a structure vaguely similar to
UTF-8, but more complicated:

1st byte 00..7F: ASCII, 1 byte
             81..FE: 2nd byte 40..FE: GB2312 Chinese character, 2
bytes
                                      30..39:  Extended character, 4
bytes

Extended characters have ranges 81..FE 30..39 81..FE 30..39, with
which all Unicode points are represented that aren't in ASCII or
GB2312. The relation between Unicode and Extended cannot be computed,
but must come from a lookup table.

The difference between GB18030 and GBK is trifle and concerns only the
Euro sign: 0x80 in Microsoft's later versions of GBK and a two byte
code of A2 E3 in GB18030.

http://en.wikipedia.org/wiki/GB_18030 has a link to the "authoritative
mapping table".

Whether Tcl's encoding mechanism can deal with this 1/2/4 byte pattern
directly (so that only an .enc file would have to be produced), I
can't tell. As last resort one might always implement the decision
mechanism sketched above, and use a 2-byte and a 4-byte lookup table.
0
1/31/2008 3:50:48 PM
Larry W. Virden wrote:
> I know that Tcl has quite a large list of encodings that it supports.
> However, I've a request for guidance by someone who needs to read
> files using either GBK or GB18030 (I think these are alternate names
> for the same encoding...).
>
> Has anyone worked out what one needs to do for this?
According to Wikipedia ( http://en.wikipedia.org/wiki/GB_18030 ) its
an encoding mandating support for non BMP characters. As Tcl currently
only supports the BMP your out of luck if you need full compliance.
But you might be able to get support for the BMP part of the encoding.

But you might come up with a limited mapping file to feed to the Tcl
encoding system (like those in the Tcl source dirs, there is some tool
in the tools/ subdir to convert the unicode.org txt files to Tcl enc
files). After that you need to put the enc File in the right places
(or register them later like the various texts for starkits describe
it).

Michael
0
schlenk (1615)
1/31/2008 3:51:57 PM
Larry W. Virden wrote:
> I know that Tcl has quite a large list of encodings that it supports.
> However, I've a request for guidance by someone who needs to read
> files using either GBK or GB18030 (I think these are alternate names
> for the same encoding...).
> 
> Has anyone worked out what one needs to do for this?

If there is documentation for the format of the encoding as a map to
unicode characters, you can do it.  The only thing you need to make is
the "compiler" with source for the map compiler located as
/tools/encoding/txt2enc.c

The readme in that directory explains it, and just seems incredibly
simple to do:

"
On Unix, use "make" to compile all the encoding files (*.txt,*.esc)
into the format that Tcl can use (*.enc).  It is the caller's
responsibility to move the generated .enc files to the appropriate
place (the $TCL_LIBRARY/encoding directory).
"

I can't describe the format of the mappings infile, but should be
straight from the unicode standard.

Looks to be a simple matter of:
1) get or create the mapping infile
2) build txt2enc.c
3) compile the mapping outfile into a .enc
4) move the .enc to where $TCL_LIBRARY/encoding resides
5) start tclsh and call 'encoding names' and observe your new addition
5) ???
6) make lots of money!

I wouldn't be surprised if the mapping infile exits already on
www.unicode.org somewhere for you to download...

ftp://ftp.unicode.org/Public/MAPPINGS/
0
davygrvy (1376)
1/31/2008 4:02:50 PM
iconv handles these two encodings, so you might be able just to run
iconv as a child process. Failing that, the iconv source may be
helpful.
0
billposer (379)
2/1/2008 6:01:43 AM
This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--------------enigBE048FC9BC70578872AEC126
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: quoted-printable

billposer@alum.mit.edu wrote:
> iconv handles these two encodings, so you might be able just to run
> iconv as a child process. Failing that, the iconv source may be
> helpful.

How is Tcl to handle the code-points in the 4-byte region, though?  I=20
didn't know it went outside USC-2.

I'll be watching this thread as I'd like to know how this resolves.


--------------enigBE048FC9BC70578872AEC126
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkek2fgACgkQlZadkQh/RmHhMgCg2Cg5Ezmcaewp73IG0ti/kHGg
YZsAnjjRzDo+WiKqZxG5f3leVfzxi0AO
=O39y
-----END PGP SIGNATURE-----

--------------enigBE048FC9BC70578872AEC126--
0
davygrvy (1376)
2/2/2008 9:00:26 PM
On Feb 2, 1:00 pm, David Gravereaux <davyg...@pobox.com> wrote:
> billpo...@alum.mit.edu wrote:
> > iconv handles these two encodings, so you might be able just to run
> > iconv as a child process. Failing that, the iconv source may be
> > helpful.
>
> How is Tcl to handle the code-points in the 4-byte region, though?  I
> didn't know it went outside USC-2.


Tcl as it is usually compiled can't handle the codepoints beyond the
BMP, so the output of iconv would have to be filtered. An alternative
is to compile Tcl to handle full Unicode. If the application is plain
Tcl, that is, does not use Tk, that should work.
0
billposer (379)
2/3/2008 5:25:19 AM
This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--------------enig3CEAB20F4A59DA7BA1C295A6
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: quoted-printable

billposer@alum.mit.edu wrote:
> On Feb 2, 1:00 pm, David Gravereaux <davyg...@pobox.com> wrote:
>> billpo...@alum.mit.edu wrote:
>>> iconv handles these two encodings, so you might be able just to run
>>> iconv as a child process. Failing that, the iconv source may be
>>> helpful.
>> How is Tcl to handle the code-points in the 4-byte region, though?  I
>> didn't know it went outside USC-2.
>=20
>=20
> Tcl as it is usually compiled can't handle the codepoints beyond the
> BMP, so the output of iconv would have to be filtered. An alternative
> is to compile Tcl to handle full Unicode. If the application is plain
> Tcl, that is, does not use Tk, that should work.


Is it just that easy to add -DTCL_UTF_MAX=3D6 when you compile the core t=
o=20
get it to work for UCS-4?

I look at line 2122 in tcl.h and I see: "UCS-4 is experimental and not=20
recommended." and makes me question how easy it really is to get=20
surrogate support.

We should wait for Jeff or Kevin to chime in here.


--------------enig3CEAB20F4A59DA7BA1C295A6
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkelb5oACgkQlZadkQh/RmFitQCg+R0iBlXhO1ByoQbUgue+LDG4
GrIAoJZq0dTDI8b4AOaozqSA8vUUmtbR
=PZYN
-----END PGP SIGNATURE-----

--------------enig3CEAB20F4A59DA7BA1C295A6--
0
davygrvy (1376)
2/3/2008 7:38:52 AM
On Feb 2, 11:38 pm, David Gravereaux <davyg...@pobox.com> wrote:

> Is it just that easy to add -DTCL_UTF_MAX=6 when you compile the core to
> get it to work for UCS-4?
>
> I look at line 2122 in tcl.h and I see: "UCS-4 is experimental and not
> recommended." and makes me question how easy it really is to get
> surrogate support.

For details, I agree that it would be a good idea to hear from one of
the core team. I did try compiling Tcl for full Unicode a couple years
ago, and it passed my basic tests. I didn't pursue this, though,
because it didn't do me much good without Tk also supporting full
Unicode.
0
billposer (379)
2/3/2008 6:02:03 PM
I should add that I have real experience with the other approach I
mentioned, namely filtering. My uninum library (http://wiki.tcl.tk/
17692), which converts between numerals in most known writing systems
and integers, generates output that may include characters outside the
BMP when going from integers to strings. Most of the numeral systems
are entirely within the BMP, but a few are outside. The Tcl interface
code checks whether each character in the output string is within the
BMP. If it isn't, it passes to Tcl a string consisting of just a
Unicode Replacement Character and sets an error flag. (The Tcl program
should be checking this flag anyhow since there are various other
conditions that could result in a failed conversion, e.g. a request
for output in an unsupported numeral system or input of an integer
outside the range supported by the numeral system.)
0
billposer (379)
2/3/2008 6:10:46 PM
David Gravereaux wrote:
> I look at line 2122 in tcl.h and I see: "UCS-4 is experimental and not 
> recommended." and makes me question how easy it really is to get 
> surrogate support.
> 
> We should wait for Jeff or Kevin to chime in here.
> 

There are a fair number of places in the code where there is an
incorrect assumption that Tcl_UniChar and the native wchar_t are
one and the same, or that UCS-2 and UTF-16 are one and the same.
The UCS-4 definitions break the code near near these
interfaces.  I'm afraid that tracking all of them down and
replacing them may well prove to be a bit of a slog.
I seem to recall that the regexp engine also stumbles when
asked to deal with non-UCS-2 characters.

Given the amount of grunt work that would be needed to audit
the code, my personal belief is that the most effective way
forward would be to replace the String internal representation
with an index structure for locating the byte offset for
a character count. Even the regex engine could be adapted
to that sort of structure (although the implementation of
'.' and '[^...]' gets a trifle ugly, it's doable).  But
I've surely not had the time to take it on.

There's still a lot of other i18n work to be done.  I'd see
an extension beyond the BMP as only one of several big
projects; another is bidi and combining forms for Tk.
(I'd really, really like to see Tk able to display Hebrew,
Aramaic and Syriac with vowel points...)  I suspect that
we could borrow substantially from ICU4C as we attack
this one.

-- 
73 de ke9tv/2, Kevin
0
kennykb (564)
2/3/2008 7:20:42 PM
On Feb 3, 12:25=A0am, billpo...@alum.mit.edu wrote:
> Tcl as it is usually compiled can't handle the codepoints beyond the
> BMP, so the output of iconv would have to be filtered. An alternative
> is to compile Tcl to handle full Unicode. If the application is plain
> Tcl, that is, does not use Tk, that should work.

Interesting note. After passing along the early part of this thread,
the developer asked me why, after converting the text via iconv, Tcl/
Tk 8.4.7 was core dumping with the converted text! I suspect that this
"filtering" is probably something that is needed.

When he moved to using Tcl/Tk 8.5, he no longer was getting core
dumps, but he WAS getting strange characters displayed (they were
boxes of "dots" in various formations...).

0
lvirden (1938)
2/4/2008 12:42:20 PM
The "filtering" is really very simple. If you are using iconv via the
underlying library, get the result as an array of 32 bit integers and
check to see if any exceed 0xFFFF. If none do, convert down to 16
bits.  If there are any codepoints outside the BMP, do whatever is
appropriate for an error. If calling iconv as a child process,
basically the same thing can be done, though the data conversion would
actually be a little more complicated.
0
billposer (379)
2/4/2008 6:16:30 PM
Reply: