Hi,
GAWK's printf has a %c format:
> This prints a number as an ASCII character;
> thus, `printf "%c", 65' outputs the letter `A'.
I am looking for a somewhat similar format that would print a Unicode
character based on a hex value. Any idea?
Thanks in advance, Hermann
|
|
0
|
|
|
|
Reply
|
peifer (24)
|
12/16/2007 3:05:18 PM |
|
Hello Hermann,
> GAWK's printf has a %c format:
>> This prints a number as an ASCII character;
>> thus, `printf "%c", 65' outputs the letter `A'.
>
> I am looking for a somewhat similar format that would print a Unicode
> character based on a hex value. Any idea?
1. Notice that there is no bijective mapping (one-to-one)
between the symbols and the byte sequence. It all
depends on the character encoding that is used.
For example, the Euro currency symbol € has the following
properties:
http://www.decodeunicode.org/de/u+20a0/properties
DECIMAL VALUE : : : : : : 8352
UTF-8 HEX VALUE : : : : : 0xE282A0
UTF-16 HEX VALUE: : : : : 0x20A0
UTF-32 HEX VALUE: : : : : 0x000020A0
2. GAWK's %c is specified to work on ASCII characters.
I dont see any way of handling Unicode conversion in GAWK.
3. Have a look at the POSIX tool iconv.
|
|
0
|
|
|
|
Reply
|
UTF
|
12/16/2007 4:19:00 PM
|
|
On Sun, 16 Dec 2007 07:05:18 -0800 (PST), Hermann Peifer
<peifer@gmx.net> wrote:
>Hi,
>
>GAWK's printf has a %c format:
>> This prints a number as an ASCII character;
>> thus, `printf "%c", 65' outputs the letter `A'.
>
>I am looking for a somewhat similar format that would print a Unicode
>character based on a hex value. Any idea?
>
>Thanks in advance, Hermann
Primitive but it works.
I'm sure someone will try to optimize this.
Advantage: the algoritm is clear.
BEGIN{
hexchar = "0123456789ABCDEF"
binchar =
"0000000100100011010001010110011110001001101010111100110111101111"
i = 0
for (g = 1;g < length(binchar);g+=4){
bin2ord[substr(binchar,g,4)] = i
i++
}
for (i=0;i<16;i++){
for (j=0;j<16;j++){
entry = ((16*i) + j)
ord2chr[entry] = sprintf("%c",entry)
chr2hex[sprintf("%c",entry)] = \
substr(hexchar,i + 1,1) substr(hexchar,j + 1,1)
}
}
printf("\xEF\xBB\xBF") # BOM
}
#
# x = string of 4 hex digits
# ISO character code HHHH
# (usually defined as \uHHHH or 0xHHHH)
# returns UTF-8
#
function hex2ut8(x ,b,q,r,s,t){
b = "";
for (q=1;q <= length(x);q++){
b = b substr(binchar,1 + 4 *
(index(hexchar,substr(x,q,1)) - 1),4)
}
# calculate nr of significant bits
q = 17 - index(b,"1")
if (q <= 7){
t = substr(b,9,8)
} else if (q <= 11){
t = "110" substr(b,6,5) "10" substr(b,11,6)
} else if (q <= 16){
t = "1110" substr(b,1,4) "10" substr(b,5,6)\
"10" substr(b,11,6)
} else {
abort("UTF-8 support is limited to 16 bits.")
# could be extended quite easily.
}
r = ""
for (u = 1;u < length(t);u+=8){
r = r ord2chr[(16 * bin2ord[substr(t,u,4)])\
+ bin2ord[substr(t,(u + 4),4)]]
}
return r
}
References:
http://en.wikipedia.org/wiki/UTF-8
http://www.unicode.org/faq/utf_bom.html
--
( Kees
)
c[_] The desire to become a politician should bar you
for life from ever becoming one. (Billy Connolly) (#232)
|
|
0
|
|
|
|
Reply
|
Kees
|
12/16/2007 10:38:01 PM
|
|
On Dec 16, 11:38 pm, Kees Nuyt <k.n...@nospam.demon.nl> wrote:
> On Sun, 16 Dec 2007 07:05:18 -0800 (PST), Hermann Peifer
>
> <pei...@gmx.net> wrote:
> >Hi,
>
> >GAWK's printf has a %c format:
> >> This prints a number as an ASCII character;
> >> thus, `printf "%c", 65' outputs the letter `A'.
>
> >I am looking for a somewhat similar format that would print a Unicode
> >character based on a hex value. Any idea?
>
> >Thanks in advance, Hermann
>
> Primitive but it works.
> I'm sure someone will try to optimize this.
> Advantage: the algoritm is clear.
>
> BEGIN{
> hexchar =3D "0123456789ABCDEF"
> binchar =3D
> "0000000100100011010001010110011110001001101010111100110111101111"
> i =3D 0
> for (g =3D 1;g < length(binchar);g+=3D4){
> bin2ord[substr(binchar,g,4)] =3D i
> i++
> }
> for (i=3D0;i<16;i++){
> for (j=3D0;j<16;j++){
> entry =3D ((16*i) + j)
> ord2chr[entry] =3D sprintf("%c",entry)
> chr2hex[sprintf("%c",entry)] =3D \
> substr(hexchar,i + 1,1) substr(hexchar,j + 1,1)
> }
> }
> printf("\xEF\xBB\xBF") # BOM}
>
> #
> # x =3D string of 4 hex digits
> # ISO character code HHHH
> # (usually defined as \uHHHH or 0xHHHH)
> # returns UTF-8
> #
> function hex2ut8(x ,b,q,r,s,t){
> b =3D "";
> for (q=3D1;q <=3D length(x);q++){
> b =3D b substr(binchar,1 + 4 *
> (index(hexchar,substr(x,q,1)) - 1),4)
> }
> # calculate nr of significant bits
> q =3D 17 - index(b,"1")
> if (q <=3D 7){
> t =3D substr(b,9,8)
> } else if (q <=3D 11){
> t =3D "110" substr(b,6,5) "10" substr(b,11,6)
> } else if (q <=3D 16){
> t =3D "1110" substr(b,1,4) "10" substr(b,5,6)\
> "10" substr(b,11,6)
> } else {
> abort("UTF-8 support is limited to 16 bits.")
> # could be extended quite easily.
> }
> r =3D ""
> for (u =3D 1;u < length(t);u+=3D8){
> r =3D r ord2chr[(16 * bin2ord[substr(t,u,4)])\
> + bin2ord[substr(t,(u + 4),4)]]
> }
> return r
>
> }
>
Thanks for this solution.
To be honest, I was hoping that something simpler would be possible,
perhaps along the lines how printing of Unicode characters can be done
with GNU printf, say for the EURO currency sign, as mentioned by
J=FCrgen:
$ /usr/bin/printf "\u20AC\n"
EURO
But obviously, this is not possible with GAWK printf.
Hermann
|
|
0
|
|
|
|
Reply
|
Hermann
|
12/17/2007 1:30:00 PM
|
|
Hermann Peifer wrote:
>
> $ /usr/bin/printf "\u20AC\n"
> EURO
>
Posting via Google groops is perhaps not the best idea. This should of
course read>
$ /usr/bin/printf "\u20AC\n"
€
Hermann
|
|
0
|
|
|
|
Reply
|
Hermann
|
12/17/2007 6:22:46 PM
|
|
|
4 Replies
348 Views
(page loaded in 0.084 seconds)
|