|
|
UTF-8 string length() in gawk
How do we correctly handle UTF-8 string length() in gawk if characters are
used that require more than one byte in UTF-8 encoding?
BEGIN {print length("S�dwestwind")} ## length=12 - ???
BEGIN {print length("Sudwestwind")} ## length=11
I'd expect a length of 11 in both cases. The gawk length() function returns
number of bytes. Any ideas how to return the number of characters instead?
BTW, there's a similar issue with split() and probably with other string
functions as well.
Janis
|
|
0
|
|
|
|
Reply
|
Janis
|
3/29/2010 3:51:00 AM |
|
Janis Papanagnou wrote:
> How do we correctly handle UTF-8 string length() in gawk if characters are
> used that require more than one byte in UTF-8 encoding?
>
> BEGIN {print length("S�dwestwind")} ## length=12 - ???
> BEGIN {print length("Sudwestwind")} ## length=11
>
> I'd expect a length of 11 in both cases. The gawk length() function
> returns number of bytes. Any ideas how to return the number of characters
> instead?
>
> BTW, there's a similar issue with split() and probably with other string
> functions as well.
My understanding is that if you set your locale to a UTF-8 locale, then gawk
does the right thing:
$ LC_ALL=C awk 'BEGIN {print length("S�dwestwind")}'
12
$ LC_ALL=en_GB.utf8 awk 'BEGIN {print length("S�dwestwind")}'
11
|
|
0
|
|
|
|
Reply
|
pk
|
3/29/2010 9:10:04 AM
|
|
pk wrote:
> Janis Papanagnou wrote:
>
>> How do we correctly handle UTF-8 string length() in gawk if characters are
>> used that require more than one byte in UTF-8 encoding?
>>
>> BEGIN {print length("S�dwestwind")} ## length=12 - ???
>> BEGIN {print length("Sudwestwind")} ## length=11
>>
>> I'd expect a length of 11 in both cases. The gawk length() function
>> returns number of bytes. Any ideas how to return the number of characters
>> instead?
>>
>> BTW, there's a similar issue with split() and probably with other string
>> functions as well.
>
> My understanding is that if you set your locale to a UTF-8 locale, then gawk
> does the right thing:
>
> $ LC_ALL=C awk 'BEGIN {print length("S�dwestwind")}'
> 12
> $ LC_ALL=en_GB.utf8 awk 'BEGIN {print length("S�dwestwind")}'
> 11
>
And I thought I had tried that without success.
Hmm.. - it works. I must have made something wrong tonight.
Thanks!
Janis
|
|
0
|
|
|
|
Reply
|
Janis
|
3/29/2010 9:58:42 AM
|
|
Janis Papanagnou wrote:
> pk wrote:
>> Janis Papanagnou wrote:
>>
>>> How do we correctly handle UTF-8 string length() in gawk if characters are
>>> used that require more than one byte in UTF-8 encoding?
>>>
>>> BEGIN {print length("S�dwestwind")} ## length=12 - ???
>>> BEGIN {print length("Sudwestwind")} ## length=11
>>>
>>> I'd expect a length of 11 in both cases. The gawk length() function
>>> returns number of bytes. Any ideas how to return the number of characters
>>> instead?
>>>
>>> BTW, there's a similar issue with split() and probably with other string
>>> functions as well.
>> My understanding is that if you set your locale to a UTF-8 locale, then gawk
>> does the right thing:
>>
>> $ LC_ALL=C awk 'BEGIN {print length("S�dwestwind")}'
>> 12
>> $ LC_ALL=en_GB.utf8 awk 'BEGIN {print length("S�dwestwind")}'
>> 11
>>
>
> And I thought I had tried that without success.
> Hmm.. - it works. I must have made something wrong tonight.
>
From my experience, I can confirm that gawk's string functions and FIELDWIDTHS work as expected, as long as your locale and the data encoding are in sync.
And also, gawk's printf counts characters, rather than bytes, e.g.:
$ LC_ALL=en_US.UTF-8 gawk 'BEGIN {printf "|%-12s|\n", "S�dwestwind"}'
|S�dwestwind |
Hermann
|
|
0
|
|
|
|
Reply
|
Hermann
|
3/29/2010 11:30:40 AM
|
|
|
3 Replies
197 Views
(page loaded in 0.112 seconds)
Similiar Articles: newbie: awk not working for multi-byte charsets? - comp.lang.awk ...I used keywords like "awk Unicode" "awk utf" "gawk Unicode" "gawk utf" "awk multibyte string" "awk length utf-8" ... All the keywords combination leads to unrelated ... Missing Charsets - comp.unix.solaris... Warning: Cannot convert string "-dt-interface system-medium-r-normal- m*utf ... does awk (I am using gawk ... list posts in UTF-8 - comp.protocols.time.ntp... at ... Converting UTF-16 character strings to ISO8859-1 - comp.os.vms ...list posts in UTF-8 - comp.protocols.time.ntp Converting UTF-16 character strings to ... I am trying to convert a string encoded in java in UTF-8 to ISO-8859-1 ... re dealing ... UTF-8 to Shift JIS - comp.lang.javascriptI need to convert an UTF-8 string to Shift-JIS in a web page (preferably ... list posts in UTF-8 - comp.protocols.time.ntp UTF-8 to Shift JIS - comp.lang.javascript ... XML-Parsing with UTF-8 Byte-Order-Mark (BOM) - comp.lang.java ...... server application, where the client reads an UTF-8 ... buffer); } byte[] xml = baos.toByteArray(); String ... Hi all, I'm pressed with time and I've only recently ... list posts in UTF-8 - comp.protocols.time.ntptransfer GBK into UTF-8 in csv file - comp.lang.ruby list posts in UTF-8 - comp.protocols.time ... UTF-8 to Shift JIS - comp.lang.javascript I need to convert an UTF-8 string ... US-ASCII to UTF-8 - comp.lang.rubyMatt Beedle wrote: > Iconv.iconv('iso-8859-1', 'utf-8', string) > > Iconv ... list posts in UTF-8 - comp.protocols.time.ntp In particular: o the default character set is ... BibTeX and UTF-8 and accentued characters - comp.text.tex ...- Every well-formed ISO-8859-1 string is also a valid UTF-8 string, which would be ... list posts in UTF-8 - comp.protocols.time.ntp BibTeX and UTF-8 and accentued ... New features added to development gawk - comp.lang.awk... where each sub-array can be of a different length. ... I often find myself building up a string in an array ... comp.lang.awk Is there a compiled version of gawk 3.1.8 for ... gawk for solaris - comp.lang.awk... ftp.sunfreeware.com/pub/freeware/sparc/8/gawk-3.1.4 ... solaris - comp.lang.awk How to create epoch time variable with gawk ... Functions - The GNU Awk User's Guide 9.1.3 String ... UTF-8 and string length limitations « Musings on software ...UTF-8 and string length limitations January 16, 2007 Posted by globalizer in International requirements, Java, Unicode. trackback. One of the more ... UTF-8 - Wikipedia, the free encyclopediaReplacing errors is "lossy": more than one UTF-8 string converts to the ... Non-BMP characters (U+10000 and above) are encoded in UTF-8 in four bytes, the same size as in UTF-16. 7/25/2012 12:29:43 AM
|
|
|
|
|
|
|
|
|