UTF-8 string length() in gawk

  • Follow


How do we correctly handle UTF-8 string length() in gawk if characters are
used that require more than one byte in UTF-8 encoding?

  BEGIN {print length("S�dwestwind")}   ## length=12 - ???
  BEGIN {print length("Sudwestwind")}   ## length=11

I'd expect a length of 11 in both cases. The gawk length() function returns
number of bytes. Any ideas how to return the number of characters instead?

BTW, there's a similar issue with split() and probably with other string
functions as well.

Janis
0
Reply Janis 3/29/2010 3:51:00 AM

Janis Papanagnou wrote:

> How do we correctly handle UTF-8 string length() in gawk if characters are
> used that require more than one byte in UTF-8 encoding?
> 
>   BEGIN {print length("S�dwestwind")}   ## length=12 - ???
>   BEGIN {print length("Sudwestwind")}   ## length=11
> 
> I'd expect a length of 11 in both cases. The gawk length() function
> returns number of bytes. Any ideas how to return the number of characters
> instead?
> 
> BTW, there's a similar issue with split() and probably with other string
> functions as well.

My understanding is that if you set your locale to a UTF-8 locale, then gawk 
does the right thing:

$ LC_ALL=C awk 'BEGIN {print length("S�dwestwind")}'
12
$ LC_ALL=en_GB.utf8 awk 'BEGIN {print length("S�dwestwind")}'
11

0
Reply pk 3/29/2010 9:10:04 AM


pk wrote:
> Janis Papanagnou wrote:
> 
>> How do we correctly handle UTF-8 string length() in gawk if characters are
>> used that require more than one byte in UTF-8 encoding?
>>
>>   BEGIN {print length("S�dwestwind")}   ## length=12 - ???
>>   BEGIN {print length("Sudwestwind")}   ## length=11
>>
>> I'd expect a length of 11 in both cases. The gawk length() function
>> returns number of bytes. Any ideas how to return the number of characters
>> instead?
>>
>> BTW, there's a similar issue with split() and probably with other string
>> functions as well.
> 
> My understanding is that if you set your locale to a UTF-8 locale, then gawk 
> does the right thing:
> 
> $ LC_ALL=C awk 'BEGIN {print length("S�dwestwind")}'
> 12
> $ LC_ALL=en_GB.utf8 awk 'BEGIN {print length("S�dwestwind")}'
> 11
> 

And I thought I had tried that without success.
Hmm.. - it works. I must have made something wrong tonight.

Thanks!

Janis
0
Reply Janis 3/29/2010 9:58:42 AM

Janis Papanagnou wrote:
> pk wrote:
>> Janis Papanagnou wrote:
>>
>>> How do we correctly handle UTF-8 string length() in gawk if characters are
>>> used that require more than one byte in UTF-8 encoding?
>>>
>>>   BEGIN {print length("S�dwestwind")}   ## length=12 - ???
>>>   BEGIN {print length("Sudwestwind")}   ## length=11
>>>
>>> I'd expect a length of 11 in both cases. The gawk length() function
>>> returns number of bytes. Any ideas how to return the number of characters
>>> instead?
>>>
>>> BTW, there's a similar issue with split() and probably with other string
>>> functions as well.
>> My understanding is that if you set your locale to a UTF-8 locale, then gawk 
>> does the right thing:
>>
>> $ LC_ALL=C awk 'BEGIN {print length("S�dwestwind")}'
>> 12
>> $ LC_ALL=en_GB.utf8 awk 'BEGIN {print length("S�dwestwind")}'
>> 11
>>
> 
> And I thought I had tried that without success.
> Hmm.. - it works. I must have made something wrong tonight.
> 

From my experience, I can confirm that gawk's string functions and FIELDWIDTHS work as expected, as long as your locale and the data encoding are in sync.

And also, gawk's printf counts characters, rather than bytes, e.g.:

$ LC_ALL=en_US.UTF-8 gawk 'BEGIN {printf "|%-12s|\n", "S�dwestwind"}'
|S�dwestwind |

Hermann
0
Reply Hermann 3/29/2010 11:30:40 AM

3 Replies
197 Views

(page loaded in 0.112 seconds)

Similiar Articles:













7/25/2012 12:29:43 AM


Reply: