f



Using locale other than english in WIndows gawk

I'm trying to figure out how to use non ascii characters in gawk in windows
I googled no luck so far.  some suggests use "set LC_ALL=UTF-8" but didn't 
work.  I also tried "set LC_ALL=utf8" which didn't work either.

How do I use different locale than English in windows gawk?
0
chano
11/21/2016 5:52:28 AM
comp.lang.awk 3450 articles. 0 followers. Post Follow

8 Replies
177 Views

Similar Articles

[PageSpeed] 56

Le lundi 21 novembre 2016 06:52:30 UTC+1, chano a =C3=A9crit=C2=A0:
> I'm trying to figure out how to use non ascii characters in gawk in windo=
ws
> I googled no luck so far.  some suggests use "set LC_ALL=3DUTF-8" but did=
n't=20
> work.  I also tried "set LC_ALL=3Dutf8" which didn't work either.
>=20
> How do I use different locale than English in windows gawk?

It does not work, see
https://lists.gnu.org/archive/html/bug-gawk/2016-02/msg00017.html

0
Marc
11/21/2016 6:27:02 AM
On 2016-11-21, Marc de Bourget <marcdebourget@gmail.com> wrote:
> Le lundi 21 novembre 2016 06:52:30 UTC+1, chano a écrit :
>> I'm trying to figure out how to use non ascii characters in gawk in windows
>> I googled no luck so far.  some suggests use "set LC_ALL=UTF-8" but didn't 
>> work.  I also tried "set LC_ALL=utf8" which didn't work either.
>> 
>> How do I use different locale than English in windows gawk?
>
> It does not work, see
> https://lists.gnu.org/archive/html/bug-gawk/2016-02/msg00017.html

This must be referring to the MinGW-based Gawk port.

Of course Microsoft's WWII era C library isn't fucking
going to parse "LC_ALL=en_US.UTF-8" and behave accordingly.

Everything works for me in the Cygnal-based port:

  C:\winawk>gawk -f celine.awk
  7
  ▒
  |Céline     |

  C:\winawk>SET LC_ALL=en_US.UTF-8

  C:\winawk>gawk -f celine.awk
  6
  é
  |Céline      |

Displays fine in the CMD.EXE console on Windows 7, from which I copied
and pasted the above.

Your problems are self-inflicted, caused by the inability to recognize
MinGW has half-broken pile of crap that needs to be discarded.

All these "X doesn't work on Windows Awk" reports are duplicates of the
same issue.
0
Kaz
11/21/2016 3:53:56 PM
El 21/11/2016 6:52, chano escribi�:
> I'm trying to figure out how to use non ascii characters in gawk in windows
> I googled no luck so far.  some suggests use "set LC_ALL=UTF-8" but didn't
> work.  I also tried "set LC_ALL=utf8" which didn't work either.
>
> How do I use different locale than English in windows gawk?

Short answer: Cygwin
Simple answer: don't use locales, but the native Windows charsets 
(CP1252 and the like). They support some non-ascii characters.

Need more help? Please post short sample input data, awk code and 
expected output.

-- 
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado

0
Manuel
11/21/2016 5:50:14 PM
On 11/21/2016 11:50 AM, Manuel Collado wrote:
> El 21/11/2016 6:52, chano escribi�:
>> I'm trying to figure out how to use non ascii characters in gawk in windows
>> I googled no luck so far.  some suggests use "set LC_ALL=UTF-8" but didn't
>> work.  I also tried "set LC_ALL=utf8" which didn't work either.
>>
>> How do I use different locale than English in windows gawk?
>
> Short answer: Cygwin

Do locales work in cygwin now? Used to be a well known issue...

	Ed.

> Simple answer: don't use locales, but the native Windows charsets (CP1252 and
> the like). They support some non-ascii characters.
>
> Need more help? Please post short sample input data, awk code and expected output.
>

0
Ed
11/21/2016 6:20:54 PM
El 21/11/2016 19:20, Ed Morton escribi�:
> On 11/21/2016 11:50 AM, Manuel Collado wrote:
>> El 21/11/2016 6:52, chano escribi�:
>>> I'm trying to figure out how to use non ascii characters in gawk in
>>> windows
>>> I googled no luck so far.  some suggests use "set LC_ALL=UTF-8" but
>>> didn't
>>> work.  I also tried "set LC_ALL=utf8" which didn't work either.
>>>
>>> How do I use different locale than English in windows gawk?
>>
>> Short answer: Cygwin
>
> Do locales work in cygwin now? Used to be a well known issue...
>

Well, somehow.

$ echo $LANG
es_ES.UTF-8
$ gawk --version > gawk.txt
$ LANG=es_ES gawk --version > gawk2.txt
$ file gawk*
gawk.txt:  UTF-8 Unicode text
gawk2.txt: ISO-8859 text

>
>> Simple answer: don't use locales, but the native Windows charsets
>> (CP1252 and
>> the like). They support some non-ascii characters.
>>
>> Need more help? Please post short sample input data, awk code and
>> expected output.
-- 
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado

0
Manuel
11/21/2016 8:50:21 PM
Le lundi 21 novembre 2016 16:53:59 UTC+1, Kaz Kylheku a =C3=A9crit=C2=A0:
> On 2016-11-21, Marc de Bourget <marcdebourget@gmail.com> wrote:
> > Le lundi 21 novembre 2016 06:52:30 UTC+1, chano a =C3=A9crit=C2=A0:
> >> I'm trying to figure out how to use non ascii characters in gawk in wi=
ndows
> >> I googled no luck so far.  some suggests use "set LC_ALL=3DUTF-8" but =
didn't=20
> >> work.  I also tried "set LC_ALL=3Dutf8" which didn't work either.
> >>=20
> >> How do I use different locale than English in windows gawk?
> >
> > It does not work, see
> > https://lists.gnu.org/archive/html/bug-gawk/2016-02/msg00017.html
>=20
> This must be referring to the MinGW-based Gawk port.
>=20
> Of course Microsoft's WWII era C library isn't fucking
> going to parse "LC_ALL=3Den_US.UTF-8" and behave accordingly.
>=20
> Everything works for me in the Cygnal-based port:
>=20
>   C:\winawk>gawk -f celine.awk
>   7
>   =E2=96=92
>   |C=C3=A9line     |
>=20
>   C:\winawk>SET LC_ALL=3Den_US.UTF-8
>=20
>   C:\winawk>gawk -f celine.awk
>   6
>   =C3=A9
>   |C=C3=A9line      |
>=20
> Displays fine in the CMD.EXE console on Windows 7, from which I copied
> and pasted the above.
>=20
> Your problems are self-inflicted, caused by the inability to recognize
> MinGW has half-broken pile of crap that needs to be discarded.
>=20
> All these "X doesn't work on Windows Awk" reports are duplicates of the
> same issue.

There is nothing self-inflicted. I haven't written the GAWK MinGW port.
I wasn't aware that your GAWK version was able to solve the UTF-8 issue.

Cygnal GAWK native Windows port:
http://www.kylheku.com/cygnal/winawk.tar.gz

How do you achieve to get Unix locales to work properly on Windows?
Anyway, I have done some tests and it looks really good. Great work so far!
We should test your version thoroughly and if no one finds an issue we may
use your version as best GAWK Windows port.
0
Marc
11/21/2016 9:57:03 PM
On 2016-11-21, Marc de Bourget <marcdebourget@gmail.com> wrote:
> Le lundi 21 novembre 2016 16:53:59 UTC+1, Kaz Kylheku a écrit :
>> On 2016-11-21, Marc de Bourget <marcdebourget@gmail.com> wrote:
>> > Le lundi 21 novembre 2016 06:52:30 UTC+1, chano a écrit :
>> >> I'm trying to figure out how to use non ascii characters in gawk in windows
>> >> I googled no luck so far.  some suggests use "set LC_ALL=UTF-8" but didn't 
>> >> work.  I also tried "set LC_ALL=utf8" which didn't work either.
>> >> 
>> >> How do I use different locale than English in windows gawk?
>> >
>> > It does not work, see
>> > https://lists.gnu.org/archive/html/bug-gawk/2016-02/msg00017.html
>> 
>> This must be referring to the MinGW-based Gawk port.
>> 
>> Of course Microsoft's WWII era C library isn't fucking
>> going to parse "LC_ALL=en_US.UTF-8" and behave accordingly.
>> 
>> Everything works for me in the Cygnal-based port:
>> 
>>   C:\winawk>gawk -f celine.awk
>>   7
>>   ▒
>>   |Céline     |
>> 
>>   C:\winawk>SET LC_ALL=en_US.UTF-8
>> 
>>   C:\winawk>gawk -f celine.awk
>>   6
>>   é
>>   |Céline      |
>> 
>> Displays fine in the CMD.EXE console on Windows 7, from which I copied
>> and pasted the above.
>> 
>> Your problems are self-inflicted, caused by the inability to recognize
>> MinGW has half-broken pile of crap that needs to be discarded.
>> 
>> All these "X doesn't work on Windows Awk" reports are duplicates of the
>> same issue.
>
> There is nothing self-inflicted. I haven't written the GAWK MinGW port.
> I wasn't aware that your GAWK version was able to solve the UTF-8 issue.
>
> Cygnal GAWK native Windows port:
> http://www.kylheku.com/cygnal/winawk.tar.gz
>
> How do you achieve to get Unix locales to work properly on Windows?

Very simply: by (hopefully) not breaking any locale stuff in Cygnal,
relative to upstream Cygwin.

The run-time support libraries from Cygwin have the locale support,
according to POSIX. If you patch some things in Cygwin to make it
more native-Windows-like here and there, and don't break the locale
stuff, then the locale stuff continues to work.

> Anyway, I have done some tests and it looks really good. Great work so far!

Great almost-no-work really; some of what you're testing might not
even be different between Cygwin and Cygnal.

> We should test your version thoroughly and if no one finds an issue we may
> use your version as best GAWK Windows port.

If you want to validate a Cygnal-based awk for yourself, the focus
should probably be on all the things that are different with regard
to that same awk executable running on Cygwin.

Mainly that would be in the area of path handling and also running
external processes.

Windows paths should work, and also the concept of a current working
directory per drive letter. If you're currently in C:\Users
but your D: drive is in D:\whatever, and you pass a path like
D:foo.txt to the Cygnal gawk, it should open D:\whatever\foo.txt,
and not D:\foo.txt.

System commands like with system("...") and the pipe syntax and
whatnot should be using the CMD.EXE command interpreter under
Cygnal.  Under Cygwin, they look for a /bin/sh shell.

Cygnal isn't likely going to break anything internal to Gawk.

Because Gawk doesn't use stdio streams, it doesn't benefit from the
Cygnal having text mode streams in Windows mode (CR-LF) as default.
This is the down side. When you do printf("foo\n") in gawk, it
puts out a Unix newline.

One useful feature in Gawk is that the RT variable is set to the piece
of text which matches the RS record separator. So with that, if you
have a record separator regex that matches either CR or CR-LF, RT
can reproduce the actual separator regex which occurred. If you
explicitly use RT, you can write code that preserves the line
termination style.
0
Kaz
11/22/2016 1:18:27 AM
Le mardi 22 novembre 2016 02:18:31 UTC+1, Kaz Kylheku a =C3=A9crit=C2=A0:
> On 2016-11-21, Marc de Bourget <marcdebourget@gmail.com> wrote:
> > Le lundi 21 novembre 2016 16:53:59 UTC+1, Kaz Kylheku a =C3=A9crit=C2=
=A0:
> >> On 2016-11-21, Marc de Bourget <marcdebourget@gmail.com> wrote:
> >> > Le lundi 21 novembre 2016 06:52:30 UTC+1, chano a =C3=A9crit=C2=A0:
> >> >> I'm trying to figure out how to use non ascii characters in gawk in=
 windows
> >> >> I googled no luck so far.  some suggests use "set LC_ALL=3DUTF-8" b=
ut didn't=20
> >> >> work.  I also tried "set LC_ALL=3Dutf8" which didn't work either.
> >> >>=20
> >> >> How do I use different locale than English in windows gawk?
> >> >
> >> > It does not work, see
> >> > https://lists.gnu.org/archive/html/bug-gawk/2016-02/msg00017.html
> >>=20
> >> This must be referring to the MinGW-based Gawk port.
> >>=20
> >> Of course Microsoft's WWII era C library isn't fucking
> >> going to parse "LC_ALL=3Den_US.UTF-8" and behave accordingly.
> >>=20
> >> Everything works for me in the Cygnal-based port:
> >>=20
> >>   C:\winawk>gawk -f celine.awk
> >>   7
> >>   =E2=96=92
> >>   |C=C3=A9line     |
> >>=20
> >>   C:\winawk>SET LC_ALL=3Den_US.UTF-8
> >>=20
> >>   C:\winawk>gawk -f celine.awk
> >>   6
> >>   =C3=A9
> >>   |C=C3=A9line      |
> >>=20
> >> Displays fine in the CMD.EXE console on Windows 7, from which I copied
> >> and pasted the above.
> >>=20
> >> Your problems are self-inflicted, caused by the inability to recognize
> >> MinGW has half-broken pile of crap that needs to be discarded.
> >>=20
> >> All these "X doesn't work on Windows Awk" reports are duplicates of th=
e
> >> same issue.
> >
> > There is nothing self-inflicted. I haven't written the GAWK MinGW port.
> > I wasn't aware that your GAWK version was able to solve the UTF-8 issue=
..
> >
> > Cygnal GAWK native Windows port:
> > http://www.kylheku.com/cygnal/winawk.tar.gz
> >
> > How do you achieve to get Unix locales to work properly on Windows?
>=20
> Very simply: by (hopefully) not breaking any locale stuff in Cygnal,
> relative to upstream Cygwin.
>=20
> The run-time support libraries from Cygwin have the locale support,
> according to POSIX. If you patch some things in Cygwin to make it
> more native-Windows-like here and there, and don't break the locale
> stuff, then the locale stuff continues to work.
>=20
> > Anyway, I have done some tests and it looks really good. Great work so =
far!
>=20
> Great almost-no-work really; some of what you're testing might not
> even be different between Cygwin and Cygnal.
>=20
> > We should test your version thoroughly and if no one finds an issue we =
may
> > use your version as best GAWK Windows port.
>=20
> If you want to validate a Cygnal-based awk for yourself, the focus
> should probably be on all the things that are different with regard
> to that same awk executable running on Cygwin.
>=20
> Mainly that would be in the area of path handling and also running
> external processes.
>=20
> Windows paths should work, and also the concept of a current working
> directory per drive letter. If you're currently in C:\Users
> but your D: drive is in D:\whatever, and you pass a path like
> D:foo.txt to the Cygnal gawk, it should open D:\whatever\foo.txt,
> and not D:\foo.txt.
>=20
> System commands like with system("...") and the pipe syntax and
> whatnot should be using the CMD.EXE command interpreter under
> Cygnal.  Under Cygwin, they look for a /bin/sh shell.
>=20
> Cygnal isn't likely going to break anything internal to Gawk.
>=20
> Because Gawk doesn't use stdio streams, it doesn't benefit from the
> Cygnal having text mode streams in Windows mode (CR-LF) as default.
> This is the down side. When you do printf("foo\n") in gawk, it
> puts out a Unix newline.
>=20
> One useful feature in Gawk is that the RT variable is set to the piece
> of text which matches the RS record separator. So with that, if you
> have a record separator regex that matches either CR or CR-LF, RT
> can reproduce the actual separator regex which occurred. If you
> explicitly use RT, you can write code that preserves the line
> termination style.

One hint: The Cygnal GAWK version prints \n instead of \r\n with celine.awk
(the print command). I would have prefered CR-LF, but this is no big issue.
0
Marc
11/22/2016 12:03:55 PM
Reply: