f



percent encoding URLs

I work with many different languages and would like a reliable url encode/d=
ecode function that is pure awk.

For example given the name "W=C5=82adys=C5=82aw" (Polish) with these enviro=
nment settings=20

LANGUAGE=3Den_US.UTF-8
LC_COLLATE=3Den_US.UTF-8
LC_ALL=3Den_US.UTF-8

And using some standard awk encode/decode functions:

http://www.shelldorado.com/scripts/cmds/urlencode
https://www.rosettacode.org/wiki/URL_encoding#AWK

It doesn't work because the environment is not Polish, the "=C5=82" is not =
recognized.

So I call Python or PHP and that works, but it creates an external dependen=
cy to the awk program.

Is there a solution to do encoding/decoding across all languages purely in =
awk? I think I know the answer ("no") but wanted to verify before spending =
any more time. Maybe there is some shell and environment trickery.
0
charlemagne
11/15/2016 12:36:36 AM
comp.lang.awk 3450 articles. 0 followers. Post Follow

6 Replies
340 Views

Similar Articles

[PageSpeed] 54

Le mardi 15 novembre 2016 01:36:37 UTC+1, charlemagn...@gmail.com a =C3=A9c=
rit=C2=A0:
> I work with many different languages and would like a reliable url encode=
/decode function that is pure awk.
>=20
> For example given the name "W=C5=82adys=C5=82aw" (Polish) with these envi=
ronment settings=20
>=20
> LANGUAGE=3Den_US.UTF-8
> LC_COLLATE=3Den_US.UTF-8
> LC_ALL=3Den_US.UTF-8
>=20
> And using some standard awk encode/decode functions:
>=20
> http://www.shelldorado.com/scripts/cmds/urlencode
> https://www.rosettacode.org/wiki/URL_encoding#AWK
>=20
> It doesn't work because the environment is not Polish, the "=C5=82" is no=
t recognized.
>=20
> So I call Python or PHP and that works, but it creates an external depend=
ency to the awk program.
>=20
> Is there a solution to do encoding/decoding across all languages purely i=
n awk? I think I know the answer ("no") but wanted to verify before spendin=
g any more time. Maybe there is some shell and environment trickery.

In my opinion, AWK needs its own encoding for UTF-8 etc. like Python does.
0
Marc
11/15/2016 9:13:15 AM
On 2016-11-15, charlemagne.steampipe@gmail.com <charlemagne.steampipe@gmail.com> wrote:

> I work with many different languages and would like a reliable
> url encode/decode function that is pure awk.
>
> For example given the name "W??adys??aw" (Polish) with these
> environment settings 
>
> LANGUAGE=en_US.UTF-8
> LC_COLLATE=en_US.UTF-8
> LC_ALL=en_US.UTF-8
>
> And using some standard awk encode/decode functions:
>
> http://www.shelldorado.com/scripts/cmds/urlencode
> https://www.rosettacode.org/wiki/URL_encoding#AWK
>
> It doesn't work because the environment is not Polish, the
> "??" is not recognized.
>
> So I call Python or PHP and that works, but it creates an
> external dependency to the awk program.
>
> Is there a solution to do encoding/decoding across all
> languages purely in awk? I think I know the answer ("no") but
> wanted to verify before spending any more time. Maybe there is
> some shell and environment trickery.

How does it "not work" ? What do you get and what would you like
to get instead ?

-- 
Andr� Majorel http://www.teaser.fr/~amajorel/
"Pauvret� n'est pas vice ! Parbleu ! Un vice est agr�able."
  -- _Passe-temps_
0
Andre
11/16/2016 1:01:01 AM
On Wednesday, November 16, 2016 at 3:06:51 PM UTC-5, Andre Majorel wrote:

> How does it "not work" ? What do you get and what would you like
> to get instead ?

Test string: =C5=82

urlencodeawk()    =3D %00
urlencodepython() =3D %C5%82

Python encodes correctly

----

#
#  Credit: Rosetta Stone May 2015
#
function urlencodeawk(str,  c, len, res, i, ord) {

        for (i =3D 0; i <=3D 255; i++)
                ord[sprintf("%c", i)] =3D i
        len =3D length(str)  =20
        res =3D ""
        for (i =3D 1; i <=3D len; i++) {
                c =3D substr(str, i, 1);
                if (c ~ /[0-9A-Za-z]/)
                        res =3D res c
                else
                        res =3D res "%" sprintf("%02X", ord[c])
        }
        return res
}

#
# url-encode via Python=20
#  Credit: https://askubuntu.com/questions/53770/how-can-i-encode-and-decod=
e-percent-encoded-strings-on-the-command-line
#
function urlencodepython(str,   command, safe) {

   safe =3D str
   gsub(/'/, "'\"'\"'", safe)     # make safe for shell
   gsub(/=E2=80=99/, "'\"=E2=80=99\"'", safe)

   command =3D "python -c \"import urllib, sys; print urllib.quote(sys.argv=
[1])\" '" safe "'"
   return sys2var(command)
}

function sys2var(command        ,fish, scale, ship) {

         # command =3D command " 2>/dev/null"
         while ( (command | getline fish) > 0 ) {
             if ( ++scale =3D=3D 1 )
                 ship =3D fish
             else               =20
                 ship =3D ship "\n" fish
         }
         close(command)
         return ship
}

function testurlendecode(str) {

  print "Test string: " str
  print ""
  print "urlencodeawk()    =3D " urlencodeawk(str)
  print "urlencodepython() =3D " urlencodepython(str)

}

BEGIN {

testurlendecode("=C5=82")

}

0
charlemagne
11/17/2016 12:59:04 AM
On 2016-11-17, charlemagne.steampipe@gmail.com <charlemagne.steampipe@gmail.com> wrote:
> On Wednesday, November 16, 2016 at 3:06:51 PM UTC-5, Andre Majorel wrote:
>
>> How does it "not work" ? What do you get and what would you like
>> to get instead ?
>
> Test string: ??
>
> urlencodeawk()    = %00
> urlencodepython() = %C5%82
>
> Python encodes correctly

Prints "%00" for "\xc5\x82" ? That's a good one.

urlencodeawk() works here, but I'm not using UTF-8. I suspect
what's happening is that, because your locale indicates UTF-8
encoding, substr() treats str as a string of Unicode characters
instead of a string of bytes. Therefore substr("\xc5\x82", 1, 1)
is not "\xc5" but U+0142 and ord[U+0142] does not exist, hence
"%00".

Possible avenues :
- forcing the locale to C,
- using Gawk's --characters-as-bytes option,
- fixing the %-encoding function to look for code points greater
  than 0x7f, work out what their UTF-8 encoding is and dump that
  instead (but this is not completely reliable as it assumes
  that the original byte sequence was canonical UTF-8, which is
  likely but not guaranteed).

-- 
Andr� Majorel http://www.teaser.fr/~amajorel/
"Pauvret� n'est pas vice ! Parbleu ! Un vice est agr�able."
  -- _Passe-temps_
0
Andre
11/17/2016 8:54:42 AM
On Thursday, November 17, 2016 at 3:54:44 AM UTC-5, Andre Majorel wrote:

>Possible avenues :
>- forcing the locale to C,
>- using Gawk's --characters-as-bytes option,=20
>- fixing the %-encoding function

I haven't tried the first or third option, because the second works:

awk -b -f urlencode.awk

Test string: =C5=82

urlencodeawk()    =3D %C5%82
urlencodepython() =3D %C5%82

That is great. Thank you very much. I'm going to run more in-depth testing =
of URLs but hopefully this will be the solution, and a simple one.
0
charlemagne
11/17/2016 6:02:58 PM
On 2016-11-17, charlemagne.steampipe@gmail.com <charlemagne.steampipe@gmail.com> wrote:
> On Thursday, November 17, 2016 at 3:54:44 AM UTC-5, Andre Majorel wrote:
>
>>Possible avenues :
>>- forcing the locale to C,
>>- using Gawk's --characters-as-bytes option, 
>>- fixing the %-encoding function
>
> I haven't tried the first or third option, because the second works:
>
> awk -b -f urlencode.awk
>
> Test string: ł
>
> urlencodeawk()    = %C5%82
> urlencodepython() = %C5%82
>
> That is great. Thank you very much. I'm going to run more in-depth testing of URLs but hopefully this will be the solution, and a simple one.

$ txr -p '(url-encode "ł")'
"%C5%82"

Not influenced by any ISO C/Unix locale garbage:

$ LANG=hottentot_ZA.UTF-13 txr -p '(url-encode "ł")'
"%C5%82"
0
Kaz
11/17/2016 8:05:14 PM
Reply: