sorting of awk arrays (hashes) function

  • Follow


It took me some time to understand, why awks arrays cannot easily been
sorted. Anyway, somewhere I got the hint to use unix shells sort by
dumping the hash, sorting it in the unix shell and reading it in
again. So I wrote a little functional capsule for that. here it is...

#
# print an array in sorted order by value
# 20091022, Johannes Mainusch
# don't blame me if it doesn't work on you
# :-)
#
function print_sorted_by_value (prefix, array, scale, norm,
significance,      i, sum, tmpfile, cmd) {
	printf ("\nprinting in sorted order by value\n");
	for (i in array) n++; # get length of the array

	tmpfile=sprintf("del.me.%d",1000000*rand());
	#print "filename = ",tmpfile;

	sum = 0;
	for (key in array) {
		value = "nan";
		sum += array[key];
		if (array[key] > significance) value = 100*array[key]/norm;
		printf ("%s%-30s %8.1f %f\n", prefix, key, scale*array[key], value)
>>tmpfile;
		}
	close (tmpfile);
	delete array;

	cmd = sprintf ("sort -n -r -k3 %s", tmpfile);
	while (cmd | getline myline)   {
		print myline;
		# split (myline, tmp);
		# array[tmp[2]]=tmp[1];
	}
	close (tmpfile);
	system ("rm "tmpfile);

	printf ("%s%-30s %8.1f\n", prefix, "sum:", sum);
	printf ("-----------------------------------------\n");
}
0
Reply johannes 10/23/2009 1:49:25 PM

johannes.mainusch wrote:
> It took me some time to understand, why awks arrays cannot easily been
> sorted. Anyway, somewhere I got the hint to use unix shells sort by
> dumping the hash, sorting it in the unix shell and reading it in
> again. So I wrote a little functional capsule for that. here it is...
> 
> #
> # print an array in sorted order by value
> # 20091022, Johannes Mainusch
> # don't blame me if it doesn't work on you
> # :-)
> #
> function print_sorted_by_value (prefix, array, scale, norm,
> significance,      i, sum, tmpfile, cmd) {
> 	printf ("\nprinting in sorted order by value\n");
> 	for (i in array) n++; # get length of the array
> 
> 	tmpfile=sprintf("del.me.%d",1000000*rand());
> 	#print "filename = ",tmpfile;
> 
> 	sum = 0;
> 	for (key in array) {
> 		value = "nan";
> 		sum += array[key];
> 		if (array[key] > significance) value = 100*array[key]/norm;
> 		printf ("%s%-30s %8.1f %f\n", prefix, key, scale*array[key], value)
>>> tmpfile;
> 		}
> 	close (tmpfile);
> 	delete array;
> 
> 	cmd = sprintf ("sort -n -r -k3 %s", tmpfile);
> 	while (cmd | getline myline)   {
> 		print myline;
> 		# split (myline, tmp);
> 		# array[tmp[2]]=tmp[1];
> 	}
> 	close (tmpfile);
> 	system ("rm "tmpfile);
> 
> 	printf ("%s%-30s %8.1f\n", prefix, "sum:", sum);
> 	printf ("-----------------------------------------\n");
> }

A couple of things pop out: you need to add "n" to the pseudo-argument 
list, the use of getline isn't a safe syntax (see 
http://awk.info/?tip/getline), you'd need to run it from a dir where you 
have write permission so, since you're assuming UNIX, put your tmp file 
in /usr/tmp or similar, deleting a whole array is a gawk-ism but gawk 
already has built in array sorting (asort() and asorti()), no need to 
use sprintf() to create the "tmpfile" and "cmd" strings, you could use a 
co-process instead of a tmp file if you're assuming gawk, you could use 
length() instead of a loop to get the array size if you're assuming 
gawk, instead of repeating the same format string in two printfs you 
should define a format variable and use that, and all the trailing 
semicolons are redundant.

Could you show some sample input, a small script that uses that function 
plus the output it produces so we can see how to use it?

Regards,

     Ed.
0
Reply Ed 10/23/2009 2:06:29 PM


In article <c33c3055-7f83-4f24-acf8-f5a14dee6f29@o36g2000vbl.googlegroups.com>,
johannes.mainusch <johannes.mainusch@gmx.de> wrote:
>It took me some time to understand, why awks arrays cannot easily been
>sorted. Anyway, somewhere I got the hint to use unix shells sort by
>dumping the hash, sorting it in the unix shell and reading it in
>again. So I wrote a little functional capsule for that. here it is...

A couple of notes (objections):

1) The need for this is pretty much obsolete today, given the built-in
    sorting capabilities of GAWK and TAWK (and if you're not using one
    or the other of these, then you really should be).

2) IME, you rarely need to sort the *values*.  My applications have
    always been the need for sorting the keys.  TAWK does this
    automatically, of course, as does GAWK if you're a sufficiently
    whiny user (hint, hint).

3) Incidentally, I've never used GAWK's asort() or asorti() functions.
    They look somewhat interesting, but I've never seen the need...

0
Reply gazelle 10/23/2009 2:09:38 PM

Kenny McCormack wrote:
<snip>
> 3) Incidentally, I've never used GAWK's asort() or asorti() functions.
>     They look somewhat interesting, but I've never seen the need...

I don't use them much but once in a while they're useful. In fact, I 
used one of them in a script just yesterday. I had multiple files of 
measurements for various types of processor, e.g. this kind of format in 
file "FILE1":

	type=foo id=3
		count1 = 7
		count2 = 5

	type=bar id=54
		count1 = 3
		count3 = 6

	type=foo id=12
		count4 = 5
		count2 = 9

and I had to produce tabular output that was sorted by processor type+id 
and with a blank line between each type:

     FILE1:
	bar_54 3 0 6 0

	foo_03 7 5 0 0
	foo_12 0 9 0 5

     FILE2:
	....

so it was convenient to initially store the data indexed by processor 
type+id, then sort the list using asorti() before printing. I could've 
piped an interim result per file to UNIX sort but then I'd have had to 
add yet another pipe to a second awk to introduce the blank lines 
between processor types and I'd have had to introduce a shell loop to 
feed awk one file at a time instead of just handling all the files on 
the awk command line or otherwise jump through hoops so just using 
asorti() in a single script was quite a bit simpler.

     Ed.
0
Reply Ed 10/23/2009 2:24:21 PM

johannes.mainusch wrote:

> It took me some time to understand, why awks arrays cannot easily been
> sorted.

Why? AFAICT, an array sort function can be written in awk just as easily as 
in any other language.

> Anyway, somewhere I got the hint to use unix shells sort by
> dumping the hash, sorting it in the unix shell and reading it in
> again. So I wrote a little functional capsule for that. here it is...

If you're using GNU awk you don't need that because it has built-in 
functions to sort arrays by value and by index (hash).

> # print an array in sorted order by value
> # 20091022, Johannes Mainusch
> # don't blame me if it doesn't work on you
> # :-)
> #
> function print_sorted_by_value (prefix, array, scale, norm,
> significance,      i, sum, tmpfile, cmd) {
>         printf ("\nprinting in sorted order by value\n");
>         for (i in array) n++; # get length of the array
>
>         tmpfile=sprintf("del.me.%d",1000000*rand());
>         #print "filename = ",tmpfile;
>
>         sum = 0;
>         for (key in array) {
>                 value = "nan";
>                 sum += array[key];
>                 if (array[key] > significance) value =
>                 100*array[key]/norm;
>                 printf ("%s%-30s %8.1f %f\n", prefix, key,
>                 scale*array[key], value) >>tmpfile;
>                 }
>         close (tmpfile);
>         delete array;
>
>         cmd = sprintf ("sort -n -r -k3 %s", tmpfile);
>         while (cmd | getline myline)   {
>                 print myline;
>                 # split (myline, tmp);
>                 # array[tmp[2]]=tmp[1];
>         }
>         close (tmpfile);
>         system ("rm "tmpfile);

You should check that getline returns a positive value, and probably you 
should also close(cmd) "just in case".

>
>         printf ("%s%-30s %8.1f\n", prefix, "sum:", sum);
>         printf ("-----------------------------------------\n");
> }


0
Reply pk 10/23/2009 2:48:08 PM

On Fri, 23 Oct 2009 14:09:38 +0000 (UTC), gazelle@shell.xmission.com (Kenny McCormack) wrote:

>In article <c33c3055-7f83-4f24-acf8-f5a14dee6f29@o36g2000vbl.googlegroups.com>,
>johannes.mainusch <johannes.mainusch@gmx.de> wrote:
>>It took me some time to understand, why awks arrays cannot easily been
>>sorted. Anyway, somewhere I got the hint to use unix shells sort by
>>dumping the hash, sorting it in the unix shell and reading it in
>>again. So I wrote a little functional capsule for that. here it is...
>
>A couple of notes (objections):
>
>1) The need for this is pretty much obsolete today, given the built-in
>    sorting capabilities of GAWK and TAWK (and if you're not using one
>    or the other of these, then you really should be).
>
>2) IME, you rarely need to sort the *values*.  My applications have
>    always been the need for sorting the keys.  TAWK does this
>    automatically, of course, as does GAWK if you're a sufficiently
>    whiny user (hint, hint).
>
>3) Incidentally, I've never used GAWK's asort() or asorti() functions.
>    They look somewhat interesting, but I've never seen the need...

I use gawk's asort and asorti heaps:

grant@deltree:/usr/local/bin$ grep asort *|grep -v "Binary file"
cc2ip-logview:  asort(tsdiff, tssort)
cc2ip-logview:  asort(qslen, qssort)
cc2ip-logview:  asort(rtime, rsort)
cc2ip-quota-lockout-view:       n = asorti(query, sort)
get-web-blocks: numip = asorti(ip, ipnum_sort)
get-web-blocks: n = asorti(list_name, list_name_sorted)
ipblockmerge:# requires recent gawk with 'asorti' (tested with gawk-3.1.5)
ipblockmerge:   n = asorti(list_input, list_sorted) # sort by start addr, blocksize
ipblockmerge:   x = asorti(list_out, list_out_sorted)
junkview:       pf = asort(xp)
junkview:               j = asort(kk)
junkview:       j = asort(kk)
junkview:       sort_addr_port_len = asort(sort_addr_port)
junkview:       sort_hits_port_len = asort(sort_hits_port)
junkview:       addr_hits_port_len = asort(addr_hits_port)
junkview:       hits_addr_port_len = asort(hits_addr_port)
junkview:       hits_netw_addr_len = asort(hits_netw_addr)
junkview:               m = asort(hl)
junkview:               asort(nl)
junkview:               sort_src_hit_dst_len = asort(sort_src_hit_dst)
logfilter:      tcpsize = asorti(tcp, tcpsort)  # sort by IP address
logfilter:      nettcpsize = asorti(nettcp, nettcpsort)         # sort by net address
pak-web-scan:           n = asorti(list, sorted)
show-browsers:  n = asorti(sort, sorted)
spam-net-finder:        count = asort(output, sorted)
spam-net-finder-db:     count = asort(output, sorted)

Grant.
-- 
http://bugsplatter.id.au
0
Reply Grant 10/23/2009 5:32:25 PM

On Oct 23, 10:06=A0am, Ed Morton <mortons...@gmail.com> wrote:
> johannes.mainusch wrote:
> > It took me some time to understand, why awks arrays cannot easily been
> > sorted. Anyway, somewhere I got the hint to use unix shells sort by
> > dumping the hash, sorting it in the unix shell and reading it in
> > again. So I wrote a little functional capsule for that. here it is...
>
> > #
> > # print an array in sorted order by value
> > # 20091022, Johannes Mainusch
> > # don't blame me if it doesn't work on you
> > # :-)
> > #
> > function print_sorted_by_value (prefix, array, scale, norm,
> > significance, =A0 =A0 =A0i, sum, tmpfile, cmd) {
> > =A0 =A0printf ("\nprinting in sorted order by value\n");
> > =A0 =A0for (i in array) n++; # get length of the array
>
> > =A0 =A0tmpfile=3Dsprintf("del.me.%d",1000000*rand());
> > =A0 =A0#print "filename =3D ",tmpfile;
>
> > =A0 =A0sum =3D 0;
> > =A0 =A0for (key in array) {
> > =A0 =A0 =A0 =A0 =A0 =A0value =3D "nan";
> > =A0 =A0 =A0 =A0 =A0 =A0sum +=3D array[key];
> > =A0 =A0 =A0 =A0 =A0 =A0if (array[key] > significance) value =3D 100*arr=
ay[key]/norm;
> > =A0 =A0 =A0 =A0 =A0 =A0printf ("%s%-30s %8.1f %f\n", prefix, key, scale=
*array[key], value)
> >>> tmpfile;
> > =A0 =A0 =A0 =A0 =A0 =A0}
> > =A0 =A0close (tmpfile);
> > =A0 =A0delete array;
>
> > =A0 =A0cmd =3D sprintf ("sort -n -r -k3 %s", tmpfile);
> > =A0 =A0while (cmd | getline myline) =A0 {
> > =A0 =A0 =A0 =A0 =A0 =A0print myline;
> > =A0 =A0 =A0 =A0 =A0 =A0# split (myline, tmp);
> > =A0 =A0 =A0 =A0 =A0 =A0# array[tmp[2]]=3Dtmp[1];
> > =A0 =A0}
> > =A0 =A0close (tmpfile);
> > =A0 =A0system ("rm "tmpfile);
>
> > =A0 =A0printf ("%s%-30s %8.1f\n", prefix, "sum:", sum);
> > =A0 =A0printf ("-----------------------------------------\n");
> > }
>
> A couple of things pop out: you need to add "n" to the pseudo-argument
> list, the use of getline isn't a safe syntax (seehttp://awk.info/?tip/get=
line),
>
> =A0 =A0 =A0Ed.

Ah, both thanks and drat as well, for that link. I was using getline
in my first awk program, but I'm pretty sure that by following the
above link, that I can  eliminate it. Good information.

0
Reply Da_Gut 10/30/2009 6:06:42 PM

On Oct 30, 7:06=A0pm, Da_Gut <googlegro...@gutcup.com> wrote:
> On Oct 23, 10:06=A0am, Ed Morton <mortons...@gmail.com> wrote:
>
>
>
>
>
> > johannes.mainusch wrote:
> > > It took me some time to understand, why awks arrays cannot easily bee=
n
> > > sorted. Anyway, somewhere I got the hint to use unix shells sort by
> > > dumping the hash, sorting it in the unix shell and reading it in
> > > again. So I wrote a little functional capsule for that. here it is...
>
> > > #
> > > # print an array in sorted order by value
> > > # 20091022, Johannes Mainusch
> > > # don't blame me if it doesn't work on you
> > > # :-)
> > > #
> > > function print_sorted_by_value (prefix, array, scale, norm,
> > > significance, =A0 =A0 =A0i, sum, tmpfile, cmd) {
> > > =A0 =A0printf ("\nprinting in sorted order by value\n");
> > > =A0 =A0for (i in array) n++; # get length of the array
>
> > > =A0 =A0tmpfile=3Dsprintf("del.me.%d",1000000*rand());
> > > =A0 =A0#print "filename =3D ",tmpfile;
>
> > > =A0 =A0sum =3D 0;
> > > =A0 =A0for (key in array) {
> > > =A0 =A0 =A0 =A0 =A0 =A0value =3D "nan";
> > > =A0 =A0 =A0 =A0 =A0 =A0sum +=3D array[key];
> > > =A0 =A0 =A0 =A0 =A0 =A0if (array[key] > significance) value =3D 100*a=
rray[key]/norm;
> > > =A0 =A0 =A0 =A0 =A0 =A0printf ("%s%-30s %8.1f %f\n", prefix, key, sca=
le*array[key], value)
> > >>> tmpfile;
> > > =A0 =A0 =A0 =A0 =A0 =A0}
> > > =A0 =A0close (tmpfile);
> > > =A0 =A0delete array;
>
> > > =A0 =A0cmd =3D sprintf ("sort -n -r -k3 %s", tmpfile);
> > > =A0 =A0while (cmd | getline myline) =A0 {
> > > =A0 =A0 =A0 =A0 =A0 =A0print myline;
> > > =A0 =A0 =A0 =A0 =A0 =A0# split (myline, tmp);
> > > =A0 =A0 =A0 =A0 =A0 =A0# array[tmp[2]]=3Dtmp[1];
> > > =A0 =A0}
> > > =A0 =A0close (tmpfile);
> > > =A0 =A0system ("rm "tmpfile);
>
> > > =A0 =A0printf ("%s%-30s %8.1f\n", prefix, "sum:", sum);
> > > =A0 =A0printf ("-----------------------------------------\n");
> > > }
>
> > A couple of things pop out: you need to add "n" to the pseudo-argument
> > list, the use of getline isn't a safe syntax (seehttp://awk.info/?tip/g=
etline),
>
> > =A0 =A0 =A0Ed.
>
> Ah, both thanks and drat as well, for that link. I was using getline
> in my first awk program, but I'm pretty sure that by following the
> above link, that I can =A0eliminate it. Good information.

Thanks for all the good discussion. I'll try to digest that link and
understand getline better and then I'll clean up my code (I have done
that in fact). The reason for me not to use gawk is simply that I
develop on Mac and that I deploy on Debian. And I am just a part/part
time developer. That is in fact a hobby besides line management. So
*awk is not really an option. And the on remark about the possibility
of sorting hashes in awk I did not understand and I do not believe
it's possible as sorting always involves swapping elements and that
involves any kind if reference to elements which I do not have in an
awk hash. anyway, I might be mistaken and please do prove me wrong by
code sample :-)

Btw. I use awk to analyze custom webserver logs and histogramm data
and get sorted cross references... Its fast and nasty, and yes I know
about the existence of perl, ruby or open source log analyzers. But as
someone recently put it: "awk is a nice chainsaw..."
Cheers
Johannes
0
Reply johannes 11/5/2009 9:35:03 PM

7 Replies
272 Views

(page loaded in 0.111 seconds)

Similiar Articles:













7/13/2012 12:27:35 AM


Reply: