Hi, Awk beginner here. I found this article about how to emulate arrays of
arrays in Awk
http://www.billposer.org/Linguistics/Computation/Miscnotes/Lists.html
unfortunately, it does not cut it for me.
What I need is:
array1["a"] -> array2["a_1"], array2["a_2"],array2["a_3"],array2["a_4"],
:
array1["z"] -> array2["z_1"], array2["z_2"],array2["z_3"],array2["z_4"],
and so far so good:
the problem is when I need to loop on part of array2[] (say all "a_*").
How do I achieve that?
I have actually worked the problem around momentarily by rescanning the full big
matrix and filtering out what I do not need with:
if (index(idxsub,idx) != 0)
but it goes without saying that this is slow as hell for large amounts of data.
Ideas?
Thanks
Luca
|
|
0
|
|
|
|
Reply
|
luca
|
9/6/2010 6:14:29 PM |
|
On 9/6/2010 1:14 PM, luca wrote:
>
> Hi, Awk beginner here. I found this article about how to emulate arrays of
> arrays in Awk
>
> http://www.billposer.org/Linguistics/Computation/Miscnotes/Lists.html
>
> unfortunately, it does not cut it for me.
>
> What I need is:
>
> array1["a"] -> array2["a_1"], array2["a_2"],array2["a_3"],array2["a_4"],
> :
> array1["z"] -> array2["z_1"], array2["z_2"],array2["z_3"],array2["z_4"],
>
> and so far so good:
>
> the problem is when I need to loop on part of array2[] (say all "a_*").
>
> How do I achieve that?
>
> I have actually worked the problem around momentarily by rescanning the full big
> matrix and filtering out what I do not need with:
>
> if (index(idxsub,idx) != 0)
>
> but it goes without saying that this is slow as hell for large amounts of data.
>
> Ideas?
>
> Thanks
>
> Luca
>
Is this what you're looking for:
$ cat array.awk
BEGIN {
array1["a"] = "a_1 a_2 a_3 a_4"
array2["a_1"] = "the"
array2["a_2"] = "quick"
array2["a_3"] = "brown"
array2["a_4"] = "fox"
c = split(array1["a"],indices)
for (i=1; i<=c; i++) {
print array2[indices[i]]
}
}
$ awk -f array.awk
the
quick
brown
fox
If not, clarify....
Ed.
|
|
0
|
|
|
|
Reply
|
Ed
|
9/6/2010 10:53:54 PM
|
|
On 07/09/2010 0.53, Ed Morton wrote:
> If not, clarify....
>
> Ed.
Thank you, Ed. And apologies for not having been sufficiently clear in the first
place. I'll go into more detail about my particular case.
I have a column with strings which should become the index of my array. With an
extra caveat ("token" = constant string part of all strings which can be made
work as some kind of delimiter):
common_root1_token
common_root1_token_variation1
common_root1_token_variation2
common_root1_token_variation3
common_root1_token_variation4
:
common_root4_token
common_root4_token_variation1
common_root4_token_variation2
common_root4_token_variation3
:
and so on.
I need to sum the values of other fields on the same line of all
"common_rootN_token_*" (including "common_root1_token"). But I also want to
collect the subroots values singularly.
What I am doing now is:
index_root = index($1,"_token");
root_id = substr($1,0,index_root+5);
#track "roots" collectively
if ($9 == 0) zeros[root_id]++
if ($9 == 1) ones[root_id]++
#but also track single elements
if ($9 == 0) zerossub[$1]++
if ($9 == 1) onessub[$1]++
Basically, I see no alternative to duplicating arrays
Now, if I had real arrays of arrays I could probably just have something like:
zeros[root_id]++
zerossub[root_id].array[substr($1,index_root+5)]++
at that point, I could do something like:
for (idx in zeros) {
#go for the roots
for (subidx in zerossub[idx].array) {
#go for subelements
}
}
I hope I have been clear this time and thank you for your attention.
Luca
|
|
0
|
|
|
|
Reply
|
luca
|
9/7/2010 2:02:18 PM
|
|
On 07/09/10 16:02, luca wrote:
> On 07/09/2010 0.53, Ed Morton wrote:
>
>> If not, clarify....
>>
>> Ed.
>
>
> Thank you, Ed. And apologies for not having been sufficiently clear in
> the first place. I'll go into more detail about my particular case.
>
> I have a column with strings which should become the index of my array.
> With an extra caveat ("token" = constant string part of all strings
> which can be made work as some kind of delimiter):
>
> common_root1_token
> common_root1_token_variation1
> common_root1_token_variation2
> common_root1_token_variation3
> common_root1_token_variation4
> :
> common_root4_token
> common_root4_token_variation1
> common_root4_token_variation2
> common_root4_token_variation3
> :
>
> and so on.
>
> I need to sum the values of other fields on the same line of all
What values? The '0' and '1' in field $9 ? - Why don't you just name it?
> "common_rootN_token_*" (including "common_root1_token"). But I also want
> to collect the subroots values singularly.
>
> What I am doing now is:
>
> index_root = index($1,"_token");
> root_id = substr($1,0,index_root+5);
Why not...
root_id = substr($1,1,index_root-1)
....which would make, e.g., common_root4 a root_id; do you need the
constant "_token" part for something?
>
> #track "roots" collectively
> if ($9 == 0) zeros[root_id]++
> if ($9 == 1) ones[root_id]++
>
> #but also track single elements
> if ($9 == 0) zerossub[$1]++
> if ($9 == 1) onessub[$1]++
>
> Basically, I see no alternative to duplicating arrays
Nothing's wrong with storing the root values in a separate array.
BTW, more typically in awk we write...
$9 == 0 { zeros[root_id]++ }
$9 == 1 { ones[root_id]++ }
$9 == 0 { zerossub[$1]++ }
$9 == 1 { onessub[$1]++ }
>
> Now, if I had real arrays of arrays I could probably just have something
> like:
>
> zeros[root_id]++
> zerossub[root_id].array[substr($1,index_root+5)]++
>
> at that point, I could do something like:
>
> for (idx in zeros) {
>
> #go for the roots
>
> for (subidx in zerossub[idx].array) {
Why don't you iterate over...
for (subidx in zerossub[idx]) {
....if you're interested in your summed up values?
>
> #go for subelements
>
> }
> }
>
> I hope I have been clear this time and thank you for your attention.
>
> Luca
This posting makes your intention not really clearer if you ask me.
It would make it clearer if you'd provided actual significant sample
data and the associated _expected output_.
Janis
|
|
0
|
|
|
|
Reply
|
Janis
|
9/7/2010 2:44:42 PM
|
|
On 07/09/2010 16.44, Janis Papanagnou wrote:
>>
>> I need to sum the values of other fields on the same line of all
>
> What values? The '0' and '1' in field $9 ? - Why don't you just name it?
yes, I need to know how many entries have 0s in field 9 and how many have 1s
>
>> "common_rootN_token_*" (including "common_root1_token"). But I also want
>> to collect the subroots values singularly.
>>
>> What I am doing now is:
>>
>> index_root = index($1,"_token");
>> root_id = substr($1,0,index_root+5);
>
> Why not...
>
> root_id = substr($1,1,index_root-1)
the token is an integral part of the ID which I will also need for other
purposes at a later stage. Anyway, this doesn't change my problem, I think.
> BTW, more typically in awk we write...
>
> $9 == 0 { zeros[root_id]++ }
> $9 == 1 { ones[root_id]++ }
> $9 == 0 { zerossub[$1]++ }
> $9 == 1 { onessub[$1]++ }
thank you for the idiom
> This posting makes your intention not really clearer if you ask me.
>
> It would make it clearer if you'd provided actual significant sample
> data and the associated _expected output_.
Let's see:
Data (token = "ver1"):
nokia_2680s_ver1 Nokia2680s-2/1.0 (05.28) Profile/MIDP-2.1
Configuration/CLDC-1.1
"http://nds1.nds.nokia.com/uaprof/N2680s-2r100.xml",
"1-CPkWdm0QMfqv0FRqDwbeBg==", "2-XnrTOLDzBJdZHN2vSasoNA==",
"3-HC5l5j+eQ9tPRpdhsseJIQ=="NULL NULL NULL NULL NULL 1
nokia_2680s_ver1_sub2b Nokia2680s-2b/1.0 (06.17) Profile/MIDP-2.1
Configuration/CLDC-1.1
"http://nds1.nds.nokia.com/uaprof/N2680s-2br100.xml" NULLNULL NULL NULL
NULL 1
nokia_2680s_ver1 Nokia2680s-2/1.0 (06.17) Profile/MIDP-2.1
Configuration/CLDC-1.1
"http://nds1.nds.nokia.com/uaprof/N2680s-2r100.xml" NULLNULL NULL NULL
NULL 1
nokia_2680s_ver1_sub2b Nokia2680s-2b/1.0 (06.17) Profile/MIDP-2.1
Configuration/CLDC-1.1 "http://nds1.nds.nokia.com/uaprof/N2680s-2br100.xml"
NULL NULL NULL NULL NULL 0
nokia_2680s_ver1_sub2b Nokia2680s-2b/1.0 (06.17) Profile/MIDP-2.1
Configuration/CLDC-1.1
"http://nds1.nds.nokia.com/uaprof/N2680s-2br100.xml" NULL NULL NULL
NULL NULL 1
expected result:
nokia_2680s_ver1 => (50,86)
nokia_2680s_ver1_sub2b (5,4)
nokia_2680s_ver1_subua (1,)
nokia_2680s_ver1 (44,82)
Thanks
Luca
|
|
0
|
|
|
|
Reply
|
luca
|
9/7/2010 3:06:29 PM
|
|
On 07/09/2010 17.06, luca wrote:
> On 07/09/2010 16.44, Janis Papanagnou wrote:
>
is it safe to assume that what I want to do cannot be done in Awk?
|
|
0
|
|
|
|
Reply
|
luca
|
9/8/2010 11:16:21 AM
|
|
luca wrote:
> On 07/09/2010 17.06, luca wrote:
>> On 07/09/2010 16.44, Janis Papanagnou wrote:
>>
>
> is it safe to assume that what I want to do cannot be done in Awk?
Nope. Paste correct data. In your previous example, your expected output was
nokia_2680s_ver1 => (50,86)
nokia_2680s_ver1_sub2b (5,4)
nokia_2680s_ver1_subua (1,)
nokia_2680s_ver1 (44,82)
but nokia_2680s_ver1_subua does not appear in the input (and sub2b appears
three times, etc.). Also, lines were wrapping, which does not make it clear
where lines are beginning and ending, so you probably better paste the data
in some pastebin and then include the URL in the message.
|
|
0
|
|
|
|
Reply
|
pk
|
9/8/2010 11:45:14 AM
|
|
On 08/09/2010 13.45, pk wrote:
> luca wrote:
>
>> On 07/09/2010 17.06, luca wrote:
>>> On 07/09/2010 16.44, Janis Papanagnou wrote:
>>>
>>
>> is it safe to assume that what I want to do cannot be done in Awk?
>
> Nope. Paste correct data. In your previous example, your expected output was
>
> nokia_2680s_ver1 => (50,86)
> nokia_2680s_ver1_sub2b (5,4)
> nokia_2680s_ver1_subua (1,)
> nokia_2680s_ver1 (44,82)
>
> but nokia_2680s_ver1_subua does not appear in the input (and sub2b appears
> three times, etc.). Also, lines were wrapping, which does not make it clear
> where lines are beginning and ending, so you probably better paste the data
> in some pastebin and then include the URL in the message.
fair enough, here is an abridged version of the data which should be good enough
for the point I am trying to make (separator is "\t":
nokia_2680s_ver1 1
nokia_2680s_ver1_sub2a 0
nokia_2680s_ver1_subua 0
nokia_2680s_ver1 1
nokia_2680s_ver1 0
nokia_2680s_ver1 1
nokia_2680s_ver1_sub2a 0
nokia_2680s_ver1_subua 1
nokia_2680s_ver1_sub2a 0
nokia_2680s_ver1_sub2a 0
nokia_2680s_ver1_subua 1
nokia_2680s_ver1 1
nokia_2680s_ver1_sub2a 0
nokia_2680s_ver1_subua 1
nokia_2680s_ver1 0
nokia_2680s_ver1_sub2a 0
nokia_2680s_ver1_subua 1
nokia_2680s_ver1 1
nokia_2680s_ver1_sub2a 0
nokia_2680s_ver1_subua 1
nokia_2680s_ver1 0
nokia_2680s_ver1 1
nokia_2680s_ver1_sub2a 0
nokia_2680s_ver1_subua 1
nokia_2680s_ver1 1
nokia_2680s_ver1 0
nokia_2680s_ver1 1
nokia_2680s_ver1_sub2a 0
nokia_2680s_ver1_subua 1
nokia_2680s_ver1_sub2a 0
nokia_2680s_ver1_sub2a 0
nokia_2680s_ver1_subua 1
nokia_2680s_ver1 1
nokia_2680s_ver1_sub2a 0
nokia_2680s_ver1_subua 1
nokia_2680s_ver1 0
nokia_2680s_ver1_sub2a 0
nokia_2680s_ver1_subua 1
nokia_2680s_ver1 1
nokia_2680s_ver1_sub2a 0
nokia_2680s_ver1_subua 1
nokia_2680s_ver1 0
The code I posted before is still valid with the exception that the field number
has changed ($9 -> $2)
#track "roots" collectively
if ($2 == 0) zeros[root_id]++
if ($2 == 1) ones[root_id]++
#but also track single elements
if ($2 == 0) zerossub[$1]++
if ($2 == 1) onessub[$1]++
Thank you for your help
Luca
|
|
0
|
|
|
|
Reply
|
luca
|
9/8/2010 2:01:16 PM
|
|
On 08/09/10 16:01, luca wrote:
> On 08/09/2010 13.45, pk wrote:
>> luca wrote:
>>
>>> On 07/09/2010 17.06, luca wrote:
>>>> On 07/09/2010 16.44, Janis Papanagnou wrote:
>>>>
>>>
>>> is it safe to assume that what I want to do cannot be done in Awk?
>>
>> Nope. Paste correct data. In your previous example, your expected
>> output was
>>
>> nokia_2680s_ver1 => (50,86)
>> nokia_2680s_ver1_sub2b (5,4)
>> nokia_2680s_ver1_subua (1,)
>> nokia_2680s_ver1 (44,82)
>>
>> but nokia_2680s_ver1_subua does not appear in the input (and sub2b
>> appears
>> three times, etc.). Also, lines were wrapping, which does not make it
>> clear
>> where lines are beginning and ending, so you probably better paste the
>> data
>> in some pastebin and then include the URL in the message.
>
> fair enough, here is an abridged version of the data which should be
> good enough for the point I am trying to make (separator is "\t":
>
> nokia_2680s_ver1 1
> nokia_2680s_ver1_sub2a 0
> nokia_2680s_ver1_subua 0
> nokia_2680s_ver1 1
> nokia_2680s_ver1 0
> nokia_2680s_ver1 1
> nokia_2680s_ver1_sub2a 0
> nokia_2680s_ver1_subua 1
> nokia_2680s_ver1_sub2a 0
> nokia_2680s_ver1_sub2a 0
> nokia_2680s_ver1_subua 1
> nokia_2680s_ver1 1
> nokia_2680s_ver1_sub2a 0
> nokia_2680s_ver1_subua 1
> nokia_2680s_ver1 0
> nokia_2680s_ver1_sub2a 0
> nokia_2680s_ver1_subua 1
> nokia_2680s_ver1 1
> nokia_2680s_ver1_sub2a 0
> nokia_2680s_ver1_subua 1
> nokia_2680s_ver1 0
> nokia_2680s_ver1 1
> nokia_2680s_ver1_sub2a 0
> nokia_2680s_ver1_subua 1
> nokia_2680s_ver1 1
> nokia_2680s_ver1 0
> nokia_2680s_ver1 1
> nokia_2680s_ver1_sub2a 0
> nokia_2680s_ver1_subua 1
> nokia_2680s_ver1_sub2a 0
> nokia_2680s_ver1_sub2a 0
> nokia_2680s_ver1_subua 1
> nokia_2680s_ver1 1
> nokia_2680s_ver1_sub2a 0
> nokia_2680s_ver1_subua 1
> nokia_2680s_ver1 0
> nokia_2680s_ver1_sub2a 0
> nokia_2680s_ver1_subua 1
> nokia_2680s_ver1 1
> nokia_2680s_ver1_sub2a 0
> nokia_2680s_ver1_subua 1
> nokia_2680s_ver1 0
>
> The code I posted before is still valid with the exception that the
> field number has changed ($9 -> $2)
>
>
> #track "roots" collectively
> if ($2 == 0) zeros[root_id]++
> if ($2 == 1) ones[root_id]++
>
> #but also track single elements
> if ($2 == 0) zerossub[$1]++
> if ($2 == 1) onessub[$1]++
>
> Thank you for your help
>
> Luca
>
>
Good start, but you missed to post the expected output data that is
corresponding to your sample data above. Well, I am not quite sure,
but here's what I think you might want...
{
index_root = index ($1, "_ver")
root_id = substr ($1, 1, index_root+3)
ri[root_id]
}
!($1 in ri) { si[$1] }
$NF == 0 { zeros[root_id]++ }
$NF == 1 { ones[root_id]++ }
$NF == 0 { zerossub[$1]++ }
$NF == 1 { onessub[$1]++ }
END {
for (idx in ri) {
print idx, zeros[idx]+0, ones[idx]+0
for (subidx in si) {
print " ", subidx, zerossub[subidx]+0, onessub[subidx]+0
}
}
}
The code has most of what was proposed already. The new parts are
mainly the sets ri[] and si[] that carry the ID's. And I replaced
the field $9 or $2 respectively by $NF (which is the last field on
the line). And I've added the "+0" so that you see "0" in case a
field is still uninitialized and would otherwise produce nothing.
For your sample data above the program will output...
nokia_2680s_ver 21 21
nokia_2680s_ver1_sub2a 14 0
nokia_2680s_ver1_subua 1 11
nokia_2680s_ver1 6 10
If that's not what you expect than provide expected output.
Janis
|
|
0
|
|
|
|
Reply
|
Janis
|
9/8/2010 6:21:36 PM
|
|
On 08/09/10 20:21, Janis Papanagnou wrote:
> On 08/09/10 16:01, luca wrote:
>> [...]
Please ignore my previous posting. Being in a hurry I just hacked it in
without checking. There's a wrong duplicate count in there, and the id
bounds are wrong. (If no one provides a solution in the meantime, I'll
have a look on it again, when I'm back tonight...)
Janis
|
|
0
|
|
|
|
Reply
|
Janis
|
9/8/2010 6:35:51 PM
|
|
On 08/09/10 20:21, Janis Papanagnou wrote:
> On 08/09/10 16:01, luca wrote:
>> On 08/09/2010 13.45, pk wrote:
>>> luca wrote:
>>>
>>>> On 07/09/2010 17.06, luca wrote:
>>>>> On 07/09/2010 16.44, Janis Papanagnou wrote:
>>>>>
>>>>
>>>> is it safe to assume that what I want to do cannot be done in Awk?
>>>
>>> Nope. Paste correct data. In your previous example, your expected
>>> output was
>>>
>>> nokia_2680s_ver1 => (50,86)
>>> nokia_2680s_ver1_sub2b (5,4)
>>> nokia_2680s_ver1_subua (1,)
>>> nokia_2680s_ver1 (44,82)
>>>
>>> but nokia_2680s_ver1_subua does not appear in the input (and sub2b
>>> appears
>>> three times, etc.). Also, lines were wrapping, which does not make it
>>> clear
>>> where lines are beginning and ending, so you probably better paste the
>>> data
>>> in some pastebin and then include the URL in the message.
>>
>> fair enough, here is an abridged version of the data which should be
>> good enough for the point I am trying to make (separator is "\t":
>>
>> nokia_2680s_ver1 1
>> nokia_2680s_ver1_sub2a 0
>> nokia_2680s_ver1_subua 0
>> nokia_2680s_ver1 1
>> nokia_2680s_ver1 0
>> nokia_2680s_ver1 1
>> nokia_2680s_ver1_sub2a 0
>> nokia_2680s_ver1_subua 1
>> nokia_2680s_ver1_sub2a 0
>> nokia_2680s_ver1_sub2a 0
>> nokia_2680s_ver1_subua 1
>> nokia_2680s_ver1 1
>> nokia_2680s_ver1_sub2a 0
>> nokia_2680s_ver1_subua 1
>> nokia_2680s_ver1 0
>> nokia_2680s_ver1_sub2a 0
>> nokia_2680s_ver1_subua 1
>> nokia_2680s_ver1 1
>> nokia_2680s_ver1_sub2a 0
>> nokia_2680s_ver1_subua 1
>> nokia_2680s_ver1 0
>> nokia_2680s_ver1 1
>> nokia_2680s_ver1_sub2a 0
>> nokia_2680s_ver1_subua 1
>> nokia_2680s_ver1 1
>> nokia_2680s_ver1 0
>> nokia_2680s_ver1 1
>> nokia_2680s_ver1_sub2a 0
>> nokia_2680s_ver1_subua 1
>> nokia_2680s_ver1_sub2a 0
>> nokia_2680s_ver1_sub2a 0
>> nokia_2680s_ver1_subua 1
>> nokia_2680s_ver1 1
>> nokia_2680s_ver1_sub2a 0
>> nokia_2680s_ver1_subua 1
>> nokia_2680s_ver1 0
>> nokia_2680s_ver1_sub2a 0
>> nokia_2680s_ver1_subua 1
>> nokia_2680s_ver1 1
>> nokia_2680s_ver1_sub2a 0
>> nokia_2680s_ver1_subua 1
>> nokia_2680s_ver1 0
>>
>> The code I posted before is still valid with the exception that the
>> field number has changed ($9 -> $2)
>>
>>
>> [snip code fragment]
>>
>> Thank you for your help
>>
>> Luca
>>
>>
>
> Good start, but you missed to post the expected output data that is
> corresponding to your sample data above. Well, I am not quite sure,
> but here's what I think you might want...
>
> [snip first version]
>
> The code has most of what was proposed already. The new parts are
> mainly the sets ri[] and si[] that carry the ID's. And I replaced
> the field $9 or $2 respectively by $NF (which is the last field on
> the line). And I've added the "+0" so that you see "0" in case a
> field is still uninitialized and would otherwise produce nothing.
>
> For your sample data above the program will output...
>
> nokia_2680s_ver 21 21
> nokia_2680s_ver1_sub2a 14 0
> nokia_2680s_ver1_subua 1 11
> nokia_2680s_ver1 6 10
>
> If that's not what you expect than provide expected output.
>
> Janis
Here's a re-written version that seems to have a simpler structure than
the one you had in your original code (which was based on separate arrays
for zeroes and ones). This one uses just one array with a 2-element index.
Explanations as comments in the code.
# a function to encapsulate the root extraction from ID
function root(id)
{
match(id,/[^_]*_[^_]*_[^_]*/)
return substr(id,RSTART,RLENGTH)
}
# count in two arrays, one for the "subs" and one for the "roots";
# this is done for every record, since you want the "roots" to be
# counted for both, individually and in the summary heading, and
# also memorize the "subs" IDs in si[]
{ s[$1,$NF]++ ; r[root($1),$NF]++ ; si[$1] }
# the "root" IDs, entries that have not three "_" in their ID, will
# be memorized separately in ri[]
$1 !~ /_.*_.*_/ { ri[$1] }
# finally print out the accumulated values by iterating over the
# memorized IDs, use of "+0" to force formatting an uninitialized
# array element as "0"
END {
for (idx in ri) {
print idx, r[idx,0]+0, r[idx,1]+0
for (subidx in si) {
print " ", subidx, s[subidx,0]+0, s[subidx,1]+0
}
}
}
Janis
|
|
0
|
|
|
|
Reply
|
Janis
|
9/9/2010 12:16:49 AM
|
|
On 08/09/2010 20.35, Janis Papanagnou wrote:
> On 08/09/10 20:21, Janis Papanagnou wrote:
>> On 08/09/10 16:01, luca wrote:
>>> [...]
>
> Please ignore my previous posting. Being in a hurry I just hacked it in
> without checking. There's a wrong duplicate count in there, and the id
> bounds are wrong. (If no one provides a solution in the meantime, I'll
> have a look on it again, when I'm back tonight...)
Thanks a lot Janis. I am really curious to see how this can be solved, since
lists of lists seems such like a surprising omission from the language.
I wonder why, with so many awk extensions around (gawk, nawk) nobody took the
decision to go for AWKOS (AWK On Steroids) :)
Performance reasons of some kind?
Luca
|
|
0
|
|
|
|
Reply
|
luca
|
9/9/2010 7:29:37 AM
|
|
On 09/09/10 09:29, luca wrote:
> On 08/09/2010 20.35, Janis Papanagnou wrote:
>> On 08/09/10 20:21, Janis Papanagnou wrote:
>>> On 08/09/10 16:01, luca wrote:
>>>> [...]
>>
>> Please ignore my previous posting. Being in a hurry I just hacked it in
>> without checking. There's a wrong duplicate count in there, and the id
>> bounds are wrong. (If no one provides a solution in the meantime, I'll
>> have a look on it again, when I'm back tonight...)
>
>
> Thanks a lot Janis. I am really curious to see how this can be solved,
You've seen my second proposal posted a few hours before your posting?
> since lists of lists seems such like a surprising omission from the
> language.
The quotient of language-simplicity on one side and and expressive power
on the other side is what makes awk an interesting choice despite lacking
feature of a more complete language.
Janis
>
> I wonder why, with so many awk extensions around (gawk, nawk) nobody
> took the decision to go for AWKOS (AWK On Steroids) :)
>
> Performance reasons of some kind?
>
> Luca
|
|
0
|
|
|
|
Reply
|
Janis
|
9/9/2010 10:58:00 AM
|
|
On 9/8/2010 9:01 AM, luca wrote:
> On 08/09/2010 13.45, pk wrote:
>> luca wrote:
>>
>>> On 07/09/2010 17.06, luca wrote:
>>>> On 07/09/2010 16.44, Janis Papanagnou wrote:
>>>>
>>>
>>> is it safe to assume that what I want to do cannot be done in Awk?
>>
>> Nope. Paste correct data. In your previous example, your expected output was
>>
>> nokia_2680s_ver1 => (50,86)
>> nokia_2680s_ver1_sub2b (5,4)
>> nokia_2680s_ver1_subua (1,)
>> nokia_2680s_ver1 (44,82)
>>
>> but nokia_2680s_ver1_subua does not appear in the input (and sub2b appears
>> three times, etc.). Also, lines were wrapping, which does not make it clear
>> where lines are beginning and ending, so you probably better paste the data
>> in some pastebin and then include the URL in the message.
>
> fair enough, here is an abridged version of the data which should be good enough
> for the point I am trying to make (separator is "\t":
>
> nokia_2680s_ver1 1
> nokia_2680s_ver1_sub2a 0
> nokia_2680s_ver1_subua 0
> nokia_2680s_ver1 1
> nokia_2680s_ver1 0
> nokia_2680s_ver1 1
> nokia_2680s_ver1_sub2a 0
> nokia_2680s_ver1_subua 1
> nokia_2680s_ver1_sub2a 0
> nokia_2680s_ver1_sub2a 0
> nokia_2680s_ver1_subua 1
> nokia_2680s_ver1 1
> nokia_2680s_ver1_sub2a 0
> nokia_2680s_ver1_subua 1
> nokia_2680s_ver1 0
> nokia_2680s_ver1_sub2a 0
> nokia_2680s_ver1_subua 1
> nokia_2680s_ver1 1
> nokia_2680s_ver1_sub2a 0
> nokia_2680s_ver1_subua 1
> nokia_2680s_ver1 0
> nokia_2680s_ver1 1
> nokia_2680s_ver1_sub2a 0
> nokia_2680s_ver1_subua 1
> nokia_2680s_ver1 1
> nokia_2680s_ver1 0
> nokia_2680s_ver1 1
> nokia_2680s_ver1_sub2a 0
> nokia_2680s_ver1_subua 1
> nokia_2680s_ver1_sub2a 0
> nokia_2680s_ver1_sub2a 0
> nokia_2680s_ver1_subua 1
> nokia_2680s_ver1 1
> nokia_2680s_ver1_sub2a 0
> nokia_2680s_ver1_subua 1
> nokia_2680s_ver1 0
> nokia_2680s_ver1_sub2a 0
> nokia_2680s_ver1_subua 1
> nokia_2680s_ver1 1
> nokia_2680s_ver1_sub2a 0
> nokia_2680s_ver1_subua 1
> nokia_2680s_ver1 0
Good sample input but I wish you'd posted a few less lines of input but posted
the expected output from your input too. As posted, we'd have to read your code
snippet, figure out what that does and then guess if that's what you wanted it
to do.
What would the expected output be from this input:
nokia_2680s_ver1 1
nokia_2680s_ver1 0
nokia_2680s_ver1 1
nokia_2680s_ver1_sub2a 0
nokia_2680s_ver1_subua 1
nokia_2680s_ver1_sub2a 0
Regards,
Ed.
|
|
0
|
|
|
|
Reply
|
Ed
|
9/9/2010 12:50:17 PM
|
|
On 9/9/2010 2:29 AM, luca wrote:
> On 08/09/2010 20.35, Janis Papanagnou wrote:
>> On 08/09/10 20:21, Janis Papanagnou wrote:
>>> On 08/09/10 16:01, luca wrote:
>>>> [...]
>>
>> Please ignore my previous posting. Being in a hurry I just hacked it in
>> without checking. There's a wrong duplicate count in there, and the id
>> bounds are wrong. (If no one provides a solution in the meantime, I'll
>> have a look on it again, when I'm back tonight...)
>
>
> Thanks a lot Janis. I am really curious to see how this can be solved, since
> lists of lists seems such like a surprising omission from the language.
I think the problem is trivial to solve, we're just having difficulty precisely
understanding the requirements from your postings.
Here's 2 "lists":
list1 = "abc def ghi klm"
list2 = "nop qrs tuv"
Here's one implementation of a list of lists:
lol[1] = list1
lol[2] = list2
If you have a list of lists implemented as above and want to print the 3rd
element in list2, it's just:
split(lol[2],list)
print list[3]
>
> I wonder why, with so many awk extensions around (gawk, nawk) nobody took the
> decision to go for AWKOS (AWK On Steroids) :)
>
> Performance reasons of some kind?
>
> Luca
There's just rarely a need for anything more than the above for text processing
and if you do need something more (e.g. to be able to modify the 3rd field of
lol[2]), it's simple to implement it yourself using the existing language
constructs, e.g. one way would be:
function xform(old,fld,val, i,c,new,sep,list)
c = split(old,list)
list[fld] = val
for (i=1; i<=c; i++) {
new = new sep list[i]
sep = " "
}
return new
}
lol[2] = xform(lol[2],3,"xyz")
Regards,
Ed.
|
|
0
|
|
|
|
Reply
|
Ed
|
9/9/2010 1:28:20 PM
|
|
On 09/09/2010 12.58, Janis Papanagnou wrote:
> On 09/09/10 09:29, luca wrote:
>> On 08/09/2010 20.35, Janis Papanagnou wrote:
>>> On 08/09/10 20:21, Janis Papanagnou wrote:
>>>> On 08/09/10 16:01, luca wrote:
>>>>> [...]
>>>
>>> Please ignore my previous posting. Being in a hurry I just hacked it in
>>> without checking. There's a wrong duplicate count in there, and the id
>>> bounds are wrong. (If no one provides a solution in the meantime, I'll
>>> have a look on it again, when I'm back tonight...)
>>
>>
>> Thanks a lot Janis. I am really curious to see how this can be solved,
>
> You've seen my second proposal posted a few hours before your posting?
yes, but it still is not making sense (or at least, it does not do what I want
to do.)
I want to be able to loop on all elements of the form:
array[root_token_*]
the array may contain 500k entries, while given a "root", the "root_token_*"
subarray is made of 1 to 10 entries.
I have managed to implement a cript that does the job (so turing completeness is
not the problem here :), but only at the price of looping on the whole
500k-items array and filtering out entries which do not match the "root_token_*"
pattern (a rare occurrence).
Again, this works, but it is very very inefficient when processing large amounts
of data.
I read what you and Ed suggest, but I don't see an answer on how to identify the
range of items I am interested into at each subloop, and just loop on that range.
Thanks
Luca
|
|
0
|
|
|
|
Reply
|
luca
|
9/10/2010 4:37:31 PM
|
|
In article <f9tio.239013$813.230923@tornado.fastwebnet.it>,
luca <luca_remove@alice.it> wrote:
....
>I want to be able to loop on all elements of the form:
>
>array[root_token_*]
>
>the array may contain 500k entries, while given a "root", the "root_token_*"
>subarray is made of 1 to 10 entries.
>
>I have managed to implement a cript that does the job (so turing
>completeness is not the problem here :), but only at the price of
>looping on the whole 500k-items array and filtering out entries which
>do not match the "root_token_*" pattern (a rare occurrence). Again,
>this works, but it is very very inefficient when processing large
>amounts of data.
Right. And that's the tragedy. There is no way to do that in standard
AWK or GAWK, since they don't have true (or in any sense real)
multi-dimensional arrays. Don't blame them - it's not a feature present
in any traditional AWK, nor is it required by any "standard".
You really, really want TAWK. Unfortunately, I checked their website
just now (www.tasoft.com) and there's no change. I really wish we could
find Pat Thompson and get him to come out of the woodwork. It's a true
shame that such a fine program is not available to the general public -
and may not ever be.
Failing that, we should continue to urge Aharon Robbins to implement
true multi-dimensional arrays in GAWK - that's a feature that might just
get me to "upgrade"...
And, finally, failing that, AWK may just not be the language for this
particular problem of yours. I'm pretty sure that other, more "modern"
languages, like Perl/Python/Ruby/etc do have ways to handle this.
--
"The anti-regulation business ethos is based on the charmingly naive notion
that people will not do unspeakable things for money." - Dana Carpender
Quoted by Paul Ciszek (pciszek at panix dot com). But what I want to know
is why is this diet/low-carb food author doing making pithy political/economic
statements?
Nevertheless, the above quote is dead-on, because, the thing is - business
in one breath tells us they don't need to be regulated (which is to say:
that they can morally self-regulate), then in the next breath tells us that
corporations are amoral entities which have no obligations to anyone except
their officers and shareholders, then in the next breath they tell us they
don't need to be regulated (that they can morally self-regulate) ...
|
|
0
|
|
|
|
Reply
|
gazelle
|
9/10/2010 4:50:57 PM
|
|
On 9/10/2010 11:37 AM, luca wrote:
> On 09/09/2010 12.58, Janis Papanagnou wrote:
>> On 09/09/10 09:29, luca wrote:
>>> On 08/09/2010 20.35, Janis Papanagnou wrote:
>>>> On 08/09/10 20:21, Janis Papanagnou wrote:
>>>>> On 08/09/10 16:01, luca wrote:
>>>>>> [...]
>>>>
>>>> Please ignore my previous posting. Being in a hurry I just hacked it in
>>>> without checking. There's a wrong duplicate count in there, and the id
>>>> bounds are wrong. (If no one provides a solution in the meantime, I'll
>>>> have a look on it again, when I'm back tonight...)
>>>
>>>
>>> Thanks a lot Janis. I am really curious to see how this can be solved,
>>
>> You've seen my second proposal posted a few hours before your posting?
>
>
> yes, but it still is not making sense (or at least, it does not do what I want
> to do.)
>
> I want to be able to loop on all elements of the form:
>
> array[root_token_*]
Then just do that. Seriously - I'm REALLY struggling to understand what it is
you're trying to do that you're finding so difficult. The problem's probably on
my side but I think it'd help a lot to have specific input and output from that
input to look at so please tell us what your expected output would be given this
input:
nokia_2680s_ver1 1
nokia_2680s_ver1 0
nokia_2680s_ver1 1
nokia_2680s_ver1_sub2a 0
nokia_2680s_ver1_subua 1
nokia_2680s_ver1_sub2a 0
or some other small, representative input set.
>
> the array may contain 500k entries, while given a "root", the "root_token_*"
> subarray is made of 1 to 10 entries.
>
> I have managed to implement a cript that does the job (so turing completeness is
> not the problem here :), but only at the price of looping on the whole
> 500k-items array and filtering out entries which do not match the "root_token_*"
> pattern (a rare occurrence).
> Again, this works, but it is very very inefficient when processing large amounts
> of data.
That's absolutely NOT necessary as far as I can tell from our postings.
>
> I read what you and Ed suggest, but I don't see an answer on how to identify the
> range of items I am interested into at each subloop, and just loop on that range.
and I, at least, don't understand why our various postings haven't provided the
answer you're looking for.
Ed.
|
|
0
|
|
|
|
Reply
|
Ed
|
9/10/2010 5:04:37 PM
|
|
In article <i6dof6$mho$1@news.eternal-september.org>,
Ed Morton <mortonspam@gmail.com> wrote:
....
>That's absolutely NOT necessary as far as I can tell from our postings.
I am pretty sure that I have decoded the problem - and basically, you
have to listen to what he said in his most recent post and ignore most
of the code samples. As is often the case, the code samples only
obscures the underlying issue - said issue is with language design, not
any specific coding problem.
See my other post (from about 15 minutes ago).
--
Is God willing to prevent evil, but not able? Then he is not omnipotent.
Is he able, but not willing? Then he is malevolent.
Is he both able and willing? Then whence cometh evil?
Is he neither able nor willing? Then why call him God?
~ Epicurus
|
|
0
|
|
|
|
Reply
|
gazelle
|
9/10/2010 5:14:38 PM
|
|
On 10/09/2010 19.04, Ed Morton wrote:
>
> Then just do that. Seriously - I'm REALLY struggling to understand what it is
> you're trying to do that you're finding so difficult. The problem's probably on
> my side but I think it'd help a lot to have specific input and output from that
> input to look at so please tell us what your expected output would be given this
> input:
>
> nokia_2680s_ver1 1
> nokia_2680s_ver1 0
> nokia_2680s_ver1 1
> nokia_2680s_ver1_sub2a 0
> nokia_2680s_ver1_subua 1
> nokia_2680s_ver1_sub2a 0
sure
nokia_2680s_ver1 => (3,3)
nokia_2680s_ver1_sub2a (1,0)
nokia_2680s_ver1_subua (0,1)
nokia_2680s_ver1 (1,2)
> That's absolutely NOT necessary as far as I can tell from our postings.
I don't want to abuse any of your time (in fact, I am grateful that you and
other knowledgeable people here spent so much time to look at my problem).
Having said this, if you really want, I can send you the script and the file for
you to look at off-line.
>> I read what you and Ed suggest, but I don't see an answer on how to identify the
>> range of items I am interested into at each subloop, and just loop on that range.
>
> and I, at least, don't understand why our various postings haven't provided the
> answer you're looking for.
because you have provided a way to emulate multidimensional arrays and lists,
but not a way to loop on a subset of such multi-dimensional arrays.
Thanks
Luca
|
|
0
|
|
|
|
Reply
|
luca
|
9/10/2010 5:48:03 PM
|
|
On 10/09/2010 18.50, Kenny McCormack wrote:
> And, finally, failing that, AWK may just not be the language for this
> particular problem of yours. I'm pretty sure that other, more "modern"
> languages, like Perl/Python/Ruby/etc do have ways to handle this.
it probably is, but I sort of like Awk. I heard of Awk many times since many
years ago at the university. Never really played around with it, until it dawned
on me that it could really be the tool for this job. And it was! I just read the
primer really quickly (I used to know Perl. It helped) and the script was doing
what I needed in a matter of minutes.
it was just the way I had had to duplicate arrays and the way I had to use to
loop on it, that left me perplexed.
Anyway, I am happy that someone agrees that this is a glaring omission from the
language.
Luca
|
|
0
|
|
|
|
Reply
|
luca
|
9/10/2010 5:56:57 PM
|
|
On 9/10/2010 12:48 PM, luca wrote:
> On 10/09/2010 19.04, Ed Morton wrote:
>
>>
>> Then just do that. Seriously - I'm REALLY struggling to understand what it is
>> you're trying to do that you're finding so difficult. The problem's probably on
>> my side but I think it'd help a lot to have specific input and output from that
>> input to look at so please tell us what your expected output would be given this
>> input:
>>
>> nokia_2680s_ver1 1
>> nokia_2680s_ver1 0
>> nokia_2680s_ver1 1
>> nokia_2680s_ver1_sub2a 0
>> nokia_2680s_ver1_subua 1
>> nokia_2680s_ver1_sub2a 0
>
> sure
>
> nokia_2680s_ver1 => (3,3)
> nokia_2680s_ver1_sub2a (1,0)
> nokia_2680s_ver1_subua (0,1)
> nokia_2680s_ver1 (1,2)
I surrender. I can't see what that output represents. I expected to see a
relationship between the input lines and the output, but why there's both of these:
nokia_2680s_ver1 => (3,3)
nokia_2680s_ver1 (1,2)
in the same output and why "subua 1" which appeared once in the input and "sub2a
0" which appeared twice lead to this:
nokia_2680s_ver1_sub2a (1,0)
nokia_2680s_ver1_subua (0,1)
just escapes me.
Good luck with your project.
Regards,
Ed.
>
>
>> That's absolutely NOT necessary as far as I can tell from our postings.
>
> I don't want to abuse any of your time (in fact, I am grateful that you and
> other knowledgeable people here spent so much time to look at my problem).
> Having said this, if you really want, I can send you the script and the file for
> you to look at off-line.
>
>
>>> I read what you and Ed suggest, but I don't see an answer on how to identify the
>>> range of items I am interested into at each subloop, and just loop on that
>>> range.
>>
>> and I, at least, don't understand why our various postings haven't provided the
>> answer you're looking for.
>
> because you have provided a way to emulate multidimensional arrays and lists,
> but not a way to loop on a subset of such multi-dimensional arrays.
>
> Thanks
>
> Luca
>
|
|
0
|
|
|
|
Reply
|
Ed
|
9/10/2010 6:09:20 PM
|
|
In article <i6dv9h$cbt$1@news.eternal-september.org>,
Ed Morton <mortonspam@gmail.com> wrote:
>On 9/10/2010 12:14 PM, Kenny McCormack wrote:
>> In article<i6dof6$mho$1@news.eternal-september.org>,
>> Ed Morton<mortonspam@gmail.com> wrote:
>> ...
>>> That's absolutely NOT necessary as far as I can tell from our postings.
>>
>> I am pretty sure that I have decoded the problem - and basically, you
>> have to listen to what he said in his most recent post and ignore most
>> of the code samples. As is often the case, the code samples only
>> obscures the underlying issue - said issue is with language design, not
>> any specific coding problem.
>>
>> See my other post (from about 15 minutes ago).
>>
>
>I do understand he wants to solve his problem with true
>multi-dimensional arrays. I also understand the tool he wants to use
>doesn't support true multi-dimensional arrays. Now, if I just
>understood what the problem actually IS I think I or any other regular
>here could show him how to solve it easily in the tool he wants to use
>without using multi-dimensional arrays.
My point was this: Forget about what actual "business problem" you might
think he is trying to solve. That's irrelevant.
The point is he wants to *efficiently* loop through all array subscripts
that match a certain pattern. And as you and I and a cast of thousands
have said, this just ain't possible to do in "standard" AWK (or in GAWK).
Just to put this more concretely (just in case you still aren't on the
right page with this):
Imagine an array with 500,000 elements (i.e., subscripts). Imagine that
25,000 of them (i.e., 25,000 of the subscripts) start with
"root_token" (or whatever). We want to quickly (emphasis on quickly
and/or "efficiently") pull out those 25,000 entries (without - and this
is the key - having to iterate through all 500,000).
Got it now?
And if you don't, go back and re-read the first paragraph that I posted
above (the one that starts with "My point").
--
Faced with the choice between changing one's mind and proving that there is
no need to do so, almost everyone gets busy on the proof.
- John Kenneth Galbraith -
|
|
0
|
|
|
|
Reply
|
gazelle
|
9/10/2010 7:27:55 PM
|
|
On 9/10/2010 2:27 PM, Kenny McCormack wrote:
> In article<i6dv9h$cbt$1@news.eternal-september.org>,
> Ed Morton<mortonspam@gmail.com> wrote:
>> On 9/10/2010 12:14 PM, Kenny McCormack wrote:
>>> In article<i6dof6$mho$1@news.eternal-september.org>,
>>> Ed Morton<mortonspam@gmail.com> wrote:
>>> ...
>>>> That's absolutely NOT necessary as far as I can tell from our postings.
>>>
>>> I am pretty sure that I have decoded the problem - and basically, you
>>> have to listen to what he said in his most recent post and ignore most
>>> of the code samples. As is often the case, the code samples only
>>> obscures the underlying issue - said issue is with language design, not
>>> any specific coding problem.
>>>
>>> See my other post (from about 15 minutes ago).
>>>
>>
>> I do understand he wants to solve his problem with true
>> multi-dimensional arrays. I also understand the tool he wants to use
>> doesn't support true multi-dimensional arrays. Now, if I just
>> understood what the problem actually IS I think I or any other regular
>> here could show him how to solve it easily in the tool he wants to use
>> without using multi-dimensional arrays.
>
> My point was this: Forget about what actual "business problem" you might
> think he is trying to solve. That's irrelevant.
>
> The point is he wants to *efficiently* loop through all array subscripts
> that match a certain pattern. And as you and I and a cast of thousands
> have said, this just ain't possible to do in "standard" AWK (or in GAWK).
The thing I'm getting stuck on is why THAT specific approach to getting at the
data you want is so important. I mean, if you want to do something with the
fields of an array, then you COULD loop through all the array subscripts and
select those that match a specific patterm or you could do something different.
> Just to put this more concretely (just in case you still aren't on the
> right page with this):
>
> Imagine an array with 500,000 elements (i.e., subscripts). Imagine that
> 25,000 of them (i.e., 25,000 of the subscripts) start with
> "root_token" (or whatever). We want to quickly (emphasis on quickly
> and/or "efficiently") pull out those 25,000 entries (without - and this
> is the key - having to iterate through all 500,000).
>
> Got it now?
I think so. So, given this type of array you want to be able to produce the
output below:
$ cat tst.awk
BEGIN {
arr["a"]=10
arr["root_token_1"]=20
arr["b"]=30
arr["root_token_2"]=40
arr["c"]=50
arr["root_token_3"]=60
arr["d"]=70
arr["root_token_4"]=70
for (i in arr) {
if (i ~ /^root_token/) {
print i, arr[i]
}
}
}
$ awk -f tst.awk
root_token_1 20
root_token_2 40
root_token_3 60
root_token_4 70
without having to iterate through every element of the array. Is that all it is?
Ed.
> And if you don't, go back and re-read the first paragraph that I posted
> above (the one that starts with "My point").
>
|
|
0
|
|
|
|
Reply
|
Ed
|
9/10/2010 8:59:53 PM
|
|
In article <i6e68b$dat$1@news.eternal-september.org>,
Ed Morton <mortonspam@gmail.com> wrote:
....
> for (i in arr) { # Iterate through every element of the array...
> if (i ~ /^root_token/) {
> print i, arr[i]
> }
....
>without having to iterate through every element of the array.
But you did.
--
> No, I haven't, that's why I'm asking questions. If you won't help me,
> why don't you just go find your lost manhood elsewhere.
CLC in a nutshell.
|
|
0
|
|
|
|
Reply
|
gazelle
|
9/10/2010 9:07:36 PM
|
|
On 9/10/2010 4:07 PM, Kenny McCormack wrote:
> In article<i6e68b$dat$1@news.eternal-september.org>,
> Ed Morton<mortonspam@gmail.com> wrote:
> ...
>> for (i in arr) { # Iterate through every element of the array...
>> if (i ~ /^root_token/) {
>> print i, arr[i]
>> }
> ...
>> without having to iterate through every element of the array.
>
> But you did.
>
I know I did, but I didn't have to. I just want to know if I understood the problem.
Ed.
|
|
0
|
|
|
|
Reply
|
Ed
|
9/10/2010 9:13:27 PM
|
|
On 10/09/10 21:01, Ed Morton wrote:
> On 9/10/2010 12:14 PM, Kenny McCormack wrote:
>> In article<i6dof6$mho$1@news.eternal-september.org>,
>> Ed Morton<mortonspam@gmail.com> wrote:
>> ...
>>> That's absolutely NOT necessary as far as I can tell from our postings.
>>
>> I am pretty sure that I have decoded the problem - and basically, you
>> have to listen to what he said in his most recent post and ignore most
>> of the code samples. As is often the case, the code samples only
>> obscures the underlying issue - said issue is with language design, not
>> any specific coding problem.
>>
>> See my other post (from about 15 minutes ago).
>>
>
> I do understand he wants to solve his problem with true
> multi-dimensional arrays. I also understand the tool he wants to use
> doesn't support true multi-dimensional arrays. Now, if I just understood
> what the problem actually IS I think I or any other regular here could
> show him how to solve it easily in the tool he wants to use without
> using multi-dimensional arrays.
>
> It's probably just me needing a coffee, though.....
No, Ed, no more coffee necessary.
They have data IN, want data OUT, have a programming concept PC in mind,
and wish a transformation: IN --PC-> OUT, where their PC is relying on
storing all data in memory and then do iterations. In a stream processor
like awk we typically avoid that. We try to store as little as necessary
and do as much on the fly as possible, or we use the few data structures
that awk has instead of relying on structs, unions, classes, and complex
compositions of those types of arbitrary large complexity and hierarchy.
Those guys are so focussed on iterating on the in-memory data that they
don't see that we do not need to build complex data structures for most
of the tasks, and, as it seems, also not for the task of this thread. If
they would only skip their "iteration" obsession and just accurately tell
what the result should be; the OP's last posting still hasn't provided
that. @luca: Consider that there can be solutions IN --PC_alt-> OUT that
don't require your sort of "iteration".
Janis
>
> Ed.
|
|
0
|
|
|
|
Reply
|
Janis
|
9/10/2010 9:34:19 PM
|
|
On 10/09/10 19:48, luca wrote:
> On 10/09/2010 19.04, Ed Morton wrote:
>
>>
>> Then just do that. Seriously - I'm REALLY struggling to understand
>> what it is
>> you're trying to do that you're finding so difficult. The problem's
>> probably on
>> my side but I think it'd help a lot to have specific input and output
>> from that
>> input to look at so please tell us what your expected output would be
>> given this
>> input:
>>
>> nokia_2680s_ver1 1
>> nokia_2680s_ver1 0
>> nokia_2680s_ver1 1
>> nokia_2680s_ver1_sub2a 0
>> nokia_2680s_ver1_subua 1
>> nokia_2680s_ver1_sub2a 0
>
> sure
>
> nokia_2680s_ver1 => (3,3)
> nokia_2680s_ver1_sub2a (1,0)
> nokia_2680s_ver1_subua (0,1)
> nokia_2680s_ver1 (1,2)
Please check this output against your sample input; to me there seem to
be inconsistencies. If you think that data is okay, please explain how
the value nokia_2680s_ver1_sub2a (1,0) has been calculated, for
in the input data there's *two* entries with a zero, not one.
>
>> That's absolutely NOT necessary as far as I can tell from our postings.
>
> I don't want to abuse any of your time (in fact, I am grateful that you
> and other knowledgeable people here spent so much time to look at my
> problem).
> Having said this, if you really want, I can send you the script and the
> file for you to look at off-line.
>
>
>>> I read what you and Ed suggest, but I don't see an answer on how to
>>> identify the
>>> range of items I am interested into at each subloop, and just loop on
>>> that range.
>>
>> and I, at least, don't understand why our various postings haven't
>> provided the
>> answer you're looking for.
>
> because you have provided a way to emulate multidimensional arrays and
> lists, but not a way to loop on a subset of such multi-dimensional arrays.
Please tell me; do you want some specific *iteration* (which is impossible
to define in awk without complex container data types), or a concrete
*solution*. In the latter case please consider that there are certainly
solutions for your problem that don't require the sort of iteration that
you somehow seem to be focussed on.
Janis
>
> Thanks
>
> Luca
>
|
|
0
|
|
|
|
Reply
|
Janis
|
9/10/2010 9:42:05 PM
|
|
On 10/09/10 21:27, Kenny McCormack wrote:
> [...]
> Imagine an array with 500,000 elements (i.e., subscripts). Imagine that
> 25,000 of them (i.e., 25,000 of the subscripts) start with
> "root_token" (or whatever). We want to quickly (emphasis on quickly
> and/or "efficiently") pull out those 25,000 entries (without - and this
> is the key - having to iterate through all 500,000).
The point is that you have to read in all data anyway, any while reading
the data you do the necessary computation and, if necessary, storing as
much of the data as necessary. You *cannot avoid* to "read all data".
The difference is just that, once the data is in memory, you can iterate
as many times as you like, and complex data structures support that; we
all know that. But in the OP's case it really seems unnecessary, as far
as I can see from what he has posted until now.
Janis
PS: Since you're focussed on multi-dimensional arrays let me add another
point. I began programming with languages that have much more complex data
types than those, and "downgrading" to a language that supports just arrays
was a real pain as long as you don't see ways to handle those. Support for
a single data structure like the multi-dimensional arrays is far from
covering all basic demands in many application areas. To be consequent,
many more features would have to be built into awk; but then we might get
into a similar situation that can be observed with shells, especially ksh,
were they start plugging type systems and other features into the (shell-)
language, a language many decades old, where the design concept was bulky
and doesn't really get better by that approach.
> [...]
|
|
0
|
|
|
|
Reply
|
Janis
|
9/10/2010 10:12:58 PM
|
|
"luca" <luca_remove@alice.it> wrote in message
news:nbuio.239037$813.213586@tornado.fastwebnet.it...
> On 10/09/2010 19.04, Ed Morton wrote:
>> input:
>>
>> nokia_2680s_ver1 1
>> nokia_2680s_ver1 0
>> nokia_2680s_ver1 1
>> nokia_2680s_ver1_sub2a 0
>> nokia_2680s_ver1_subua 1
>> nokia_2680s_ver1_sub2a 0
>
> sure
>
> nokia_2680s_ver1 => (3,3)
> nokia_2680s_ver1_sub2a (1,0)
> nokia_2680s_ver1_subua (0,1)
> nokia_2680s_ver1 (1,2)
Well, given that input and output, here's one way to get something like it::
$2 == "0" { zeroes[$1]++; zero_tot++ }
$2 == "1" { ones[$1]++; one_tot++ }
END {
# find root name
# - won't fail on example input, but is conceivable that this can
for ( variant in zeroes ) {
if ( variant ~ /ver$/ ) {
root = variant
break
}
}
# print root name and totals for all variants
print root, "=> (" zero_tot "," one_tot ")"
# print all (n,m) and (n,0) entries
for ( variant in zeroes ) {
if ( variant in ones )
print "\t", variant, "(" zeroes[variant] "," ones[variant]) ")"
else
print "\t", variant, "(" zeroes[variant] ",0)"
}
# print all (0,n) entries
for ( variant in ones ) {
if ( !(variant in zeroes) )
print "\t", variant, "(0, " ones[variant] ")"
}
}
- Anton Treuenfels
|
|
0
|
|
|
|
Reply
|
Anton
|
9/10/2010 11:05:05 PM
|
|
On 10/09/2010 23.13, Ed Morton wrote:
> On 9/10/2010 4:07 PM, Kenny McCormack wrote:
>> In article<i6e68b$dat$1@news.eternal-september.org>,
>> Ed Morton<mortonspam@gmail.com> wrote:
>> ...
>>> for (i in arr) { # Iterate through every element of the array...
>>> if (i ~ /^root_token/) {
>>> print i, arr[i]
>>> }
>> ...
>>> without having to iterate through every element of the array.
>>
>> But you did.
>>
>
> I know I did, but I didn't have to. I just want to know if I understood the
> problem.
I think you understood the problem. You say you don't have to iterate, but you
did not show me how you can do it without iterating. The way I read your code,
you are evaluating the regexp 500k times for each iteration.
Luca
|
|
0
|
|
|
|
Reply
|
luca
|
9/10/2010 11:39:15 PM
|
|
"Anton Treuenfels" <teamtempest@yahoo.com> wrote in message
news:suednYksULUwJBfRnZ2dnUVZ_tGdnZ2d@earthlink.com...
But wait, there can be different root names in your input. Can they all be
distinguished from non-root names by correct use of a single pattern?
$9 == "0" { zeroes[$1]++ }
$9 == "1" { ones[$1]++ }
END {
# find root names - multiple times, perhaps (don't care)
for ( variant in zeroes ) {
if ( variant ~ rootPattern )
root[ variant ] = ".T."
}
for ( variant in ones ) {
if ( variant ~ rootPattern )
root[ variant ] = ".T."
}
# big loop
for ( variant in root ) {
# get total of variant and sub-variants
zero_tot = 0
for ( i in zeroes ) {
if ( index(i, variant) == 1 )
zero_tot += zeroes[ i ]
}
one_tot = 0
for ( i in ones ) {
if ( index(i, variant) == 1 )
one_tot += ones[ i ]
}
# print root name and totals
print variant, "=> (" zero_tot "," one_tot ")"
# print all (n,m) and (n,0) entries
for ( i in zeroes ) {
if ( index(i, variant) == 1 ) {
if ( i in ones )
print "\t", i, "(" zeroes[i] "," ones[i]) ")"
else
print "\t", i, "(" zeroes[i] ",0)"
}
}
# print all (0,n) entries
for ( i in ones ) {
if ( index(i, variant) == 1 ) {
if ( !(i in zeroes) )
print "\t", i, "(0, " ones[i] ")"
}
}
}
With any luck, the total number of elements in "zeroes" and "ones" will be
substantially less than half a million, so it shouldn't be such a pain to
iterate over them multiple times.
- Anton Treuenfels
|
|
0
|
|
|
|
Reply
|
Anton
|
9/10/2010 11:42:33 PM
|
|
On 10/09/2010 23.42, Janis Papanagnou wrote:
>>> nokia_2680s_ver1 1
>>> nokia_2680s_ver1 0
>>> nokia_2680s_ver1 1
>>> nokia_2680s_ver1_sub2a 0
>>> nokia_2680s_ver1_subua 1
>>> nokia_2680s_ver1_sub2a 0
>>
>> sure
>>
>> nokia_2680s_ver1 => (3,3)
>> nokia_2680s_ver1_sub2a (1,0)
>> nokia_2680s_ver1_subua (0,1)
>> nokia_2680s_ver1 (1,2)
>
> Please check this output against your sample input; to me there seem to
> be inconsistencies. If you think that data is okay, please explain how
> the value nokia_2680s_ver1_sub2a (1,0) has been calculated, for
> in the input data there's *two* entries with a zero, not one.
you are absolutely right. My bad. That should be:
nokia_2680s_ver1_sub2a (2,0)
>> because you have provided a way to emulate multidimensional arrays and
>> lists, but not a way to loop on a subset of such multi-dimensional arrays.
>
> Please tell me; do you want some specific *iteration* (which is impossible
> to define in awk without complex container data types), or a concrete
> *solution*. In the latter case please consider that there are certainly
> solutions for your problem that don't require the sort of iteration that
> you somehow seem to be focussed on.
I had solved my problem since even before I posted. I was just perplexed about
the fact that it was very unelegant and it seemed strage that there was no
better way to do it. Some webpages exist that show how to emulate
multi-dimensional arrays, but they did not cut it for me (couldn't iterate
efficiently).
So I figured that asking here would be a good idea.
Luca
|
|
0
|
|
|
|
Reply
|
luca
|
9/10/2010 11:43:58 PM
|
|
On 11/09/2010 1.05, Anton Treuenfels wrote:
>
>
> Well, given that input and output, here's one way to get something like it::
>
> $2 == "0" { zeroes[$1]++; zero_tot++ }
> $2 == "1" { ones[$1]++; one_tot++ }
actually zero_tot will also need to be an array because I have multiple roots.
Anyway, my question boils down to whether multidimensional arrays can be
emulated efficiently for the purpose of iterating on an arbitrary sub-array.
I think that the discussion has clarified that the answer is no.
thank you
Luca
|
|
0
|
|
|
|
Reply
|
luca
|
9/10/2010 11:49:29 PM
|
|
On 11/09/10 01:43, luca wrote:
> On 10/09/2010 23.42, Janis Papanagnou wrote:
>
>>>> nokia_2680s_ver1 1
>>>> nokia_2680s_ver1 0
>>>> nokia_2680s_ver1 1
>>>> nokia_2680s_ver1_sub2a 0
>>>> nokia_2680s_ver1_subua 1
>>>> nokia_2680s_ver1_sub2a 0
>>>
>>> sure
>>>
>>> nokia_2680s_ver1 => (3,3)
>>> nokia_2680s_ver1_sub2a (1,0)
>>> nokia_2680s_ver1_subua (0,1)
>>> nokia_2680s_ver1 (1,2)
>>
>> Please check this output against your sample input; to me there seem to
>> be inconsistencies. If you think that data is okay, please explain how
>> the value nokia_2680s_ver1_sub2a (1,0) has been calculated, for
>> in the input data there's *two* entries with a zero, not one.
>
> you are absolutely right. My bad. That should be:
>
> nokia_2680s_ver1_sub2a (2,0)
Okay. So my upthread posted program does produce exactly what you asked
for. Here it is again for your convenience...
# a function to encapsulate the root extraction from ID
function root(id)
{
match(id,/[^_]*_[^_]*_[^_]*/)
return substr(id,RSTART,RLENGTH)
}
# count in two arrays, one for the "subs" and one for the "roots";
# this is done for every record, since you want the "roots" to be
# counted for both, individually and in the summary heading, and
# also memorize the "subs" IDs in si[]
{ s[$1,$NF]++ ; r[root($1),$NF]++ ; si[$1] }
# the "root" IDs, entries that have not three "_" in their ID, will
# be memorized separately in ri[]
$1 !~ /_.*_.*_/ { ri[$1] }
# finally print out the accumulated values by iterating over the
# memorized IDs, use of "+0" to force formatting an uninitialized
# array element as "0"
END {
for (idx in ri) {
print idx, r[idx,0]+0, r[idx,1]+0
for (subidx in si) {
print " ", subidx, s[subidx,0]+0, s[subidx,1]+0
}
}
}
With this input data...
nokia_2680s_ver1 1
nokia_2680s_ver1 0
nokia_2680s_ver1 1
nokia_2680s_ver1_sub2a 0
nokia_2680s_ver1_subua 1
nokia_2680s_ver1_sub2a 0
....the program will produce that output...
nokia_2680s_ver1 3 3
nokia_2680s_ver1_sub2a 2 0
nokia_2680s_ver1_subua 0 1
nokia_2680s_ver1 1 2
....where you expected...
nokia_2680s_ver1 => (3,3)
nokia_2680s_ver1_sub2a (2,0)
nokia_2680s_ver1_subua (0,1)
nokia_2680s_ver1 (1,2)
You see, it's exactly the same data without any superfluous "iteration"
or any complex data structures.
The only thing you have to do is to add some filling symbols in the
print statements (brackets, a comma, and the arrow), or use printf
for simplicity, something like...
printf %s\t => (%d,%d)\n", idx, r[idx,0]+0, r[idx,1]+0
and resp.
printf " %s\t(%d,%d)\n", subidx, s[subidx,0]+0, s[subidx,1]+0.
Easy, isn't it?
And it doesn't even need to store all the data lines, because only
the IDs and counts are memorized. Efficient, isn't it?
>
>
>>> because you have provided a way to emulate multidimensional arrays and
>>> lists, but not a way to loop on a subset of such multi-dimensional
>>> arrays.
>>
>> Please tell me; do you want some specific *iteration* (which is
>> impossible
>> to define in awk without complex container data types), or a concrete
>> *solution*. In the latter case please consider that there are certainly
>> solutions for your problem that don't require the sort of iteration that
>> you somehow seem to be focussed on.
>
> I had solved my problem since even before I posted. I was just perplexed
> about the fact that it was very unelegant and it seemed strage that
> there was no better way to do it.
Elegance lies in the eye of the beholder. Your first attempt clearly
showed that you did not yet understood how to use awk appropriately.
If you haven't noticed awk's elegance you might want to consider using
another language which has (and supports) the desired complexity?
> Some webpages exist that show how to
> emulate multi-dimensional arrays, but they did not cut it for me
> (couldn't iterate efficiently).
>
> So I figured that asking here would be a good idea.
There are reasons why awk is not designed as a full blown programming
language with all complex data types. Because it doesn't have those
data types doesn't mean that you cannot solve your problems with awk.
(Consider awk would have those multi-dimensional arrays; some other
guy would asking for classes, multisets, multimaps, balanced trees,
etc.)
Good luck with your further efforts.
Janis
>
> Luca
>
>
|
|
0
|
|
|
|
Reply
|
Janis
|
9/11/2010 12:31:35 AM
|
|
Oh, I can't resist. One more variant.
"Anton Treuenfels" <teamtempest@yahoo.com> wrote in message
news:C9OdnS5PaLDoXxfRnZ2dnUVZ_tidnZ2d@earthlink.com...
>
> "Anton Treuenfels" <teamtempest@yahoo.com> wrote in message
> news:suednYksULUwJBfRnZ2dnUVZ_tGdnZ2d@earthlink.com...
>
> But wait, there can be different root names in your input. Can they all be
> distinguished from non-root names by correct use of a single pattern?
>
> $9 == "0" { zeroes[$1]++ }
> $9 == "1" { ones[$1]++ }
>
> END {
>
> # find root names - multiple times, perhaps (don't care)
>
> for ( variant in zeroes ) {
> if ( variant ~ rootPattern )
> root[ variant ] = ".T."
> }
>
> for ( variant in ones ) {
> if ( variant ~ rootPattern )
> root[ variant ] = ".T."
> }
>
> # big loop
>
> for ( variant in root ) {
>
> # get total of variant and sub-variants
>
> zero_tot = 0
> for ( i in zeroes ) {
> if ( index(i, variant) == 1 )
> zero_tot += zeroes[ i ]
> }
>
> one_tot = 0
> for ( i in ones ) {
> if ( index(i, variant) == 1 )
> one_tot += ones[ i ]
> }
>
> # print root name and totals
>
> print variant, "=> (" zero_tot "," one_tot ")"
>
> # print all (n,m) and (n,0) entries
>
> for ( i in zeroes ) {
> if ( index(i, variant) == 1 ) {
if ( i in ones ) {
> print "\t", i, "(" zeroes[i] "," ones[i]) ")"
delete ones[i]
}
> else
> print "\t", i, "(" zeroes[i] ",0)"
delete zeroes[ i ]
> }
> }
>
> # print all (0,n) entries
>
> for ( i in ones ) {
> if ( index(i, variant) == 1 ) {
> print "\t", i, "(0, " ones[i] ")"
delete ones[ i ]
> }
> }
> }
>
> With any luck, the total number of elements in "zeroes" and "ones" will be
> substantially less than half a million, so it shouldn't be such a pain to
> iterate over them multiple times.
And the number of element will go down as each root and sub-variant count is
eliminated once it isn't needed any more. Which should make each iteration
of "big loop" faster than the previous one. Enough to make a noticeable
difference? I dunno. How many root names + sub-variants are there?
> - Anton Treuenfels
>
|
|
0
|
|
|
|
Reply
|
Anton
|
9/11/2010 12:33:08 AM
|
|
On 10/09/10 23:34, Janis Papanagnou wrote:
> On 10/09/10 21:01, Ed Morton wrote:
>>
>> It's probably just me needing a coffee, though.....
>
> No, Ed, no more coffee necessary.
I have to correct that statement; probably both of us (and the OP
as well) need more coffee. Rethinking about the whole thread, the
OP's sample data, after first having been incorrect, finally might
not have been describing his data sufficiently; with more ID's he
we will certainly need another more complex approach. Sigh.
Janis
|
|
0
|
|
|
|
Reply
|
Janis
|
9/11/2010 12:49:28 AM
|
|
In article <Ckzio.239199$813.89074@tornado.fastwebnet.it>,
luca <luca_remove@alice.it> wrote:
>On 10/09/2010 23.13, Ed Morton wrote:
>> On 9/10/2010 4:07 PM, Kenny McCormack wrote:
>>> In article<i6e68b$dat$1@news.eternal-september.org>,
>>> Ed Morton<mortonspam@gmail.com> wrote:
>>> ...
>>>> for (i in arr) { # Iterate through every element of the array...
>>>> if (i ~ /^root_token/) {
>>>> print i, arr[i]
>>>> }
>>> ...
>>>> without having to iterate through every element of the array.
>>>
>>> But you did.
>>>
>>
>> I know I did, but I didn't have to. I just want to know if I understood the
>> problem.
>
>I think you understood the problem. You say you don't have to iterate, but you
>did not show me how you can do it without iterating. The way I read your code,
>you are evaluating the regexp 500k times for each iteration.
>
>Luca
>
I honestly think that I am the only one here besides you who understands
what you are saying. Everyone else is fixated, as newsgroup posters
usually are, on solving some specific problem - as if they were taking a
standardized test and needed a good score to get into a good college.
Every so-called "solution" posted has involved iterating over the entire
array - and you have made it clear (to me, if to no one else) that that
is precisely what you are trying to avoid.
--
Faced with the choice between changing one's mind and proving that there is
no need to do so, almost everyone gets busy on the proof.
- John Kenneth Galbraith -
|
|
0
|
|
|
|
Reply
|
gazelle
|
9/11/2010 1:28:50 AM
|
|
In article <cuzio.239202$813.3754@tornado.fastwebnet.it>,
luca <luca_remove@alice.it> wrote:
>On 11/09/2010 1.05, Anton Treuenfels wrote:
>>
>>
>> Well, given that input and output, here's one way to get something like it::
>>
>> $2 == "0" { zeroes[$1]++; zero_tot++ }
>> $2 == "1" { ones[$1]++; one_tot++ }
>
>
>actually zero_tot will also need to be an array because I have multiple roots.
>
>Anyway, my question boils down to whether multidimensional arrays can be
>emulated efficiently for the purpose of iterating on an arbitrary sub-array.
>
>I think that the discussion has clarified that the answer is no.
>
>thank you
>
>Luca
Quite so. Within the confines of "standard" AWK (or GAWK), the answer
is a clear-cut "No".
--
One of the best lines I've heard lately:
Obama could cure cancer tomorrow, and the Republicans would be
complaining that he had ruined the pharmaceutical business.
(Heard on Stephanie Miller = but the sad thing is that there is an awful lot
of direct truth in it. We've constructed an economy in which eliminating
cancer would be a horrible disaster. There are many other such examples.)
|
|
0
|
|
|
|
Reply
|
gazelle
|
9/11/2010 1:30:42 AM
|
|
On 11/09/10 02:49, Janis Papanagnou wrote:
> On 10/09/10 23:34, Janis Papanagnou wrote:
>> On 10/09/10 21:01, Ed Morton wrote:
>>>
>>> It's probably just me needing a coffee, though.....
>>
>> No, Ed, no more coffee necessary.
>
> I have to correct that statement; probably both of us (and the OP
> as well) need more coffee. Rethinking about the whole thread, the
> OP's sample data, after first having been incorrect, finally might
> not have been describing his data sufficiently; with more ID's he
> we will certainly need another more complex approach. Sigh.
Here's a slightly more complex approach (based on code I posted before)
that also does the book-keeping for the sub-array indices...
match ($1, /[^_]*_[^_]*_[^_]*/) {
root = substr($1,RSTART,RLENGTH)
subs = substr($1,RSTART+RLENGTH)
if (!subs) {
ri[root]
}
else if (!($1 in si)) {
si[$1]
ids[root] = ids[root]" "$1
}
s[$1,$NF]++ ; r[root,$NF]++
}
END {
for (root in ids) {
print root, r[root,0]+0, r[root,1]+0
n = split(ids[root],subids)
for (i=1; i<=n; i++)
print " ", subids[i], s[subids[i],0]+0, s[subids[i],1]+0
}
}
For the subsequent (shuffled) test data (with more than one root)...
nokia_2680s_ver1 1
nokia_2680s_ver1_sub2a 0
nokia_2680s_ver1 1
nokia_2680s_ver1_sub2a 0
nokia_2680s_ver2_subua 0
nokia_2680s_ver2 1
nokia_2680s_ver2 0
nokia_2680s_ver1 0
nokia_2680s_ver2 1
nokia_2680s_ver1_subua 1
nokia_2680s_ver2_sub2a 1
nokia_2680s_ver2_sub2a 1
....it produces the following output...
nokia_2680s_ver1 3 3
nokia_2680s_ver1_sub2a 2 0
nokia_2680s_ver1_subua 0 1
nokia_2680s_ver2 2 4
nokia_2680s_ver2_subua 1 0
nokia_2680s_ver2_sub2a 0 2
....which seems to be what was asked - but who knows.
It's late here in Central Europe and being tired I may have missed
something. Drop me a note if "it works".
Janis
|
|
0
|
|
|
|
Reply
|
Janis
|
9/11/2010 2:05:41 AM
|
|
In article <i6eo5m$e0a$1@news.m-online.net>,
Janis Papanagnou <janis_papanagnou@hotmail.com> wrote:
....
> for (root in ids) {
Stop right there!
You guys simply don't get it. Any "solution" that includes a line such
as the above, which iterates through the array, is a non-starter.
--
(This discussion group is about C, ...)
Wrong. It is only OCCASIONALLY a discussion group
about C; mostly, like most "discussion" groups, it is
off-topic Rorsharch [sic] revelations of the childhood
traumas of the participants...
|
|
0
|
|
|
|
Reply
|
gazelle
|
9/11/2010 2:13:39 AM
|
|
On 11/09/10 04:13, Kenny McCormack wrote:
> In article <i6eo5m$e0a$1@news.m-online.net>,
> Janis Papanagnou <janis_papanagnou@hotmail.com> wrote:
> ...
>> for (root in ids) {
>
> Stop right there!
>
> You guys simply don't get it. Any "solution" that includes a line such
> as the above, which iterates through the array, is a non-starter.
>
Maybe I don't get it, Kenny. But if I assume I have a 2-dimensional array
r1 s11 s12 s13 ...
r2 s21 s22 s23 ...
r3 s31 s32 s33 ...
...
rN SN1 sN2 sN3 ...
wouldn't "your" solution also have to iterate through r1, r2, .., RN ?
And for each rI iterate through sI1, sI2, sI3,... to print the values?
Janis
|
|
0
|
|
|
|
Reply
|
Janis
|
9/11/2010 2:24:11 AM
|
|
On Sep 10, 7:13=A0pm, gaze...@shell.xmission.com (Kenny McCormack)
wrote:
> In article <i6eo5m$e0...@news.m-online.net>,
> Janis Papanagnou =A0<janis_papanag...@hotmail.com> wrote:
> ...
>
> > =A0 =A0for (root in ids) {
>
> Stop right there!
>
> You guys simply don't get it. =A0Any "solution" that includes a line such
> as the above, which iterates through the array, is a non-starter.
>
> --
> (This discussion group is about C, ...)
>
> Wrong. =A0It is only OCCASIONALLY a discussion group
> about C; mostly, like most "discussion" groups, it is
> off-topic Rorsharch [sic] revelations of the childhood
> traumas of the participants...
Here is a data structure I have had to handle a number of times:
There are a number of root items r1, r2, ..., rn.
For each root item r, there are a number of branch items r_b1,
r_b2, ..., r_bk where k can depend on r.
To store them, I use an array ar which, thanks to gawk's flexibility I
use in both 1 and 2 dimensional form.
For each root item r, ar[r] contains the number of branch items of r,
and the branch items are stored in ar[r, i] for i =3D 1 to ar[r] (or 0
to ar[r]-1).
The items are stored by the following routine:
# ar =3D array, r =3D root, b =3D branch for the root
function storem(ar, r, b )
{
if ( r in ar )
{
ar[r]++;
}
else
{
ar[r] =3D 1;
}
ar[r, ar[r]] =3D b;
}
To go through the branch items for a root item r, do
if ( r in ar )
{
for ( i=3D1; i<=3Dar[r]; i++ )
{
b =3D ar[r. i];
# process the branch item b
}
}
else
{
# r is not a root item
}
If you want to go through all the branch items, replace the "if" above
with "for" and remove the "else" clause.
|
|
0
|
|
|
|
Reply
|
mjc
|
9/11/2010 2:32:07 AM
|
|
On 9/10/2010 9:13 PM, Kenny McCormack wrote:
> In article<i6eo5m$e0a$1@news.m-online.net>,
> Janis Papanagnou<janis_papanagnou@hotmail.com> wrote:
> ...
>> for (root in ids) {
>
> Stop right there!
>
> You guys simply don't get it. Any "solution" that includes a line such
> as the above, which iterates through the array, is a non-starter.
>
OK, look - I for one am very interested in understanding whatever it is we're
discussing. As far as I can tell, Janis has several times provided solutions to
the OPs specific problem which are apparently being rejected "just because" they
aren't the type of solution the OP and apparently Kenny think would be more
appropriate.
Kenny - would you mind starting a new thread with some specific sample input and
expected output that would clearly demonstrate what the issue is so us simple
folks can get a better understanding of it?
Ed.
|
|
0
|
|
|
|
Reply
|
Ed
|
9/11/2010 2:57:59 AM
|
|
In article <i6ep8b$ebs$1@news.m-online.net>,
Janis Papanagnou <janis_papanagnou@hotmail.com> wrote:
>On 11/09/10 04:13, Kenny McCormack wrote:
>> In article <i6eo5m$e0a$1@news.m-online.net>,
>> Janis Papanagnou <janis_papanagnou@hotmail.com> wrote:
>> ...
>>> for (root in ids) {
>>
>> Stop right there!
>>
>> You guys simply don't get it. Any "solution" that includes a line such
>> as the above, which iterates through the array, is a non-starter.
>>
>
>Maybe I don't get it, Kenny. But if I assume I have a 2-dimensional array
>
> r1 s11 s12 s13 ...
> r2 s21 s22 s23 ...
> r3 s31 s32 s33 ...
> ...
> rN SN1 sN2 sN3 ...
>
>wouldn't "your" solution also have to iterate through r1, r2, .., RN ?
>And for each rI iterate through sI1, sI2, sI3,... to print the values?
>
>Janis
No. And I'm glad you asked, because it gives me the opportunity to make
this explicit - how you would do it in TAWK.
Suppose we have a collection of "roots":
root1 root2 root3 ... (assume there are about 20 of these)
and, for each "root", there are a large number of "tails":
root1tail1 root1tail2 root1tail3 ...
(assume there are about 25000 of these, for each root)
But note that there is no assumption that the collection of "tails" is
the same for each root. In fact, each "root" will almost certainly have
it's own set of tails - the tails that make sense for that root.
Then, finally, for each root/tail combination, there is an associated
data value.
Well, now obviously, in a "standard" AWK, what you end up with, even if
you fake it by using the kludgey SUBSEP nonsense, if a single-dimensional
array with 500,000 elements. But in TAWK, we have arrays of arrays
(which, a good nitpicker could argue, is not "real" multi-dimensional
arrays - but is, rather, arrays of arrays. I think we had this
discussion a while back. But anyway, suffice to say that when I use the
term "real" or "true" "multi-dimensional" arrays, I mean TAWK's "arrays
of arrays"). So, in TAWK, what we can write:
A["root1"]["root1tail1"] = "A value"
and so on and on. And what we end up with is an array A whose subscripts are
"root1", "root2", ... And then (ta da!) when we want to know all the
tails that are associated with, say, "root17", we simply iterate through
the array A["root17"], like this:
for (i in A["root17"])
which results in iterating over an array of 25,000 elements, not 500,000
elements. Trust me - the internal implementation *is* efficient; this
is *not* HLL sleight-of-hand or "syntactic sugar".
And, quoting Sam Rothstein, that's that!
--
> No, I haven't, that's why I'm asking questions. If you won't help me,
> why don't you just go find your lost manhood elsewhere.
CLC in a nutshell.
|
|
0
|
|
|
|
Reply
|
gazelle
|
9/11/2010 3:07:00 AM
|
|
In article <i6er7q$mmm$1@news.eternal-september.org>,
Ed Morton <mortonspam@gmail.com> wrote:
....
>Kenny - would you mind starting a new thread with some specific sample
>input and expected output that would clearly demonstrate what the issue
>is so us simple folks can get a better understanding of it?
As I keep telling you, this is not a college entrance exam test.
--
Just for a change of pace, this sig is *not* an obscure reference to
comp.lang.c...
|
|
0
|
|
|
|
Reply
|
gazelle
|
9/11/2010 3:08:43 AM
|
|
On 9/10/2010 10:07 PM, Kenny McCormack wrote:
> In article<i6ep8b$ebs$1@news.m-online.net>,
> Janis Papanagnou<janis_papanagnou@hotmail.com> wrote:
>> On 11/09/10 04:13, Kenny McCormack wrote:
>>> In article<i6eo5m$e0a$1@news.m-online.net>,
>>> Janis Papanagnou<janis_papanagnou@hotmail.com> wrote:
>>> ...
>>>> for (root in ids) {
>>>
>>> Stop right there!
>>>
>>> You guys simply don't get it. Any "solution" that includes a line such
>>> as the above, which iterates through the array, is a non-starter.
>>>
>>
>> Maybe I don't get it, Kenny. But if I assume I have a 2-dimensional array
>>
>> r1 s11 s12 s13 ...
>> r2 s21 s22 s23 ...
>> r3 s31 s32 s33 ...
>> ...
>> rN SN1 sN2 sN3 ...
>>
>> wouldn't "your" solution also have to iterate through r1, r2, .., RN ?
>> And for each rI iterate through sI1, sI2, sI3,... to print the values?
>>
>> Janis
>
> No. And I'm glad you asked, because it gives me the opportunity to make
> this explicit - how you would do it in TAWK.
>
> Suppose we have a collection of "roots":
>
> root1 root2 root3 ... (assume there are about 20 of these)
>
> and, for each "root", there are a large number of "tails":
>
> root1tail1 root1tail2 root1tail3 ...
> (assume there are about 25000 of these, for each root)
>
> But note that there is no assumption that the collection of "tails" is
> the same for each root. In fact, each "root" will almost certainly have
> it's own set of tails - the tails that make sense for that root.
>
> Then, finally, for each root/tail combination, there is an associated
> data value.
>
> Well, now obviously, in a "standard" AWK, what you end up with, even if
> you fake it by using the kludgey SUBSEP nonsense, if a single-dimensional
> array with 500,000 elements. But in TAWK, we have arrays of arrays
> (which, a good nitpicker could argue, is not "real" multi-dimensional
> arrays - but is, rather, arrays of arrays. I think we had this
> discussion a while back. But anyway, suffice to say that when I use the
> term "real" or "true" "multi-dimensional" arrays, I mean TAWK's "arrays
> of arrays"). So, in TAWK, what we can write:
>
> A["root1"]["root1tail1"] = "A value"
As opposed to, in other awks (one possibility):
A["root1","root1tail1"] = "A value"
A["root1"] = A["root1"] SUBSEP "root1tail1"
>
> and so on and on. And what we end up with is an array A whose subscripts are
> "root1", "root2", ... And then (ta da!) when we want to know all the
> tails that are associated with, say, "root17", we simply iterate through
> the array A["root17"], like this:
>
> for (i in A["root17"])
As opposed to, in other awks (again, one possibility given the above):
nt=split(A["root17"],tl,SUBSEP)
for (nr=2; nr<=nt; nr++) {
i = tl[nr]
}
>
> which results in iterating over an array of 25,000 elements, not 500,000
> elements.
Ditto in other awks.
Trust me - the internal implementation *is* efficient; this
> is *not* HLL sleight-of-hand or "syntactic sugar".
>
> And, quoting Sam Rothstein, that's that!
>
Yes, the way you describe tawk working is a couple of lines briefer, probably
uses a bit less memory and is probably slightly more efficient than the ways you
could do it with other awks but it just doesn't seem like it's more than a
trivial difference.
Ed.
|
|
0
|
|
|
|
Reply
|
Ed
|
9/11/2010 4:03:23 AM
|
|
On 9/10/2010 6:39 PM, luca wrote:
> On 10/09/2010 23.13, Ed Morton wrote:
[reconstructing my original]
> So, given this type of array you want to be able to produce the output below:
>
> $ cat tst.awk
> BEGIN {
> arr["a"]=10
> arr["root_token_1"]=20
> arr["b"]=30
> arr["root_token_2"]=40
> arr["c"]=50
> arr["root_token_3"]=60
> arr["d"]=70
> arr["root_token_4"]=70
>
> for (i in arr) {
> if (i ~ /^root_token/) {
> print i, arr[i]
> }
> }
> }
>
> $ awk -f tst.awk
> root_token_1 20
> root_token_2 40
> root_token_3 60
> root_token_4 70
>
> without having to iterate through every element of the array. Is that all it is?
>
> Ed.
>
> I think you understood the problem. You say you don't have to iterate, but you
> did not show me how you can do it without iterating. The way I read your code,
> you are evaluating the regexp 500k times for each iteration.
Here's one way to do it without iterating over the whole array:
$ cat tst.awk
BEGIN {
arr["a"]=10
arr["root_token_1"]=20
arr["b"]=30
arr["root_token_2"]=40
arr["c"]=50
arr["root_token_3"]=60
arr["d"]=70
arr["root_token_4"]=70
arr["root_token"] = SUBSEP 1 SUBSEP 2 SUBSEP 3 SUBSEP 4
nt = split(arr["root_token"],tl,SUBSEP)
for (nr=2; nr<=nt; nr++) {
i = "root_token_" tl[nr]
print i, arr[i]
}
}
$ awk -f tst.awk
root_token_1 20
root_token_2 40
root_token_3 60
root_token_4 70
i.e. just save the mapping of roots to tails as you populate the array, then
look that up later to let you iterate over just the tails for the given root.
Regards,
Ed.
|
|
0
|
|
|
|
Reply
|
Ed
|
9/11/2010 4:21:39 AM
|
|
On 11/09/10 05:07, Kenny McCormack wrote:
> In article <i6ep8b$ebs$1@news.m-online.net>,
> Janis Papanagnou <janis_papanagnou@hotmail.com> wrote:
>> On 11/09/10 04:13, Kenny McCormack wrote:
>>> In article <i6eo5m$e0a$1@news.m-online.net>,
>>> Janis Papanagnou <janis_papanagnou@hotmail.com> wrote:
>>> ...
>>>> for (root in ids) {
>>>
>>> Stop right there!
>>>
>>> You guys simply don't get it. Any "solution" that includes a line such
>>> as the above, which iterates through the array, is a non-starter.
>>>
>>
>> Maybe I don't get it, Kenny. But if I assume I have a 2-dimensional array
>>
>> r1 s11 s12 s13 ...
>> r2 s21 s22 s23 ...
>> r3 s31 s32 s33 ...
>> ...
>> rN SN1 sN2 sN3 ...
>>
>> wouldn't "your" solution also have to iterate through r1, r2, .., RN ?
>> And for each rI iterate through sI1, sI2, sI3,... to print the values?
>>
>> Janis
>
> No. And I'm glad you asked, because it gives me the opportunity to make
> this explicit - how you would do it in TAWK.
Thanks for explaining. (Just keep in mind that the multi-dimensional
arrays and how to access elements is well known to me when I comment
below.)
>
> Suppose we have a collection of "roots":
>
> root1 root2 root3 ... (assume there are about 20 of these)
Yes. In my description above it was r1, r2, ..., rN.
>
> and, for each "root", there are a large number of "tails":
>
> root1tail1 root1tail2 root1tail3 ...
> (assume there are about 25000 of these, for each root)
Yes. In my description above it was s11 s12 s13 ... for root r1.
>
> But note that there is no assumption that the collection of "tails" is
> the same for each root. In fact, each "root" will almost certainly have
> it's own set of tails - the tails that make sense for that root.
Yes. That assumption was true also for my above description; all lines
ended in "..." (as opposed to s1K, s2K, ..., sNK).
>
> Then, finally, for each root/tail combination, there is an associated
> data value.
Yes. In my proposal they were carried in r[] and s[]; separate arrays,
granted, and probably not as elegant as if you'd have a language with
more sophisticated data structures available, no one doubts that.
>
> Well, now obviously, in a "standard" AWK, what you end up with, even if
> you fake it by using the kludgey SUBSEP nonsense, if a single-dimensional
> array with 500,000 elements.
We agree that SUBSEP is a concept that is not comparable with support
for complex data structures. But you have an efficient access to those
elements, a hash table, as I've been told. And you only need memory for
those array elements that are actually to store (as opposed to some
languages that support more data structures, but where, depending on
the actual language, there's sometimes the whole memory space reserved
for an M x N array; though I assume this may not be the case in tawk).
> But in TAWK, we have arrays of arrays
> (which, a good nitpicker could argue, is not "real" multi-dimensional
> arrays - but is, rather, arrays of arrays. I think we had this
> discussion a while back. But anyway, suffice to say that when I use the
> term "real" or "true" "multi-dimensional" arrays, I mean TAWK's "arrays
> of arrays"). So, in TAWK, what we can write:
>
> A["root1"]["root1tail1"] = "A value"
>
> and so on and on. And what we end up with is an array A whose subscripts are
> "root1", "root2", ... And then (ta da!) when we want to know all the
> tails that are associated with, say, "root17", we simply iterate through
> the array A["root17"], like this:
>
> for (i in A["root17"])
Yes. We also agree that language support for multi-dimensional arrays or
other complex data structures makes writing respective code much easier.
(You certainly know whether it's as simple in tawk to implement solutions
that relies on OO polymorphism; I doubt it's simple, if it's unsupported.
Then we would have to emulate that concept in a more or less bulky way.
Emulating multi-dimensional arrays is also not preferred, but apparently
less bulky than, say polymorphism.)
>
> which results in iterating over an array of 25,000 elements, not 500,000
> elements.
But that's what I also do in my last proposal; I iterate over the 25.000
elements that each root has, because the OP wanted for *each* root the
summary counts and the sub-root counts. If it would have been the question
to iterate just over one specific root then we need other code, but the
access mechanism won't change; instead of the for loop access the single
root element and iterate just over those. (Instead of for (root in ids)
use ids["rX"], and "rX" in the respective places for any concrete root
"rX".)
> Trust me - the internal implementation *is* efficient; this
> is *not* HLL sleight-of-hand or "syntactic sugar".
I don't think anyone doubts that.
Janis
>
> And, quoting Sam Rothstein, that's that!
>
|
|
0
|
|
|
|
Reply
|
Janis
|
9/11/2010 10:02:56 AM
|
|
On 11/09/10 06:03, Ed Morton wrote:
> On 9/10/2010 10:07 PM, Kenny McCormack wrote:
>> In article<i6ep8b$ebs$1@news.m-online.net>,
>>
>> [...] And then (ta da!) when we want to know all the
>> tails that are associated with, say, "root17", we simply iterate through
>> the array A["root17"], like this:
>>
>> for (i in A["root17"])
>
> As opposed to, in other awks (again, one possibility given the above):
>
> nt=split(A["root17"],tl,SUBSEP)
> for (nr=2; nr<=nt; nr++) {
> i = tl[nr]
> }
>
>>
>> which results in iterating over an array of 25,000 elements, not 500,000
>> elements.
IMO, all true what you have said.
I was using that split() approach in my example. Though, to be fair, we
should explicitly mention that it is a bit less efficient - we both know
that but others may care -; where a real sophisticated multi-dimensional
array implementation has a complexity of N x M the above solution would
make it N x 2M; it's still the same complexity class, though.
Janis
> [...]
|
|
0
|
|
|
|
Reply
|
Janis
|
9/11/2010 10:17:40 AM
|
|
On 9/11/2010 5:17 AM, Janis Papanagnou wrote:
> On 11/09/10 06:03, Ed Morton wrote:
>> On 9/10/2010 10:07 PM, Kenny McCormack wrote:
>>> In article<i6ep8b$ebs$1@news.m-online.net>,
>>>
>>> [...] And then (ta da!) when we want to know all the
>>> tails that are associated with, say, "root17", we simply iterate through
>>> the array A["root17"], like this:
>>>
>>> for (i in A["root17"])
>>
>> As opposed to, in other awks (again, one possibility given the above):
>>
>> nt=split(A["root17"],tl,SUBSEP)
>> for (nr=2; nr<=nt; nr++) {
>> i = tl[nr]
>> }
>>
>>>
>>> which results in iterating over an array of 25,000 elements, not 500,000
>>> elements.
>
> IMO, all true what you have said.
>
> I was using that split() approach in my example. Though, to be fair, we
> should explicitly mention that it is a bit less efficient
I did mention that at the end of the posting. I also mentioned it probably uses
a bit more memory and a couple more lines of code.
Ed.
- we both know
> that but others may care -; where a real sophisticated multi-dimensional
> array implementation has a complexity of N x M the above solution would
> make it N x 2M; it's still the same complexity class, though.
>
> Janis
>
>> [...]
|
|
0
|
|
|
|
Reply
|
Ed
|
9/11/2010 1:11:39 PM
|
|
gee, this is getting complicated :)
On 11/09/2010 2.31, Janis Papanagnou wrote:
> # count in two arrays, one for the "subs" and one for the "roots";
> # this is done for every record, since you want the "roots" to be
> # counted for both, individually and in the summary heading, and
> # also memorize the "subs" IDs in si[]
> { s[$1,$NF]++ ; r[root($1),$NF]++ ; si[$1] }
>
> # the "root" IDs, entries that have not three "_" in their ID, will
> # be memorized separately in ri[]
> $1 !~ /_.*_.*_/ { ri[$1] }
I think you are making assumptions about the syntax of the IDs based on the data
sample which are not generally valid. The only thing that matters is that there
is a "_ver1" constant token. There is no assumption on the actual number of
underscores.
>
> # finally print out the accumulated values by iterating over the
> # memorized IDs, use of "+0" to force formatting an uninitialized
> # array element as "0"
> END {
> for (idx in ri) {
> print idx, r[idx,0]+0, r[idx,1]+0
> for (subidx in si) {
> print " ", subidx, s[subidx,0]+0, s[subidx,1]+0
> }
> }
> }
OK, I am not able to understand 100% what you are doing, but I think I
understand the basic concept. You are not interating over 500k si[] elements.
You sort of pre-collpased identical IDs ahead of time and looped on those (thus
greatly improving performance).
If I understand correctly, that's smart, but makes me wonder if, with all of
this complexity, the game is worth the candle.
I mean, this stuff is way more complex than the Perl I was afraid to have to
re-learn again :)
> Elegance lies in the eye of the beholder. Your first attempt clearly
> showed that you did not yet understood how to use awk appropriately.
My first 20 or so chars posted in this thread were "Hi, Awk beginner here."
No secret there.
> There are reasons why awk is not designed as a full blown programming
> language with all complex data types. Because it doesn't have those
> data types doesn't mean that you cannot solve your problems with awk.
> (Consider awk would have those multi-dimensional arrays; some other
> guy would asking for classes, multisets, multimaps, balanced trees,
> etc.)
I understand, but still something tells me that multidimensional arrays and
arrays of arrays are a glaring omission given anything else that Awk is good at.
> Good luck with your further efforts.
>
> Janis
Thank you Janis. I am really grateful that you spent so much time to look at my
problem and I learned a lot in the process.
Luca
|
|
0
|
|
|
|
Reply
|
luca
|
9/11/2010 5:37:12 PM
|
|
On 11/09/10 19:37, luca wrote:
>
> gee, this is getting complicated :)
Mainly it's getting rather long than complicated.
>
> On 11/09/2010 2.31, Janis Papanagnou wrote:
>
>> # count in two arrays, one for the "subs" and one for the "roots";
>> # this is done for every record, since you want the "roots" to be
>> # counted for both, individually and in the summary heading, and
>> # also memorize the "subs" IDs in si[]
>> { s[$1,$NF]++ ; r[root($1),$NF]++ ; si[$1] }
>>
>> # the "root" IDs, entries that have not three "_" in their ID, will
>> # be memorized separately in ri[]
>> $1 !~ /_.*_.*_/ { ri[$1] }
>
>
> I think you are making assumptions about the syntax of the IDs based on
> the data sample which are not generally valid. The only thing that
> matters is that there is a "_ver1" constant token. There is no
> assumption on the actual number of underscores.
Pattern matching has the nice property that you can easily adjust it
to your needs without sacrificing the subsequent logic.
>
>>
>> [snip outdated code]
>
> OK, I am not able to understand 100% what you are doing, but I think I
> understand the basic concept. You are not interating over 500k si[]
> elements.
The above code was based to give you a solution for your sample code
which unfortunately didn't inaccurately describe the problem. Forget
that code, I've posted a version adapted to your needs a bit later,
about one and a half hour later.
> You sort of pre-collpased identical IDs ahead of time and looped on
> those (thus greatly improving performance).
Not quite. Maybe I'll better make another posting following up your
original posting, where I post the code and comment on it, along with
other recent observations.
>
> If I understand correctly, that's smart, but makes me wonder if, with
> all of this complexity, the game is worth the candle.
Wait for the commented version and decide then. Certainly, since awk
does not support directly what you want, it's not as elegant.
Janis
> [...]
|
|
0
|
|
|
|
Reply
|
Janis
|
9/11/2010 7:32:45 PM
|
|
On 06/09/10 20:14, luca wrote:
>
> Hi, Awk beginner here. I found this article about how to emulate arrays
> of arrays in Awk
>
> http://www.billposer.org/Linguistics/Computation/Miscnotes/Lists.html
>
> unfortunately, it does not cut it for me.
>
> What I need is:
>
> array1["a"] -> array2["a_1"], array2["a_2"],array2["a_3"],array2["a_4"],
> :
> array1["z"] -> array2["z_1"], array2["z_2"],array2["z_3"],array2["z_4"],
>
> and so far so good:
>
> the problem is when I need to loop on part of array2[] (say all "a_*").
>
> How do I achieve that?
>
> I have actually worked the problem around momentarily by rescanning the
> full big matrix and filtering out what I do not need with:
>
> if (index(idxsub,idx) != 0)
>
> but it goes without saying that this is slow as hell for large amounts
> of data.
>
> Ideas?
>
> Thanks
>
> Luca
>
First a comment in advance; if it is true, as Kenny seems to assume, that
you don't want to iterate over all roots, then Ed's very first reply in
this thread already provided one solution how you can approach that goal.
I understand that, as awk beginner, it might not have clarified the issue
enough for you.
For one complete solution here's my code with line numbers to comment below.
1 match ($1, /[^_]*_[^_]*_[^_]*/) {
2 root = substr($1,RSTART,RLENGTH)
3 subs = substr($1,RSTART+RLENGTH)
4
5 if (!subs) {
6 ri[root]
7 }
8 else if (!($1 in si)) {
9 si[$1]
10 ids[root] = ids[root]" "$1
11 }
12
13 s[$1,$NF]++ ; r[root,$NF]++
14 }
15
16 END {
17 for (root in ids) {
18 print root, r[root,0]+0, r[root,1]+0
19
20 n = split(ids[root],subids)
21 for (i=1; i<=n; i++)
22 print " ", subids[i], s[subids[i],0]+0, s[subids[i],1]+0
23 }
24 }
Line 1-14 is done for every data record; all data will be accumulated and
book-keeping information is built as necessary.
Line 1: This is a pattern that should be defined in a way that it matches
every root component in any of the IDs. Adjust to your needs.
Line 2, 3: The match() function provides through those predefined variables
the means to separate the root part from the subs part. Save those values.
Line 5: If the subs part is not present we're in a line with a root entry;
I saved the root ID name in an array ri[], but this is unused in this final
version and can actually be omitted; the !subs case will do nothing.[*]
Line 8: Otherwise we're in a subs case; and only if there wasn't already
another equal subs present (check against si[]) we memorise it in si[].
Line 9: Memorize the subs ID in s[] for the check done in Line 8. You only
want to memorize the same subs ID of a root once in Line 10.
Line 10: For the current root add the new subs ID name to an array of root
IDs ids[]. This is a book-keeping structure to only access the root specific
data and to not iterate over subs of other roots.
Line 13: In s[] we're counting the subs (or the roots, which you say should
carry their own value as well), and in r[] we're adding all root-parts, also
the ones in subs, to the root entry (you said you want the sum of all subs
and the root associated with the respective root).
Line 16-24 is the evaluation/formatted printing of the memorized accumulated
data.
Line 17: We iterate through all root IDs that are stored in ids[].
Line 18: We print the root ID and the accumulated values for that ID for 0 and
for 1.
Line 20: Now we want to access the subs IDs associated to the current root; we
split the subs IDs into an array subids[] and memorize the amount of subs IDs,
n, which are associated to the current root.
Line 21: We iterate over the subs IDs for the current root.
Line 22: We print the subs ID and the accumulated values for that ID for 0 and
for 1.
Here the effective overhead of the emulation of multi-dimensional arrays can
be found in Line 10 and in Line 20.[**] It's not too much code overhead, but
it certainly requires some thinking that we have to do if we don't have the
desired feature in awk available.
Janis
[*]
Line 6: This is a remains of a previous version. An empty statement would
suffice here, as explained.
[**] Compare those two commands again with Ed's very first posting in this
thread and you can see his point.
|
|
0
|
|
|
|
Reply
|
Janis
|
9/11/2010 8:11:01 PM
|
|
On 11/09/2010 22.11, Janis Papanagnou wrote:
> On 06/09/10 20:14, luca wrote:
Thank you, Janis. Very appreciated.
Luca
|
|
0
|
|
|
|
Reply
|
luca
|
9/13/2010 4:23:56 PM
|
|
On 11/09/2010 3.28, Kenny McCormack wrote:
>
> I honestly think that I am the only one here besides you who understands
> what you are saying. Everyone else is fixated, as newsgroup posters
> usually are, on solving some specific problem - as if they were taking a
> standardized test and needed a good score to get into a good college.
>
> Every so-called "solution" posted has involved iterating over the entire
> array - and you have made it clear (to me, if to no one else) that that
> is precisely what you are trying to avoid.
thank you, Kenny. It's good to know that I am not totally insane ;)
Luca
|
|
0
|
|
|
|
Reply
|
luca
|
9/14/2010 9:17:17 AM
|
|
On 9/14/2010 4:17 AM, luca wrote:
> On 11/09/2010 3.28, Kenny McCormack wrote:
>>
>> I honestly think that I am the only one here besides you who understands
>> what you are saying. Everyone else is fixated, as newsgroup posters
>> usually are, on solving some specific problem - as if they were taking a
>> standardized test and needed a good score to get into a good college.
>>
>> Every so-called "solution" posted has involved iterating over the entire
>> array - and you have made it clear (to me, if to no one else) that that
>> is precisely what you are trying to avoid.
>
> thank you, Kenny. It's good to know that I am not totally insane ;)
>
Judging your sanity on whether or not you agree with Kenny might make for an
interesting legal defense :-).
Sane or not, though, that last statement of Kenny's is wrong. Several solutions
posted only involved iterating over a small subset of the array similar to the
iterations required if you had a true multi-dimensional array.
Ed.
|
|
0
|
|
|
|
Reply
|
Ed
|
9/14/2010 1:02:34 PM
|
|
In article <i6nrpd$dio$1@news.eternal-september.org>,
Ed Morton <mortonspam@gmail.com> wrote:
....
>Judging your sanity on whether or not you agree with Kenny might make for an
>interesting legal defense :-).
There you go with the insults again.
>Sane or not, though, that last statement of Kenny's is wrong. Several
>solutions posted only involved iterating over a small subset of the
>array similar to the iterations required if you had a true
>multi-dimensional array.
At the cost of a lot of weird ugly code.
See my other post about the cost of this "only what's in the standard,
no matter how kludgy" attitude.
Believe me, having arrays of arrays is very useful and just because you
can kludge your way out of this particular problem is not a good
argument against having them.
--
Faced with the choice between changing one's mind and proving that there is
no need to do so, almost everyone gets busy on the proof.
- John Kenneth Galbraith -
|
|
0
|
|
|
|
Reply
|
gazelle
|
9/14/2010 1:20:53 PM
|
|
In article <v4Hjo.240945$813.82761@tornado.fastwebnet.it>,
luca <luca_remove@alice.it> wrote:
>On 11/09/2010 3.28, Kenny McCormack wrote:
>>
>> I honestly think that I am the only one here besides you who understands
>> what you are saying. Everyone else is fixated, as newsgroup posters
>> usually are, on solving some specific problem - as if they were taking a
>> standardized test and needed a good score to get into a good college.
>>
>> Every so-called "solution" posted has involved iterating over the entire
>> array - and you have made it clear (to me, if to no one else) that that
>> is precisely what you are trying to avoid.
>
>thank you, Kenny. It's good to know that I am not totally insane ;)
Thanks. Others will take issue with you for saying this, but you are
spot on. It is the others who are loopy.
--
"The anti-regulation business ethos is based on the charmingly naive notion
that people will not do unspeakable things for money." - Dana Carpender
Quoted by Paul Ciszek (pciszek at panix dot com). But what I want to know
is why is this diet/low-carb food author doing making pithy political/economic
statements?
Nevertheless, the above quote is dead-on, because, the thing is - business
in one breath tells us they don't need to be regulated (which is to say:
that they can morally self-regulate), then in the next breath tells us that
corporations are amoral entities which have no obligations to anyone except
their officers and shareholders, then in the next breath they tell us they
don't need to be regulated (that they can morally self-regulate) ...
|
|
0
|
|
|
|
Reply
|
gazelle
|
9/14/2010 1:21:50 PM
|
|
On 9/14/2010 8:20 AM, Kenny McCormack wrote:
> In article<i6nrpd$dio$1@news.eternal-september.org>,
> Ed Morton<mortonspam@gmail.com> wrote:
> ...
>> Judging your sanity on whether or not you agree with Kenny might make for an
>> interesting legal defense :-).
>
> There you go with the insults again.
Oh c'mon - I even put a smiley face on it!
Ed.
>
>> Sane or not, though, that last statement of Kenny's is wrong. Several
>> solutions posted only involved iterating over a small subset of the
>> array similar to the iterations required if you had a true
>> multi-dimensional array.
>
> At the cost of a lot of weird ugly code.
>
> See my other post about the cost of this "only what's in the standard,
> no matter how kludgy" attitude.
>
> Believe me, having arrays of arrays is very useful and just because you
> can kludge your way out of this particular problem is not a good
> argument against having them.
>
|
|
0
|
|
|
|
Reply
|
Ed
|
9/14/2010 1:29:42 PM
|
|
On 9/14/2010 8:29 AM, Ed Morton wrote:
> On 9/14/2010 8:20 AM, Kenny McCormack wrote:
>> In article<i6nrpd$dio$1@news.eternal-september.org>,
>> Ed Morton<mortonspam@gmail.com> wrote:
>> ...
>>> Judging your sanity on whether or not you agree with Kenny might make for an
>>> interesting legal defense :-).
>>
>> There you go with the insults again.
>
> Oh c'mon - I even put a smiley face on it!
>
To be clear: I respect your opinions even when I disagree with them, enjoy the
discussions we have, and learn a lot from your posts. Sorry if my comment above
was offensive, it wasn't intended to be.
Ed.
|
|
0
|
|
|
|
Reply
|
Ed
|
9/14/2010 1:40:42 PM
|
|
On 14/09/2010 15.20, Kenny McCormack wrote:
>> Judging your sanity on whether or not you agree with Kenny might make for an
>> interesting legal defense :-).
>
> There you go with the insults again.
people, you all have been very kind in looking at my problem. The last thing I
want is to be the cause of a row among you guys.
If Ed and Jamis were not so passionate about Awk and defending its merits, they
would not be here using so much time on it and pushing the envelope.
On the other hand, I think Kenny has a point here. One cannot argue that arrays
of arrays are not a glaring omission from the language, unless there is a
religious component to it. There are people out there that need to get the job
done and they need the shortest path to the goal.
Thank you
Luca
|
|
0
|
|
|
|
Reply
|
luca
|
9/14/2010 1:49:45 PM
|
|
On 9/14/2010 8:20 AM, Kenny McCormack wrote:
<snip>
> See my other post about the cost of this "only what's in the standard,
> no matter how kludgy" attitude.
I don't think anything in my (or Janis') posts suggested that that the basis of
the argument was that we should only follow what's in the standard. There's been
many threads over the years where I've suggested and/or supported non-standard
enhancements such as the extra argument to split to hold an array of field
separators. Sometimes someone suggests something non-standard and I agree with
it, sometimes I don't. In this particular discussion, although I can see
advantages to true multi-dimensional arrays, IMHO it just doesn't seem
worthwhile changing the language to accommodate them. I believe that having true
2-D arrays would make accessing elements of those arrays more efficient than
today's pseudo 2-D arrays, but for all I know maybe there's an internal cost to
implementing 2-D arrays that'd impact efficiency when accessing 1-D arrays - we
haven't even talked about that, but given the far more prevalent occurrences of
1-D arrays I'd think that might be a showstopper if it's true.
> Believe me, having arrays of arrays is very useful and just because you
> can kludge your way out of this particular problem is not a good
> argument against having them.
OK, but the status quo is that we don't have them and so far we haven't seen a
good argument FOR having them. If we did have them in the language already then
I don't think we've seen any argument strong enough to remove them either. There
just doesn't seem to be much incentive either way based on the discussions so far.
Ed.
|
|
0
|
|
|
|
Reply
|
Ed
|
9/14/2010 2:03:28 PM
|
|
On 9/14/2010 8:49 AM, luca wrote:
> On 14/09/2010 15.20, Kenny McCormack wrote:
>
>>> Judging your sanity on whether or not you agree with Kenny might make for an
>>> interesting legal defense :-).
>>
>> There you go with the insults again.
>
> people, you all have been very kind in looking at my problem. The last thing I
> want is to be the cause of a row among you guys.
>
> If Ed and Jamis were not so passionate about Awk and defending its merits, they
> would not be here using so much time on it and pushing the envelope.
>
> On the other hand, I think Kenny has a point here. One cannot argue that arrays
> of arrays are not a glaring omission from the language, unless there is a
> religious component to it.
You could say that about complex data structures too. Or how about enums? The
point is there's lots of things that exist in other languages that if you're
used to them you might think are glaring omissions from awk but after using awk
for a while you grow to appreciate awk's simplicity and can easily work around
the "glaring omissions" on those rare occasions when you'd like to have the
"missing" language constructs. It's not that I or anyone else feels a strong
need to defend the awk standards in not supporting 2-D arrays, it's just not
that big a deal to have them or not so we wanted to show you how to easily get
by without them and if there WAS a compelling reason to have them I for one
really wanted to understand what it was for my own benefit, but I haven't seen
one so far. There's been other discussions in this NG about possible language
enhancements that have then led to changes in gawk at least so this might've
been one of them if some argument had been presented about what notable value
true 2-D arrays would add.
Ed.
There are people out there that need to get the job
> done and they need the shortest path to the goal.
>
> Thank you
>
> Luca
|
|
0
|
|
|
|
Reply
|
Ed
|
9/14/2010 2:19:30 PM
|
|
In article <i6o09j$ecu$1@news.eternal-september.org>,
Ed Morton <mortonspam@gmail.com> wrote:
....
>one so far. There's been other discussions in this NG about possible language
>enhancements that have then led to changes in gawk at least so this might've
>been one of them if some argument had been presented about what notable value
>true 2-D arrays would add.
I don't think anyone seriously argues that it wouldn't be useful - in
fact, *very* useful. The problem is that it is seriously non-trivial to
implement. I.e., the cost-benefit burden hasn't yet been met.
If it were easy (cheap) to implement, it'd have been done by now (in GAWK).
--
> No, I haven't, that's why I'm asking questions. If you won't help me,
> why don't you just go find your lost manhood elsewhere.
CLC in a nutshell.
|
|
0
|
|
|
|
Reply
|
gazelle
|
9/14/2010 2:29:40 PM
|
|
luca schrieb:
> On 14/09/2010 15.20, Kenny McCormack wrote:
>
>>> Judging your sanity on whether or not you agree with Kenny might make
>>> for an
>>> interesting legal defense :-).
>>
>> There you go with the insults again.
>
> people, you all have been very kind in looking at my problem. The last
> thing I want is to be the cause of a row among you guys.
>
> If Ed and Jamis were not so passionate about Awk and defending its
> merits, they would not be here using so much time on it and pushing the
> envelope.
Now, is that misspelling of my name an insult? :-)
>
> On the other hand, I think Kenny has a point here. One cannot argue that
> arrays of arrays are not a glaring omission from the language, unless
> there is a religious component to it. There are people out there that
> need to get the job done and they need the shortest path to the goal.
(If that's your conclusion then half of what I wrote hasn't been
considered at all. Anyway...)[*]
Janis
>
> Thank you
>
> Luca
[*] Without OO and polymorphism, e.g., awk has "glaring omissions"!
Restricting presumed "glaring amissions" to multi-dimensional arrays
seems to me to be a very limited view on programming languages in
general and awk specifically. We can enumerate hundreds of features,
many significant, that are not in awk but would "get the job done"
much better. Personally I am missing structs/records as elementary
data structure. And hierarchical data structures. But then I wouldn't
use awk.
|
|
0
|
|
|
|
Reply
|
Janis
|
9/14/2010 2:32:39 PM
|
|
In article <i6o11j$utp$1@speranza.aioe.org>,
Janis Papanagnou <janis_papanagnou@hotmail.com> wrote:
> Personally I am missing structs/records as elementary data structure. And
> hierarchical data structures. But then I wouldn't use awk.
I have simulated trees in awk with little problem, and the code was
fairly straight forward. I still occasionally use the script where I
build a tree structure.
:-)
--
Aharon (Arnold) Robbins arnold AT skeeve DOT com
P.O. Box 354 Home Phone: +972 8 979-0381
Nof Ayalon Cell Phone: +972 50 729-7545
D.N. Shimshon 99785 ISRAEL
|
|
0
|
|
|
|
Reply
|
arnold
|
9/14/2010 8:45:26 PM
|
|
On 14/09/2010 16.32, Janis Papanagnou wrote:
>>
>> If Ed and Jamis were not so passionate about Awk and defending its merits,
>> they would not be here using so much time on it and pushing the envelope.
>
> Now, is that misspelling of my name an insult? :-)
misspelling. Honest.
> [*] Without OO and polymorphism, e.g., awk has "glaring omissions"!
>
> Restricting presumed "glaring amissions" to multi-dimensional arrays
> seems to me to be a very limited view on programming languages in
> general and awk specifically. We can enumerate hundreds of features,
> many significant, that are not in awk but would "get the job done"
> much better. Personally I am missing structs/records as elementary
> data structure. And hierarchical data structures. But then I wouldn't
> use awk.
OK. You are right. I wonder why Gawk has not evolved to be awk on steroids then.
Fully able to run Awk, but with xtra features.
Thanks
Luca
|
|
0
|
|
|
|
Reply
|
luca
|
9/15/2010 7:50:57 AM
|
|
On 14/09/2010 22.45, Aharon Robbins wrote:
>
> I have simulated trees in awk with little problem, and the code was
> fairly straight forward. I still occasionally use the script where I
> build a tree structure.
is this a script you can post?
Luca
|
|
0
|
|
|
|
Reply
|
luca
|
9/15/2010 7:51:39 AM
|
|
On Sep 10, 5:50=A0pm, gaze...@shell.xmission.com (Kenny McCormack)
wrote:
> In article <f9tio.239013$813.230...@tornado.fastwebnet.it>,luca =A0<luca_=
rem...@alice.it> wrote:
>
> ...
>
> >I want to be able to loop on all elements of the form:
>
> >array[root_token_*]
>
> >the array may contain 500k entries, while given a "root", the "root_toke=
n_*"
> >subarray is made of 1 to 10 entries.
>
> >I have managed to implement a cript that does the job (so turing
> >completeness is not the problem here :), but only at the price of
> >looping on the whole 500k-items array and filtering out entries which
> >do not match the "root_token_*" pattern (a rare occurrence). =A0Again,
> >this works, but it is very very inefficient when processing large
> >amounts of data.
>
> Right. =A0And that's the tragedy. =A0There is no way to do that in standa=
rd
> AWK or GAWK, since they don't have true (or in any sense real)
> multi-dimensional arrays. =A0Don't blame them - it's not a feature presen=
t
> in any traditional AWK, nor is it required by any "standard".
I'm not sure that this is a multidimensional array (or array of
arrays :)) feature. E.g.
I think of multidimensional arrays as working like this:
a["hello"][3] =3D "foo"
for(i in a["hello"])
...;
In this case it seems that the OP wants:
for(i in a[/hello.*/])
....;
or something. This implies that there could be multiple views on the
same data. I suppose one could do it with a multidimensional array of
references.
-Ed
|
|
0
|
|
|
|
Reply
|
Edward
|
9/15/2010 10:57:18 AM
|
|
In article <dW_jo.75$%a.57@tornado.fastwebnet.it>,
luca <luca_remove@alice.it> wrote:
>On 14/09/2010 22.45, Aharon Robbins wrote:
>>
>> I have simulated trees in awk with little problem, and the code was
>> fairly straight forward. I still occasionally use the script where I
>> build a tree structure.
>
>
>is this a script you can post?
>
>Luca
See the prepinfo program in the GNU texinfo distribution. It's a bit
big to post (a few hundred lines). It replaced a C program that was
four times bigger and didn't do as much.
Arnold
--
Aharon (Arnold) Robbins arnold AT skeeve DOT com
P.O. Box 354 Home Phone: +972 8 979-0381
Nof Ayalon Cell Phone: +972 50 729-7545
D.N. Shimshon 99785 ISRAEL
|
|
0
|
|
|
|
Reply
|
arnold
|
9/16/2010 6:38:21 AM
|
|
In article <yV_jo.73$%a.25@tornado.fastwebnet.it>,
luca <luca_remove@alice.it> wrote:
>OK. You are right. I wonder why Gawk has not evolved to be awk on
>steroids then.
Because I don't think that's necessarily "the right thing" to do.
At some point, such a language ceases to be awk. To paraphrase
Dennis Ritchie, if you want perl, you know where to get it. :-)
Somebody did do an awk on steroids a few years back, it was called
"hawk". I don't have a URL for it though.
Arnold
--
Aharon (Arnold) Robbins arnold AT skeeve DOT com
P.O. Box 354 Home Phone: +972 8 979-0381
Nof Ayalon Cell Phone: +972 50 729-7545
D.N. Shimshon 99785 ISRAEL
|
|
0
|
|
|
|
Reply
|
arnold
|
9/16/2010 7:35:10 AM
|
|
El 16/09/2010 9:35, Aharon Robbins escribi�:
> In article<yV_jo.73$%a.25@tornado.fastwebnet.it>,
> luca<luca_remove@alice.it> wrote:
>> OK. You are right. I wonder why Gawk has not evolved to be awk on
>> steroids then.
>
> Because I don't think that's necessarily "the right thing" to do.
> At some point, such a language ceases to be awk. To paraphrase
> Dennis Ritchie, if you want perl, you know where to get it. :-)
>
> Somebody did do an awk on steroids a few years back, it was called
> "hawk". I don't have a URL for it though.
Well, the "AWK FAQ" mention:
> ftwalk / hawk
>
> > a language that attempts to scale awk principles up to
> > a level competitive with Perl, Python, etc. Run as ftwalk,
> > it does a file tree walk (think of find+awk). Run as hawk,
> > it runs awk scripts (not quite compatibly).
>
> <http://www.tomhull.com/ocston/projects/hawk.html>
> <http://ftwalk.sourceforge.net/>
Hope this helps.
--
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado
|
|
0
|
|
|
|
Reply
|
Manuel
|
9/16/2010 11:01:46 AM
|
|
|
72 Replies
317 Views
(page loaded in 0.283 seconds)
Similiar Articles: Local array variables in functions - comp.lang.awkArray parameters passed to functions are passed 'by reference' in awk, and changing an array inside the function will change its origin outside the fu... awk challenge: sort array using only for ... in ... - comp.lang ...This occurred to me as an interesting challenge in awk: Given an array, output its indices and corresponding elements sorted by element value using O... How to sort array - comp.lang.awkHi All How to using sort in awk ? try printf("%s\n",a[ele]) | "sort" is not good. #!/bin/ksh # 2008/12/11 echo "" | awk '{ a[2] = "d 1" ... New features added to development gawk - comp.lang.awkpatsplit is to split as FPAT is to FS; matches of the > regexp become the elements of the array. Good idea to extend tokenre.awk module. Thanks. split bash variable by semicolon - comp.lang.awkCan't I access the header variable inside awk like that? Is the array accessible to bash after awk finishes its job? Thank you for any help, Luis alternatives to awk - comp.lang.awkawk and arrays - comp.lang.awk Hi, Awk beginner here ... see no alternative to duplicating arrays Nothing's wrong with storing the root values in a separate array. AWK question - split string into variables - comp.unix.shell ...Awk arrays and specific character matching - comp.lang.awk ... AWK question - split string into variables - comp.unix.shell ... Awk arrays and specific character matching ... Loop to evaluate consecutive values in array - comp.soft-sys ...awk and arrays - comp.lang.awk... subloop, and just loop on that range ... your code, you are evaluating ... and store ... compare values in the same field in consecutive ... [need help]: how to automatically detect which number is mostly ...awk and arrays - comp.lang.awk If you won't help me, > why don't you just go find your lost manhood ... actually zero_tot will also need to be an array ... variable and ... nawk: out of space in tostring on ... - comp.lang.awkI am using awk > >>>>>arrays to do the merging - I am not sure if that has anything to do > >>>>>with it. > > >>It is not surprising that you script fails when ... ??? Attempt to reference field of non-structure array. - comp.soft ...Local array variables in functions - comp.lang.awk... to functions are passed 'by reference' in awk, and changing an array ... runtime error } fatal: attempt to use array ... How to combine two awk commands - comp.lang.awkI found Ed's answer: awk "BEGIN{ FS=\" \" } { array[$2]++ } END{ for ( i in array ) print i }" to run slightly faster than hq00e's: awk "BEGIN{ FS=\" \" } { array[$2 ... Best AWK book? - comp.unix.shellDear All, I would like to get a good beginners introduction to awk, including how to program with awk. Any tips on which book to get - in print or... bigger memory allocation for Gawk - comp.lang.awki don't know why gawk is choking, if i use awk instead of gawk (and few changes ... After years I found out, then when interrogating an array for the existance of a ... awk equivalent to perl's $& - comp.lang.awkhow to transpose large matrix? - comp.unix.shell As a simple form, use awk's associative array: a[row "," column]=value; In perl (which might perform better), it's $a ... AWK Language Programming - Arrays in awk - UUMath - HomeArrays in awk. An array is a table of values, called elements. The elements of an array are distinguished by their indices. Indices may be either numbers or strings. AWK Arrays Explained with 5 Practical Examples - The Geek StuffAwk programming language supports arrays. As part of our on-going awk examples series, we have seen awk user defined variables and awk built-in variables. 7/22/2012 7:16:44 PM
|