changing a field without recompiling the record

  • Follow


Someone recently asked how to do this (all spaces in "file" are tabs):

$ cat file
aaa     bbb     cXcXc   ddd
$ awk 'BEGIN{FS=OFS="\t"}gsub(/X/,"+",$3)' file
aaa     bbb     c+c+c   ddd

i.e. change all ocurrences of a character in a field without changing
the spacing between fields. The above only "works" because the field
separator is a specific character and we can set the OFS to the same
character so that though the record gets recompiled it looks the same
after as before wrt field spacing. If the FS had been a space
character (or any RE), however, we can't use that same trick (all
spaces in "file" are chains of blank chars):

$ cat file
aaa        bbb         cXcXc              ddd
$ awk 'BEGIN{FS=OFS=" "}gsub(/X/,"+",$3)' file
aaa bbb c+c+c ddd
$ awk 'BEGIN{FS=OFS=" +"}gsub(/X/,"+",$3)' file
aaa +bbb +c+c+c +ddd

This problem of unwanted recompilation of the record comes up so
often, I wonder if anyone has a suggestion on a simple way to work
around it in general. Unless you use GNU awks gensub(), the best I can
think of is to carve up the record into the part before the 3rd field,
then the 3rd field, then the part after the 3rd field and stick it all
back together afterward manipulating the 3rd field, but any
implementation I can imagine for that is very lengthy considering how
simple the problem sounds. Any suggestions?

      Ed.

0
Reply mortonspam (827) 11/25/2008 10:53:19 PM

Ed Morton wrote:
> Someone recently asked how to do this (all spaces in "file" are tabs):
> 
> $ cat file
> aaa     bbb     cXcXc   ddd
> $ awk 'BEGIN{FS=OFS="\t"}gsub(/X/,"+",$3)' file
> aaa     bbb     c+c+c   ddd
> 
> i.e. change all ocurrences of a character in a field without changing
> the spacing between fields. The above only "works" because the field
> separator is a specific character and we can set the OFS to the same
> character so that though the record gets recompiled it looks the same
> after as before wrt field spacing. If the FS had been a space
> character (or any RE), however, we can't use that same trick (all
> spaces in "file" are chains of blank chars):
> 
> $ cat file
> aaa        bbb         cXcXc              ddd
> $ awk 'BEGIN{FS=OFS=" "}gsub(/X/,"+",$3)' file
> aaa bbb c+c+c ddd
> $ awk 'BEGIN{FS=OFS=" +"}gsub(/X/,"+",$3)' file
> aaa +bbb +c+c+c +ddd
> 
> This problem of unwanted recompilation of the record comes up so
> often, I wonder if anyone has a suggestion on a simple way to work
> around it in general. Unless you use GNU awks gensub(), the best I can
> think of is to carve up the record into the part before the 3rd field,
> then the 3rd field, then the part after the 3rd field and stick it all
> back together afterward manipulating the 3rd field, but any
> implementation I can imagine for that is very lengthy considering how
> simple the problem sounds. Any suggestions?
> 
>       Ed.
> 

Hi,

I have indeed no simple solution but my awk-function does the necessary 
splitting with storing of the splitted and intermediate strings:

---------8<------------------------
# function factor(s, a, re):
# factors string s by regex re in a[1] a[2] ... a[2*n+1]
# where a[2*i] are instances of re for 1 <= i <= n
# re in { "", " " } is handled specially like FS == re
# in gawk resp. awk
function factor(s, a, re,    i, j)
{
   for (i in a)
     delete a[i]
   i = 1
   if (re == "") {
     for (j = 1; j < length(s); ++j) {
       a[i++] = substr(s,j,1)
       a[i++] = ""
     }
     a[i] = substr(s,j,1)
   } else if (re == " ") {
     re = "[ \t]+"
     if (match(s, re) == 1)
       s = substr(s, RSTART + RLENGTH)
     while (match(s, re)) {
       a[i++] = substr(s, 1, RSTART - 1)
       a[i++] = substr(s, RSTART, RLENGTH)
       s = substr(s, RSTART + RLENGTH)
     }
     if (s == "")
       delete a[i-1]
     else
       a[i] = s
   } else {
     while (match(s, re)) {
       a[i++] = substr(s, 1, RSTART - 1)
       a[i++] = substr(s, RSTART, RLENGTH)
       s = substr(s, RSTART + RLENGTH)
     }
     a[i] = s
   }
}
---------8<------------------------

Perhaps the function is useful for you.

--
Steffen
0
Reply Steffen 11/26/2008 6:47:42 AM


On Tuesday 25 November 2008 23:53, Ed Morton wrote:

> Someone recently asked how to do this (all spaces in "file" are tabs):
> 
> $ cat file
> aaa     bbb     cXcXc   ddd
> $ awk 'BEGIN{FS=OFS="\t"}gsub(/X/,"+",$3)' file
> aaa     bbb     c+c+c   ddd
> 
> i.e. change all ocurrences of a character in a field without changing
> the spacing between fields. The above only "works" because the field
> separator is a specific character and we can set the OFS to the same
> character so that though the record gets recompiled it looks the same
> after as before wrt field spacing. If the FS had been a space
> character (or any RE), however, we can't use that same trick 

This is because there is no built in way to store FSs (ie, something that
does for FS what RT does for RS in GNU awk). It would be nice to have, say,
an array or something, where the characters that matched FS are stored.
Alternatively, it could be useful to know (again, in an array) the offsets
inside $0 where the various fields start, and the rest could be worked out
from that. In the absence of these facilities, when FS is a regex it is
difficult if not impossible to restore the same separators in the output
line.

> (all spaces in "file" are chains of blank chars):
> 
> $ cat file
> aaa        bbb         cXcXc              ddd
> $ awk 'BEGIN{FS=OFS=" "}gsub(/X/,"+",$3)' file
> aaa bbb c+c+c ddd
> $ awk 'BEGIN{FS=OFS=" +"}gsub(/X/,"+",$3)' file
> aaa +bbb +c+c+c +ddd

The following code seems to work in this case:

$ cat file
aaa        bbb         cXcXc              ddd
$ awk '{a=$3;gsub(/X/,"+",a);sub($3,a)}1' file
aaa        bbb         c+c+c              ddd

Of course this cannot be a general solution because is highly dependent on
the actual input data, and can fail in lots of cases, the most obvious of
which is when $1 or $2 is "cXcXc", closely followed by the case where $3
happens to contain a regular expression that matches somewhere else before,
which can however be seen as a variation of the former.

However, if you know your input data, that might be a workable solution in
some specific cases.
 
> This problem of unwanted recompilation of the record comes up so
> often, I wonder if anyone has a suggestion on a simple way to work
> around it in general. Unless you use GNU awks gensub(), the best I can
> think of is to carve up the record into the part before the 3rd field,
> then the 3rd field, then the part after the 3rd field and stick it all
> back together afterward manipulating the 3rd field, but any implementation
> I can imagine for that is very lengthy considering how simple the problem
> sounds. Any suggestions? 

I'd either used this technique or write a function along the lines of that
posted by Steffen. However, I'm not 100% sure that repeatedly iterating
over the line with match() and pulling out the various separators and
fields is guaranteed to produce exactly the same result done by FS when
splitting the line into fields. In particular, I suspect that there can be
issues when FS contains a regular expression that can match the empty
string. FS splitting ignores that (or at least it should, as emerged in a
thread here some time ago), but match() will match it.
For example, you cannot write such a function using a loop like

while(match($0,FS)){ ... }

or at least, you have to introduce additional logic to work around match()
matching the empty string. Furthermore, you have to provide additional code
to handle the case where FS has the default value, where leading and
trailing space is removed, and runs of consecutive space character are
treated as a single separator.
Probably you'd end up writing code that duplicates the inner working of FS
splitting, which I guess is not really short either.
So in the end, it looks like the before/after part technique is the safest.
0
Reply pk 11/26/2008 11:02:04 AM

On Wednesday 26 November 2008 07:47, Steffen Schuler wrote:

> I have indeed no simple solution but my awk-function does the necessary
> splitting with storing of the splitted and intermediate strings:
> 
> ---------8<------------------------
> # function factor(s, a, re):
> # factors string s by regex re in a[1] a[2] ... a[2*n+1]
> # where a[2*i] are instances of re for 1 <= i <= n
> # re in { "", " " } is handled specially like FS == re
> # in gawk resp. awk
> function factor(s, a, re,    i, j)
> {
>    for (i in a)
>      delete a[i]
>    i = 1
>    if (re == "") {
>      for (j = 1; j < length(s); ++j) {
>        a[i++] = substr(s,j,1)
>        a[i++] = ""
>      }
>      a[i] = substr(s,j,1)
>    } else if (re == " ") {
>      re = "[ \t]+"
>      if (match(s, re) == 1)
>        s = substr(s, RSTART + RLENGTH)
>      while (match(s, re)) {
>        a[i++] = substr(s, 1, RSTART - 1)
>        a[i++] = substr(s, RSTART, RLENGTH)
>        s = substr(s, RSTART + RLENGTH)
>      }
>      if (s == "")
>        delete a[i-1]
>      else
>        a[i] = s
>    } else {
>      while (match(s, re)) {
>        a[i++] = substr(s, 1, RSTART - 1)
>        a[i++] = substr(s, RSTART, RLENGTH)
>        s = substr(s, RSTART + RLENGTH)
>      }
>      a[i] = s
>    }
> }

Note that the above function will loop forever with, say, a string
like "aXbXc" and a RE like "X*".

0
Reply pk 11/26/2008 11:50:20 AM

Hi Ed,

I don't know that my way is a solution for your problem or not.
I tried to help myself in this way as below!!
=============================================
gawk '
  BEGIN {
    FS = OFS = ""
  }
  {
    for (i = 1; i <= NF; i++) {
      if ($i == "X") {
        $i = "+"
      }
      printf "%s", $i
    }
  }
' FILE_IN > FILE_OUT
===================================
FILE_IN is the input file containing TABS.
aaa     bbb     cXcXc   ddd

FILE_OUT has after invoking the script the following content:
aaa     bbb     c+c+c   ddd

Regards
Mohsen
0
Reply mohsen 11/26/2008 3:25:46 PM

On Nov 26, 9:25=A0am, mohsen <mohsen.ow...@biospectra.ch> wrote:
> Hi Ed,
>
> I don't know that my way is a solution for your problem or not.
> I tried to help myself in this way as below!!
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> gawk '
> =A0 BEGIN {
> =A0 =A0 FS =3D OFS =3D ""
> =A0 }
> =A0 {
> =A0 =A0 for (i =3D 1; i <=3D NF; i++) {
> =A0 =A0 =A0 if ($i =3D=3D "X") {
> =A0 =A0 =A0 =A0 $i =3D "+"
> =A0 =A0 =A0 }
> =A0 =A0 =A0 printf "%s", $i
> =A0 =A0 }
> =A0 }
> ' FILE_IN > FILE_OUT
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> FILE_IN is the input file containing TABS.
> aaa =A0 =A0 bbb =A0 =A0 cXcXc =A0 ddd
>
> FILE_OUT has after invoking the script the following content:
> aaa =A0 =A0 bbb =A0 =A0 c+c+c =A0 ddd
>
> Regards
> Mohsen

Mohsen - no, that'd make the change in every field rather than just
the field I was interested in and I'm looking for a general solution
for any change to a field, not just changing individual characters.

Thanks though (and thanks also to Steffen and pk). Still feels like
there should be a concise solution out there somewhere....

    Ed.
0
Reply Ed 11/26/2008 3:40:26 PM

In article <d1ee33f5-f72c-4bae-b45b-1c9573fd7d10@f13g2000yqj.googlegroups.com>,
mohsen  <mohsen.owzar@biospectra.ch> wrote:
>Hi Ed,
>
>I don't know that my way is a solution for your problem or not.
>I tried to help myself in this way as below!!
>=============================================
>gawk '
>  BEGIN {
>    FS = OFS = ""
>  }
>  {
>    for (i = 1; i <= NF; i++) {
>      if ($i == "X") {
>        $i = "+"
>      }
>      printf "%s", $i
>    }
>  }
>' FILE_IN > FILE_OUT

If you are going to go that route, you might as well just do:

{
l = length($0)
for (i = 1; i <= l; i++) {
    if ((c = substr($0,i,1)) == "X")
	c = "+"
    printf "%s",c
    }
print ""
}

0
Reply gazelle 11/26/2008 3:41:54 PM

pk wrote:
> On Wednesday 26 November 2008 07:47, Steffen Schuler wrote:
> 
>> I have indeed no simple solution but my awk-function does the necessary
>> splitting with storing of the splitted and intermediate strings:
>>
>> ---------8<------------------------
>> # function factor(s, a, re):
>> # factors string s by regex re in a[1] a[2] ... a[2*n+1]
>> # where a[2*i] are instances of re for 1 <= i <= n
>> # re in { "", " " } is handled specially like FS == re
>> # in gawk resp. awk
>> function factor(s, a, re,    i, j)
>> {
>>    for (i in a)
>>      delete a[i]
>>    i = 1
>>    if (re == "") {
>>      for (j = 1; j < length(s); ++j) {
>>        a[i++] = substr(s,j,1)
>>        a[i++] = ""
>>      }
>>      a[i] = substr(s,j,1)
>>    } else if (re == " ") {
>>      re = "[ \t]+"
>>      if (match(s, re) == 1)
>>        s = substr(s, RSTART + RLENGTH)
>>      while (match(s, re)) {
>>        a[i++] = substr(s, 1, RSTART - 1)
>>        a[i++] = substr(s, RSTART, RLENGTH)
>>        s = substr(s, RSTART + RLENGTH)
>>      }
>>      if (s == "")
>>        delete a[i-1]
>>      else
>>        a[i] = s
>>    } else {
>>      while (match(s, re)) {
>>        a[i++] = substr(s, 1, RSTART - 1)
>>        a[i++] = substr(s, RSTART, RLENGTH)
>>        s = substr(s, RSTART + RLENGTH)
>>      }
>>      a[i] = s
>>    }
>> }
> 
> Note that the above function will loop forever with, say, a string
> like "aXbXc" and a RE like "X*".
> 

Now fixed. But factor() doesn't work in gawk 3.1.5 because of a problem 
with match(s,"") and RLENGTH:

--------------8<-----------------------------------------
# function factor(s, a, re):
# factors string s by regex re in a[1] a[2] ... a[2*n+1]
# where a[2*i] are instances of re for 1 <= i <= n
# re in { "", " " } is handled specially like FS == re
# in gawk resp. awk
# returns the number of fields 2*n+1 (inclusively the
# intermediate strings matched by re)
# remark: match(s, "") delivers a wrong RLENGTH in gawk 3.1.5
#         fixed in CVS gawk
#         therefore factor() doesn't work in gawk 3.1.5
function factor(s, a, re,    i, j)
{
   for (i in a)
     delete a[i]
   i = 1
   a[1] = ""
   if (re == " ") {
     re = "[ \t]+"
     if (match(s, re) == 1)
       s = substr(s, RSTART + RLENGTH)
     while (match(s, re)) {
       a[i++] = substr(s, 1, RSTART - 1)
       a[i++] = substr(s, RSTART, RLENGTH)
       s = substr(s, RSTART + RLENGTH)
     }
     if (i > 2 && s == "") {
       delete a[--i]
       --i
     } else
       a[i] = s
   } else {
     while (a[i] s != "" && match(s, re)) {
       a[i] = a[i] substr(s, 1, RSTART - 1)
       a[++i] = substr(s, RSTART, RLENGTH)
       ++i
       if (RSTART + RLENGTH == 1) {
         a[i] = substr(s,1,1)
         s = substr(s, 2)
       } else
         s = substr(s, RSTART + RLENGTH)
     }
     a[i] = a[i] s
   }
   return i
}
--------------8<----------------------------------------

-- 
Steffen
0
Reply Steffen 11/26/2008 6:50:55 PM

On Nov 26, 9:41=A0am, gaze...@shell.xmission.com (Kenny McCormack)
wrote:
> In article <d1ee33f5-f72c-4bae-b45b-1c9573fd7...@f13g2000yqj.googlegroups=
..com>,
>
>
>
>
>
> mohsen =A0<mohsen.ow...@biospectra.ch> wrote:
> >Hi Ed,
>
> >I don't know that my way is a solution for your problem or not.
> >I tried to help myself in this way as below!!
> >=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >gawk '
> > =A0BEGIN {
> > =A0 =A0FS =3D OFS =3D ""
> > =A0}
> > =A0{
> > =A0 =A0for (i =3D 1; i <=3D NF; i++) {
> > =A0 =A0 =A0if ($i =3D=3D "X") {
> > =A0 =A0 =A0 =A0$i =3D "+"
> > =A0 =A0 =A0}
> > =A0 =A0 =A0printf "%s", $i
> > =A0 =A0}
> > =A0}
> >' FILE_IN > FILE_OUT
>
> If you are going to go that route, you might as well just do:
>
> {
> l =3D length($0)
> for (i =3D 1; i <=3D l; i++) {
> =A0 =A0 if ((c =3D substr($0,i,1)) =3D=3D "X")
> =A0 =A0 =A0 =A0 c =3D "+"
> =A0 =A0 printf "%s",c
> =A0 =A0 }
> print ""
>
>
>
> }- Hide quoted text -
>
> - Show quoted text -- Hide quoted text -
>
> - Show quoted text -

If you're going to go that route you might as well do:

gsub(/X/,"+")

  Ed.
0
Reply Ed 11/26/2008 8:59:00 PM

In article <52b7f03b-ec82-438d-8570-0989fb68dbd1@w24g2000prd.googlegroups.com>,
Ed Morton  <mortonspam@gmail.com> wrote:
....
>If you're going to go that route you might as well do:
>
>gsub(/X/,"+")

That's true, too.  But you know what they say: baby steps...

0
Reply gazelle 11/26/2008 9:06:02 PM

On Nov 25, 5:53=A0pm, Ed Morton <mortons...@gmail.com> wrote:
> Someone recently asked how to do this (all spaces in "file" are tabs):
>
> $ cat file
> aaa =A0 =A0 bbb =A0 =A0 cXcXc =A0 ddd
> $ awk 'BEGIN{FS=3DOFS=3D"\t"}gsub(/X/,"+",$3)' file
> aaa =A0 =A0 bbb =A0 =A0 c+c+c =A0 ddd
>
> i.e. change all ocurrences of a character in a field without changing
> the spacing between fields. The above only "works" because the field
> separator is a specific character and we can set the OFS to the same
> character so that though the record gets recompiled it looks the same
> after as before wrt field spacing. If the FS had been a space
> character (or any RE), however, we can't use that same trick (all
> spaces in "file" are chains of blank chars):
>
> $ cat file
> aaa =A0 =A0 =A0 =A0bbb =A0 =A0 =A0 =A0 cXcXc =A0 =A0 =A0 =A0 =A0 =A0 =A0d=
dd
> $ awk 'BEGIN{FS=3DOFS=3D" "}gsub(/X/,"+",$3)' file
> aaa bbb c+c+c ddd
> $ awk 'BEGIN{FS=3DOFS=3D" +"}gsub(/X/,"+",$3)' file
> aaa +bbb +c+c+c +ddd
>
> This problem of unwanted recompilation of the record comes up so
> often, I wonder if anyone has a suggestion on a simple way to work
> around it in general. Unless you use GNU awks gensub(), the best I can
> think of is to carve up the record into the part before the 3rd field,
> then the 3rd field, then the part after the 3rd field and stick it all
> back together afterward manipulating the 3rd field, but any
> implementation I can imagine for that is very lengthy considering how
> simple the problem sounds. Any suggestions?
>
> =A0 =A0 =A0 Ed.


Or you can store all the space characters in an array before
the substitution, and print them in the end, something in the
lines of the following:


$ cat file

aaa        bYbYb         cXcXc                ddd

$

$ awk '{ split($0,a,/[^ ]+/); gsub(/Y/,"-",$2); gsub(/X/,"+",$3)
         for (i=3D1;i<=3DNF;i++)  printf $i a[i+1]; print "" }' file

aaa        b-b-b         c+c+c                ddd



I haven't tested it with more complex requirements, but for the
given example seems to work fine.

0
Reply mbsans1 11/27/2008 12:51:51 AM

On Nov 26, 6:51=A0pm, mbsa...@gmail.com wrote:
> On Nov 25, 5:53=A0pm, Ed Morton <mortons...@gmail.com> wrote:
>
>
>
>
>
> > Someone recently asked how to do this (all spaces in "file" are tabs):
>
> > $ cat file
> > aaa =A0 =A0 bbb =A0 =A0 cXcXc =A0 ddd
> > $ awk 'BEGIN{FS=3DOFS=3D"\t"}gsub(/X/,"+",$3)' file
> > aaa =A0 =A0 bbb =A0 =A0 c+c+c =A0 ddd
>
> > i.e. change all ocurrences of a character in a field without changing
> > the spacing between fields. The above only "works" because the field
> > separator is a specific character and we can set the OFS to the same
> > character so that though the record gets recompiled it looks the same
> > after as before wrt field spacing. If the FS had been a space
> > character (or any RE), however, we can't use that same trick (all
> > spaces in "file" are chains of blank chars):
>
> > $ cat file
> > aaa =A0 =A0 =A0 =A0bbb =A0 =A0 =A0 =A0 cXcXc =A0 =A0 =A0 =A0 =A0 =A0 =
=A0ddd
> > $ awk 'BEGIN{FS=3DOFS=3D" "}gsub(/X/,"+",$3)' file
> > aaa bbb c+c+c ddd
> > $ awk 'BEGIN{FS=3DOFS=3D" +"}gsub(/X/,"+",$3)' file
> > aaa +bbb +c+c+c +ddd
>
> > This problem of unwanted recompilation of the record comes up so
> > often, I wonder if anyone has a suggestion on a simple way to work
> > around it in general. Unless you use GNU awks gensub(), the best I can
> > think of is to carve up the record into the part before the 3rd field,
> > then the 3rd field, then the part after the 3rd field and stick it all
> > back together afterward manipulating the 3rd field, but any
> > implementation I can imagine for that is very lengthy considering how
> > simple the problem sounds. Any suggestions?
>
> > =A0 =A0 =A0 Ed.
>
> Or you can store all the space characters in an array before
> the substitution, and print them in the end, something in the
> lines of the following:
>
> $ cat file
>
> aaa =A0 =A0 =A0 =A0bYbYb =A0 =A0 =A0 =A0 cXcXc =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0ddd
>
> $
>
> $ awk '{ split($0,a,/[^ ]+/); gsub(/Y/,"-",$2); gsub(/X/,"+",$3)
> =A0 =A0 =A0 =A0 =A0for (i=3D1;i<=3DNF;i++) =A0printf $i a[i+1]; print "" =
}' file
>
> aaa =A0 =A0 =A0 =A0b-b-b =A0 =A0 =A0 =A0 c+c+c =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0ddd
>
> I haven't tested it with more complex requirements, but for the
> given example seems to work fine.- Hide quoted text -
>
> - Show quoted text -

The main problem with that is that in general it's not the "non-space
characters" you'd need to store, it's the "non-FS regexps" so you'd
need some way to negate a regexp in the split() call. You'd also have
to specially handle the case where FS is a single blank char since in
that case $1 would be the first non-blank after any leading blanks
whereas with any other FS, $1 would be an empty field if the record
started with the FS:

$ echo "##a###b" | awk -F'#+' '{print $1,$2}'
 a
$ echo "  a   b" | awk -F' +' '{print $1,$2}'
 a
$ echo "  a   b" | awk -F' ' '{print $1,$2}'
a b

    Ed.
0
Reply Ed 11/27/2008 6:02:00 AM

Steffen Schuler wrote:

> Now fixed. But factor() doesn't work in gawk 3.1.5 because of a problem 
> with match(s,"") and RLENGTH:
> 
> --------------8<-----------------------------------------
> # function factor(s, a, re):
> # factors string s by regex re in a[1] a[2] ... a[2*n+1]
> # where a[2*i] are instances of re for 1 <= i <= n
> # re in { "", " " } is handled specially like FS == re
> # in gawk resp. awk
> # returns the number of fields 2*n+1 (inclusively the
> # intermediate strings matched by re)
> # remark: match(s, "") delivers a wrong RLENGTH in gawk 3.1.5
> #         fixed in CVS gawk
> #         therefore factor() doesn't work in gawk 3.1.5
> function factor(s, a, re,    i, j)
> {
>    for (i in a)
>      delete a[i]
>    i = 1
>    a[1] = ""
>    if (re == " ") {
>      re = "[ \t]+"
>      if (match(s, re) == 1)
>        s = substr(s, RSTART + RLENGTH)
>      while (match(s, re)) {
>        a[i++] = substr(s, 1, RSTART - 1)
>        a[i++] = substr(s, RSTART, RLENGTH)
>        s = substr(s, RSTART + RLENGTH)
>      }
>      if (i > 2 && s == "") {
>        delete a[--i]
>        --i
>      } else
>        a[i] = s
>    } else {
>      while (a[i] s != "" && match(s, re)) {
>        a[i] = a[i] substr(s, 1, RSTART - 1)
>        a[++i] = substr(s, RSTART, RLENGTH)
>        ++i
>        if (RSTART + RLENGTH == 1) {
>          a[i] = substr(s,1,1)
>          s = substr(s, 2)
>        } else
>          s = substr(s, RSTART + RLENGTH)
>      }
>      a[i] = a[i] s
>    }
>    return i
> }
> --------------8<----------------------------------------

I have followed the thread with great interest, thanks all for your remarks.
I hate to play the part of the nitpicker, but please allow me to do two
comments about the above code.

1) The function factor() looks to me like a way to decompose a line into
matching and not matching parts, given a regular expression. If you think of
re as FS, it seems to mimic the behavior of awk when splitting the line into
fields using FS, with the difference that the parts of the line that match
FS are returned too (correct me if I'm wrong, of course).

I don't understand the results if factor() is called with s="abc" and
re="f*". It returns 9, and the resulting array has these elements
(surrounded by --> <--):

--><--
--><--
-->a<--
--><--
-->b<--
--><--
-->c<--
--><--
--><--

If re is allowed to match the empty string (as it seems), I'd expect the
array to contain this instead:

--><--
-->a<--
--><--
-->b<--
--><--
-->c<--
--><--

or this, if you follow the rule that each line must begin and end with a
field (not a separator):

-->a<--
--><--
-->b<--
--><--
-->c<--

Alternatively, one might argue that the factor() function should behave
exactly like awk when splitting on FS, and thus empty string matching is not
allowed (see
http://groups.google.com/group/comp.lang.awk/browse_thread/thread/7a8846069c606ba7#
for more details). In that case, given the same string and re, one would
expect the array to contain

-->abc<--

Note that I'm not saying that the current behavior of the function is wrong;
I'm just trying to understand why it works that way, and if that behavior is
intentional.


2) I recently discovered this minor problem in gawk (and mawk, and busybox awk):

http://lists.gnu.org/archive/html/bug-gnu-utils/2008-11/msg00039.html

It looks like your function has the same behavior if given a RE that
includes an anchored element, for example given s="XXX" and re="^X", it returns

--><--
-->X<--
--><--
-->X<--
--><--
-->X<--
--><--

instead of

--><--
-->X<--
-->XX<--

As explained in the thread I linked, the reason is that since you consume a
part of the string each time, at the next iteration ^X will match again,
even if the "^X" it's matching is not the real "^X", but it's just an X at
the beginning of the remainder of the string.

If you've read everything so far, thanks for your patience!

-- 
awk 'BEGIN{O="~"~"~";o="=="=="==";o+=+o;x=O""O;while(X++<=x+o+o)c=c"%c";
printf c,(x-O)*(x-O),x*(x-o)-o,x*(x-O)+x-O-o,+x*(x-O)-x+o,X*(o*o+O)+x-O,
X*(X-x)-o*o,(x+X)*o*o+o,x*(X-x)-O-O,x-O+(O+o+X+x)*(o+O),X*X-X*(x-O)-x+O,
O+X*(o*(o+O)+O),+x+O+X*o,x*(x-o),(o+X+x)*o*o-(x-O-O),O+(X-x)*(X+O),x-O}'
0
Reply Dave 11/27/2008 5:07:18 PM

On Nov 26, 12:50=A0pm, Steffen Schuler <schuler.stef...@googlemail.com>
wrote:
> pk wrote:
> > On Wednesday 26 November 2008 07:47, Steffen Schuler wrote:
>
> >> I have indeed no simple solution but my awk-function does the necessar=
y
> >> splitting with storing of the splitted and intermediate strings:
>
> >> ---------8<------------------------
> >> # function factor(s, a, re):
> >> # factors string s by regex re in a[1] a[2] ... a[2*n+1]
> >> # where a[2*i] are instances of re for 1 <=3D i <=3D n
> >> # re in { "", " " } is handled specially like FS =3D=3D re
> >> # in gawk resp. awk
> >> function factor(s, a, re, =A0 =A0i, j)
> >> {
> >> =A0 =A0for (i in a)
> >> =A0 =A0 =A0delete a[i]
> >> =A0 =A0i =3D 1
> >> =A0 =A0if (re =3D=3D "") {
> >> =A0 =A0 =A0for (j =3D 1; j < length(s); ++j) {
> >> =A0 =A0 =A0 =A0a[i++] =3D substr(s,j,1)
> >> =A0 =A0 =A0 =A0a[i++] =3D ""
> >> =A0 =A0 =A0}
> >> =A0 =A0 =A0a[i] =3D substr(s,j,1)
> >> =A0 =A0} else if (re =3D=3D " ") {
> >> =A0 =A0 =A0re =3D "[ \t]+"
> >> =A0 =A0 =A0if (match(s, re) =3D=3D 1)
> >> =A0 =A0 =A0 =A0s =3D substr(s, RSTART + RLENGTH)
> >> =A0 =A0 =A0while (match(s, re)) {
> >> =A0 =A0 =A0 =A0a[i++] =3D substr(s, 1, RSTART - 1)
> >> =A0 =A0 =A0 =A0a[i++] =3D substr(s, RSTART, RLENGTH)
> >> =A0 =A0 =A0 =A0s =3D substr(s, RSTART + RLENGTH)
> >> =A0 =A0 =A0}
> >> =A0 =A0 =A0if (s =3D=3D "")
> >> =A0 =A0 =A0 =A0delete a[i-1]
> >> =A0 =A0 =A0else
> >> =A0 =A0 =A0 =A0a[i] =3D s
> >> =A0 =A0} else {
> >> =A0 =A0 =A0while (match(s, re)) {
> >> =A0 =A0 =A0 =A0a[i++] =3D substr(s, 1, RSTART - 1)
> >> =A0 =A0 =A0 =A0a[i++] =3D substr(s, RSTART, RLENGTH)
> >> =A0 =A0 =A0 =A0s =3D substr(s, RSTART + RLENGTH)
> >> =A0 =A0 =A0}
> >> =A0 =A0 =A0a[i] =3D s
> >> =A0 =A0}
> >> }
>
> > Note that the above function will loop forever with, say, a string
> > like "aXbXc" and a RE like "X*".
>
> Now fixed. But factor() doesn't work in gawk 3.1.5 because of a problem
> with match(s,"") and RLENGTH:
>
> --------------8<-----------------------------------------
> # function factor(s, a, re):
> # factors string s by regex re in a[1] a[2] ... a[2*n+1]
> # where a[2*i] are instances of re for 1 <=3D i <=3D n
> # re in { "", " " } is handled specially like FS =3D=3D re
> # in gawk resp. awk
> # returns the number of fields 2*n+1 (inclusively the
> # intermediate strings matched by re)
> # remark: match(s, "") delivers a wrong RLENGTH in gawk 3.1.5
> # =A0 =A0 =A0 =A0 fixed in CVS gawk
> # =A0 =A0 =A0 =A0 therefore factor() doesn't work in gawk 3.1.5
> function factor(s, a, re, =A0 =A0i, j)
> {
> =A0 =A0for (i in a)
> =A0 =A0 =A0delete a[i]
> =A0 =A0i =3D 1
> =A0 =A0a[1] =3D ""
> =A0 =A0if (re =3D=3D " ") {
> =A0 =A0 =A0re =3D "[ \t]+"

I haven't thought through the code at all (yet), but at this line
ITYM:

    re =3D "[[:space:]]+"

That'll account for, for example, the case where the RS is something
other than a newline and so a newline would be one of the characters
treated as part of the RS.

      Ed.
0
Reply Ed 11/27/2008 6:27:10 PM

On Nov 27, 12:27=A0pm, Ed Morton <mortons...@gmail.com> wrote:
> On Nov 26, 12:50=A0pm, Steffen Schuler <schuler.stef...@googlemail.com>
> wrote:
>
>
>
>
>
> > pk wrote:
> > > On Wednesday 26 November 2008 07:47, Steffen Schuler wrote:
>
> > >> I have indeed no simple solution but my awk-function does the necess=
ary
> > >> splitting with storing of the splitted and intermediate strings:
>
> > >> ---------8<------------------------
> > >> # function factor(s, a, re):
> > >> # factors string s by regex re in a[1] a[2] ... a[2*n+1]
> > >> # where a[2*i] are instances of re for 1 <=3D i <=3D n
> > >> # re in { "", " " } is handled specially like FS =3D=3D re
> > >> # in gawk resp. awk
> > >> function factor(s, a, re, =A0 =A0i, j)
> > >> {
> > >> =A0 =A0for (i in a)
> > >> =A0 =A0 =A0delete a[i]
> > >> =A0 =A0i =3D 1
> > >> =A0 =A0if (re =3D=3D "") {
> > >> =A0 =A0 =A0for (j =3D 1; j < length(s); ++j) {
> > >> =A0 =A0 =A0 =A0a[i++] =3D substr(s,j,1)
> > >> =A0 =A0 =A0 =A0a[i++] =3D ""
> > >> =A0 =A0 =A0}
> > >> =A0 =A0 =A0a[i] =3D substr(s,j,1)
> > >> =A0 =A0} else if (re =3D=3D " ") {
> > >> =A0 =A0 =A0re =3D "[ \t]+"
> > >> =A0 =A0 =A0if (match(s, re) =3D=3D 1)
> > >> =A0 =A0 =A0 =A0s =3D substr(s, RSTART + RLENGTH)
> > >> =A0 =A0 =A0while (match(s, re)) {
> > >> =A0 =A0 =A0 =A0a[i++] =3D substr(s, 1, RSTART - 1)
> > >> =A0 =A0 =A0 =A0a[i++] =3D substr(s, RSTART, RLENGTH)
> > >> =A0 =A0 =A0 =A0s =3D substr(s, RSTART + RLENGTH)
> > >> =A0 =A0 =A0}
> > >> =A0 =A0 =A0if (s =3D=3D "")
> > >> =A0 =A0 =A0 =A0delete a[i-1]
> > >> =A0 =A0 =A0else
> > >> =A0 =A0 =A0 =A0a[i] =3D s
> > >> =A0 =A0} else {
> > >> =A0 =A0 =A0while (match(s, re)) {
> > >> =A0 =A0 =A0 =A0a[i++] =3D substr(s, 1, RSTART - 1)
> > >> =A0 =A0 =A0 =A0a[i++] =3D substr(s, RSTART, RLENGTH)
> > >> =A0 =A0 =A0 =A0s =3D substr(s, RSTART + RLENGTH)
> > >> =A0 =A0 =A0}
> > >> =A0 =A0 =A0a[i] =3D s
> > >> =A0 =A0}
> > >> }
>
> > > Note that the above function will loop forever with, say, a string
> > > like "aXbXc" and a RE like "X*".
>
> > Now fixed. But factor() doesn't work in gawk 3.1.5 because of a problem
> > with match(s,"") and RLENGTH:
>
> > --------------8<-----------------------------------------
> > # function factor(s, a, re):
> > # factors string s by regex re in a[1] a[2] ... a[2*n+1]
> > # where a[2*i] are instances of re for 1 <=3D i <=3D n
> > # re in { "", " " } is handled specially like FS =3D=3D re
> > # in gawk resp. awk
> > # returns the number of fields 2*n+1 (inclusively the
> > # intermediate strings matched by re)
> > # remark: match(s, "") delivers a wrong RLENGTH in gawk 3.1.5
> > # =A0 =A0 =A0 =A0 fixed in CVS gawk
> > # =A0 =A0 =A0 =A0 therefore factor() doesn't work in gawk 3.1.5
> > function factor(s, a, re, =A0 =A0i, j)
> > {
> > =A0 =A0for (i in a)
> > =A0 =A0 =A0delete a[i]
> > =A0 =A0i =3D 1
> > =A0 =A0a[1] =3D ""
> > =A0 =A0if (re =3D=3D " ") {
> > =A0 =A0 =A0re =3D "[ \t]+"
>
> I haven't thought through the code at all (yet), but at this line
> ITYM:
>
> =A0 =A0 re =3D "[[:space:]]+"
>
> That'll account for, for example, the case where the RS is something
> other than a newline and so a newline would be one of the characters
> treated as part of the RS.

I mean "FS".



0
Reply Ed 11/27/2008 6:28:57 PM

Dave B wrote:
> Steffen Schuler wrote:
[snip]
>> function factor(s, a, re,    i, j)
>> {
>>    for (i in a)
>>      delete a[i]
>>    i = 1
>>    a[1] = ""
>>    if (re == " ") {
>>      re = "[ \t]+"
>>      if (match(s, re) == 1)
>>        s = substr(s, RSTART + RLENGTH)
>>      while (match(s, re)) {
>>        a[i++] = substr(s, 1, RSTART - 1)
>>        a[i++] = substr(s, RSTART, RLENGTH)
>>        s = substr(s, RSTART + RLENGTH)
>>      }
>>      if (i > 2 && s == "") {
>>        delete a[--i]
>>        --i
>>      } else
>>        a[i] = s
>>    } else {
>>      while (a[i] s != "" && match(s, re)) {
>>        a[i] = a[i] substr(s, 1, RSTART - 1)
>>        a[++i] = substr(s, RSTART, RLENGTH)
>>        ++i
>>        if (RSTART + RLENGTH == 1) {
>>          a[i] = substr(s,1,1)
>>          s = substr(s, 2)
>>        } else
>>          s = substr(s, RSTART + RLENGTH)
>>      }
>>      a[i] = a[i] s
>>    }
>>    return i
>> }
>> --------------8<----------------------------------------
> 
> I have followed the thread with great interest, thanks all for your remarks.
> I hate to play the part of the nitpicker, but please allow me to do two
> comments about the above code.
> 
> 1) The function factor() looks to me like a way to decompose a line into
> matching and not matching parts, given a regular expression. If you think of
> re as FS, it seems to mimic the behavior of awk when splitting the line into
> fields using FS, with the difference that the parts of the line that match
> FS are returned too (correct me if I'm wrong, of course).

That's correct.

>
> I don't understand the results if factor() is called with s="abc" and
> re="f*". It returns 9, and the resulting array has these elements
> (surrounded by --> <--):
> 
> --><--
> --><--
> -->a<--
> --><--
> -->b<--
> --><--
> -->c<--
> --><--
> --><--

The regular expression re matches the empty string before "a" in "abc" 
so separating an empty field, then the empty string matched by re, then 
"a", then the empty string matched again by re, etc.

Totally analogous is the behavior at the end of the string "abc":
The string "c" is between to empty fields matched by re, so we have the 
sequence "", "c", "", "". The last empty string is the field following 
the empty string matched by re after the "c".

In my opinion, the behavior is completely correct in this case.

> 
> If re is allowed to match the empty string (as it seems), I'd expect the
> array to contain this instead:
> 
> --><--
> -->a<--
> --><--
> -->b<--
> --><--
> -->c<--
> --><--
> 
> or this, if you follow the rule that each line must begin and end with a
> field (not a separator):
> 
> -->a<--
> --><--
> -->b<--
> --><--
> -->c<--
> 
> Alternatively, one might argue that the factor() function should behave
> exactly like awk when splitting on FS, and thus empty string matching is not
> allowed (see
> http://groups.google.com/group/comp.lang.awk/browse_thread/thread/7a8846069c606ba7#
> for more details). In that case, given the same string and re, one would
> expect the array to contain
> 
> -->abc<--
> 
> Note that I'm not saying that the current behavior of the function is wrong;
> I'm just trying to understand why it works that way, and if that behavior is
> intentional.
> 
> 
> 2) I recently discovered this minor problem in gawk (and mawk, and busybox awk):
> 
> http://lists.gnu.org/archive/html/bug-gnu-utils/2008-11/msg00039.html
> 
> It looks like your function has the same behavior if given a RE that
> includes an anchored element, for example given s="XXX" and re="^X", it returns
> 
> --><--
> -->X<--
> --><--
> -->X<--
> --><--
> -->X<--
> --><--
> 
> instead of
> 
> --><--
> -->X<--
> -->XX<--
> 
> As explained in the thread I linked, the reason is that since you consume a
> part of the string each time, at the next iteration ^X will match again,
> even if the "^X" it's matching is not the real "^X", but it's just an X at
> the beginning of the remainder of the string.
> 
> If you've read everything so far, thanks for your patience!
> 

This could indeed be a minor problem. But I assumed that no string 
anchors should be used in re.

-- 
Steffen
0
Reply Steffen 11/27/2008 10:35:21 PM

Ed Morton wrote:
> On Nov 27, 12:27 pm, Ed Morton <mortons...@gmail.com> wrote:
>> On Nov 26, 12:50 pm, Steffen Schuler <schuler.stef...@googlemail.com>
>> wrote:
>>
>>
>>
>>
>>
>>> pk wrote:
>>>> On Wednesday 26 November 2008 07:47, Steffen Schuler wrote:
>>>>> I have indeed no simple solution but my awk-function does the necessary
>>>>> splitting with storing of the splitted and intermediate strings:
>>>>> ---------8<------------------------
>>>>> # function factor(s, a, re):
>>>>> # factors string s by regex re in a[1] a[2] ... a[2*n+1]
>>>>> # where a[2*i] are instances of re for 1 <= i <= n
>>>>> # re in { "", " " } is handled specially like FS == re
>>>>> # in gawk resp. awk
>>>>> function factor(s, a, re,    i, j)
>>>>> {
>>>>>    for (i in a)
>>>>>      delete a[i]
>>>>>    i = 1
>>>>>    if (re == "") {
>>>>>      for (j = 1; j < length(s); ++j) {
>>>>>        a[i++] = substr(s,j,1)
>>>>>        a[i++] = ""
>>>>>      }
>>>>>      a[i] = substr(s,j,1)
>>>>>    } else if (re == " ") {
>>>>>      re = "[ \t]+"
>>>>>      if (match(s, re) == 1)
>>>>>        s = substr(s, RSTART + RLENGTH)
>>>>>      while (match(s, re)) {
>>>>>        a[i++] = substr(s, 1, RSTART - 1)
>>>>>        a[i++] = substr(s, RSTART, RLENGTH)
>>>>>        s = substr(s, RSTART + RLENGTH)
>>>>>      }
>>>>>      if (s == "")
>>>>>        delete a[i-1]
>>>>>      else
>>>>>        a[i] = s
>>>>>    } else {
>>>>>      while (match(s, re)) {
>>>>>        a[i++] = substr(s, 1, RSTART - 1)
>>>>>        a[i++] = substr(s, RSTART, RLENGTH)
>>>>>        s = substr(s, RSTART + RLENGTH)
>>>>>      }
>>>>>      a[i] = s
>>>>>    }
>>>>> }
>>>> Note that the above function will loop forever with, say, a string
>>>> like "aXbXc" and a RE like "X*".
>>> Now fixed. But factor() doesn't work in gawk 3.1.5 because of a problem
>>> with match(s,"") and RLENGTH:
>>> --------------8<-----------------------------------------
>>> # function factor(s, a, re):
>>> # factors string s by regex re in a[1] a[2] ... a[2*n+1]
>>> # where a[2*i] are instances of re for 1 <= i <= n
>>> # re in { "", " " } is handled specially like FS == re
>>> # in gawk resp. awk
>>> # returns the number of fields 2*n+1 (inclusively the
>>> # intermediate strings matched by re)
>>> # remark: match(s, "") delivers a wrong RLENGTH in gawk 3.1.5
>>> #         fixed in CVS gawk
>>> #         therefore factor() doesn't work in gawk 3.1.5
>>> function factor(s, a, re,    i, j)
>>> {
>>>    for (i in a)
>>>      delete a[i]
>>>    i = 1
>>>    a[1] = ""
>>>    if (re == " ") {
>>>      re = "[ \t]+"
>> I haven't thought through the code at all (yet), but at this line
>> ITYM:
>>
>>     re = "[[:space:]]+"

The POSIX awk manpage specifies that " " should match blanks where 
blanks mean " " or "\t".

Cited from POSIX awk manpage contained in Debian GNU/Linux 4.0 as 
package "manpages-posix":

[...]
An  extended regular expression can be used to separate fields by using
the -F ERE option or by assigning a string containing the expression to
the built-in variable FS. The default value of the FS variable shall be
a single <space>. The following describes FS behavior:

    1. If FS is a null string, the behavior is unspecified.

    2. If FS is a single character:

       a. If FS is <space>, skip leading and  trailing  <blank>s;  fields
          shall be delimited by sets of one or more <blank>s.

       b. Otherwise,  if  FS  is  any  other character c, fields shall be
          delimited by each single occurrence of c.
[...]

>>
>> That'll account for, for example, the case where the RS is something
>> other than a newline and so a newline would be one of the characters
>> treated as part of the RS.
> 
> I mean "FS".
> 
> 
> 


-- 
Steffen
0
Reply Steffen 11/27/2008 10:44:24 PM

On Nov 27, 4:44=A0pm, Steffen Schuler <schuler.stef...@googlemail.com>
wrote:
> Ed Morton wrote:
> > On Nov 27, 12:27 pm, Ed Morton <mortons...@gmail.com> wrote:
> >> On Nov 26, 12:50 pm, Steffen Schuler <schuler.stef...@googlemail.com>
> >> wrote:
>
> >>> pk wrote:
> >>>> On Wednesday 26 November 2008 07:47, Steffen Schuler wrote:
> >>>>> I have indeed no simple solution but my awk-function does the neces=
sary
> >>>>> splitting with storing of the splitted and intermediate strings:
> >>>>> ---------8<------------------------
> >>>>> # function factor(s, a, re):
> >>>>> # factors string s by regex re in a[1] a[2] ... a[2*n+1]
> >>>>> # where a[2*i] are instances of re for 1 <=3D i <=3D n
> >>>>> # re in { "", " " } is handled specially like FS =3D=3D re
> >>>>> # in gawk resp. awk
> >>>>> function factor(s, a, re, =A0 =A0i, j)
> >>>>> {
> >>>>> =A0 =A0for (i in a)
> >>>>> =A0 =A0 =A0delete a[i]
> >>>>> =A0 =A0i =3D 1
> >>>>> =A0 =A0if (re =3D=3D "") {
> >>>>> =A0 =A0 =A0for (j =3D 1; j < length(s); ++j) {
> >>>>> =A0 =A0 =A0 =A0a[i++] =3D substr(s,j,1)
> >>>>> =A0 =A0 =A0 =A0a[i++] =3D ""
> >>>>> =A0 =A0 =A0}
> >>>>> =A0 =A0 =A0a[i] =3D substr(s,j,1)
> >>>>> =A0 =A0} else if (re =3D=3D " ") {
> >>>>> =A0 =A0 =A0re =3D "[ \t]+"
> >>>>> =A0 =A0 =A0if (match(s, re) =3D=3D 1)
> >>>>> =A0 =A0 =A0 =A0s =3D substr(s, RSTART + RLENGTH)
> >>>>> =A0 =A0 =A0while (match(s, re)) {
> >>>>> =A0 =A0 =A0 =A0a[i++] =3D substr(s, 1, RSTART - 1)
> >>>>> =A0 =A0 =A0 =A0a[i++] =3D substr(s, RSTART, RLENGTH)
> >>>>> =A0 =A0 =A0 =A0s =3D substr(s, RSTART + RLENGTH)
> >>>>> =A0 =A0 =A0}
> >>>>> =A0 =A0 =A0if (s =3D=3D "")
> >>>>> =A0 =A0 =A0 =A0delete a[i-1]
> >>>>> =A0 =A0 =A0else
> >>>>> =A0 =A0 =A0 =A0a[i] =3D s
> >>>>> =A0 =A0} else {
> >>>>> =A0 =A0 =A0while (match(s, re)) {
> >>>>> =A0 =A0 =A0 =A0a[i++] =3D substr(s, 1, RSTART - 1)
> >>>>> =A0 =A0 =A0 =A0a[i++] =3D substr(s, RSTART, RLENGTH)
> >>>>> =A0 =A0 =A0 =A0s =3D substr(s, RSTART + RLENGTH)
> >>>>> =A0 =A0 =A0}
> >>>>> =A0 =A0 =A0a[i] =3D s
> >>>>> =A0 =A0}
> >>>>> }
> >>>> Note that the above function will loop forever with, say, a string
> >>>> like "aXbXc" and a RE like "X*".
> >>> Now fixed. But factor() doesn't work in gawk 3.1.5 because of a probl=
em
> >>> with match(s,"") and RLENGTH:
> >>> --------------8<-----------------------------------------
> >>> # function factor(s, a, re):
> >>> # factors string s by regex re in a[1] a[2] ... a[2*n+1]
> >>> # where a[2*i] are instances of re for 1 <=3D i <=3D n
> >>> # re in { "", " " } is handled specially like FS =3D=3D re
> >>> # in gawk resp. awk
> >>> # returns the number of fields 2*n+1 (inclusively the
> >>> # intermediate strings matched by re)
> >>> # remark: match(s, "") delivers a wrong RLENGTH in gawk 3.1.5
> >>> # =A0 =A0 =A0 =A0 fixed in CVS gawk
> >>> # =A0 =A0 =A0 =A0 therefore factor() doesn't work in gawk 3.1.5
> >>> function factor(s, a, re, =A0 =A0i, j)
> >>> {
> >>> =A0 =A0for (i in a)
> >>> =A0 =A0 =A0delete a[i]
> >>> =A0 =A0i =3D 1
> >>> =A0 =A0a[1] =3D ""
> >>> =A0 =A0if (re =3D=3D " ") {
> >>> =A0 =A0 =A0re =3D "[ \t]+"
> >> I haven't thought through the code at all (yet), but at this line
> >> ITYM:
>
> >> =A0 =A0 re =3D "[[:space:]]+"
>
> The POSIX awk manpage specifies that " " should match blanks where
> blanks mean " " or "\t".

Hmm, I hadn't noticed that about POSIX before. I prefer the normal
gawk behavior:

$ printf "abc \n defX" | gawk -v RS=3D'X' '{printf "$1=3D<%s>,$2=3D<%s>\n",
$1,$2}'
$1=3D<abc>,$2=3D<def>
$ printf "abc \n defX" | gawk --posix -v RS=3D'X' '{printf "$1=3D<%s>,$2=3D=
<
%s>\n",$1,$2}'
$1=3D<abc>,$2=3D<
>

but maybe that's just because that's what I'm used to.

   Ed.
0
Reply Ed 11/28/2008 3:42:26 AM

In article <ggja13$rf0$1@aioe.org>, pk  <pk@pk.invalid> wrote:
>On Tuesday 25 November 2008 23:53, Ed Morton wrote:
>This is because there is no built in way to store FSs (ie, something that
>does for FS what RT does for RS in GNU awk). It would be nice to have, say,
>an array or something, where the characters that matched FS are stored.
>Alternatively, it could be useful to know (again, in an array) the offsets
>inside $0 where the various fields start, and the rest could be worked out
>from that. In the absence of these facilities, when FS is a regex it is
>difficult if not impossible to restore the same separators in the output
>line.

Not quite.  There is no ability to do this for the record and fields.
Saving this information for every record would be very expensive, and
I felt that it was not right to pay this price unconditionally when
it's not needed most of the time.

HOWEVER, you can get everything you need to do this kind of thing by
passing an array as the third argument to match(). See the gawk doc.
(This is a gawk extension.)

Arnold
-- 
Aharon (Arnold) Robbins 				arnold AT skeeve DOT com
P.O. Box 354		Home Phone: +972  8 979-0381
Nof Ayalon		Cell Phone: +972 50  729-7545
D.N. Shimshon 99785	ISRAEL
0
Reply arnold 11/28/2008 9:40:12 AM

On Friday 28 November 2008 10:40, Aharon Robbins wrote:

> In article <ggja13$rf0$1@aioe.org>, pk  <pk@pk.invalid> wrote:
>>On Tuesday 25 November 2008 23:53, Ed Morton wrote:
>>This is because there is no built in way to store FSs (ie, something that
>>does for FS what RT does for RS in GNU awk). It would be nice to have,
>>say, an array or something, where the characters that matched FS are
>>stored. Alternatively, it could be useful to know (again, in an array) the
>>offsets inside $0 where the various fields start, and the rest could be
>>worked out from that. In the absence of these facilities, when FS is a
>>regex it is difficult if not impossible to restore the same separators in
>>the output line.
> 
> Not quite.  There is no ability to do this for the record and fields.
> Saving this information for every record would be very expensive, and
> I felt that it was not right to pay this price unconditionally when
> it's not needed most of the time.

Well, I don't know the inner workings of the algorithm that parses the line
into fields according to FS, but I suppose it involves some kind of loop
over the line, each time looking if FS matches and if so, take what's
before the match as field and what matched as separator (roughly at least).
If it's so (which of course could be wrong), all the information needed is
already available as part of the process, and all that would be needed is
assigning those offsets into a special array, so, say OFF[1]...OFF[NF]
would contain the offsets in the line where $1...$NF start. After that, one
could easily pull out the parts of $0 that matched FS based on OFF[n] and
length($n). Or awk could directly assign the parts that matched FS to a
special array, like eg SEP[1]...SEP[NF-1].

I agree that most of the time that information is not needed, and it would
be useful only when FS is a RE; nonetheless, I still think that having it
could be a nice feature, most certainly not enabled by default, but perhaps
selectable using a switch.
 
> HOWEVER, you can get everything you need to do this kind of thing by
> passing an array as the third argument to match(). See the gawk doc.
> (This is a gawk extension.)

Fair enough, However, from what I understand, match() does NOT exactly mimic
the behavior of FS splitting, since it will match the empty string. For
example (sorry to always bring up the same corner test case),
match("aXbXc",/X*/,a) does not do what an FS of "X*" would do to the same
line.

Anyway, even with regular cases, I cannot get it to work:

awk 'BEGIN{
  match("abab",/(b)/,a)
  print a[1,"start"],a[1,"length"]
  print a[2,"start"],a[2,"length"]}'

I expect the above program to print

2 1
4 1

which it doesn't. (Note that it's entirely possible that I completely
misunderstood how the third argument to match() works. In that case, please
accept my apologies).

0
Reply pk 11/28/2008 11:12:48 AM

Steffen Schuler wrote:
 >
 > --------------8<-----------------------------------------
 > # function factor(s, a, re):
 > # factors string s by regex re in a[1] a[2] ... a[2*n+1]
 > # where a[2*i] are instances of re for 1 <= i <= n
 > # re in { "", " " } is handled specially like FS == re
 > # in gawk resp. awk
 > # returns the number of fields 2*n+1 (inclusively the
 > # intermediate strings matched by re)
 > # remark: match(s, "") delivers a wrong RLENGTH in gawk 3.1.5
 > #         fixed in CVS gawk
 > #         therefore factor() doesn't work in gawk 3.1.5
 > function factor(s, a, re,    i, j)
 > {
 >   for (i in a)
 >     delete a[i]
 >   i = 1
 >   a[1] = ""
 >   if (re == " ") {
 >     re = "[ \t]+"
 >     if (match(s, re) == 1)
 >       s = substr(s, RSTART + RLENGTH)
 >     while (match(s, re)) {
 >       a[i++] = substr(s, 1, RSTART - 1)
 >       a[i++] = substr(s, RSTART, RLENGTH)
 >       s = substr(s, RSTART + RLENGTH)
 >     }
 >     if (i > 2 && s == "") {
 >       delete a[--i]
 >       --i
 >     } else
 >       a[i] = s
 >   } else {
 >     while (a[i] s != "" && match(s, re)) {
 >       a[i] = a[i] substr(s, 1, RSTART - 1)
 >       a[++i] = substr(s, RSTART, RLENGTH)
 >       ++i
 >       if (RSTART + RLENGTH == 1) {
 >         a[i] = substr(s,1,1)
 >         s = substr(s, 2)
 >       } else
 >         s = substr(s, RSTART + RLENGTH)
 >     }
 >     a[i] = a[i] s
 >   }
 >   return i
 > }
 > --------------8<----------------------------------------

pk wrote:
> On Friday 28 November 2008 10:40, Aharon Robbins wrote:
[snip]
>> HOWEVER, you can get everything you need to do this kind of thing by
>> passing an array as the third argument to match(). See the gawk doc.
>> (This is a gawk extension.)
> 
> Fair enough, However, from what I understand, match() does NOT exactly mimic
> the behavior of FS splitting, since it will match the empty string. For
> example (sorry to always bring up the same corner test case),
> match("aXbXc",/X*/,a) does not do what an FS of "X*" would do to the same
> line.
> 
> Anyway, even with regular cases, I cannot get it to work:
> 
> awk 'BEGIN{
>   match("abab",/(b)/,a)
>   print a[1,"start"],a[1,"length"]
>   print a[2,"start"],a[2,"length"]}'
> 
> I expect the above program to print
> 
> 2 1
> 4 1
> 
> which it doesn't. (Note that it's entirely possible that I completely
> misunderstood how the third argument to match() works. In that case, please
> accept my apologies).
> 

Hi,

`match(s, re, a)� stores in `a� each instance of a match of a 
regex-subexpression of `re� in parentheses at a _single_ match.
`a[i]� (`a[i, "start"]�, `a[i,"length"]�) contains the matched substring 
by the i-th regex subexpression of `re� in parentheses where i is 
counted by the number of the i-th _left_ parenthesis.

For example:

$ cat match_script.awk
BEGIN {
   match("abab", /((b)(.))/, a)
   for (i = 1; i in a; ++i)
     printf "a[%d]: %s\n", i, a[i]
}
$ awk -f match_script.awk
a[1]: ba
a[2]: b
a[3]: a
$

IMHO there is no possibility to use the third argument of match() to do 
the task solved with the function factor() which could not be solved as 
easily with the first two arguments of match().

I would be glad about Arnold giving us an example how he wants to solve 
the task with the third argument of match().

Thank you very much for any answers.

-- 
Steffen
0
Reply Steffen 11/28/2008 2:11:07 PM

On Friday 28 November 2008 15:11, Steffen Schuler wrote:

>> Anyway, even with regular cases, I cannot get it to work:
>> 
>> awk 'BEGIN{
>>   match("abab",/(b)/,a)
>>   print a[1,"start"],a[1,"length"]
>>   print a[2,"start"],a[2,"length"]}'
>> 
>> I expect the above program to print
>> 
>> 2 1
>> 4 1
>> 
>> which it doesn't. (Note that it's entirely possible that I completely
>> misunderstood how the third argument to match() works. In that case,
>> please accept my apologies).
>> 
> 
> Hi,
> 
> `match(s, re, a)´ stores in `a´ each instance of a match of a
> regex-subexpression of `re´ in parentheses at a _single_ match.
> `a[i]´ (`a[i, "start"]´, `a[i,"length"]´) contains the matched substring
> by the i-th regex subexpression of `re´ in parentheses where i is
> counted by the number of the i-th _left_ parenthesis.
> 
> For example:
> 
> $ cat match_script.awk
> BEGIN {
>    match("abab", /((b)(.))/, a)
>    for (i = 1; i in a; ++i)
>      printf "a[%d]: %s\n", i, a[i]
> }
> $ awk -f match_script.awk
> a[1]: ba
> a[2]: b
> a[3]: a
> $

(I think a[0] also would hold "ba" in that case, if I read the docs
correctly)

Ah ok, so I misunderstood that. match() only matches the first occurrence of
the regex in the string, and the purpose if the array is to store
subexpressions *within that match* only. I was clearly thinking to the way
m//g works in Perl when I wrote that. Thanks for clearing that up.

As an extension, it would be nice to have match() be sensible to the current
value of RSTART (or other special variable), ie set RSTART=10 and match()
tries to match the string starting from position 10. So you could loop over
the string with match(), but without being forced to consume it to find
subsequent matches (which is currently the only method I guess).

> IMHO there is no possibility to use the third argument of match() to do
> the task solved with the function factor() which could not be solved as
> easily with the first two arguments of match().
> 
> I would be glad about Arnold giving us an example how he wants to solve
> the task with the third argument of match().
> 
> Thank you very much for any answers.

+1

0
Reply pk 11/28/2008 2:26:47 PM

pk wrote:
> [...]
> As an extension, it would be nice to have match() be sensible to the current
> value of RSTART (or other special variable), ie set RSTART=10 and match()
> tries to match the string starting from position 10.

And, optionally, to consider also RLENGTH to continue searching from
RSTART+RLENGTH. Both variants would have been helpful, mainly to tidy
up code, in several applications.

One must be careful, though, with such implicit initializations.

> So you could loop over
> the string with match(), but without being forced to consume it to find
> subsequent matches (which is currently the only method I guess).

Yes, that would be a fine feature. Maybe in a function with another
name; I wouldn't want behavioural changes of that kind in match() by
command line options.

Janis
0
Reply Janis 11/28/2008 2:44:16 PM

Ed Morton <mortonspam@gmail.com> writes:
> Someone recently asked how to do this (all spaces in "file" are tabs):
>
> $ cat file
> aaa     bbb     cXcXc   ddd
> $ awk 'BEGIN{FS=OFS="\t"}gsub(/X/,"+",$3)' file
> aaa     bbb     c+c+c   ddd
>
> i.e. change all ocurrences of a character in a field without changing
> the spacing between fields. The above only "works" because the field
> separator is a specific character and we can set the OFS to the same
> character so that though the record gets recompiled it looks the same
> after as before wrt field spacing. If the FS had been a space
> character (or any RE), however, we can't use that same trick (all
> spaces in "file" are chains of blank chars):
>
> $ cat file
> aaa        bbb         cXcXc              ddd
> $ awk 'BEGIN{FS=OFS=" "}gsub(/X/,"+",$3)' file
> aaa bbb c+c+c ddd
> $ awk 'BEGIN{FS=OFS=" +"}gsub(/X/,"+",$3)' file
> aaa +bbb +c+c+c +ddd
>
> This problem of unwanted recompilation of the record comes up so
> often, I wonder if anyone has a suggestion on a simple way to work
> around it in general. Unless you use GNU awks gensub(), the best I can
> think of is to carve up the record into the part before the 3rd field,
> then the 3rd field, then the part after the 3rd field and stick it all
> back together afterward manipulating the 3rd field, but any
> implementation I can imagine for that is very lengthy considering how
> simple the problem sounds. Any suggestions?

As others have shown a general solution in awk is quite hard.  However,
I suspect the user usually has some pre-knowledge of how the field
format should look:

$ cat file
aaa        bbb         cXcXc              ddd
$ awk '{gsub(/X/,"+",$3)
        printf("%-11s%-12s%-19s%s\n", $1, $2, $3,$4)}' file
aaa        bbb         c+c+c              ddd

As with several other problems posted here, I think it is often easier
and more resilient to use awk's parsing ability to decode the information
in the file, and use printf to create the desired output file format,
even when ones first inclination is to just play with the structure
of the input data.

-- 
Barry Fishman
0
Reply Barry 11/28/2008 3:14:34 PM

On Friday 28 November 2008 16:14, Barry Fishman wrote:

> As others have shown a general solution in awk is quite hard.  However,
> I suspect the user usually has some pre-knowledge of how the field
> format should look:
> 
> $ cat file
> aaa        bbb         cXcXc              ddd
> $ awk '{gsub(/X/,"+",$3)
>         printf("%-11s%-12s%-19s%s\n", $1, $2, $3,$4)}' file
> aaa        bbb         c+c+c              ddd
> 
> As with several other problems posted here, I think it is often easier
> and more resilient to use awk's parsing ability to decode the information
> in the file, and use printf to create the desired output file format,
> even when ones first inclination is to just play with the structure
> of the input data.

Even if the user has some knowledge of its data, there's currently no easy
and reliable way to rebuild the output record if FS is a full regular
expression. Example (purposely contrived, but not too far from some real
cases I've seen):

$ cat file
aXaa1786bXbb44338cXcXc194768ddd

# we want to get as output
# aXaa1786bXbb44338c+c+c194768ddd
# ie, change X to + in $3 only, keeping the rest intact

$ awk -F '[0-9]+' '{gsub(/X/,"+",$3)
                    printf ...?  # what to put here? }' file

0
Reply pk 11/28/2008 4:39:08 PM

pk <pk@pk.invalid> writes:
> Even if the user has some knowledge of its data, there's currently no easy
> and reliable way to rebuild the output record if FS is a full regular
> expression. Example (purposely contrived, but not too far from some real
> cases I've seen):
>
> $ cat file
> aXaa1786bXbb44338cXcXc194768ddd
>
> # we want to get as output
> # aXaa1786bXbb44338c+c+c194768ddd
> # ie, change X to + in $3 only, keeping the rest intact
>
> $ awk -F '[0-9]+' '{gsub(/X/,"+",$3)
>                     printf ...?  # what to put here? }' file

I was just suggesting another approach that works well with many
problems.  I didn't claim it worked for every problem.

Awk does, what it does well, very quickly and simply, but it isn't a
general purpose language.  It is, however, powerful enough so you can do
things in a more C like way:

awk '{ line    = $0
       prev_ch = 9
       field   = 0
       for (pos = 1; pos < length(line); pos++) {
          ch = substr(line, pos, 1)
          if ((index("0123456789", prev_ch) > 0) &&
              (index("0123456789", ch) == 0)) {
             field++
          }
          if ((field == 3) && (ch == "X")) {
             line = substr(line, 1, pos - 1) "+" substr(line, pos + 1)
          }
          prev_ch = ch
       }
       print line
     }' file

There are times one misses things like regular expressions with
subsequence matches.

-- 
Barry Fishman
0
Reply Barry 11/28/2008 10:05:19 PM

In article <ggojdg$th7$1@aioe.org>, pk  <pk@pk.invalid> wrote:
>On Friday 28 November 2008 10:40, Aharon Robbins wrote:
>
>> In article <ggja13$rf0$1@aioe.org>, pk  <pk@pk.invalid> wrote:
>>>On Tuesday 25 November 2008 23:53, Ed Morton wrote:
>>>This is because there is no built in way to store FSs (ie, something that
>>>does for FS what RT does for RS in GNU awk). It would be nice to have,
>>>say, an array or something, where the characters that matched FS are
>>>stored. Alternatively, it could be useful to know (again, in an array) the
>>>offsets inside $0 where the various fields start, and the rest could be
>>>worked out from that. In the absence of these facilities, when FS is a
>>>regex it is difficult if not impossible to restore the same separators in
>>>the output line.
>> 
>> Not quite.  There is no ability to do this for the record and fields.
>> Saving this information for every record would be very expensive, and
>> I felt that it was not right to pay this price unconditionally when
>> it's not needed most of the time.
>
>Well, I don't know the inner workings of the algorithm that parses the line
>into fields according to FS, but I suppose it involves some kind of loop
>over the line, each time looking if FS matches and if so, take what's
>before the match as field and what matched as separator (roughly at least).
>If it's so (which of course could be wrong), all the information needed is
>already available as part of the process, and all that would be needed is
>assigning those offsets into a special array, so, say OFF[1]...OFF[NF]
>would contain the offsets in the line where $1...$NF start. After that, one
>could easily pull out the parts of $0 that matched FS based on OFF[n] and
>length($n). Or awk could directly assign the parts that matched FS to a
>special array, like eg SEP[1]...SEP[NF-1].

"all that would be needed..." :-)

You'd be surprised how expensive these little things are, and how they
add up in CPU time when you have large numbers of records.

When gawk first got RS as a regexp and the RT variable, I did the simple
thing and cleared and then set RT on each record. This turned out to be
very expensive, especially as most of the time the value was the same:
"\n".  When I made the code a little smarter to only set RT if it changed,
I/O speed improved considerably.

Furthermore, for efficiency, gawk does not parse the record as soon as
it reads it. Instead, it only parses the record up to the largest field
that is accessed, and only parses the record fully when it's needed
(such as for the value of NF or $NF).

This works well, but the code that manages it isn't simple.

Setting these arrays unconditionally would require fully parsing the
record every time, and setting them only when referenced would introduce a
lot more complexity than I really want to deal with or that I really feel
is necessary. Particularly for something that would not be used often.

>> HOWEVER, you can get everything you need to do this kind of thing by
>> passing an array as the third argument to match(). See the gawk doc.
>> (This is a gawk extension.)
>
>Fair enough, However, from what I understand, match() does NOT exactly mimic
>the behavior of FS splitting, since it will match the empty string. For
>example (sorry to always bring up the same corner test case),
>match("aXbXc",/X*/,a) does not do what an FS of "X*" would do to the same
>line.
>
>Anyway, even with regular cases, I cannot get it to work:

Right. I wasn't thinking. You have to use match and a loop to go through
the full string to pull out the bits of info you need.

Still, I think it would be better to code this into a function than to
add more complexity to gawk. Gawk already has features that I thought
were cool that in retrospect don't get used much and should not have been
added, or that seemed simple but took many years to get the semantics
right.

Arnold
-- 
Aharon (Arnold) Robbins 				arnold AT skeeve DOT com
P.O. Box 354		Home Phone: +972  8 979-0381
Nof Ayalon		Cell Phone: +972 50  729-7545
D.N. Shimshon 99785	ISRAEL
0
Reply arnold 11/29/2008 6:49:54 PM

On Nov 26, 12:50=A0pm, Steffen Schuler <schuler.stef...@googlemail.com>
wrote:
> pk wrote:
> > On Wednesday 26 November 2008 07:47, Steffen Schuler wrote:
>
> >> I have indeed no simple solution but my awk-function does the necessar=
y
> >> splitting with storing of the splitted and intermediate strings:
>
> >> ---------8<------------------------
> >> # function factor(s, a, re):
> >> # factors string s by regex re in a[1] a[2] ... a[2*n+1]
> >> # where a[2*i] are instances of re for 1 <=3D i <=3D n
> >> # re in { "", " " } is handled specially like FS =3D=3D re
> >> # in gawk resp. awk
> >> function factor(s, a, re, =A0 =A0i, j)
> >> {
> >> =A0 =A0for (i in a)
> >> =A0 =A0 =A0delete a[i]
> >> =A0 =A0i =3D 1
> >> =A0 =A0if (re =3D=3D "") {
> >> =A0 =A0 =A0for (j =3D 1; j < length(s); ++j) {
> >> =A0 =A0 =A0 =A0a[i++] =3D substr(s,j,1)
> >> =A0 =A0 =A0 =A0a[i++] =3D ""
> >> =A0 =A0 =A0}
> >> =A0 =A0 =A0a[i] =3D substr(s,j,1)
> >> =A0 =A0} else if (re =3D=3D " ") {
> >> =A0 =A0 =A0re =3D "[ \t]+"
> >> =A0 =A0 =A0if (match(s, re) =3D=3D 1)
> >> =A0 =A0 =A0 =A0s =3D substr(s, RSTART + RLENGTH)
> >> =A0 =A0 =A0while (match(s, re)) {
> >> =A0 =A0 =A0 =A0a[i++] =3D substr(s, 1, RSTART - 1)
> >> =A0 =A0 =A0 =A0a[i++] =3D substr(s, RSTART, RLENGTH)
> >> =A0 =A0 =A0 =A0s =3D substr(s, RSTART + RLENGTH)
> >> =A0 =A0 =A0}
> >> =A0 =A0 =A0if (s =3D=3D "")
> >> =A0 =A0 =A0 =A0delete a[i-1]
> >> =A0 =A0 =A0else
> >> =A0 =A0 =A0 =A0a[i] =3D s
> >> =A0 =A0} else {
> >> =A0 =A0 =A0while (match(s, re)) {
> >> =A0 =A0 =A0 =A0a[i++] =3D substr(s, 1, RSTART - 1)
> >> =A0 =A0 =A0 =A0a[i++] =3D substr(s, RSTART, RLENGTH)
> >> =A0 =A0 =A0 =A0s =3D substr(s, RSTART + RLENGTH)
> >> =A0 =A0 =A0}
> >> =A0 =A0 =A0a[i] =3D s
> >> =A0 =A0}
> >> }
>
> > Note that the above function will loop forever with, say, a string
> > like "aXbXc" and a RE like "X*".
>
> Now fixed. But factor() doesn't work in gawk 3.1.5 because of a problem
> with match(s,"") and RLENGTH:
>
> --------------8<-----------------------------------------
> # function factor(s, a, re):
> # factors string s by regex re in a[1] a[2] ... a[2*n+1]
> # where a[2*i] are instances of re for 1 <=3D i <=3D n
> # re in { "", " " } is handled specially like FS =3D=3D re
> # in gawk resp. awk
> # returns the number of fields 2*n+1 (inclusively the
> # intermediate strings matched by re)
> # remark: match(s, "") delivers a wrong RLENGTH in gawk 3.1.5
> # =A0 =A0 =A0 =A0 fixed in CVS gawk
> # =A0 =A0 =A0 =A0 therefore factor() doesn't work in gawk 3.1.5
> function factor(s, a, re, =A0 =A0i, j)
> {
> =A0 =A0for (i in a)
> =A0 =A0 =A0delete a[i]
> =A0 =A0i =3D 1
> =A0 =A0a[1] =3D ""
> =A0 =A0if (re =3D=3D " ") {
> =A0 =A0 =A0re =3D "[ \t]+"
> =A0 =A0 =A0if (match(s, re) =3D=3D 1)
> =A0 =A0 =A0 =A0s =3D substr(s, RSTART + RLENGTH)
> =A0 =A0 =A0while (match(s, re)) {
> =A0 =A0 =A0 =A0a[i++] =3D substr(s, 1, RSTART - 1)
> =A0 =A0 =A0 =A0a[i++] =3D substr(s, RSTART, RLENGTH)
> =A0 =A0 =A0 =A0s =3D substr(s, RSTART + RLENGTH)
> =A0 =A0 =A0}
> =A0 =A0 =A0if (i > 2 && s =3D=3D "") {
> =A0 =A0 =A0 =A0delete a[--i]
> =A0 =A0 =A0 =A0--i
> =A0 =A0 =A0} else
> =A0 =A0 =A0 =A0a[i] =3D s
> =A0 =A0} else {
> =A0 =A0 =A0while (a[i] s !=3D "" && match(s, re)) {
> =A0 =A0 =A0 =A0a[i] =3D a[i] substr(s, 1, RSTART - 1)
> =A0 =A0 =A0 =A0a[++i] =3D substr(s, RSTART, RLENGTH)
> =A0 =A0 =A0 =A0++i
> =A0 =A0 =A0 =A0if (RSTART + RLENGTH =3D=3D 1) {
> =A0 =A0 =A0 =A0 =A0a[i] =3D substr(s,1,1)
> =A0 =A0 =A0 =A0 =A0s =3D substr(s, 2)
> =A0 =A0 =A0 =A0} else
> =A0 =A0 =A0 =A0 =A0s =3D substr(s, RSTART + RLENGTH)
> =A0 =A0 =A0}
> =A0 =A0 =A0a[i] =3D a[i] s
> =A0 =A0}
> =A0 =A0return i}
>
> --------------8<----------------------------------------
>
> --
> Steffen- Hide quoted text -
>
> - Show quoted text -

I think this will work and looks a bit simpler than the above "factor
()" function:

function factor()
{
        rec =3D $0
        if (FS=3D=3D" ") {
                # Fields start after a leading FS and end before a
trailing FS:
                # " a " =3D> NF=3D1, $1=3D"a"
                fs =3D "[[:space:]]+"             # or fs =3D "[ \t]+" for
POSIX
                fnr =3D (match(rec,"^"fs) ? 0 : 1)
        } else {
                # Fields start before a leading FS and end after a
trailing FS:
                # " a " =3D> NF=3D3, $1=3D"", $2=3D"a", $3=3D""
                fs =3D FS
                fnr =3D 1
        }
        fts[fnr] =3D flds[fnr] =3D ""

        while (rec !=3D "") {
                if (match(rec,"^"fs)) {
                        fts[++fnr] =3D substr(rec,RSTART,RLENGTH)
                        rec =3D substr(rec,RSTART+RLENGTH)
                }
                if (match(rec,fs)) {
                        flds[fnr] =3D substr(rec,1,RSTART-1)
                        rec =3D substr(rec,RSTART)
                } else {
                        flds[fnr] =3D rec
                        rec =3D ""
                }
        }
}
function recompile()
{
        rec =3D ""
        for (i=3D1;i<=3Dfnr;i++) {
                rec =3D rec fts[i] $i
        }
        $0 =3D rec
}
{ print; factor(); gsub(/X/,"+",$3); recompile(); print }

I expect there's some holes - anyone see them? Also, is there any way
to tell if we're using a POSIX awk or not so we can set "fs"
appropriately when FS is " "?

      Ed.
0
Reply Ed 11/29/2008 8:03:55 PM

Aharon Robbins wrote:
> [...]
> Still, I think it would be better to code this into a function than to
> add more complexity to gawk. Gawk already has features that I thought
> were cool that in retrospect don't get used much and should not have been
> added, or that seemed simple but took many years to get the semantics
> right.

Would you mind to tell us more details about those "wrong" features and
problems?

Janis
0
Reply Janis 11/29/2008 10:06:18 PM

On Nov 29, 2:03=A0pm, Ed Morton <mortons...@gmail.com> wrote:
> On Nov 26, 12:50=A0pm, Steffen Schuler <schuler.stef...@googlemail.com>
> wrote:
>
>
>
>
>
> > pk wrote:
> > > On Wednesday 26 November 2008 07:47, Steffen Schuler wrote:
>
> > >> I have indeed no simple solution but my awk-function does the necess=
ary
> > >> splitting with storing of the splitted and intermediate strings:
>
> > >> ---------8<------------------------
> > >> # function factor(s, a, re):
> > >> # factors string s by regex re in a[1] a[2] ... a[2*n+1]
> > >> # where a[2*i] are instances of re for 1 <=3D i <=3D n
> > >> # re in { "", " " } is handled specially like FS =3D=3D re
> > >> # in gawk resp. awk
> > >> function factor(s, a, re, =A0 =A0i, j)
> > >> {
> > >> =A0 =A0for (i in a)
> > >> =A0 =A0 =A0delete a[i]
> > >> =A0 =A0i =3D 1
> > >> =A0 =A0if (re =3D=3D "") {
> > >> =A0 =A0 =A0for (j =3D 1; j < length(s); ++j) {
> > >> =A0 =A0 =A0 =A0a[i++] =3D substr(s,j,1)
> > >> =A0 =A0 =A0 =A0a[i++] =3D ""
> > >> =A0 =A0 =A0}
> > >> =A0 =A0 =A0a[i] =3D substr(s,j,1)
> > >> =A0 =A0} else if (re =3D=3D " ") {
> > >> =A0 =A0 =A0re =3D "[ \t]+"
> > >> =A0 =A0 =A0if (match(s, re) =3D=3D 1)
> > >> =A0 =A0 =A0 =A0s =3D substr(s, RSTART + RLENGTH)
> > >> =A0 =A0 =A0while (match(s, re)) {
> > >> =A0 =A0 =A0 =A0a[i++] =3D substr(s, 1, RSTART - 1)
> > >> =A0 =A0 =A0 =A0a[i++] =3D substr(s, RSTART, RLENGTH)
> > >> =A0 =A0 =A0 =A0s =3D substr(s, RSTART + RLENGTH)
> > >> =A0 =A0 =A0}
> > >> =A0 =A0 =A0if (s =3D=3D "")
> > >> =A0 =A0 =A0 =A0delete a[i-1]
> > >> =A0 =A0 =A0else
> > >> =A0 =A0 =A0 =A0a[i] =3D s
> > >> =A0 =A0} else {
> > >> =A0 =A0 =A0while (match(s, re)) {
> > >> =A0 =A0 =A0 =A0a[i++] =3D substr(s, 1, RSTART - 1)
> > >> =A0 =A0 =A0 =A0a[i++] =3D substr(s, RSTART, RLENGTH)
> > >> =A0 =A0 =A0 =A0s =3D substr(s, RSTART + RLENGTH)
> > >> =A0 =A0 =A0}
> > >> =A0 =A0 =A0a[i] =3D s
> > >> =A0 =A0}
> > >> }
>
> > > Note that the above function will loop forever with, say, a string
> > > like "aXbXc" and a RE like "X*".
>
> > Now fixed. But factor() doesn't work in gawk 3.1.5 because of a problem
> > with match(s,"") and RLENGTH:
>
> > --------------8<-----------------------------------------
> > # function factor(s, a, re):
> > # factors string s by regex re in a[1] a[2] ... a[2*n+1]
> > # where a[2*i] are instances of re for 1 <=3D i <=3D n
> > # re in { "", " " } is handled specially like FS =3D=3D re
> > # in gawk resp. awk
> > # returns the number of fields 2*n+1 (inclusively the
> > # intermediate strings matched by re)
> > # remark: match(s, "") delivers a wrong RLENGTH in gawk 3.1.5
> > # =A0 =A0 =A0 =A0 fixed in CVS gawk
> > # =A0 =A0 =A0 =A0 therefore factor() doesn't work in gawk 3.1.5
> > function factor(s, a, re, =A0 =A0i, j)
> > {
> > =A0 =A0for (i in a)
> > =A0 =A0 =A0delete a[i]
> > =A0 =A0i =3D 1
> > =A0 =A0a[1] =3D ""
> > =A0 =A0if (re =3D=3D " ") {
> > =A0 =A0 =A0re =3D "[ \t]+"
> > =A0 =A0 =A0if (match(s, re) =3D=3D 1)
> > =A0 =A0 =A0 =A0s =3D substr(s, RSTART + RLENGTH)
> > =A0 =A0 =A0while (match(s, re)) {
> > =A0 =A0 =A0 =A0a[i++] =3D substr(s, 1, RSTART - 1)
> > =A0 =A0 =A0 =A0a[i++] =3D substr(s, RSTART, RLENGTH)
> > =A0 =A0 =A0 =A0s =3D substr(s, RSTART + RLENGTH)
> > =A0 =A0 =A0}
> > =A0 =A0 =A0if (i > 2 && s =3D=3D "") {
> > =A0 =A0 =A0 =A0delete a[--i]
> > =A0 =A0 =A0 =A0--i
> > =A0 =A0 =A0} else
> > =A0 =A0 =A0 =A0a[i] =3D s
> > =A0 =A0} else {
> > =A0 =A0 =A0while (a[i] s !=3D "" && match(s, re)) {
> > =A0 =A0 =A0 =A0a[i] =3D a[i] substr(s, 1, RSTART - 1)
> > =A0 =A0 =A0 =A0a[++i] =3D substr(s, RSTART, RLENGTH)
> > =A0 =A0 =A0 =A0++i
> > =A0 =A0 =A0 =A0if (RSTART + RLENGTH =3D=3D 1) {
> > =A0 =A0 =A0 =A0 =A0a[i] =3D substr(s,1,1)
> > =A0 =A0 =A0 =A0 =A0s =3D substr(s, 2)
> > =A0 =A0 =A0 =A0} else
> > =A0 =A0 =A0 =A0 =A0s =3D substr(s, RSTART + RLENGTH)
> > =A0 =A0 =A0}
> > =A0 =A0 =A0a[i] =3D a[i] s
> > =A0 =A0}
> > =A0 =A0return i}
>
> > --------------8<----------------------------------------
>
> > --
> > Steffen- Hide quoted text -
>
> > - Show quoted text -
>
> I think this will work and looks a bit simpler than the above "factor
> ()" function:
>
> function factor()
> {
> =A0 =A0 =A0 =A0 rec =3D $0
> =A0 =A0 =A0 =A0 if (FS=3D=3D" ") {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 # Fields start after a leading FS and end=
 before a
> trailing FS:
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 # " a " =3D> NF=3D1, $1=3D"a"
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 fs =3D "[[:space:]]+" =A0 =A0 =A0 =A0 =A0=
 =A0 # or fs =3D "[ \t]+" for
> POSIX
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 fnr =3D (match(rec,"^"fs) ? 0 : 1)
> =A0 =A0 =A0 =A0 } else {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 # Fields start before a leading FS and en=
d after a
> trailing FS:
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 # " a " =3D> NF=3D3, $1=3D"", $2=3D"a", $=
3=3D""
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 fs =3D FS
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 fnr =3D 1
> =A0 =A0 =A0 =A0 }
> =A0 =A0 =A0 =A0 fts[fnr] =3D flds[fnr] =3D ""
>
> =A0 =A0 =A0 =A0 while (rec !=3D "") {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (match(rec,"^"fs)) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 fts[++fnr] =3D substr(rec=
,RSTART,RLENGTH)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rec =3D substr(rec,RSTART=
+RLENGTH)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 }
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (match(rec,fs)) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 flds[fnr] =3D substr(rec,=
1,RSTART-1)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rec =3D substr(rec,RSTART=
)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 } else {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 flds[fnr] =3D rec
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rec =3D ""
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 }
> =A0 =A0 =A0 =A0 }}
>
> function recompile()
> {
> =A0 =A0 =A0 =A0 rec =3D ""
> =A0 =A0 =A0 =A0 for (i=3D1;i<=3Dfnr;i++) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rec =3D rec fts[i] $i
> =A0 =A0 =A0 =A0 }
> =A0 =A0 =A0 =A0 $0 =3D rec}
>
> { print; factor(); gsub(/X/,"+",$3); recompile(); print }
>
> I expect there's some holes - anyone see them? Also, is there any way
> to tell if we're using a POSIX awk or not so we can set "fs"
> appropriately when FS is " "?
>
> =A0 =A0 =A0 Ed.- Hide quoted text -
>
> - Show quoted text -

Sorry to reply to myself, but I discovered a "hole" when FS is empty,
and I could further simplify the code so here's the new version which
I think is a simple as it can get and will work for all values of FS:

function rstrSeps(      rec,i)
{
        rec =3D ""
        for (i=3D1;i<=3DNumSeps;i++)
                rec =3D rec Seps[i] $i
        $0 =3D rec
}
function saveSeps(      rec,fs)
{
        rec =3D $0
        split("",Seps)  # portable "delete"
        if (FS !~ /./) { # " a " =3D> NF=3D3, $1=3D" ", $2=3D"a", $3=3D" "
                NumSeps =3D NF            # all NULL strings
                return
        } else if (FS =3D=3D " ") { # " a " =3D> NF=3D1, $1=3D"a"
                fs =3D "[[:space:]]+"     # POSIX: "[ \t]+"
                NumSeps =3D (rec ~ "^"fs ? 0 : 1)
        } else { # " a " =3D> NF=3D3, $1=3D"", $2=3D"a", $3=3D""
                fs =3D FS
                NumSeps =3D 1
        }

        while (rec !=3D "") {
                if (match(rec,"^"fs)) {
                        Seps[++NumSeps] =3D substr(rec,RSTART,RLENGTH)
                        rec =3D substr(rec,RSTART+RLENGTH)
                }
                rec =3D (match(rec,fs) ? substr(rec,RSTART) : "")
        }
}
{ print "<"$0">"; saveSeps(); gsub(/X/,"+",$3); rstrSeps(); print
"<"$0">"}

Still looking for a way to tell POSIX awk from GNU awk, though, to set
"fs" appropriately if FS is " ".

       Ed
0
Reply Ed 11/30/2008 2:19:30 PM

Ed Morton wrote:

> Sorry to reply to myself, but I discovered a "hole" when FS is empty,
> and I could further simplify the code so here's the new version which
> I think is a simple as it can get and will work for all values of FS:
> 
> function rstrSeps(      rec,i)
> {
>         rec = ""
>         for (i=1;i<=NumSeps;i++)
>                 rec = rec Seps[i] $i
>         $0 = rec
> }
> function saveSeps(      rec,fs)
> {
>         rec = $0
>         split("",Seps)  # portable "delete"
>         if (FS !~ /./) { # " a " => NF=3, $1=" ", $2="a", $3=" "
>                 NumSeps = NF            # all NULL strings
>                 return
>         } else if (FS == " ") { # " a " => NF=1, $1="a"
>                 fs = "[[:space:]]+"     # POSIX: "[ \t]+"
>                 NumSeps = (rec ~ "^"fs ? 0 : 1)
>         } else { # " a " => NF=3, $1="", $2="a", $3=""
>                 fs = FS
>                 NumSeps = 1
>         }
> 
>         while (rec != "") {
>                 if (match(rec,"^"fs)) {
>                         Seps[++NumSeps] = substr(rec,RSTART,RLENGTH)
>                         rec = substr(rec,RSTART+RLENGTH)
>                 }
>                 rec = (match(rec,fs) ? substr(rec,RSTART) : "")
>         }
> }
> { print "<"$0">"; saveSeps(); gsub(/X/,"+",$3); rstrSeps(); print
> "<"$0">"}

From my tests, it seems to hang if given a RE that can match the empty
string (eg, rec="aXc" and FS="X*").

Another thing that I can see is that, since you artificially add a "^" to
the RE when trying to match, that can lead to unwanted results if the RE
itself already contains an anchored part and/or an alternation, eg "^X" will
become "^^X" (that probably works with many awks, but gives an error in bell
labs' awk), and "X|Y" will become "^X|Y", which is quite different. Since
the RE can be an arbitrarily complex expression, to prevent the added ^ from
changing its meaning, you could put it between () when trying to match, eg
match(rec,"^("fs")"). Note that this can produce things like ^(^X) which,
while probably accepted by all awks, don't produce expected results with
bell labs awk (see eg echo 'XYZ' | bell_awk '/^(^X)/' prints nothing, while
gawk and mawk work).

Another obvious nitpick is that any algorithm that consumes the string is
subject to false matches under certain circumstances if the regular
expression used for factoring contains an anchored RE (see the link to the
GNU list I posted upthread). For example, as it is currently, your function
produces incorrect results if rec="YaKYbcd" and FS="^Y|K", or with rec="XXX"
and FS="^X" (for example).
However, given the way match() currently works in awk, I don't see any easy
way around that, short of manually "parsing" the RE itself...

(btw, that's why I agree with pk when he says that since awk already factors
lines, it could make the information it gathers during that operation
available to user programs, rather than forcing users to write code that
duplicates that functionality, but that's another matter, and I don't want
to be polemic in any way, of course)

Anyway, using anchored expressions in the RE should be a relatively
exceptional case, and could imho be treated as such (eg, ignored).
The first issue I reported above seems to be more serious.

Another approach that does not completely factor the record but instead
tries just to separate what's before the field from what follows it is
something along this (for example, assuming we are interested in the third
field):

bregex="^"$1"("FS")"$2"("FS")"   # can also be done with a for loop
match($0,bregex)
before=substr($0,RSTART,RLENGTH)
after=substr($0,RSTART+RLENGTH+length($3))
# some operation that changes $3...
$0 = before $3 after

but it has the big weakness that, since it uses the contents of the fields
in a computed regex, care must be taken to ensure everything is properly
escaped, which opens a whole new can of worms.

> Still looking for a way to tell POSIX awk from GNU awk, though, to set
> "fs" appropriately if FS is " ".

I'd assume a POSIX awk, but that's of course just my personal opinion.

-- 
awk 'BEGIN{O="~"~"~";o="=="=="==";o+=+o;x=O""O;while(X++<=x+o+o)c=c"%c";
printf c,(x-O)*(x-O),x*(x-o)-o,x*(x-O)+x-O-o,+x*(x-O)-x+o,X*(o*o+O)+x-O,
X*(X-x)-o*o,(x+X)*o*o+o,x*(X-x)-O-O,x-O+(O+o+X+x)*(o+O),X*X-X*(x-O)-x+O,
O+X*(o*(o+O)+O),+x+O+X*o,x*(x-o),(o+X+x)*o*o-(x-O-O),O+(X-x)*(X+O),x-O}'
0
Reply Dave 11/30/2008 3:49:44 PM

In article <ggsecr$sbn$1@svr7.m-online.net>,
Janis Papanagnou  <janis_papanagnou@hotmail.com> wrote:
>Aharon Robbins wrote:
>> [...]
>> Still, I think it would be better to code this into a function than to
>> add more complexity to gawk. Gawk already has features that I thought
>> were cool that in retrospect don't get used much and should not have been
>> added, or that seemed simple but took many years to get the semantics
>> right.
>
>Would you mind to tell us more details about those "wrong" features and
>problems?

- IGNORECASE took a good number of years to get the semantics right.

- The /dev/pid, /dev/user, and so on special files; I don't think anyone
  uses them; they were even broken for a while and no-one noticed (they'll
  come out of gawk for the next major release). (/dev/std{in,out,err} are
  worthwhile; /dev/fd/N probably not.)

- Bringing gettext functionality out to the awk level; I have never heard
  of anyone who uses it. (Having gawk itself internationalized at the
  C code level is very much worthwhile.)

Those are the major ones I can give without having to think about it;
there are probably more.

Some other extensions are worth their weight in gold: most notably RS as
a regex and RT.

Arnold
-- 
Aharon (Arnold) Robbins 				arnold AT skeeve DOT com
P.O. Box 354		Home Phone: +972  8 979-0381
Nof Ayalon		Cell Phone: +972 50  729-7545
D.N. Shimshon 99785	ISRAEL
0
Reply arnold 11/30/2008 9:45:16 PM

Ed Morton wrote:
[...]
> function rstrSeps(      rec,i)
> {
>         rec = ""
>         for (i=1;i<=NumSeps;i++)
>                 rec = rec Seps[i] $i
>         $0 = rec
> }
> function saveSeps(      rec,fs)
> {
>         rec = $0
>         split("",Seps)  # portable "delete"
>         if (FS !~ /./) { # " a " => NF=3, $1=" ", $2="a", $3=" "
>                 NumSeps = NF            # all NULL strings
>                 return
>         } else if (FS == " ") { # " a " => NF=1, $1="a"
>                 fs = "[[:space:]]+"     # POSIX: "[ \t]+"
>                 NumSeps = (rec ~ "^"fs ? 0 : 1)
                   # solves anchor issue
                   NumSeps = match(rec, fs) != 1
>         } else { # " a " => NF=3, $1="", $2="a", $3=""
>                 fs = FS
>                 NumSeps = 1
>         }
> 
>         while (rec != "") {
>                 if (match(rec,"^"fs)) {
                   # solves anchor issue
                   if (match(rec, fs) == 1) {
>                         Seps[++NumSeps] = substr(rec,RSTART,RLENGTH)
>                         rec = substr(rec,RSTART+RLENGTH)
                           # solves nontermination issue when fs
                           # matches empty string (*)
                           rec = substr(rec, max(RSTART+RLENGTH,2))
>                 }
>                 rec = (match(rec,fs) ? substr(rec,RSTART) : "")
>         }
> }
# needed above in (*)
function max(a, b) { return a > b ? a : b }
> { print "<"$0">"; saveSeps(); gsub(/X/,"+",$3); rstrSeps(); print
> "<"$0">"}
[...]

Ed, thank you very much for this beautiful code.
(Above I inserted some small fixes for minor problems, detected by
Dave.)

-- 
Steffen
0
Reply Steffen 12/1/2008 12:25:50 AM

"Aharon Robbins" <arnold@skeeve.com> wrote in message
news:ggs2sh$915$1@news.netvision.net.il...
> In article <ggojdg$th7$1@aioe.org>, pk  <pk@pk.invalid> wrote:

> Setting these arrays unconditionally would require fully parsing the
> record every time, and setting them only when referenced would introduce a
> lot more complexity than I really want to deal with or that I really feel
> is necessary. Particularly for something that would not be used often.

Actually I have programs that could use this information on every input
record that isn't discarded as blank or comment. Mainly I want to discard
leading fields while preserving all trailing fields, including whatever
separates them.

If there are zero or one trailing fields, the problem is easy. More than one
introduces the "unknown separator" problem (hence I cannot simply iterate
over the fields) and the "first field to preserve might match all or part of
a previous field to discard" problem (hence I cannot simply index() my way
to that field).

If setting the field starts every record is too expensive, and automatically
setting them when referenced is too complex, perhaps a third way would be to
create a built-in function that returns an array or sets an input array? Or
perhaps a function like FIELDSTART(n), where 'n' is the number of the field
whose starting position is requested?

With that latter I could do this: substr($0, fieldstart(n)) - which would
accomplish what I, at any rate, would like to do. Even if, hmm, internally
the function code worked very much like what I do now, I imagine it would be
faster, and certainly my user code would be neater-looking!

- Anton Treuenfels


0
Reply Anton 12/1/2008 5:24:23 AM

On Nov 30, 6:25=A0pm, Steffen Schuler <schuler.stef...@googlemail.com>
wrote:
> Ed Morton wrote:
>
> [...]
>
>
>
> > function rstrSeps( =A0 =A0 =A0rec,i)
> > {
> > =A0 =A0 =A0 =A0 rec =3D ""
> > =A0 =A0 =A0 =A0 for (i=3D1;i<=3DNumSeps;i++)
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rec =3D rec Seps[i] $i
> > =A0 =A0 =A0 =A0 $0 =3D rec
> > }
> > function saveSeps( =A0 =A0 =A0rec,fs)
> > {
> > =A0 =A0 =A0 =A0 rec =3D $0
> > =A0 =A0 =A0 =A0 split("",Seps) =A0# portable "delete"
> > =A0 =A0 =A0 =A0 if (FS !~ /./) { # " a " =3D> NF=3D3, $1=3D" ", $2=3D"a=
", $3=3D" "
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 NumSeps =3D NF =A0 =A0 =A0 =A0 =A0 =A0#=
 all NULL strings
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return
> > =A0 =A0 =A0 =A0 } else if (FS =3D=3D " ") { # " a " =3D> NF=3D1, $1=3D"=
a"
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 fs =3D "[[:space:]]+" =A0 =A0 # POSIX: =
"[ \t]+"
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 NumSeps =3D (rec ~ "^"fs ? 0 : 1)
>
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0# solves anchor issue
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0NumSeps =3D match(rec, fs) !=3D 1>=
 =A0 =A0 =A0 =A0 } else { # " a " =3D> NF=3D3, $1=3D"", $2=3D"a", $3=3D""
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 fs =3D FS
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 NumSeps =3D 1
> > =A0 =A0 =A0 =A0 }
>
> > =A0 =A0 =A0 =A0 while (rec !=3D "") {
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (match(rec,"^"fs)) {
>
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0# solves anchor issue
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (match(rec, fs) =3D=3D 1) {> =
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 Seps[++NumSeps] =3D substr(=
rec,RSTART,RLENGTH)
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rec =3D substr(rec,RSTA=
RT+RLENGTH)
>
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0# solves nontermin=
ation issue when fs
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0# matches empty st=
ring (*)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0rec =3D substr(rec=
, max(RSTART+RLENGTH,2))> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 }
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rec =3D (match(rec,fs) ? substr(rec,RST=
ART) : "")
> > =A0 =A0 =A0 =A0 }
> > }
>
> # needed above in (*)
> function max(a, b) { return a > b ? a : b }> { print "<"$0">"; saveSeps()=
; gsub(/X/,"+",$3); rstrSeps(); print
> > "<"$0">"}
>
> [...]
>
> Ed, thank you very much for this beautiful code.
> (Above I inserted some small fixes for minor problems, detected by
> Dave.)
>
> --
> Steffen- Hide quoted text -
>
> - Show quoted text -

Thanks to Dave for finding the holes and Steffen for filling them. Now
that we don't have a nontermination issue for empty strings, we don't
need to special-case FS =3D=3D "" so we can simplify a bit more and I
figured out a way to test for POSIX awk for setting "fs" when FS =3D=3D "
":

function rstrSeps(      rec,i)
{
        rec =3D ""
        for (i=3D1;i<=3DNumSeps;i++)
                rec =3D rec Seps[i] $i
        $0 =3D rec
}
function saveSeps(      rec,fs,t)
{
        rec =3D $0
        split("",Seps)  # portable "delete"
        if (FS =3D=3D " ") {
                fs =3D (split("\n",t) =3D=3D 1 ? "[ \t]+" : "[[:space:]]+")
                NumSeps =3D (match(rec, fs) !=3D 1)
        } else {
                fs =3D FS
                NumSeps =3D 1
        }

        while (rec !=3D "") {
                if (match(rec,fs) =3D=3D 1) {
                        Seps[++NumSeps] =3D substr(rec,RSTART,RLENGTH)
                        rec =3D substr(rec,(RLENGTH?RLENGTH+1:2))
                }
                rec =3D (match(rec,fs) ? substr(rec,RSTART) : "")
        }
}
{ print "<"$0">"; saveSeps(); gsub(/X/,"+",$3); rstrSeps(); print
"<"$0">"}

So now it's absolutely perfect and has zero issues ;-).

I did find something in gawk 3.1.6 on cygwin that's got me scratching
my head though:

1a) gawk --posix 'BEGIN{print split("a\nb",a)}'
1
1b) gawk 'BEGIN{print split("a\nb",a)}'
2

2a) gawk --posix 'BEGIN{print split("\n",a,/[[:space:]]+/)}'
2
2b) gawk 'BEGIN{print split("\n",a,/[[:space:]]+/)}'
2

3a) gawk --posix 'BEGIN{print split("\n",a)}'
1
3b) gawk 'BEGIN{print split("\n",a)}'
0

All the results seem to make sense except "3b". I don't understand why
it returns 0 instead of 2 like "1b" and "2b" do.

     Ed.
0
Reply Ed 12/1/2008 8:14:44 AM

On Saturday 29 November 2008 19:49, Aharon Robbins wrote:

>>Well, I don't know the inner workings of the algorithm that parses the
>>line into fields according to FS, but I suppose it involves some kind of
>>loop over the line, each time looking if FS matches and if so, take what's
>>before the match as field and what matched as separator (roughly at
>>least). If it's so (which of course could be wrong), all the information
>>needed is already available as part of the process, and all that would be
>>needed is assigning those offsets into a special array, so, say
>>OFF[1]...OFF[NF] would contain the offsets in the line where $1...$NF
>>start. After that, one could easily pull out the parts of $0 that matched
>>FS based on OFF[n] and length($n). Or awk could directly assign the parts
>>that matched FS to a special array, like eg SEP[1]...SEP[NF-1].
> 
> "all that would be needed..." :-)
> 
> You'd be surprised how expensive these little things are, and how they
> add up in CPU time when you have large numbers of records.
> 
> When gawk first got RS as a regexp and the RT variable, I did the simple
> thing and cleared and then set RT on each record. This turned out to be
> very expensive, especially as most of the time the value was the same:
> "\n".  When I made the code a little smarter to only set RT if it changed,
> I/O speed improved considerably.

RT is a really cool feature, I use it all the time. Many many thanks for
introducing it.

> Furthermore, for efficiency, gawk does not parse the record as soon as
> it reads it. Instead, it only parses the record up to the largest field
> that is accessed, and only parses the record fully when it's needed
> (such as for the value of NF or $NF).
> 
> This works well, but the code that manages it isn't simple.
> 
> Setting these arrays unconditionally would require fully parsing the
> record every time, and setting them only when referenced would introduce a
> lot more complexity than I really want to deal with or that I really feel
> is necessary. Particularly for something that would not be used often.

Ok, I admit my remark was probably naive, but I didn't want to sound *that*
naive :-)
First of all, I never thought or said of setting the arrays unconditionally
for each record. In my vision, that is meant to be an optional feature,
probably not requested nor needed 95% of the time. That means that 95% of
the time, nothing changes for the end user (or even 100%, if he doesn't
want the feature).
And even if the user wants it (with a command line switch for example), of
course it can be implemented in the most optimized way possible. Above you
said:

"for efficiency, gawk does not parse the record as soon as it reads it.
Instead, it only parses the record up to the largest field that is
accessed, and only parses the record fully when it's needed (such as for
the value of NF or $NF)."

Good. My bet is that if the user does not need to access a field, or does
not need to know NF, neither he needs to access that special array. So
parsing the record and filling the array could be done at the same time, ie
only when absolutely necessary.

However, since I don't know the actual inner workings of gawk, I have to
agree with you that even with all possible optimizations implementing the
feature could still be too complex or expensive if compared with its actual
real life usefulness and user demand.

Thanks for your answers!

0
Reply pk 12/1/2008 8:50:44 AM

Aharon Robbins escribi�:
> In article <ggojdg$th7$1@aioe.org>, pk  <pk@pk.invalid> wrote:
>> On Friday 28 November 2008 10:40, Aharon Robbins wrote:
>>
>>> ...
>>> Not quite.  There is no ability to do this for the record and fields.
>>> Saving this information for every record would be very expensive, and
>>> I felt that it was not right to pay this price unconditionally when
>>> it's not needed most of the time.
>> Well, I don't know the inner workings of the algorithm that parses the line
>> into fields according to FS, but I suppose it involves some kind of loop
>> over the line, each time looking if FS matches and if so, take what's
>> before the match as field and what matched as separator (roughly at least).
>> If it's so (which of course could be wrong), all the information needed is
>> already available as part of the process, and all that would be needed is
>> assigning those offsets into a special array, so, say OFF[1]...OFF[NF]
>> would contain the offsets in the line where $1...$NF start. After that, one
>> could easily pull out the parts of $0 that matched FS based on OFF[n] and
>> length($n). Or awk could directly assign the parts that matched FS to a
>> special array, like eg SEP[1]...SEP[NF-1].
> 
> "all that would be needed..." :-)
> 
> You'd be surprised how expensive these little things are, and how they
> add up in CPU time when you have large numbers of records.
> 
> When gawk first got RS as a regexp and the RT variable, I did the simple
> thing and cleared and then set RT on each record. This turned out to be
> very expensive, especially as most of the time the value was the same:
> "\n".  When I made the code a little smarter to only set RT if it changed,
> I/O speed improved considerably.
> 
> Furthermore, for efficiency, gawk does not parse the record as soon as
> it reads it. Instead, it only parses the record up to the largest field
> that is accessed, and only parses the record fully when it's needed
> (such as for the value of NF or $NF).
> 
> This works well, but the code that manages it isn't simple.
> 
> Setting these arrays unconditionally would require fully parsing the
> record every time, and setting them only when referenced would introduce a
> lot more complexity than I really want to deal with or that I really feel
> is necessary. Particularly for something that would not be used often.

Perhaps a solution would be to add an optional fourth argument to 
split(). If not given it will behave exactly as now, and if given it 
will store the actual separators:

    split(string, array [, fieldsep [, separray]])

The user can use it to reparse the input record, if needed:

    split($0, fields, FS, seps)

Just my 2c.
-- 
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado
0
Reply Manuel 12/1/2008 9:06:23 AM

On Monday 1 December 2008 10:06, Manuel Collado wrote:

> Perhaps a solution would be to add an optional fourth argument to
> split(). If not given it will behave exactly as now, and if given it
> will store the actual separators:
> 
>     split(string, array [, fieldsep [, separray]])
> 
> The user can use it to reparse the input record, if needed:
> 
>     split($0, fields, FS, seps)

Sounds like a really good idea, although it would probably need an option to
make it behave exactly like FS splitting (it can differ when the separator
contains a regex that can match the empty string).
However, definitely +1 for me.

0
Reply pk 12/1/2008 9:30:00 AM

On Monday 1 December 2008 10:30, pk wrote:

> On Monday 1 December 2008 10:06, Manuel Collado wrote:
> 
>> Perhaps a solution would be to add an optional fourth argument to
>> split(). If not given it will behave exactly as now, and if given it
>> will store the actual separators:
>> 
>>     split(string, array [, fieldsep [, separray]])
>> 
>> The user can use it to reparse the input record, if needed:
>> 
>>     split($0, fields, FS, seps)
> 
> Sounds like a really good idea, although it would probably need an option
> to make it behave exactly like FS splitting (it can differ when the
> separator contains a regex that can match the empty string).

No, wrong. It seems that the only case where split() splits on an empty
string is when the regex passed is explicitly "" or //.

> However, definitely +1 for me.

Even more so.

0
Reply pk 12/1/2008 9:40:06 AM

In article <gh0ah7$neu$1@aioe.org>, pk  <pk@pk.invalid> wrote:
>On Monday 1 December 2008 10:06, Manuel Collado wrote:
>
>> Perhaps a solution would be to add an optional fourth argument to
>> split(). If not given it will behave exactly as now, and if given it
>> will store the actual separators:
>> 
>>     split(string, array [, fieldsep [, separray]])
>> 
>> The user can use it to reparse the input record, if needed:
>> 
>>     split($0, fields, FS, seps)
>
>Sounds like a really good idea, although it would probably need an option to
>make it behave exactly like FS splitting (it can differ when the separator
>contains a regex that can match the empty string).
>However, definitely +1 for me.

I'd be willing to consider including such a feature if someone were
to contribute the patch.  (Hint, hint.)
-- 
Aharon (Arnold) Robbins 				arnold AT skeeve DOT com
P.O. Box 354		Home Phone: +972  8 979-0381
Nof Ayalon		Cell Phone: +972 50  729-7545
D.N. Shimshon 99785	ISRAEL
0
Reply arnold 12/1/2008 8:32:44 PM

On Dec 1, 2:14=A0am, Ed Morton <mortons...@gmail.com> wrote:
> I did find something in gawk 3.1.6 on cygwin that's got me scratching
> my head though:
>
> 1a) gawk --posix 'BEGIN{print split("a\nb",a)}'
> 1
> 1b) gawk 'BEGIN{print split("a\nb",a)}'
> 2
>
> 2a) gawk --posix 'BEGIN{print split("\n",a,/[[:space:]]+/)}'
> 2
> 2b) gawk 'BEGIN{print split("\n",a,/[[:space:]]+/)}'
> 2
>
> 3a) gawk --posix 'BEGIN{print split("\n",a)}'
> 1
> 3b) gawk 'BEGIN{print split("\n",a)}'
> 0
>
> All the results seem to make sense except "3b". I don't understand why
> it returns 0 instead of 2 like "1b" and "2b" do.
>
> =A0 =A0 =A0Ed.- Hide quoted text -
>
> - Show quoted text -

Never mind - that does make sense since the "\n" in 3b would be part
of the leading spaces that'd get ignored when setting fields using the
default FS, just like this POSIX equivalent:

gawk --posix 'BEGIN{print split("\t",a)}'

    Ed.
0
Reply Ed 12/1/2008 11:13:51 PM

This is a multi-part message in MIME format.
--------------040506090307020103090307
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Aharon Robbins wrote:
> In article <gh0ah7$neu$1@aioe.org>, pk  <pk@pk.invalid> wrote:
>> On Monday 1 December 2008 10:06, Manuel Collado wrote:
>>
>>> Perhaps a solution would be to add an optional fourth argument to
>>> split(). If not given it will behave exactly as now, and if given it
>>> will store the actual separators:
>>>
>>>     split(string, array [, fieldsep [, separray]])
>>>
>>> The user can use it to reparse the input record, if needed:
>>>
>>>     split($0, fields, FS, seps)
>> Sounds like a really good idea, although it would probably need an option to
>> make it behave exactly like FS splitting (it can differ when the separator
>> contains a regex that can match the empty string).
>> However, definitely +1 for me.
> 
> I'd be willing to consider including such a feature if someone were
> to contribute the patch.  (Hint, hint.)

Hi Arnold, hello awk-users,

attached a patch for the suggestion of Manuel for gawk-devel (current 
CVS version)

-- 
Steffen

--------------040506090307020103090307
Content-Type: text/plain;
 name="patch-split-extension"
Content-Transfer-Encoding: 8bit
Content-Disposition: inline;
 filename="patch-split-extension"

diff -u -r -N gawk-devel/awkgram.y gawk-devel-ext/awkgram.y
--- gawk-devel/awkgram.y	2008-11-16 20:23:56.000000000 +0100
+++ gawk-devel-ext/awkgram.y	2008-12-02 07:34:42.000000000 +0100
@@ -1162,7 +1162,7 @@
 {"seek",	Node_builtin,	 LEX_BUILTIN,	GAWKX|A(2),	do_seek},
 #endif
 {"sin",		Node_builtin,	 LEX_BUILTIN,	NOT_OLD|A(1),	do_sin},
-{"split",	Node_builtin,	 LEX_BUILTIN,	A(2)|A(3),	do_split},
+{"split",	Node_builtin,	 LEX_BUILTIN,	A(2)|A(3)|A(4),	do_split},
 {"sprintf",	Node_builtin,	 LEX_BUILTIN,	0,		do_sprintf},
 {"sqrt",	Node_builtin,	 LEX_BUILTIN,	A(1),		do_sqrt},
 {"srand",	Node_builtin,	 LEX_BUILTIN,	NOT_OLD|A(0)|A(1), do_srand},
diff -u -r -N gawk-devel/ChangeLog gawk-devel-ext/ChangeLog
--- gawk-devel/ChangeLog	2008-11-16 21:05:33.000000000 +0100
+++ gawk-devel-ext/ChangeLog	2008-12-02 07:55:55.000000000 +0100
@@ -1,3 +1,11 @@
+Tue Dec  2 07:42:00 2008  Steffen Schuler    <schuler.steffen@googlemail.com>
+
+        * field.c (*_parse_field): Add `sep_arr´ argument and fill it.
+	* field.c (do_split): Add handling of fourth argument of awk 
+	builtin	`split´.
+	* field.c (get_field): Extend parse_field by default argument.
+	* awkgram.y (tokentab): Extend `split´ entry with fourth argument.
+
 Sat Oct 27 22:43:50 2007  Arnold D. Robbins  <arnold@skeeve.com>
 
 	* re.c (resetup): Add RE_INVALID_INTERVAL_ORD to syntax bits if
diff -u -r -N gawk-devel/doc/gawk.1 gawk-devel-ext/doc/gawk.1
--- gawk-devel/doc/gawk.1	2008-11-16 21:05:45.000000000 +0100
+++ gawk-devel-ext/doc/gawk.1	2008-12-02 08:04:27.000000000 +0100
@@ -2472,11 +2472,13 @@
 provide the starting index in the string and length
 respectively, of each matching substring.
 .TP
-\fBsplit(\fIs\fB, \fIa \fR[\fB, \fIr\fR]\fB)\fR
+\fBsplit(\fIs\fB, \fIa \fR[\fB, \fIr\fR [\fB, \fIseps\fR] ]\fB)\fR
 Splits the string
 .I s
 into the array
 .I a
+and the separators array
+.I seps
 on the regular expression
 .IR r ,
 and returns the number of fields.  If
diff -u -r -N gawk-devel/doc/gawk.texi gawk-devel-ext/doc/gawk.texi
--- gawk-devel/doc/gawk.texi	2008-11-16 21:05:45.000000000 +0100
+++ gawk-devel-ext/doc/gawk.texi	2008-12-02 08:14:16.000000000 +0100
@@ -12692,10 +12692,11 @@
 (@pxref{Options}),
 using a third argument is a fatal error.
 
-@item split(@var{string}, @var{array} @r{[}, @var{fieldsep}@r{]})
+@item split(@var{string}, @var{array} @r{[}, @var{fieldsep} @r{[}, @var{seps} @r{]} @r{]})
 @cindex @code{split} function
 This function divides @var{string} into pieces separated by @var{fieldsep}
-and stores the pieces in @var{array}.  The first piece is stored in
+and stores the pieces in @var{array} and the separator strings in 
+@var{seps} array.  The first piece is stored in
 @code{@var{array}[1]}, the second piece in @code{@var{array}[2]}, and so
 forth.  The string value of the third argument, @var{fieldsep}, is
 a regexp describing where to split @var{string} (much as @code{FS} can
@@ -12707,7 +12708,7 @@
 manner similar to the way input lines are split into fields.  For example:
 
 @example
-split("cul-de-sac", a, "-")
+split("cul-de-sac", a, "-", seps)
 @end example
 
 @noindent
@@ -12721,6 +12722,13 @@
 a[3] = "sac"
 @end example
 
+and sets the contents of the array @code{seps} as follows:
+
+@example
+seps[1] = "-"
+seps[2] = "-"
+@end example
+
 @noindent
 The value returned by this call to @code{split} is three.
 
diff -u -r -N gawk-devel/field.c gawk-devel-ext/field.c
--- gawk-devel/field.c	2008-11-16 20:21:15.000000000 +0100
+++ gawk-devel-ext/field.c	2008-12-02 07:34:42.000000000 +0100
@@ -39,20 +39,20 @@
 typedef void (* Setfunc) P((long, char *, long, NODE *));
 
 static long (*parse_field) P((long, char **, int, NODE *,
-			     Regexp *, Setfunc, NODE *));
+			     Regexp *, Setfunc, NODE *, NODE *));
 static void rebuild_record P((void));
 static long re_parse_field P((long, char **, int, NODE *,
-			     Regexp *, Setfunc, NODE *));
+			     Regexp *, Setfunc, NODE *, NODE *));
 static long def_parse_field P((long, char **, int, NODE *,
-			      Regexp *, Setfunc, NODE *));
+			      Regexp *, Setfunc, NODE *, NODE *));
 static long posix_def_parse_field P((long, char **, int, NODE *,
-			      Regexp *, Setfunc, NODE *));
+			      Regexp *, Setfunc, NODE *, NODE *));
 static long null_parse_field P((long, char **, int, NODE *,
-			     Regexp *, Setfunc, NODE *));
+			     Regexp *, Setfunc, NODE *, NODE *));
 static long sc_parse_field P((long, char **, int, NODE *,
-			     Regexp *, Setfunc, NODE *));
+			     Regexp *, Setfunc, NODE *, NODE *));
 static long fw_parse_field P((long, char **, int, NODE *,
-			     Regexp *, Setfunc, NODE *));
+			     Regexp *, Setfunc, NODE *, NODE *));
 static void set_element P((long num, char * str, long len, NODE *arr));
 static void grow_fields_arr P((long num));
 static void set_field P((long num, char *str, long len, NODE *dummy));
@@ -364,12 +364,14 @@
 	NODE *fs ATTRIBUTE_UNUSED,
 	Regexp *rp,
 	Setfunc set,	/* routine to set the value of the parsed field */
-	NODE *n)
+	NODE *n,
+	NODE *sep_arr)
 {
 	register char *scan = *buf;
 	register long nf = parse_high_water;
 	register char *field;
 	register char *end = scan + len;
+	char *sep;
 #ifdef MBS_SUPPORT
 	size_t mbclen = 0;
 	mbstate_t mbs;
@@ -382,9 +384,14 @@
 	if (len == 0)
 		return nf;
 
-	if (RS_is_null && default_FS)
+	if (RS_is_null && default_FS) {
+		sep = scan;
 		while (scan < end && (*scan == ' ' || *scan == '\t' || *scan == '\n'))
 			scan++;
+		if (sep_arr != NULL) 
+			set_element(nf, sep, (long)(scan - sep), sep_arr);
+	}
+
 	field = scan;
 	while (scan < end
 	       && research(rp, scan, 0, (end - scan), RE_NEED_START) != -1
@@ -404,6 +411,8 @@
 			scan++;
 			if (scan == end) {
 				(*set)(++nf, field, (long)(scan - field), n);
+				if (sep_arr != NULL) 
+	        			set_element(nf, field, 0L, sep_arr);
 				up_to = nf;
 				break;
 			}
@@ -411,6 +420,9 @@
 		}
 		(*set)(++nf, field,
 		       (long)(scan + RESTART(rp, scan) - field), n);
+		if (sep_arr != NULL) 
+	    		set_element(nf, scan + RESTART(rp, scan), 
+           			(long) (REEND(rp, scan) - RESTART(rp, scan)), sep_arr);
 		scan += REEND(rp, scan);
 		field = scan;
 		if (scan == end)	/* FS at end of record */
@@ -439,13 +451,15 @@
 	NODE *fs,
 	Regexp *rp ATTRIBUTE_UNUSED,
 	Setfunc set,	/* routine to set the value of the parsed field */
-	NODE *n)
+	NODE *n,
+	NODE *sep_arr)
 {
 	register char *scan = *buf;
 	register long nf = parse_high_water;
 	register char *field;
 	register char *end = scan + len;
 	char sav;
+	char *sep;
 
 	if (up_to == UNLIMITED)
 		nf = 0;
@@ -455,6 +469,8 @@
 	/*
 	 * Nasty special case. If FS set to "", return whole record
 	 * as first field. This is not worth a separate function.
+	 *
+	 * Steffen Schuler: Is this case still needed?
 	 */
 	if (fs->stlen == 0) {
 		(*set)(++nf, *buf, len, n);
@@ -471,9 +487,12 @@
 		/*
 		 * special case:  fs is single space, strip leading whitespace 
 		 */
+ 		sep = scan;
 		while (scan < end && (*scan == ' ' || *scan == '\t' || *scan == '\n'))
 			scan++;
-		if (scan >= end)
+		if (sep_arr != NULL)
+			set_element(nf, sep, (long) (scan - sep), sep_arr);
+		if (scan >= end) 
 			break;
 		field = scan;
 		while (*scan != ' ' && *scan != '\t' && *scan != '\n')
@@ -506,13 +525,15 @@
 	NODE *fs,
 	Regexp *rp ATTRIBUTE_UNUSED,
 	Setfunc set,	/* routine to set the value of the parsed field */
-	NODE *n)
+	NODE *n,
+	NODE *sep_arr)
 {
 	register char *scan = *buf;
 	register long nf = parse_high_water;
 	register char *field;
 	register char *end = scan + len;
 	char sav;
+	char *sep;
 
 	if (up_to == UNLIMITED)
 		nf = 0;
@@ -522,6 +543,8 @@
 	/*
 	 * Nasty special case. If FS set to "", return whole record
 	 * as first field. This is not worth a separate function.
+	 *
+	 * Steffen Schuler: Is this case still needed?
 	 */
 	if (fs->stlen == 0) {
 		(*set)(++nf, *buf, len, n);
@@ -538,8 +561,11 @@
 		/*
 		 * special case:  fs is single space, strip leading whitespace 
 		 */
+		sep = scan;
 		while (scan < end && (*scan == ' ' || *scan == '\t'))
 			scan++;
+		if (sep_arr != NULL)
+			set_element(nf, sep, (long) (scan - sep), sep_arr);
 		if (scan >= end)
 			break;
 		field = scan;
@@ -570,7 +596,8 @@
 	NODE *fs ATTRIBUTE_UNUSED,
 	Regexp *rp ATTRIBUTE_UNUSED,
 	Setfunc set,	/* routine to set the value of the parsed field */
-	NODE *n)
+	NODE *n,
+	NODE *sep_arr)
 {
 	register char *scan = *buf;
 	register long nf = parse_high_water;
@@ -592,13 +619,18 @@
 				/* We treat it as a singlebyte character.  */
 				mbclen = 1;
 			}
+			if (nf > 0 && sep_arr != NULL)
+        			set_element(nf, scan, 0L, sep_arr);
 			(*set)(++nf, scan, mbclen, n);
 			scan += mbclen;
 		}
 	} else
 #endif
-	for (; nf < up_to && scan < end; scan++)
+	for (; nf < up_to && scan < end; scan++) {
+		if (nf > 0 && sep_arr != NULL)
+			set_element(nf, scan, 0L, sep_arr);
 		(*set)(++nf, scan, 1L, n);
+	}
 
 	*buf = scan;
 	return nf;
@@ -618,7 +650,8 @@
 	NODE *fs,
 	Regexp *rp ATTRIBUTE_UNUSED,
 	Setfunc set,	/* routine to set the value of the parsed field */
-	NODE *n)
+	NODE *n,
+	NODE *sep_arr)
 {
 	register char *scan = *buf;
 	register char fschar;
@@ -626,6 +659,7 @@
 	register char *field;
 	register char *end = scan + len;
 	char sav;
+	char *sep;
 #ifdef MBS_SUPPORT
 	size_t mbclen = 0;
 	mbstate_t mbs;
@@ -668,6 +702,8 @@
 		(*set)(++nf, field, (long)(scan - field), n);
 		if (scan == end)
 			break;
+		if (sep_arr != NULL)
+			set_element(nf, scan, 1L, sep_arr);
 		scan++;
 		if (scan == end) {	/* FS at end of record */
 			(*set)(++nf, field, 0L, n);
@@ -695,7 +731,8 @@
 	NODE *fs ATTRIBUTE_UNUSED,
 	Regexp *rp ATTRIBUTE_UNUSED,
 	Setfunc set,	/* routine to set the value of the parsed field */
-	NODE *n)
+	NODE *n,
+	NODE *sep_arr)
 {
 	register char *scan = *buf;
 	register long nf = parse_high_water;
@@ -737,6 +774,7 @@
 		    			fields_arr[0]->stlen -
 					(parse_extent - fields_arr[0]->stptr),
 		    			save_FS, FS_regexp, set_field,
+					(NODE *) NULL,
 					(NODE *) NULL);
 				parse_high_water = NF;
 			}
@@ -764,7 +802,7 @@
 			parse_extent = fields_arr[0]->stptr;
 		parse_high_water = (*parse_field)(requested, &parse_extent,
 		     fields_arr[0]->stlen - (parse_extent - fields_arr[0]->stptr),
-		     save_FS, FS_regexp, set_field, (NODE *) NULL);
+		     save_FS, FS_regexp, set_field, (NODE *) NULL, (NODE *) NULL);
 
 		/*
 		 * if we reached the end of the record, set NF to the number of
@@ -810,10 +848,10 @@
 NODE *
 do_split(NODE *tree)
 {
-	NODE *src, *arr, *sep, *fs, *src2, *fs2, *tmp;
+	NODE *src, *arr, *sep, *fs, *sep_arr, *src2, *fs2, *tmp;
 	char *s;
 	long (*parseit) P((long, char **, int, NODE *,
-			 Regexp *, Setfunc, NODE *));
+			 Regexp *, Setfunc, NODE *, NODE *));
 	Regexp *rp = NULL;
 
 	src = force_string(tree_eval(tree->lnode));
@@ -824,6 +862,26 @@
 
 	sep = tree->rnode->rnode->lnode;
 
+	if (tree->rnode->rnode->rnode == NULL)
+		sep_arr = NULL;
+	else {
+		static short warned = FALSE;
+
+		if (do_traditional || do_posix) {
+			fatal(_("split: fourth argument is a gawk extension"));
+		}
+
+		sep_arr = get_param(tree->rnode->rnode->rnode->lnode);
+
+		if (sep_arr->type != Node_var_array)
+			fatal(_("split: optional fourth argument is not an array"));
+
+		if (do_lint && !warned) {
+			warned = TRUE;
+			lintwarn(_("split: fourth arg is a gawk extension"));
+		}
+	}
+
 	if (src->stlen == 0) {
 		/*
 		 * Skip the work if first arg is the null string.
@@ -835,9 +893,13 @@
 		if ((sep->re_flags & (FS_DFLT|CONST)) == 0)
 			free_temp(tree_eval(sep->re_exp));
 		/*
-		 * And now we can safely turn off the array.
+		 * And now we can safely turn off the arrays.
 		 */
 		assoc_clear(arr);
+
+		if (sep_arr != NULL)
+			assoc_clear(sep_arr);
+
 		return tmp_number((AWKNUM) 0);
 	}
 
@@ -884,9 +946,12 @@
 
 	assoc_clear(arr);
 
+	if (sep_arr != NULL)
+		assoc_clear(sep_arr);
+
 	s = src2->stptr;
 	tmp = tmp_number((AWKNUM) (*parseit)(UNLIMITED, &s, (int) src2->stlen,
-					     fs2, rp, set_element, arr));
+					     fs2, rp, set_element, arr, sep_arr));
 	unref(src2);
 	unref(fs2);
 	return tmp;

--------------040506090307020103090307--
0
Reply Steffen 12/2/2008 7:55:45 AM

Steffen Schuler escribi�:
> Aharon Robbins wrote:
>> In article <gh0ah7$neu$1@aioe.org>, pk  <pk@pk.invalid> wrote:
>>> On Monday 1 December 2008 10:06, Manuel Collado wrote:
>>>
>>>> Perhaps a solution would be to add an optional fourth argument to
>>>> split(). If not given it will behave exactly as now, and if given it
>>>> will store the actual separators:
>>>>
>>>>     split(string, array [, fieldsep [, separray]])
>>>> ...
> Hi Arnold, hello awk-users,
> 
> attached a patch for the suggestion of Manuel for gawk-devel (current 
> CVS version)

Just to clarify the interface, the intended semantics should be:

    split( string, afield, regexp, asep )

    asep[0] = optional leading separator, when regexp = " "
    asep[n] = actual separator after afield[n]

This way, the following code should rebuild an exact copy of the 
original string:

  function join( afield, asep,          k, string ) {
    string = asep[0]
    for (k=1; k in afield; k++) {
       string = string afield[k] asep[k]
    }
    return string
  }

Regards.
-- 
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado
0
Reply Manuel 12/2/2008 8:49:25 AM

Manuel Collado wrote:

> Just to clarify the interface, the intended semantics should be:
> 
>     split( string, afield, regexp, asep )
> 
>     asep[0] = optional leading separator, when regexp = " "
>     asep[n] = actual separator after afield[n]
> 
> This way, the following code should rebuild an exact copy of the 
> original string:
> 
>   function join( afield, asep,          k, string ) {
>     string = asep[0]
>     for (k=1; k in afield; k++) {
>        string = string afield[k] asep[k]
>     }
>     return string
>   }

Uhm, shouldn't asep[n] be optional too? It seems it would only be used if
regexp = " ", just like asep[0]. Otherwise, the following should hold IIUC:

length(asep) = length(afield) - 1

-- 
awk 'BEGIN{O="~"~"~";o="=="=="==";o+=+o;x=O""O;while(X++<=x+o+o)c=c"%c";
printf c,(x-O)*(x-O),x*(x-o)-o,x*(x-O)+x-O-o,+x*(x-O)-x+o,X*(o*o+O)+x-O,
X*(X-x)-o*o,(x+X)*o*o+o,x*(X-x)-O-O,x-O+(O+o+X+x)*(o+O),X*X-X*(x-O)-x+O,
O+X*(o*(o+O)+O),+x+O+X*o,x*(x-o),(o+X+x)*o*o-(x-O-O),O+(X-x)*(X+O),x-O}'
0
Reply Dave 12/2/2008 9:19:39 AM

Dave B escribi�:
> Manuel Collado wrote:
> 
>> Just to clarify the interface, the intended semantics should be:
>>
>>     split( string, afield, regexp, asep )
>>
>>     asep[0] = optional leading separator, when regexp = " "
>>     asep[n] = actual separator after afield[n]
>>
>> This way, the following code should rebuild an exact copy of the 
>> original string:
>>
>>   function join( afield, asep,          k, string ) {
>>     string = asep[0]
>>     for (k=1; k in afield; k++) {
>>        string = string afield[k] asep[k]
>>     }
>>     return string
>>   }
> 
> Uhm, shouldn't asep[n] be optional too? It seems it would only be used if
> regexp = " ", just like asep[0]. Otherwise, the following should hold IIUC:
> 
> length(asep) = length(afield) - 1
> 

Yes. asep[0] and asep[last] need not be generated for regexp != " ". Or 
even in this special case if there are no actual leading/trailing 
separators. In any case, if the user references those items he/she will 
get null strings if they do not exist.

In fact, my (simple, naive) reference join code will create those extra 
separators if they have not been generated by split.
-- 
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado
0
Reply m 12/2/2008 10:29:29 AM

On Dec 2, 4:29=A0am, m.coll...@domain.invalid wrote:
> Dave B escribi=F3:
>
>
>
>
>
> > Manuel Collado wrote:
>
> >> Just to clarify the interface, the intended semantics should be:
>
> >> =A0 =A0 split( string, afield, regexp, asep )
>
> >> =A0 =A0 asep[0] =3D optional leading separator, when regexp =3D " "
> >> =A0 =A0 asep[n] =3D actual separator after afield[n]
>
> >> This way, the following code should rebuild an exact copy of the
> >> original string:
>
> >> =A0 function join( afield, asep, =A0 =A0 =A0 =A0 =A0k, string ) {
> >> =A0 =A0 string =3D asep[0]
> >> =A0 =A0 for (k=3D1; k in afield; k++) {
> >> =A0 =A0 =A0 =A0string =3D string afield[k] asep[k]
> >> =A0 =A0 }
> >> =A0 =A0 return string
> >> =A0 }
>
> > Uhm, shouldn't asep[n] be optional too? It seems it would only be used =
if
> > regexp =3D " ", just like asep[0]. Otherwise, the following should hold=
 IIUC:
>
> > length(asep) =3D length(afield) - 1
>
> Yes. asep[0] and asep[last] need not be generated for regexp !=3D " ". Or
> even in this special case if there are no actual leading/trailing
> separators. In any case, if the user references those items he/she will
> get null strings if they do not exist.
>
> In fact, my (simple, naive) reference join code will create those extra
> separators if they have not been generated by split.
> --
> Manuel Collado -http://lml.ls.fi.upm.es/~mcollado- Hide quoted text -
>
> - Show quoted text -

I'm not sure about populating asep[0]. Normally we say that awk arrays
start at 1, so is it OK to break with tradition on this or should asep
[] go from 1 to NF+1 and contain the separators before each field (and
yes, there is no NF+1th field) instead of 0 to NF and contain the
separators after each field (and yes, there is no 0th field)?

Also, having thought about it a bit more now, I'm sure we don't want
the join() function to modify the arrays it's being asked to operate
on and what if the modification someone wanted to do on the record was
to delete the 3rd field instead of modifying it? Ideally we'd be able
to do:

   split($0,aflds,FS,aseps)
   delete aflds[3]
   $0=3Djoin(aflds,aseps)

so we'd need to write the join() as:

  function join( afield, asep,          k, string, max ) {
    for (k in afield) {
       max =3D (k > max ? k : max)
    }
    string =3D (0 in asep ? asep[0] : "")
    for (k=3D1; k<=3Dmax; k++) {
       if ((k in afield) && (k in asep) {
          string =3D string afield[k] asep[k]
       }
    }
    return string
  }

That's pretty non-obvious so I think we should provide that as a
primitive when we provide the split() enhancement.

Arnold - would that be OK if someone provides it?

Regards,

   Ed.






0
Reply Ed 12/2/2008 2:13:47 PM

On Dec 2, 8:13=A0am, Ed Morton <mortons...@gmail.com> wrote:
> On Dec 2, 4:29=A0am, m.coll...@domain.invalid wrote:
>
>
>
>
>
> > Dave B escribi=F3:
>
> > > Manuel Collado wrote:
>
> > >> Just to clarify the interface, the intended semantics should be:
>
> > >> =A0 =A0 split( string, afield, regexp, asep )
>
> > >> =A0 =A0 asep[0] =3D optional leading separator, when regexp =3D " "
> > >> =A0 =A0 asep[n] =3D actual separator after afield[n]
>
> > >> This way, the following code should rebuild an exact copy of the
> > >> original string:
>
> > >> =A0 function join( afield, asep, =A0 =A0 =A0 =A0 =A0k, string ) {
> > >> =A0 =A0 string =3D asep[0]
> > >> =A0 =A0 for (k=3D1; k in afield; k++) {
> > >> =A0 =A0 =A0 =A0string =3D string afield[k] asep[k]
> > >> =A0 =A0 }
> > >> =A0 =A0 return string
> > >> =A0 }
>
> > > Uhm, shouldn't asep[n] be optional too? It seems it would only be use=
d if
> > > regexp =3D " ", just like asep[0]. Otherwise, the following should ho=
ld IIUC:
>
> > > length(asep) =3D length(afield) - 1
>
> > Yes. asep[0] and asep[last] need not be generated for regexp !=3D " ". =
Or
> > even in this special case if there are no actual leading/trailing
> > separators. In any case, if the user references those items he/she will
> > get null strings if they do not exist.
>
> > In fact, my (simple, naive) reference join code will create those extra
> > separators if they have not been generated by split.
> > --
> > Manuel Collado -http://lml.ls.fi.upm.es/~mcollado-Hide quoted text -
>
> > - Show quoted text -
>
> I'm not sure about populating asep[0]. Normally we say that awk arrays
> start at 1, so is it OK to break with tradition on this or should asep
> [] go from 1 to NF+1 and contain the separators before each field (and
> yes, there is no NF+1th field) instead of 0 to NF and contain the
> separators after each field (and yes, there is no 0th field)?
>
> Also, having thought about it a bit more now, I'm sure we don't want
> the join() function to modify the arrays it's being asked to operate
> on and what if the modification someone wanted to do on the record was
> to delete the 3rd field instead of modifying it? Ideally we'd be able
> to do:
>
> =A0 =A0split($0,aflds,FS,aseps)
> =A0 =A0delete aflds[3]
> =A0 =A0$0=3Djoin(aflds,aseps)
>
> so we'd need to write the join() as:
>
> =A0 function join( afield, asep, =A0 =A0 =A0 =A0 =A0k, string, max ) {
> =A0 =A0 for (k in afield) {
> =A0 =A0 =A0 =A0max =3D (k > max ? k : max)
> =A0 =A0 }
> =A0 =A0 string =3D (0 in asep ? asep[0] : "")
> =A0 =A0 for (k=3D1; k<=3Dmax; k++) {
> =A0 =A0 =A0 =A0if ((k in afield) && (k in asep) {
> =A0 =A0 =A0 =A0 =A0 string =3D string afield[k] asep[k]
> =A0 =A0 =A0 =A0}
> =A0 =A0 }
> =A0 =A0 return string
> =A0 }
>
> That's pretty non-obvious so I think we should provide that as a
> primitive when we provide the split() enhancement.
>
> Arnold - would that be OK if someone provides it?
>
> Regards,
>
> =A0 =A0Ed.- Hide quoted text -
>
> - Show quoted text -

The more I think about this, the more I think the separator array
should start at 1 instead of zero because:

a) all other awk arrays start at 1
b) it seems more common to have leading blanks than trailing blanks
c) the first separator in a record is the 1st separator, not the 0th
separator
d) the separator array index should match the field number (somehow)
so the separator for $1 should be sep[1] and so there can't be a sep
[0] since $0 has a specific meaning separate from the semantics for
identifying fields

So, the join function should be something like:

  function join( afield, asep,          k, string, max ) {
    for (k in afield) {
       max =3D (k > max ? k : max)
    }
    string =3D ""
    for (k=3D1; k<=3Dmax; k++) {
       if ((k in afield) && (k in asep) {
          string =3D string asep[k] afield[k]
       }
    }
    string =3D string (k in asep ? asep[k] : "")
    return string
  }

but it raises the question of what to do with this:

   $0=3D"a b c"
   split($0,aflds,FS,aseps)
   delete aflds[3]
   $0=3Djoin(aflds,aseps)
   # now $0 is "a b " instead of "a b"

Maybe rather than trying to guess what to do with trailing (or leading
if we'd started aseps at zero) separators the "join()" should be
precise in performing a real "join" operation and only combine the
arrays where the field numbers match, in which case it wouldn't
attempt to add back trailing spaces since they'd be in aseps[NF+1] and
there would be no corresponding aflds[NF+1]. If anyone wanted to add
them back they could always trivially do it manually:

  function join( afront, aback,          k, string, max ) {
    for (k in afront) {
       max =3D (k > max ? k : max)
    }
    for (k in aback) {
       max =3D (k > max ? k : max)
    }
    string =3D ""
    for (k=3D1; k<=3Dmax; k++) {
       if ((k in afront) && (k in aback) {
          string =3D string afront[k] aback[k]
       }
    }
    return string
  }

   $0=3D"a b c "
   split($0,aflds,FS,aseps)
   delete aflds[3]
      $0=3Djoin(aseps,aflds)
      # now $0 is "a b"
   or:
      $0=3Djoin(aseps,aflds) aflds[NF+1]
      # now $0 is "a b "

Note that the "join()" is now general for any arrays with a numeric
index (we could even further generalise it using asort() or asorti()
for non-numeric indices), and we'd invoke it with the separator array
as the first arg since the separators come before the fields.

Regards,

    Ed.
0
Reply Ed 12/2/2008 3:22:21 PM

On Dec 2, 9:22=A0am, Ed Morton <mortons...@gmail.com> wrote:
<snip>
Sorry:
> =A0 =A0 =A0 $0=3Djoin(aseps,aflds) aflds[NF+1]
should of course be:
        $0=3Djoin(aseps,aflds) aseps[NF+1]
0
Reply Ed 12/2/2008 3:33:56 PM

Ed Morton escribi�:
> 
> The more I think about this, the more I think the separator array
> should start at 1 instead of zero because:
> 
> a) all other awk arrays start at 1
> b) it seems more common to have leading blanks than trailing blanks
> c) the first separator in a record is the 1st separator, not the 0th
> separator
> d) the separator array index should match the field number (somehow)
> so the separator for $1 should be sep[1] and so there can't be a sep
> [0] since $0 has a specific meaning separate from the semantics for
> identifying fields

As Arnold pointed out, it is important to clarify the desired behaviour 
before adding something new to gawk.

The fact is that a string composed of n fields requires:
a) (n-1) separators in the regular case (when separators exactly match
    the given regexp), and
b) upto (n+1) separators in the special case (of regexp = " ")

So having sep[1]..sep[n+1] or else sep[0]..sep[n] is of no special 
importance, IMHO. The first option is possibly more appropriate for the 
first case (a), while the second option better matches the second case (b).

Starting from 1 means that the i-th separator precedes the i-th field, 
and starting from 0 means that the separator follows the field. Perhaps 
the preference is just a matter of taste.

Also note that if numbering starts at 1, sep[1] may not exist, so the 
first actual separator is sep[2]. I.e. the sep array really doesn't 
always starts at 1.

Regards.
-- 
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado
0
Reply Manuel 12/2/2008 4:23:30 PM

The discussion here is interesting and points out some issues.

It is also a great example of how feature and language design
are almost never as simple as they first appear!

I will think about what to provide if / when I integrate the submitted
patch.

I am not really looking to add a join() function right now. There are
too many ways to do it and whatever thing I might pick won't work for
someobody somewhere.  The lesson to learn (as seen in the early Unix
systems) is to provide the basic features that can only be done in the
interpreter and to let the user use those as building blocks to build the
correct solution for her needs. (<--- Yow, politically correct, even! :-)

Thanks,

Arnold

In article <gh3nl8$9b6$1@heraldo.rediris.es>,
Manuel Collado  <m.collado@lml.ls.fi.upm.es> wrote:
>Ed Morton escribi�:
.....
-- 
Aharon (Arnold) Robbins 				arnold AT skeeve DOT com
P.O. Box 354		Home Phone: +972  8 979-0381
Nof Ayalon		Cell Phone: +972 50  729-7545
D.N. Shimshon 99785	ISRAEL
0
Reply arnold 12/2/2008 5:48:57 PM

On Dec 2, 10:23=A0am, Manuel Collado <m.coll...@lml.ls.fi.upm.es> wrote:
> Ed Morton escribi=F3:
>
>
>
> > The more I think about this, the more I think the separator array
> > should start at 1 instead of zero because:
>
> > a) all other awk arrays start at 1
> > b) it seems more common to have leading blanks than trailing blanks
> > c) the first separator in a record is the 1st separator, not the 0th
> > separator
> > d) the separator array index should match the field number (somehow)
> > so the separator for $1 should be sep[1] and so there can't be a sep
> > [0] since $0 has a specific meaning separate from the semantics for
> > identifying fields
>
> As Arnold pointed out, it is important to clarify the desired behaviour
> before adding something new to gawk.

Absolutely. We clearly have a couple of options and once it's in
place....

> The fact is that a string composed of n fields requires:
> a) (n-1) separators in the regular case (when separators exactly match
> =A0 =A0 the given regexp), and
> b) upto (n+1) separators in the special case (of regexp =3D " ")
>
> So having sep[1]..sep[n+1] or else sep[0]..sep[n] is of no special
> importance, IMHO.

Then can we agree on sep[1]..sep[n+1] for the reasons I gave above?

> The first option is possibly more appropriate for the
> first case (a), while the second option better matches the second case (b=
).
>
> Starting from 1 means that the i-th separator precedes the i-th field,
> and starting from 0 means that the separator follows the field. Perhaps
> the preference is just a matter of taste.
>
> Also note that if numbering starts at 1, sep[1] may not exist, so the
> first actual separator is sep[2]. I.e. the sep array really doesn't
> always starts at 1.

I think the sep array should always contain whatever separator
preceeds the field of the same number, so if there are any fields,
then there is a first field so sep[1] WILL be populated, even if it's
with the NULL string. That will ensure that when we come to join()
them later that the join will succeed because sep[1] and fld[1] are
both populated. If we don't populate sep[] that way, then the
subsequent join() becomes less clear and simple.

    Ed.
0
Reply Ed 12/2/2008 6:00:58 PM

This is a multi-part message in MIME format.
--------------010603080004070809040809
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Manuel Collado wrote:
 >
[...]
 > As Arnold pointed out, it is important to clarify the desired behaviour
 > before adding something new to gawk.
[...]

Hi Arnold, hi awk-users,

I had two errors (one big and one minor) in my first patch from today 
for fs = " ". In the now attached patch these errors are fixed. The 
semantics for seps in split(s, a, fs, seps) according to the attached 
patch is:

1.) fs == " ": if there is beginning whitespace, this is collected in
     seps[0]; if there is trailing whitespace and the highest index of a
     is n, then seps[n] contains this whitespace.
     if there is no beginning or trailing whitespace, then 0 resp. n
     aren't contained in seps as indices.
     s = seps[0] a[1] seps[2] a[2] ... seps[n-1] a[n] seps[n]
     where seps[0] and seps[n] are replaced with "" if seps does not
     contain 0 resp. n.

2.) in all other cases: if the highest index of a is n, then
     s = a[1] seps[1] a[2] seps[2] ... a[n-1] seps[n-1] a[n]
     and seps contains only the indices 1 until n-1

My opinion is, we should use the starting 0 index of seps in the case of 
fs == " ", because of the simpler and more consistent semantics.

Besides ARGV contains also the index 0.

The patch is programmed with this semantics.

I tested the patch with Prof. Collado's join function and with my own 
gawk-scripts. I hope there are no more bugs in the patch.

Arnold and Prof. Collado, you're totally correct, that the desired 
behavior should be clarified, before adding something new to gawk. I'm 
sorry for being to fast with my first patch.

-- 
Steffen

--------------010603080004070809040809
Content-Type: text/plain;
 name="patch-split-extension"
Content-Transfer-Encoding: 8bit
Content-Disposition: inline;
 filename="patch-split-extension"

diff -u -r -N gawk-devel/awkgram.y gawk-devel-ext/awkgram.y
--- gawk-devel/awkgram.y	2008-11-16 20:23:56.000000000 +0100
+++ gawk-devel-ext/awkgram.y	2008-12-02 07:34:42.000000000 +0100
@@ -1162,7 +1162,7 @@
 {"seek",	Node_builtin,	 LEX_BUILTIN,	GAWKX|A(2),	do_seek},
 #endif
 {"sin",		Node_builtin,	 LEX_BUILTIN,	NOT_OLD|A(1),	do_sin},
-{"split",	Node_builtin,	 LEX_BUILTIN,	A(2)|A(3),	do_split},
+{"split",	Node_builtin,	 LEX_BUILTIN,	A(2)|A(3)|A(4),	do_split},
 {"sprintf",	Node_builtin,	 LEX_BUILTIN,	0,		do_sprintf},
 {"sqrt",	Node_builtin,	 LEX_BUILTIN,	A(1),		do_sqrt},
 {"srand",	Node_builtin,	 LEX_BUILTIN,	NOT_OLD|A(0)|A(1), do_srand},
diff -u -r -N gawk-devel/ChangeLog gawk-devel-ext/ChangeLog
--- gawk-devel/ChangeLog	2008-11-16 21:05:33.000000000 +0100
+++ gawk-devel-ext/ChangeLog	2008-12-02 21:57:43.000000000 +0100
@@ -1,3 +1,11 @@
+Tue Dec  2 07:42:00 2008  Steffen Schuler    <schuler.steffen@googlemail.com>
+
+	* field.c (*_parse_field): Add `sep_arr´ argument and fill it.
+	* field.c (do_split): Add handling of fourth argument of awk 
+	builtin	`split´.
+	* field.c (get_field): Extend parse_field by default argument.
+	* awkgram.y (tokentab): Extend `split´ entry with fourth argument.
+
 Sat Oct 27 22:43:50 2007  Arnold D. Robbins  <arnold@skeeve.com>
 
 	* re.c (resetup): Add RE_INVALID_INTERVAL_ORD to syntax bits if
diff -u -r -N gawk-devel/doc/gawk.1 gawk-devel-ext/doc/gawk.1
--- gawk-devel/doc/gawk.1	2008-11-16 21:05:45.000000000 +0100
+++ gawk-devel-ext/doc/gawk.1	2008-12-02 08:04:27.000000000 +0100
@@ -2472,11 +2472,13 @@
 provide the starting index in the string and length
 respectively, of each matching substring.
 .TP
-\fBsplit(\fIs\fB, \fIa \fR[\fB, \fIr\fR]\fB)\fR
+\fBsplit(\fIs\fB, \fIa \fR[\fB, \fIr\fR [\fB, \fIseps\fR] ]\fB)\fR
 Splits the string
 .I s
 into the array
 .I a
+and the separators array
+.I seps
 on the regular expression
 .IR r ,
 and returns the number of fields.  If
diff -u -r -N gawk-devel/doc/gawk.texi gawk-devel-ext/doc/gawk.texi
--- gawk-devel/doc/gawk.texi	2008-11-16 21:05:45.000000000 +0100
+++ gawk-devel-ext/doc/gawk.texi	2008-12-02 08:14:16.000000000 +0100
@@ -12692,10 +12692,11 @@
 (@pxref{Options}),
 using a third argument is a fatal error.
 
-@item split(@var{string}, @var{array} @r{[}, @var{fieldsep}@r{]})
+@item split(@var{string}, @var{array} @r{[}, @var{fieldsep} @r{[}, @var{seps} @r{]} @r{]})
 @cindex @code{split} function
 This function divides @var{string} into pieces separated by @var{fieldsep}
-and stores the pieces in @var{array}.  The first piece is stored in
+and stores the pieces in @var{array} and the separator strings in 
+@var{seps} array.  The first piece is stored in
 @code{@var{array}[1]}, the second piece in @code{@var{array}[2]}, and so
 forth.  The string value of the third argument, @var{fieldsep}, is
 a regexp describing where to split @var{string} (much as @code{FS} can
@@ -12707,7 +12708,7 @@
 manner similar to the way input lines are split into fields.  For example:
 
 @example
-split("cul-de-sac", a, "-")
+split("cul-de-sac", a, "-", seps)
 @end example
 
 @noindent
@@ -12721,6 +12722,13 @@
 a[3] = "sac"
 @end example
 
+and sets the contents of the array @code{seps} as follows:
+
+@example
+seps[1] = "-"
+seps[2] = "-"
+@end example
+
 @noindent
 The value returned by this call to @code{split} is three.
 
diff -u -r -N gawk-devel/field.c gawk-devel-ext/field.c
--- gawk-devel/field.c	2008-11-16 20:21:15.000000000 +0100
+++ gawk-devel-ext/field.c	2008-12-02 22:14:05.000000000 +0100
@@ -39,20 +39,20 @@
 typedef void (* Setfunc) P((long, char *, long, NODE *));
 
 static long (*parse_field) P((long, char **, int, NODE *,
-			     Regexp *, Setfunc, NODE *));
+			     Regexp *, Setfunc, NODE *, NODE *));
 static void rebuild_record P((void));
 static long re_parse_field P((long, char **, int, NODE *,
-			     Regexp *, Setfunc, NODE *));
+			     Regexp *, Setfunc, NODE *, NODE *));
 static long def_parse_field P((long, char **, int, NODE *,
-			      Regexp *, Setfunc, NODE *));
+			      Regexp *, Setfunc, NODE *, NODE *));
 static long posix_def_parse_field P((long, char **, int, NODE *,
-			      Regexp *, Setfunc, NODE *));
+			      Regexp *, Setfunc, NODE *, NODE *));
 static long null_parse_field P((long, char **, int, NODE *,
-			     Regexp *, Setfunc, NODE *));
+			     Regexp *, Setfunc, NODE *, NODE *));
 static long sc_parse_field P((long, char **, int, NODE *,
-			     Regexp *, Setfunc, NODE *));
+			     Regexp *, Setfunc, NODE *, NODE *));
 static long fw_parse_field P((long, char **, int, NODE *,
-			     Regexp *, Setfunc, NODE *));
+			     Regexp *, Setfunc, NODE *, NODE *));
 static void set_element P((long num, char * str, long len, NODE *arr));
 static void grow_fields_arr P((long num));
 static void set_field P((long num, char *str, long len, NODE *dummy));
@@ -364,12 +364,14 @@
 	NODE *fs ATTRIBUTE_UNUSED,
 	Regexp *rp,
 	Setfunc set,	/* routine to set the value of the parsed field */
-	NODE *n)
+	NODE *n,
+	NODE *sep_arr)
 {
 	register char *scan = *buf;
 	register long nf = parse_high_water;
 	register char *field;
 	register char *end = scan + len;
+	char *sep;
 #ifdef MBS_SUPPORT
 	size_t mbclen = 0;
 	mbstate_t mbs;
@@ -382,9 +384,14 @@
 	if (len == 0)
 		return nf;
 
-	if (RS_is_null && default_FS)
+	if (RS_is_null && default_FS) {
+		sep = scan;
 		while (scan < end && (*scan == ' ' || *scan == '\t' || *scan == '\n'))
 			scan++;
+		if (sep_arr != NULL) 
+			set_element(nf, sep, (long)(scan - sep), sep_arr);
+	}
+
 	field = scan;
 	while (scan < end
 	       && research(rp, scan, 0, (end - scan), RE_NEED_START) != -1
@@ -404,6 +411,8 @@
 			scan++;
 			if (scan == end) {
 				(*set)(++nf, field, (long)(scan - field), n);
+				if (sep_arr != NULL) 
+	        			set_element(nf, field, 0L, sep_arr);
 				up_to = nf;
 				break;
 			}
@@ -411,6 +420,9 @@
 		}
 		(*set)(++nf, field,
 		       (long)(scan + RESTART(rp, scan) - field), n);
+		if (sep_arr != NULL) 
+	    		set_element(nf, scan + RESTART(rp, scan), 
+           			(long) (REEND(rp, scan) - RESTART(rp, scan)), sep_arr);
 		scan += REEND(rp, scan);
 		field = scan;
 		if (scan == end)	/* FS at end of record */
@@ -439,13 +451,15 @@
 	NODE *fs,
 	Regexp *rp ATTRIBUTE_UNUSED,
 	Setfunc set,	/* routine to set the value of the parsed field */
-	NODE *n)
+	NODE *n,
+	NODE *sep_arr)
 {
 	register char *scan = *buf;
 	register long nf = parse_high_water;
 	register char *field;
 	register char *end = scan + len;
 	char sav;
+	char *sep;
 
 	if (up_to == UNLIMITED)
 		nf = 0;
@@ -455,6 +469,8 @@
 	/*
 	 * Nasty special case. If FS set to "", return whole record
 	 * as first field. This is not worth a separate function.
+	 *
+	 * Steffen Schuler: Is this case still needed?
 	 */
 	if (fs->stlen == 0) {
 		(*set)(++nf, *buf, len, n);
@@ -467,13 +483,16 @@
 	/* because it will be destroyed now: */
 
 	*end = ' ';	/* sentinel character */
-	for (; nf < up_to; scan++) {
+	while (nf < up_to) {
 		/*
 		 * special case:  fs is single space, strip leading whitespace 
 		 */
+ 		sep = scan;
 		while (scan < end && (*scan == ' ' || *scan == '\t' || *scan == '\n'))
 			scan++;
-		if (scan >= end)
+		if (sep_arr != NULL && scan > sep)
+			set_element(nf, sep, (long) (scan - sep), sep_arr);
+		if (scan >= end) 
 			break;
 		field = scan;
 		while (*scan != ' ' && *scan != '\t' && *scan != '\n')
@@ -506,13 +525,15 @@
 	NODE *fs,
 	Regexp *rp ATTRIBUTE_UNUSED,
 	Setfunc set,	/* routine to set the value of the parsed field */
-	NODE *n)
+	NODE *n,
+	NODE *sep_arr)
 {
 	register char *scan = *buf;
 	register long nf = parse_high_water;
 	register char *field;
 	register char *end = scan + len;
 	char sav;
+	char *sep;
 
 	if (up_to == UNLIMITED)
 		nf = 0;
@@ -522,6 +543,8 @@
 	/*
 	 * Nasty special case. If FS set to "", return whole record
 	 * as first field. This is not worth a separate function.
+	 *
+	 * Steffen Schuler: Is this case still needed?
 	 */
 	if (fs->stlen == 0) {
 		(*set)(++nf, *buf, len, n);
@@ -538,8 +561,11 @@
 		/*
 		 * special case:  fs is single space, strip leading whitespace 
 		 */
+		sep = scan;
 		while (scan < end && (*scan == ' ' || *scan == '\t'))
 			scan++;
+		if (sep_arr != NULL)
+			set_element(nf, sep, (long) (scan - sep), sep_arr);
 		if (scan >= end)
 			break;
 		field = scan;
@@ -570,7 +596,8 @@
 	NODE *fs ATTRIBUTE_UNUSED,
 	Regexp *rp ATTRIBUTE_UNUSED,
 	Setfunc set,	/* routine to set the value of the parsed field */
-	NODE *n)
+	NODE *n,
+	NODE *sep_arr)
 {
 	register char *scan = *buf;
 	register long nf = parse_high_water;
@@ -592,13 +619,18 @@
 				/* We treat it as a singlebyte character.  */
 				mbclen = 1;
 			}
+			if (nf > 0 && sep_arr != NULL)
+        			set_element(nf, scan, 0L, sep_arr);
 			(*set)(++nf, scan, mbclen, n);
 			scan += mbclen;
 		}
 	} else
 #endif
-	for (; nf < up_to && scan < end; scan++)
+	for (; nf < up_to && scan < end; scan++) {
+		if (nf > 0 && sep_arr != NULL)
+			set_element(nf, scan, 0L, sep_arr);
 		(*set)(++nf, scan, 1L, n);
+	}
 
 	*buf = scan;
 	return nf;
@@ -618,7 +650,8 @@
 	NODE *fs,
 	Regexp *rp ATTRIBUTE_UNUSED,
 	Setfunc set,	/* routine to set the value of the parsed field */
-	NODE *n)
+	NODE *n,
+	NODE *sep_arr)
 {
 	register char *scan = *buf;
 	register char fschar;
@@ -626,6 +659,7 @@
 	register char *field;
 	register char *end = scan + len;
 	char sav;
+	char *sep;
 #ifdef MBS_SUPPORT
 	size_t mbclen = 0;
 	mbstate_t mbs;
@@ -668,6 +702,8 @@
 		(*set)(++nf, field, (long)(scan - field), n);
 		if (scan == end)
 			break;
+		if (sep_arr != NULL)
+			set_element(nf, scan, 1L, sep_arr);
 		scan++;
 		if (scan == end) {	/* FS at end of record */
 			(*set)(++nf, field, 0L, n);
@@ -695,7 +731,8 @@
 	NODE *fs ATTRIBUTE_UNUSED,
 	Regexp *rp ATTRIBUTE_UNUSED,
 	Setfunc set,	/* routine to set the value of the parsed field */
-	NODE *n)
+	NODE *n,
+	NODE *sep_arr)
 {
 	register char *scan = *buf;
 	register long nf = parse_high_water;
@@ -737,6 +774,7 @@
 		    			fields_arr[0]->stlen -
 					(parse_extent - fields_arr[0]->stptr),
 		    			save_FS, FS_regexp, set_field,
+					(NODE *) NULL,
 					(NODE *) NULL);
 				parse_high_water = NF;
 			}
@@ -764,7 +802,7 @@
 			parse_extent = fields_arr[0]->stptr;
 		parse_high_water = (*parse_field)(requested, &parse_extent,
 		     fields_arr[0]->stlen - (parse_extent - fields_arr[0]->stptr),
-		     save_FS, FS_regexp, set_field, (NODE *) NULL);
+		     save_FS, FS_regexp, set_field, (NODE *) NULL, (NODE *) NULL);
 
 		/*
 		 * if we reached the end of the record, set NF to the number of
@@ -810,10 +848,10 @@
 NODE *
 do_split(NODE *tree)
 {
-	NODE *src, *arr, *sep, *fs, *src2, *fs2, *tmp;
+	NODE *src, *arr, *sep, *fs, *sep_arr, *src2, *fs2, *tmp;
 	char *s;
 	long (*parseit) P((long, char **, int, NODE *,
-			 Regexp *, Setfunc, NODE *));
+			 Regexp *, Setfunc, NODE *, NODE *));
 	Regexp *rp = NULL;
 
 	src = force_string(tree_eval(tree->lnode));
@@ -824,6 +862,26 @@
 
 	sep = tree->rnode->rnode->lnode;
 
+	if (tree->rnode->rnode->rnode == NULL)
+		sep_arr = NULL;
+	else {
+		static short warned = FALSE;
+
+		if (do_traditional || do_posix) {
+			fatal(_("split: fourth argument is a gawk extension"));
+		}
+
+		sep_arr = get_param(tree->rnode->rnode->rnode->lnode);
+
+		if (sep_arr->type != Node_var_array)
+			fatal(_("split: optional fourth argument is not an array"));
+
+		if (do_lint && !warned) {
+			warned = TRUE;
+			lintwarn(_("split: fourth arg is a gawk extension"));
+		}
+	}
+
 	if (src->stlen == 0) {
 		/*
 		 * Skip the work if first arg is the null string.
@@ -835,9 +893,13 @@
 		if ((sep->re_flags & (FS_DFLT|CONST)) == 0)
 			free_temp(tree_eval(sep->re_exp));
 		/*
-		 * And now we can safely turn off the array.
+		 * And now we can safely turn off the arrays.
 		 */
 		assoc_clear(arr);
+
+		if (sep_arr != NULL)
+			assoc_clear(sep_arr);
+
 		return tmp_number((AWKNUM) 0);
 	}
 
@@ -884,9 +946,12 @@
 
 	assoc_clear(arr);
 
+	if (sep_arr != NULL)
+		assoc_clear(sep_arr);
+
 	s = src2->stptr;
 	tmp = tmp_number((AWKNUM) (*parseit)(UNLIMITED, &s, (int) src2->stlen,
-					     fs2, rp, set_element, arr));
+					     fs2, rp, set_element, arr, sep_arr));
 	unref(src2);
 	unref(fs2);
 	return tmp;

--------------010603080004070809040809--
0
Reply Steffen 12/2/2008 10:04:59 PM

On Dec 2, 4:04=A0pm, Steffen Schuler <schuler.stef...@googlemail.com>
wrote:
> Manuel Collado wrote:
>
> =A0>
> [...]
> =A0> As Arnold pointed out, it is important to clarify the desired behavi=
our
> =A0> before adding something new to gawk.
> [...]
>
> Hi Arnold, hi awk-users,
>
> I had two errors (one big and one minor) in my first patch from today
> for fs =3D " ". In the now attached patch these errors are fixed. The
> semantics for seps in split(s, a, fs, seps) according to the attached
> patch is:
>
> 1.) fs =3D=3D " ": if there is beginning whitespace, this is collected in
> =A0 =A0 =A0seps[0]; if there is trailing whitespace and the highest index=
 of a
> =A0 =A0 =A0is n, then seps[n] contains this whitespace.
> =A0 =A0 =A0if there is no beginning or trailing whitespace, then 0 resp. =
n
> =A0 =A0 =A0aren't contained in seps as indices.
> =A0 =A0 =A0s =3D seps[0] a[1] seps[2] a[2] ... seps[n-1] a[n] seps[n]
> =A0 =A0 =A0where seps[0] and seps[n] are replaced with "" if seps does no=
t
> =A0 =A0 =A0contain 0 resp. n.
>
> 2.) in all other cases: if the highest index of a is n, then
> =A0 =A0 =A0s =3D a[1] seps[1] a[2] seps[2] ... a[n-1] seps[n-1] a[n]
> =A0 =A0 =A0and seps contains only the indices 1 until n-1
>
> My opinion is, we should use the starting 0 index of seps in the case of
> fs =3D=3D " ", because of the simpler and more consistent semantics.
>
> Besides ARGV contains also the index 0.

I'm not trying to be snide, and I may just be missing the curcial
point but I just don't see how starting at zero is simpler or has more
consistent semantics.

Consistency: Yes, ARGV has a zero index but that's just the name of
the program which is very different from all of the actual arguments
that start at index 1, and $0 is very different from all of the actual
fields which start at $1, so having the array of field separators
start at zero with an actual field separator just like all the other
field separators is totally inconsistent with all other awk arrays
including those two. If you said that the field separator array
started at zero but the zero-th element contained all of the field
separators concatenated or the name of the function you called to
split the record, THEN I could see how it'd be consistent with $0 or
ARGV[0]. So what is starting the separator array at zero consistent
with?

Simplicity: there is no difference as far as I can tell in modifying
or deleting then recombining the fields with the separators between 1-
>NF+1 or 0->NF. So, what is simpler about starting at 0 vs starting at
1?

It seems to me like starting at zero is the same simplicity as
starting at 1 but is less consistent with all other awk arrays. I also
think (and this is just based on me trying to recall what I've seen
over the years so I could be totally wrong) that we much more
frequently see meaningful leading white space than trailing white
space and people are used to using 1->NF to loop through fields so
starting the separators at 1 makes this loop:

   for (i=3D1;i<=3DNF;i++)
      printf "%s%s",seps[i],$i

much more likely to be what is actually wanted rather than this:

   for (i=3D0;i<=3DNF;i++)
      printf "%s%s",$i,seps[i]

Note the deliberate bug introduced above by forgetting the serious
semantic inconsistency between the relationship of seps[1]<->$1 and
that of seps[0]<->$0.

   Ed.
0
Reply Ed 12/3/2008 1:19:26 AM

Ed Morton wrote:
> On Dec 2, 4:04 pm, Steffen Schuler <schuler.stef...@googlemail.com>
> wrote:
>> [...]  The
>> semantics for seps in split(s, a, fs, seps) according to the attached
>> patch is:
>>
>> 1.) fs == " ": if there is beginning whitespace, this is collected in
>>      seps[0]; if there is trailing whitespace and the highest index of a
>>      is n, then seps[n] contains this whitespace.
>>      if there is no beginning or trailing whitespace, then 0 resp. n
>>      aren't contained in seps as indices.
>>      s = seps[0] a[1] seps[2] a[2] ... seps[n-1] a[n] seps[n]
>>      where seps[0] and seps[n] are replaced with "" if seps does not
>>      contain 0 resp. n.
>>
>> 2.) in all other cases: if the highest index of a is n, then
>>      s = a[1] seps[1] a[2] seps[2] ... a[n-1] seps[n-1] a[n]
>>      and seps contains only the indices 1 until n-1
>>
>> My opinion is, we should use the starting 0 index of seps in the case of
>> fs == " ", because of the simpler and more consistent semantics.
>>
>> Besides ARGV contains also the index 0.
> 
> I'm not trying to be snide, and I may just be missing the curcial
> point but I just don't see how starting at zero is simpler or has more
> consistent semantics.
> 
> Consistency: Yes, ARGV has a zero index but that's just the name of
> the program which is very different from all of the actual arguments
> that start at index 1, and $0 is very different from all of the actual
> fields which start at $1, so having the array of field separators
> start at zero with an actual field separator just like all the other
> field separators is totally inconsistent with all other awk arrays
> including those two. If you said that the field separator array
> started at zero but the zero-th element contained all of the field
> separators concatenated or the name of the function you called to
> split the record, THEN I could see how it'd be consistent with $0 or
> ARGV[0]. So what is starting the separator array at zero consistent
> with?

Let me assume that if fs != " ", then seps[1] is the separator string
between a[1] and a[2] (, if fs matches a substring of s).

With consistency I mean, that one wants the same semantics, if
fs == " ". That is seps[1] is the whitespace string (the separator
string) between a[1] and a[2]. Then the whitespace string before
a[1] (if existing) must come before seps[1] that is must have an
index one smaller than seps[1], hence it has the index 0.

Therefore if one wants consistency the starting whitespace (if existing 
and fs == " ") must be assigned to seps[0].

Consistency means here the same semantics of seps[i] relative to
a[1], ..., a[n] for the cases fs == " " and fs != " ".

You would also have consistency when you start seps with 2 when
fs != " " and start seps with 1 when fs == " ".

But this is more unnatural than Prof. Collado's semantics, because
starting a sequence with 0 or 1 is more natural than starting a sequence
with 1 or 2.

If you start seps with 1 when fs == " ", you have yet the strange
semantics that when a[n] (resp. $NF in the record case) is the last
member of the a-array, then seps[n+2] is the last member of the seps
array when there is trailing whitespace. But you would wish that the
last index of seps is  maximally by 1 greater than the last index of a.
I think you see what I mean with consistent and natural.

> 
> Simplicity: there is no difference as far as I can tell in modifying
> or deleting then recombining the fields with the separators between 1-
>> NF+1 or 0->NF. So, what is simpler about starting at 0 vs starting at
> 1?

If you consider the simple join-function provided by Prof. Collado
for the case seps starts with index 0 when fs == " " and your more
complicated join-function for the case seps starts with index 1 when
fs == " ", then you see that the "seps[0]"-case provides more simplicity
than the "seps[1]"-case.

But I can imagine a join-function in the "seps[1]"-case which is nearly
as simple than Prof. Collado's join-function.

> 
> It seems to me like starting at zero is the same simplicity as
> starting at 1 but is less consistent with all other awk arrays. I also
> think (and this is just based on me trying to recall what I've seen
> over the years so I could be totally wrong) that we much more
> frequently see meaningful leading white space than trailing white
> space and people are used to using 1->NF to loop through fields so
> starting the separators at 1 makes this loop:
> 
>    for (i=1;i<=NF;i++)
>       printf "%s%s",seps[i],$i
> 
> much more likely to be what is actually wanted rather than this:
> 
>    for (i=0;i<=NF;i++)
>       printf "%s%s",$i,seps[i]
> 
> Note the deliberate bug introduced above by forgetting the serious
> semantic inconsistency between the relationship of seps[1]<->$1 and
> that of seps[0]<->$0.
> 

$0 is a special field. I believe if we define the semantics clearly,
awk-users will not make the error to associate $0 with seps[0] as people
don't make the error to associate ARGV[0] with $0.

For the "seps[0]"-semantics speaks also that two people assumed the same
"seps[0]"-semantics independently. These two people are me with the
gawk-patch and Prof. Collado with his semantics but only one person (Ed) 
with the "seps[1]"-semantics.

>    Ed.

-- 
Steffen
0
Reply Steffen 12/3/2008 5:32:42 AM

Steffen Schuler wrote:
> [...]
> 
> For the "seps[0]"-semantics speaks also that two people assumed the same
> "seps[0]"-semantics independently. These two people are me with the
> gawk-patch and Prof. Collado with his semantics but only one person (Ed) 
> with the "seps[1]"-semantics.

Those types of assumptions based on a handful of people are dangerous, to
say the least. Clearly, I can see the point of Ed, and I can also see the
reasons for your preference. Yet I just abstained because I am yet undecided
which arguments weigh more to get the least "inconsistency". Rather, I think,
the special case to handle a space seems to outweigh any "index consistency"
question. And, generally, if we could get rid of such special handlings the
consistency question ([0] vs. [1]) could be more easily answered, I guess.

Janis
0
Reply Janis 12/3/2008 6:56:43 AM

Ed Morton escribi�:
> On Dec 2, 4:04 pm, Steffen Schuler <schuler.stef...@googlemail.com>
> wrote:
>> Manuel Collado wrote:
>> [...]
>>  > As Arnold pointed out, it is important to clarify the desired behaviour
>>  > before adding something new to gawk.
>> [...]
>> ... The
>> semantics for seps in split(s, a, fs, seps) according to the attached
>> patch is:
>>
>> 1.) fs == " ": if there is beginning whitespace, this is collected in
>>      seps[0]; if there is trailing whitespace and the highest index of a
>>      is n, then seps[n] contains this whitespace.
>>      if there is no beginning or trailing whitespace, then 0 resp. n
>>      aren't contained in seps as indices.
>>      s = seps[0] a[1] seps[2] a[2] ... seps[n-1] a[n] seps[n]
>>      where seps[0] and seps[n] are replaced with "" if seps does not
>>      contain 0 resp. n.
>>
>> 2.) in all other cases: if the highest index of a is n, then
>>      s = a[1] seps[1] a[2] seps[2] ... a[n-1] seps[n-1] a[n]
>>      and seps contains only the indices 1 until n-1
>>
>> My opinion is, we should use the starting 0 index of seps in the case of
>> fs == " ", because of the simpler and more consistent semantics.

The fact is that there is a third case:

3) fs = "". The string is split into individual characters. For 
efficiency, the sep array should not been generated (no need for it to 
rebuild the original string).

> 
> I'm not trying to be snide, and I may just be missing the curcial
> point but I just don't see how starting at zero is simpler or has more
> consistent semantics.
> 
> Consistency: Yes, ARGV has a zero index but that's just the name of
> the program which is very different from all of the actual arguments
> that start at index 1, and $0 is very different from all of the actual
> fields which start at $1, so having the array of field separators
> start at zero with an actual field separator just like all the other
> field separators is totally inconsistent with all other awk arrays
> including those two. If you said that the field separator array
> started at zero but the zero-th element contained all of the field
> separators concatenated or the name of the function you called to
> split the record, THEN I could see how it'd be consistent with $0 or
> ARGV[0]. So what is starting the separator array at zero consistent
> with?
> 
> Simplicity: there is no difference as far as I can tell in modifying
> or deleting then recombining the fields with the separators between 1-
>> NF+1 or 0->NF. So, what is simpler about starting at 0 vs starting at
> 1?
> 
> It seems to me like starting at zero is the same simplicity as
> starting at 1 but is less consistent with all other awk arrays. I also
> think (and this is just based on me trying to recall what I've seen
> over the years so I could be totally wrong) that we much more
> frequently see meaningful leading white space than trailing white
> space and people are used to using 1->NF to loop through fields so
> starting the separators at 1 makes this loop:

Perhaps not a crucial point, but something worth to consider: splitting 
a string into pieces can be seen from the two sides of the mirror. The 
splitting regexp can be seen as a specification for the separator, or 
for the fields themselves. I remember a request for this feature 
sometime ago.

So it should be very convenient to always generate a sep array that 
exactly contains the actual matches of the splitting regexp. This is 
consistent with an empty set of separators in the third special case of 
a null splitting regexp, for which there are no real separators. And the 
sep array will never contain a null value.

This is against forging a null starting separator in some cases:

> I think the sep array should always contain whatever separator
> preceeds the field of the same number, so if there are any fields,
> then there is a first field so sep[1] WILL be populated, even if it's
> with the NULL string. That will ensure that when we come to join()
> them later that the join will succeed because sep[1] and fld[1] are
> both populated. If we don't populate sep[] that way, then the
> subsequent join() becomes less clear and simple.

because in that case the sep array cannot be seen as a set of fields, if 
the splitting regexp is intended to match the fields and not the separators.

So my vote is for using sep[0] for the optional leading space (in fact 
not a separator) in the special " " splitting regexp.

Regards.
-- 
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado
0
Reply m 12/3/2008 11:58:57 AM

On Wednesday 3 December 2008 02:19, Ed Morton wrote:

> It seems to me like starting at zero is the same simplicity as
> starting at 1 but is less consistent with all other awk arrays. I also
> think (and this is just based on me trying to recall what I've seen
> over the years so I could be totally wrong) that we much more
> frequently see meaningful leading white space than trailing white
> space and people are used to using 1->NF to loop through fields so
> starting the separators at 1 makes this loop:
> 
>    for (i=1;i<=NF;i++)
>       printf "%s%s",seps[i],$i
> 
> much more likely to be what is actually wanted rather than this:
> 
>    for (i=0;i<=NF;i++)
>       printf "%s%s",$i,seps[i]

My 2 cents:

You don't need either. If, as Manuel said, we have a sep[] array where
sep[0] and sep[NF] are optional, you can *always* rebuild a line using this
idiom:

line=sep[0]   # possibly empty
for (i=1;i<=NF;i++)
  line=line $i sep[i]   # sep[NF] might be empty, but 
                        # concatenating an empty string does nothing

and this regardless of whether the regex used for splitting is "", " ", or
another regex. At the same time, the sep[] array and its elements (if any)
are preserved, since merely referencing a non-existing array element does
not create it, so you can still check whether sep[0], sep[NF] or any
element in between are defined, or do any other check.


0
Reply pk 12/3/2008 12:31:18 PM

On Dec 3, 6:31=A0am, pk <p...@pk.invalid> wrote:
> On Wednesday 3 December 2008 02:19, Ed Morton wrote:
>
> > It seems to me like starting at zero is the same simplicity as
> > starting at 1 but is less consistent with all other awk arrays. I also
> > think (and this is just based on me trying to recall what I've seen
> > over the years so I could be totally wrong) that we much more
> > frequently see meaningful leading white space than trailing white
> > space and people are used to using 1->NF to loop through fields so
> > starting the separators at 1 makes this loop:
>
> > =A0 =A0for (i=3D1;i<=3DNF;i++)
> > =A0 =A0 =A0 printf "%s%s",seps[i],$i
>
> > much more likely to be what is actually wanted rather than this:
>
> > =A0 =A0for (i=3D0;i<=3DNF;i++)
> > =A0 =A0 =A0 printf "%s%s",$i,seps[i]
>
> My 2 cents:
>
> You don't need either. If, as Manuel said, we have a sep[] array where
> sep[0] and sep[NF] are optional, you can *always* rebuild a line using th=
is
> idiom:
>
> line=3Dsep[0] =A0 # possibly empty
> for (i=3D1;i<=3DNF;i++)
> =A0 line=3Dline $i sep[i] =A0 # sep[NF] might be empty, but
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 # concatenating an empty =
string does nothing

No, if you don't want to modify sep[] then you need to make that:

line=3D(0 in sep ? sep[0] : "")
for (i=3D1;i<=3DNF;i++)
  line=3Dline $i (i in sep ? sep[i] : "")

> and this regardless of whether the regex used for splitting is "", " ", o=
r
> another regex. At the same time, the sep[] array and its elements (if any=
)
> are preserved, since merely referencing a non-existing array element does
> not create it,

Yes, it does create it. Your code above creates sep[0] and sep[NF].

   Ed.

> so you can still check whether sep[0], sep[NF] or any
> element in between are defined, or do any other check.

0
Reply Ed 12/3/2008 1:05:48 PM

Aharon Robbins wrote:
> The discussion here is interesting and points out some issues.
> 
> It is also a great example of how feature and language design
> are almost never as simple as they first appear!
> 
> I will think about what to provide if / when I integrate the submitted
> patch.
> 
> I am not really looking to add a join() function right now. There are
> too many ways to do it and whatever thing I might pick won't work for
> someobody somewhere.  The lesson to learn (as seen in the early Unix
> systems) is to provide the basic features that can only be done in the
> interpreter and to let the user use those as building blocks to build the
> correct solution for her needs. (<--- Yow, politically correct, even! :-)
> 
> Thanks,
> 
> Arnold
> 
> In article <gh3nl8$9b6$1@heraldo.rediris.es>,
> Manuel Collado  <m.collado@lml.ls.fi.upm.es> wrote:
>> Ed Morton escribi�:
> ....

Hi Arnold, hi awk-users,

For testing the extended split functionality with one of the both
considered start-index-semantics of the fourth argument, the following
awk-script is useful.

Especially the join2-function is general, and is also useful when one or
more fields or separators are deleted. It works for both considered
semantics (see below).

-----------------8<----------------------------------------
{
   split($0, field, FS, seps)

   # when indices in seps start minimally with 0
   joined = join2(field, seps)

   # when indices in seps start minimally with 1
   # joined = join2(seps, field)
}

$0 != joined {
   ++errcnt
   fmt = "%3d:%d:ERROR:%s:\n"
   printf fmt, NR, NF, $0
   printf fmt, NR, NF, joined
   mini = minidx(seps)
   maxi = maxidx(seps)
   for (i = mini; i <= maxi; ++i)
     if (i in seps)
       printf "%d:-->%s<--\n", i, seps[i]
}

END {
   printf "total %d errors\n", errcnt
   exit(errcnt > 0)
}

# some general and portable functions on arrays with (integer) indices;
# arrays may have any max or min (integer) index and
# may have holes in the range of indices
# in the distinct functions the input arrays but not the output array
# are not changed
# the thrust of development was on clean design not on smallness or
# efficiency

function join2(a, b,   c)
{
   zip(a, b, c)
   return join(c)
}

function join(a,   mini, maxi, i, s)
{
   mini = minidx(a) + 0
   maxi = maxidx(a) + 0
   for (i = mini; i <= maxi; ++i)
     if (i in a)
       s = s a[i]
   return s
}

function zip(a, b, c,   mini, maxi, i)
{
   del(c)
   mini = minidx2(a, b) + 0
   maxi = maxidx2(a, b) + 0
   for (i = mini; i <= maxi; ++i) {
     if (i in a)
       c[i] = a[i]
     if (i in b)
       c[i] = c[i] b[i]
   }
}

function maxidx2(a, b)
{
   if (emptyarr(a))
     return maxidx(b)
   else if (emptyarr(b))
     return maxidx(a)
   else
     return max(maxidx(a), maxidx(b))
}

function minidx2(a, b)
{
   if (emptyarr(a))
     return minidx(b)
   else if (emptyarr(b))
     return minidx(a)
   else
     return min(minidx(a), minidx(b))
}

function maxidx(a,   i, m)
{
   if (emptyarr(a))
     m = ""
   else {
     m = choose(a) + 0
     for (i in a) {
       i += 0
       if (i > m)
         m = i
     }
   }
   return m
}

function minidx(a,   i, m)
{
   if (emptyarr(a))
     m = ""
   else {
     m = choose(a) + 0
     for (i in a) {
       i += 0
       if (i < m)
         m = i
     }
   }
   return m
}

function choose(a,   m)
{
   for (m in a)
     return m
   return ""
}

function emptyarr(a,  k)
{
   for (k in a)
     return 0
   return 1
}

# portable delete
function del(a)
{
   split("", a)
}

function max(x, y)
{
   return x > y ? x : y
}

function min(x, y)
{
   return x < y ? x : y
}
-----------------8<----------------------------------------

-- 
Steffen
0
Reply Steffen 12/3/2008 1:14:28 PM

On Dec 2, 11:32=A0pm, Steffen Schuler <schuler.stef...@googlemail.com>
wrote:
> Ed Morton wrote:
> > On Dec 2, 4:04 pm, Steffen Schuler <schuler.stef...@googlemail.com>
> > wrote:
> >> [...] =A0The
> >> semantics for seps in split(s, a, fs, seps) according to the attached
> >> patch is:
>
> >> 1.) fs =3D=3D " ": if there is beginning whitespace, this is collected=
 in
> >> =A0 =A0 =A0seps[0]; if there is trailing whitespace and the highest in=
dex of a
> >> =A0 =A0 =A0is n, then seps[n] contains this whitespace.
> >> =A0 =A0 =A0if there is no beginning or trailing whitespace, then 0 res=
p. n
> >> =A0 =A0 =A0aren't contained in seps as indices.
> >> =A0 =A0 =A0s =3D seps[0] a[1] seps[2] a[2] ... seps[n-1] a[n] seps[n]
> >> =A0 =A0 =A0where seps[0] and seps[n] are replaced with "" if seps does=
 not
> >> =A0 =A0 =A0contain 0 resp. n.
>
> >> 2.) in all other cases: if the highest index of a is n, then
> >> =A0 =A0 =A0s =3D a[1] seps[1] a[2] seps[2] ... a[n-1] seps[n-1] a[n]
> >> =A0 =A0 =A0and seps contains only the indices 1 until n-1
>
> >> My opinion is, we should use the starting 0 index of seps in the case =
of
> >> fs =3D=3D " ", because of the simpler and more consistent semantics.
>
> >> Besides ARGV contains also the index 0.
>
> > I'm not trying to be snide, and I may just be missing the curcial
> > point but I just don't see how starting at zero is simpler or has more
> > consistent semantics.
>
> > Consistency: Yes, ARGV has a zero index but that's just the name of
> > the program which is very different from all of the actual arguments
> > that start at index 1, and $0 is very different from all of the actual
> > fields which start at $1, so having the array of field separators
> > start at zero with an actual field separator just like all the other
> > field separators is totally inconsistent with all other awk arrays
> > including those two. If you said that the field separator array
> > started at zero but the zero-th element contained all of the field
> > separators concatenated or the name of the function you called to
> > split the record, THEN I could see how it'd be consistent with $0 or
> > ARGV[0]. So what is starting the separator array at zero consistent
> > with?
>
> Let me assume that if fs !=3D " ", then seps[1] is the separator string
> between a[1] and a[2] (, if fs matches a substring of s).

Why make a special case out of fs =3D=3D " " when you can just state that
fs[N] is the separator string before a[N] for all separators?

> With consistency I mean, that one wants the same semantics, if
> fs =3D=3D " ". That is seps[1] is the whitespace string (the separator
> string) between a[1] and a[2]. Then the whitespace string before
> a[1] (if existing) must come before seps[1] that is must have an
> index one smaller than seps[1], hence it has the index 0.

Or - seps[1] is the whitespace string before a[1]. Period.

> Therefore if one wants consistency the starting whitespace (if existing
> and fs =3D=3D " ") must be assigned to seps[0].

No - seps[1] is consistently the whitespace before a[1].

> Consistency means here the same semantics of seps[i] relative to
> a[1], ..., a[n] for the cases fs =3D=3D " " and fs !=3D " ".

Right - seps[i] is always the space before a[i] and seps[i+1] is
always the space after a[i].

> You would also have consistency when you start seps with 2 when
> fs !=3D " " and start seps with 1 when fs =3D=3D " ".

No, then you're special-casing fs =3D=3D " ".

> But this is more unnatural than Prof. Collado's semantics, because
> starting a sequence with 0 or 1 is more natural than starting a sequence
> with 1 or 2.

Starting an array with 1 is more natural for awk than starting with 0
or 2.

> If you start seps with 1 when fs =3D=3D " ", you have yet the strange
> semantics that when a[n] (resp. $NF in the record case) is the last
> member of the a-array, then seps[n+2] is the last member of the seps
> array when there is trailing whitespace. But you would wish that the
> last index of seps is =A0maximally by 1 greater than the last index of a.
> I think you see what I mean with consistent and natural.

Starting at 1, seps[n+1] is consistently the separator after the n-th
field.

> > Simplicity: there is no difference as far as I can tell in modifying
> > or deleting then recombining the fields with the separators between 1-
> >> NF+1 or 0->NF. So, what is simpler about starting at 0 vs starting at
> > 1?
>
> If you consider the simple join-function provided by Prof. Collado
> for the case seps starts with index 0 when fs =3D=3D " " and your more
> complicated join-function for the case seps starts with index 1 when
> fs =3D=3D " ", then you see that the "seps[0]"-case provides more simplic=
ity
> than the "seps[1]"-case.

My function was more complicated because it implemented a general join
of 2 arrays on a key field (the index), not because the indices
started at 1.

> But I can imagine a join-function in the "seps[1]"-case which is nearly
> as simple than Prof. Collado's join-function.

It could be exactly as simple if we wanted it to be exactly as
specialised.

>
> > It seems to me like starting at zero is the same simplicity as
> > starting at 1 but is less consistent with all other awk arrays. I also
> > think (and this is just based on me trying to recall what I've seen
> > over the years so I could be totally wrong) that we much more
> > frequently see meaningful leading white space than trailing white
> > space and people are used to using 1->NF to loop through fields so
> > starting the separators at 1 makes this loop:
>
> > =A0 =A0for (i=3D1;i<=3DNF;i++)
> > =A0 =A0 =A0 printf "%s%s",seps[i],$i
>
> > much more likely to be what is actually wanted rather than this:
>
> > =A0 =A0for (i=3D0;i<=3DNF;i++)
> > =A0 =A0 =A0 printf "%s%s",$i,seps[i]
>
> > Note the deliberate bug introduced above by forgetting the serious
> > semantic inconsistency between the relationship of seps[1]<->$1 and
> > that of seps[0]<->$0.
>
> $0 is a special field. I believe if we define the semantics clearly,
> awk-users will not make the error to associate $0 with seps[0] as people
> don't make the error to associate ARGV[0] with $0.

My point was that we can't say that it's OK to start the seps array at
zero because ARGV[0] and $0 exist because those are "special" and
their contents are dissimilair from the rest of the ARGV[N] and $N
arrays.

> For the "seps[0]"-semantics speaks also that two people assumed the same
> "seps[0]"-semantics independently. These two people are me with the
> gawk-patch and Prof. Collado with his semantics but only one person (Ed)
> with the "seps[1]"-semantics.

I don't think that's statistically significant. What's important is
what would the average uninformed user expect - an array that
sometimes starts at zero and sometimes starts at 1 and is generated
differently for different values of FS, or an array that always starts
at 1 like every other awk array and has the same semantics (sep[n] =3D
the FS before field[n], sep[n+1] =3D the FS after field[n]) regardless
of the value of FS?

   Ed
>
> > =A0 =A0Ed.
>
> --
> Steffen- Hide quoted text -
>
> - Show quoted text -- Hide quoted text -
>
> - Show quoted text -

0
Reply Ed 12/3/2008 1:23:35 PM

m.collado@domain.invalid wrote:
 > [...]
> The fact is that there is a third case:
> 
> 3) fs = "". The string is split into individual characters. For 
> efficiency, the sep array should not been generated (no need for it to 
> rebuild the original string).
> 

It is correct that for fs == "" the separator array doesn't need to be
generated. But in this case one needs only to call split with three
arguments instead of four. Hence you loose no efficiency although in my
patch the separator array is generated for fs == "" for simpler generic
usage of the fourth parameter of split.

 > [...]

-- 
Steffen
0
Reply Steffen 12/3/2008 1:30:52 PM

On Dec 3, 5:58=A0am, m.coll...@domain.invalid wrote:
> Ed Morton escribi=F3:
>
>
>
>
>
> > On Dec 2, 4:04 pm, Steffen Schuler <schuler.stef...@googlemail.com>
> > wrote:
> >> Manuel Collado wrote:
> >> [...]
> >> =A0> As Arnold pointed out, it is important to clarify the desired beh=
aviour
> >> =A0> before adding something new to gawk.
> >> [...]
> >> ... The
> >> semantics for seps in split(s, a, fs, seps) according to the attached
> >> patch is:
>
> >> 1.) fs =3D=3D " ": if there is beginning whitespace, this is collected=
 in
> >> =A0 =A0 =A0seps[0]; if there is trailing whitespace and the highest in=
dex of a
> >> =A0 =A0 =A0is n, then seps[n] contains this whitespace.
> >> =A0 =A0 =A0if there is no beginning or trailing whitespace, then 0 res=
p. n
> >> =A0 =A0 =A0aren't contained in seps as indices.
> >> =A0 =A0 =A0s =3D seps[0] a[1] seps[2] a[2] ... seps[n-1] a[n] seps[n]
> >> =A0 =A0 =A0where seps[0] and seps[n] are replaced with "" if seps does=
 not
> >> =A0 =A0 =A0contain 0 resp. n.
>
> >> 2.) in all other cases: if the highest index of a is n, then
> >> =A0 =A0 =A0s =3D a[1] seps[1] a[2] seps[2] ... a[n-1] seps[n-1] a[n]
> >> =A0 =A0 =A0and seps contains only the indices 1 until n-1
>
> >> My opinion is, we should use the starting 0 index of seps in the case =
of
> >> fs =3D=3D " ", because of the simpler and more consistent semantics.
>
> The fact is that there is a third case:
>
> 3) fs =3D "". The string is split into individual characters. For
> efficiency, the sep array should not been generated (no need for it to
> rebuild the original string).

There's no need to special-case this or any other value of "fs". Just
generate the array of NULL strings. If you don't then you can't write/
use a truly general-purpose join() function in this case.

> > I'm not trying to be snide, and I may just be missing the curcial
> > point but I just don't see how starting at zero is simpler or has more
> > consistent semantics.
>
> > Consistency: Yes, ARGV has a zero index but that's just the name of
> > the program which is very different from all of the actual arguments
> > that start at index 1, and $0 is very different from all of the actual
> > fields which start at $1, so having the array of field separators
> > start at zero with an actual field separator just like all the other
> > field separators is totally inconsistent with all other awk arrays
> > including those two. If you said that the field separator array
> > started at zero but the zero-th element contained all of the field
> > separators concatenated or the name of the function you called to
> > split the record, THEN I could see how it'd be consistent with $0 or
> > ARGV[0]. So what is starting the separator array at zero consistent
> > with?
>
> > Simplicity: there is no difference as far as I can tell in modifying
> > or deleting then recombining the fields with the separators between 1-
> >> NF+1 or 0->NF. So, what is simpler about starting at 0 vs starting at
> > 1?
>
> > It seems to me like starting at zero is the same simplicity as
> > starting at 1 but is less consistent with all other awk arrays. I also
> > think (and this is just based on me trying to recall what I've seen
> > over the years so I could be totally wrong) that we much more
> > frequently see meaningful leading white space than trailing white
> > space and people are used to using 1->NF to loop through fields so
> > starting the separators at 1 makes this loop:
>
> Perhaps not a crucial point, but something worth to consider: splitting
> a string into pieces can be seen from the two sides of the mirror. The
> splitting regexp can be seen as a specification for the separator, or
> for the fields themselves. I remember a request for this feature
> sometime ago.

Yes, I could see the value in doing that when you have a record that
contains some data strings that you can express as an RE and you don't
care about the rest of the text between the REs. Obviously we handle
that today by just adding separators before/after the REs using gsub()
or gensub() so it's not a big deal to handle today, is pretty rare
anyway and anyone who wants to use the modified split() to do it can
still do that with minimal additional code.

> So it should be very convenient to always generate a sep array that
> exactly contains the actual matches of the splitting regexp. This is
> consistent with an empty set of separators in the third special case of
> a null splitting regexp, for which there are no real separators. And the
> sep array will never contain a null value.
>
> This is against forging a null starting separator in some cases:

I'm not suggesting forging anything, just having an array that always
has these semantics:

   sep[n] =3D the separator before field n
   sep[n+1] =3D the separator after field n

Sometimes there's no separator before the 1st field or after the last
field so we consider that the NULL string and in that case the sep[]
value is that NULL string.

> > I think the sep array should always contain whatever separator
> > preceeds the field of the same number, so if there are any fields,
> > then there is a first field so sep[1] WILL be populated, even if it's
> > with the NULL string. That will ensure that when we come to join()
> > them later that the join will succeed because sep[1] and fld[1] are
> > both populated. If we don't populate sep[] that way, then the
> > subsequent join() becomes less clear and simple.
>
> because in that case the sep array cannot be seen as a set of fields, if
> the splitting regexp is intended to match the fields and not the separato=
rs.

Right - the sep array is NOT a set of fields it's a set of field
separators. That's clearly the intent since the 3rd argument of split
():

    split(string, array [, fieldsep])

is called "fieldsep", not "regexp" and it's treated as a "fieldsep"
rather than a "regexp" wrt having the value " " and if that argument
is missing it takes the value of FS.

    Ed.

> So my vote is for using sep[0] for the optional leading space (in fact
> not a separator) in the special " " splitting regexp.
>
> Regards.
> --
> Manuel Collado -http://lml.ls.fi.upm.es/~mcollado- Hide quoted text -
>
> - Show quoted text -- Hide quoted text -
>
> - Show quoted text -

0
Reply Ed 12/3/2008 1:47:57 PM

pk wrote:

> another regex. At the same time, the sep[] array and its elements (if any)
> are preserved, since merely referencing a non-existing array element does
> not create it, so you can still check whether sep[0], sep[NF] or any
> element in between are defined, or do any other check.

No, see this example:

$ awk 'BEGIN{x=a["foo"]; if("foo" in a)print "OK"}'
OK

Maybe you were thinking to fields, where merely referencing a field does not
create it?

$ echo 'a b c' | awk  '{x=$10;print NF}'
3

-- 
awk 'BEGIN{O="~"~"~";o="=="=="==";o+=+o;x=O""O;while(X++<=x+o+o)c=c"%c";
printf c,(x-O)*(x-O),x*(x-o)-o,x*(x-O)+x-O-o,+x*(x-O)-x+o,X*(o*o+O)+x-O,
X*(X-x)-o*o,(x+X)*o*o+o,x*(X-x)-O-O,x-O+(O+o+X+x)*(o+O),X*X-X*(x-O)-x+O,
O+X*(o*(o+O)+O),+x+O+X*o,x*(x-o),(o+X+x)*o*o-(x-O-O),O+(X-x)*(X+O),x-O}'
0
Reply Dave 12/3/2008 1:58:45 PM

Ed Morton wrote:
> On Dec 2, 11:32 pm, Steffen Schuler <schuler.stef...@googlemail.com>
> wrote:
>> Ed Morton wrote:
> [...]  
>> If you start seps with 1 when fs == " ", you have yet the strange
>> semantics that when a[n] (resp. $NF in the record case) is the last
>> member of the a-array, then seps[n+2] is the last member of the seps
>> array when there is trailing whitespace. But you would wish that the
>> last index of seps is  maximally by 1 greater than the last index of a.
>> I think you see what I mean with consistent and natural.
> 
> Starting at 1, seps[n+1] is consistently the separator after the n-th
> field.
>
 > [...]

In this case I made indeed an error.

-- 
Steffen
0
Reply Steffen 12/3/2008 1:58:55 PM

On Dec 3, 7:14=A0am, Steffen Schuler <schuler.stef...@googlemail.com>
wrote:
<snip>
> function join2(a, b, =A0 c)
> {
> =A0 =A0zip(a, b, c)
> =A0 =A0return join(c)
>
> }
<snip>
> function zip(a, b, c, =A0 mini, maxi, i)
> {
> =A0 =A0del(c)
> =A0 =A0mini =3D minidx2(a, b) + 0
> =A0 =A0maxi =3D maxidx2(a, b) + 0
> =A0 =A0for (i =3D mini; i <=3D maxi; ++i) {
> =A0 =A0 =A0if (i in a)
> =A0 =A0 =A0 =A0c[i] =3D a[i]
> =A0 =A0 =A0if (i in b)
> =A0 =A0 =A0 =A0c[i] =3D c[i] b[i]
> =A0 =A0}
>
> }
<snip>

The semantics of join2() are wrong because within zip() you include
values that have an index in either array rather than just the values
that have an index in both arrays. zip() should be:

function zip(a, b, c,   mini, maxi, i)
{
   del(c)
   mini =3D minidx2(a, b) + 0
   maxi =3D maxidx2(a, b) + 0
   for (i =3D mini; i <=3D maxi; ++i) {
     if ((i in a) && (i in b))
       c[i] =3D a[i] b[i]
   }
}

Regards,

   Ed.
0
Reply Ed 12/3/2008 2:03:40 PM

Ed Morton wrote:
> On Dec 3, 7:14 am, Steffen Schuler <schuler.stef...@googlemail.com>
> wrote:
> <snip>
>> function join2(a, b,   c)
>> {
>>    zip(a, b, c)
>>    return join(c)
>>
>> }
> <snip>
>> function zip(a, b, c,   mini, maxi, i)
>> {
>>    del(c)
>>    mini = minidx2(a, b) + 0
>>    maxi = maxidx2(a, b) + 0
>>    for (i = mini; i <= maxi; ++i) {
>>      if (i in a)
>>        c[i] = a[i]
>>      if (i in b)
>>        c[i] = c[i] b[i]
>>    }
>>
>> }
> <snip>
> 
> The semantics of join2() are wrong because within zip() you include
> values that have an index in either array rather than just the values
> that have an index in both arrays. zip() should be:

Thanks Ed for correcting me. But zip should indeed concatenate values 
that have an index in either array or both. It doesn't have the same 
semantics as zip in Ruby or in whatever language you refer to.

> 
> function zip(a, b, c,   mini, maxi, i)
> {
>    del(c)
>    mini = minidx2(a, b) + 0
>    maxi = maxidx2(a, b) + 0
>    for (i = mini; i <= maxi; ++i) {
>      if ((i in a) && (i in b))
>        c[i] = a[i] b[i]
>    }
> }
> 
> Regards,
> 
>    Ed.


-- 
Steffen
0
Reply Steffen 12/3/2008 2:19:26 PM

On Dec 3, 8:19=A0am, Steffen Schuler <schuler.stef...@googlemail.com>
wrote:
> Ed Morton wrote:
> > On Dec 3, 7:14 am, Steffen Schuler <schuler.stef...@googlemail.com>
> > wrote:
> > <snip>
> >> function join2(a, b, =A0 c)
> >> {
> >> =A0 =A0zip(a, b, c)
> >> =A0 =A0return join(c)
>
> >> }
> > <snip>
> >> function zip(a, b, c, =A0 mini, maxi, i)
> >> {
> >> =A0 =A0del(c)
> >> =A0 =A0mini =3D minidx2(a, b) + 0
> >> =A0 =A0maxi =3D maxidx2(a, b) + 0
> >> =A0 =A0for (i =3D mini; i <=3D maxi; ++i) {
> >> =A0 =A0 =A0if (i in a)
> >> =A0 =A0 =A0 =A0c[i] =3D a[i]
> >> =A0 =A0 =A0if (i in b)
> >> =A0 =A0 =A0 =A0c[i] =3D c[i] b[i]
> >> =A0 =A0}
>
> >> }
> > <snip>
>
> > The semantics of join2() are wrong because within zip() you include
> > values that have an index in either array rather than just the values
> > that have an index in both arrays. zip() should be:
>
> Thanks Ed for correcting me. But zip should indeed concatenate values
> that have an index in either array or both. It doesn't have the same
> semantics as zip in Ruby or in whatever language you refer to.

Steffen - I'm not familiar with Ruby and I'm not refering to a
language, I'm refering to the inner join semantics of databases (and
UNIX tools). Look:

$ cat file1
a b c
d e
$ cat file2
a x y
k g
$ join file1 file2
a b c x y
$

Only the lines that match on the key field in both files get included
in output, not those that only exist in one file. For our join(), only
the array elements that match on the key indices in both arrays should
get included in the output, not those that only exist in one array.

Regards,

     Ed.
>
>
> > function zip(a, b, c, =A0 mini, maxi, i)
> > {
> > =A0 =A0del(c)
> > =A0 =A0mini =3D minidx2(a, b) + 0
> > =A0 =A0maxi =3D maxidx2(a, b) + 0
> > =A0 =A0for (i =3D mini; i <=3D maxi; ++i) {
> > =A0 =A0 =A0if ((i in a) && (i in b))
> > =A0 =A0 =A0 =A0c[i] =3D a[i] b[i]
> > =A0 =A0}
> > }
>
> > Regards,
>
> > =A0 =A0Ed.


0
Reply Ed 12/3/2008 2:30:24 PM

On Wednesday 3 December 2008 14:58, Dave B wrote:

> pk wrote:
> 
>> another regex. At the same time, the sep[] array and its elements (if
>> any) are preserved, since merely referencing a non-existing array element
>> does not create it, so you can still check whether sep[0], sep[NF] or any
>> element in between are defined, or do any other check.
> 
> No, see this example:
> 
> $ awk 'BEGIN{x=a["foo"]; if("foo" in a)print "OK"}'
> OK
> 
> Maybe you were thinking to fields, where merely referencing a field does
> not create it?
> 
> $ echo 'a b c' | awk  '{x=$10;print NF}'
> 3

Doh! That's what I get for posting while speaking on the phone. And I even
use the bare a[$0] in scripts to create the element...
Apologies for the gross mistake, and thanks to you and Ed for correcting me.

0
Reply pk 12/3/2008 2:45:11 PM

Since this is now the longest thread I've remember seeing in this NG,
I thought it might be good to summarise where we are and what options
we're discussing.

We have decided to add a 4th argument to split() in GNU awk so it will
now be:

     N = split(string, fieldsArray [, fieldSep [, sepsArray]])

where fieldsArray will be an array of fields as today and sepsArray
will be an array of separators associated with those fields. split()
will, as today, return the number of elements stored in fieldsArray (N
above).

So far so good. Now we're debating how to populate sepsArray and there
are 2 proposals on the table:

1) start at 0 if fieldSep is " " and there are leading separators,
otherwise start at 1:

    if (fieldSep is " " ) {
        if (there is leading separator) {
            sepsArray starts at 0
            sepsArray[0] = the leading separator
        } else {
            sepsArray starts at 1
            sepsArray[0] does not exist
        }
        if (there is trailing separator) {
            sepsArray ends at N
            sepsArray[N] = the trailing separator
        } else {
            sepsArray ends at N-1
            sepsArray[N] does not exist
        }
        if (there are leading && trailing separators) {
            length(sepsArray) = N+1
        } else if (there are leading || trailing separators) {
            length(sepsArray) = N
        } else {
            length(sepsArray) = N-1
        }
    } else {
        sepsArray starts at 1
        sepsArray ends at N-1
        length(sepsArray) = N-1
    }
    for (fld=2;fld<N;fld++) {   #note: only always true for fields 2-
>N-1.
        sepsArray[fld-1] = the separator before field fld
        sepsArray[fld] = the separator after field fld
    }


2) always start at one:

    sepsArray starts at 1
    sepsArray ends at N+1
    if ((fieldSep == " ") && (there is leading separator)) {
        sepsArray[1] = the leading separator
    } else {
        sepsArray[1] = ""
    }
    if ((fieldSep == " ") && (there is trailing separator)) {
        sepsArray[N+1] = the trailing separator
    } else {
        sepsArray[N+1] = ""
    }
    length(sepsArray) = N+1
    for (fld=1;fld<=N;fld++) {    # note: always true for all fields 1-
>N.
        sepsArray[fld] = the separator before field fld
        sepsArray[fld+1] = the separator after field fld
    }

Regards,

    Ed.
0
Reply Ed 12/3/2008 3:23:11 PM

pk escribi�:
> On Wednesday 3 December 2008 14:58, Dave B wrote:
> 
>> pk wrote:
>>
>>> another regex. At the same time, the sep[] array and its elements (if
>>> any) are preserved, since merely referencing a non-existing array element
>>> does not create it, so you can still check whether sep[0], sep[NF] or any
>>> element in between are defined, or do any other check.
>> No, see this example:
>>
>> $ awk 'BEGIN{x=a["foo"]; if("foo" in a)print "OK"}'
>> OK
>>
>> Maybe you were thinking to fields, where merely referencing a field does
>> not create it?
>>
>> $ echo 'a b c' | awk  '{x=$10;print NF}'
>> 3
> 
> Doh! That's what I get for posting while speaking on the phone. And I even
> use the bare a[$0] in scripts to create the element...
> Apologies for the gross mistake, and thanks to you and Ed for correcting me.
> 

No. Merely having a["foo"] in a script doesn't creates it. It is the 
actual evaluation of a["foo"] what creates it. See

$ gawk 'BEGIN{if ("foo" in a) x=a["foo"]; \
               if("foo" in a)print "CREATED";
               else print "NOT CREATED"}'
NOT CREATED

-- 
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado
0
Reply Manuel 12/3/2008 3:24:22 PM

Ed Morton wrote:
> On Dec 3, 8:19 am, Steffen Schuler <schuler.stef...@googlemail.com>
> wrote:
>> Ed Morton wrote:
>>> On Dec 3, 7:14 am, Steffen Schuler <schuler.stef...@googlemail.com>
>>> wrote:
>>> <snip>
>>>> function join2(a, b,   c)
>>>> {
>>>>    zip(a, b, c)
>>>>    return join(c)
>>>> }
>>> <snip>
>>>> function zip(a, b, c,   mini, maxi, i)
>>>> {
>>>>    del(c)
>>>>    mini = minidx2(a, b) + 0
>>>>    maxi = maxidx2(a, b) + 0
>>>>    for (i = mini; i <= maxi; ++i) {
>>>>      if (i in a)
>>>>        c[i] = a[i]
>>>>      if (i in b)
>>>>        c[i] = c[i] b[i]
>>>>    }
>>>> }
>>> <snip>
>>> The semantics of join2() are wrong because within zip() you include
>>> values that have an index in either array rather than just the values
>>> that have an index in both arrays. zip() should be:
>> Thanks Ed for correcting me. But zip should indeed concatenate values
>> that have an index in either array or both. It doesn't have the same
>> semantics as zip in Ruby or in whatever language you refer to.
> 
> Steffen - I'm not familiar with Ruby and I'm not refering to a
> language, I'm refering to the inner join semantics of databases (and
> UNIX tools). Look:
> 
> $ cat file1
> a b c
> d e
> $ cat file2
> a x y
> k g
> $ join file1 file2
> a b c x y
> $
> 
> Only the lines that match on the key field in both files get included
> in output, not those that only exist in one file. For our join(), only
> the array elements that match on the key indices in both arrays should
> get included in the output, not those that only exist in one array.
> 
> Regards,
> 
>      Ed.

Ed thank you very much for explaining this to me. I understand now that 
I gave my function the wrong name. Perhaps "zipcat" is a better name 
than "join2". With the name join2 I wanted to express that I concatenate 
the result of zip as the function name join is used in contexts of gawk, 
Perl, and Ruby, to concatenate the strings of a list resp. array. But 
this leads to serious misunderstandings. Sorry for that. I should have 
detected the flaw by myself.

>>
>>> function zip(a, b, c,   mini, maxi, i)
>>> {
>>>    del(c)
>>>    mini = minidx2(a, b) + 0
>>>    maxi = maxidx2(a, b) + 0
>>>    for (i = mini; i <= maxi; ++i) {
>>>      if ((i in a) && (i in b))
>>>        c[i] = a[i] b[i]
>>>    }
>>> }
>>> Regards,
>>>    Ed.
> 
> 


-- 
Steffen
0
Reply Steffen 12/3/2008 3:36:00 PM

On Wednesday 3 December 2008 16:24, Manuel Collado wrote:

> No. Merely having a["foo"] in a script doesn't creates it. It is the
> actual evaluation of a["foo"] what creates it. See

Ok, of course that's what I meant. Thanks for making that explicit.

0
Reply pk 12/3/2008 3:43:31 PM

Steffen Schuler wrote:
> Ed Morton wrote:
>> On Dec 3, 8:19 am, Steffen Schuler <schuler.stef...@googlemail.com>
>> wrote:
>>> Ed Morton wrote:
>>>> On Dec 3, 7:14 am, Steffen Schuler <schuler.stef...@googlemail.com>
>>>> wrote:
>>>> <snip>
>>>>> function join2(a, b,   c)
>>>>> {
>>>>>    zip(a, b, c)
>>>>>    return join(c)
>>>>> }
>>>> <snip>
>>>>> function zip(a, b, c,   mini, maxi, i)
>>>>> {
>>>>>    del(c)
>>>>>    mini = minidx2(a, b) + 0
>>>>>    maxi = maxidx2(a, b) + 0
>>>>>    for (i = mini; i <= maxi; ++i) {
>>>>>      if (i in a)
>>>>>        c[i] = a[i]
>>>>>      if (i in b)
>>>>>        c[i] = c[i] b[i]
>>>>>    }
>>>>> }
>>>> <snip>
>>>> The semantics of join2() are wrong because within zip() you include
>>>> values that have an index in either array rather than just the values
>>>> that have an index in both arrays. zip() should be:
>>> Thanks Ed for correcting me. But zip should indeed concatenate values
>>> that have an index in either array or both. It doesn't have the same
>>> semantics as zip in Ruby or in whatever language you refer to.
>>
>> Steffen - I'm not familiar with Ruby and I'm not refering to a
>> language, I'm refering to the inner join semantics of databases (and
>> UNIX tools). Look:
>>
>> $ cat file1
>> a b c
>> d e
>> $ cat file2
>> a x y
>> k g
>> $ join file1 file2
>> a b c x y
>> $
>>
>> Only the lines that match on the key field in both files get included
>> in output, not those that only exist in one file. For our join(), only
>> the array elements that match on the key indices in both arrays should
>> get included in the output, not those that only exist in one array.
>>
>> Regards,
>>
>>      Ed.
> 
> Ed thank you very much for explaining this to me. I understand now that 
> I gave my function the wrong name. Perhaps "zipcat" is a better name 
> than "join2". With the name join2 I wanted to express that I concatenate 
> the result of zip as the function name join is used in contexts of gawk, 
> Perl, and Ruby, to concatenate the strings of a list resp. array. But 
> this leads to serious misunderstandings. Sorry for that. I should have 
> detected the flaw by myself.
> 

Renamed code:

---------------------------8<-----------------------------------
{
   split($0, field, FS, seps)

   # when indices in seps start minimally with 0
   zc = zipcat(field, seps)

   # when indices in seps start minimally with 1
   # zc = zipcat(seps, field)
}

$0 != zc {
   ++errcnt
   fmt = "%3d:%d:ERROR:%s:\n"
   printf fmt, NR, NF, $0
   printf fmt, NR, NF, zc
   mini = minidx(seps)
   maxi = maxidx(seps)
   for (i = mini; i <= maxi; ++i)
     if (i in seps)
       printf "%d:-->%s<--\n", i, seps[i]
}

END {
   printf "total %d errors\n", errcnt
   exit(errcnt > 0)
}

# some general and portable functions on arrays with (integer) indices;
# arrays may have any max or min (integer) index and
# may have holes in the range of indices
# in the distinct functions the input arrays but not the output array
# are not changed
# the thrust of development was on clean design not on smallness
# or efficiency

function zipcat(a, b,   c)
{
   zip(a, b, c)
   return cat(c)
}

function cat(a,   mini, maxi, i, s)
{
   mini = minidx(a) + 0
   maxi = maxidx(a) + 0
   for (i = mini; i <= maxi; ++i)
     if (i in a)
       s = s a[i]
   return s
}

function zip(a, b, c,   mini, maxi, i)
{
   del(c)
   mini = minidx2(a, b) + 0
   maxi = maxidx2(a, b) + 0
   for (i = mini; i <= maxi; ++i) {
     if (i in a)
       c[i] = a[i]
     if (i in b)
       c[i] = c[i] b[i]
   }
}

function maxidx2(a, b)
{
   if (emptyarr(a))
     return maxidx(b)
   else if (emptyarr(b))
     return maxidx(a)
   else
     return max(maxidx(a), maxidx(b))
}

function minidx2(a, b)
{
   if (emptyarr(a))
     return minidx(b)
   else if (emptyarr(b))
     return minidx(a)
   else
     return min(minidx(a), minidx(b))
}

function maxidx(a,   i, m)
{
   if (emptyarr(a))
     m = ""
   else {
     m = choose(a) + 0
     for (i in a) {
       i += 0
       if (i > m)
         m = i
     }
   }
   return m
}

function minidx(a,   i, m)
{
   if (emptyarr(a))
     m = ""
   else {
     m = choose(a) + 0
     for (i in a) {
       i += 0
       if (i < m)
         m = i
     }
   }
   return m
}

function choose(a,   m)
{
   for (m in a)
     return m
   return ""
}

function emptyarr(a,  k)
{
   for (k in a)
     return 0
   return 1
}

# portable delete
function del(a)
{
   split("", a)
}

function max(x, y)
{
   return x > y ? x : y
}

function min(x, y)
{
   return x < y ? x : y
}
---------------------------8<-----------------------------------

-- 
Steffen
0
Reply Steffen 12/3/2008 4:11:34 PM

On Dec 3, 10:11=A0am, Steffen Schuler <schuler.stef...@googlemail.com>
wrote:
> Steffen Schuler wrote:
> > Ed Morton wrote:
> >> On Dec 3, 8:19 am, Steffen Schuler <schuler.stef...@googlemail.com>
> >> wrote:
> >>> Ed Morton wrote:
> >>>> On Dec 3, 7:14 am, Steffen Schuler <schuler.stef...@googlemail.com>
> >>>> wrote:
> >>>> <snip>
> >>>>> function join2(a, b, =A0 c)
> >>>>> {
> >>>>> =A0 =A0zip(a, b, c)
> >>>>> =A0 =A0return join(c)
> >>>>> }
> >>>> <snip>
> >>>>> function zip(a, b, c, =A0 mini, maxi, i)
> >>>>> {
> >>>>> =A0 =A0del(c)
> >>>>> =A0 =A0mini =3D minidx2(a, b) + 0
> >>>>> =A0 =A0maxi =3D maxidx2(a, b) + 0
> >>>>> =A0 =A0for (i =3D mini; i <=3D maxi; ++i) {
> >>>>> =A0 =A0 =A0if (i in a)
> >>>>> =A0 =A0 =A0 =A0c[i] =3D a[i]
> >>>>> =A0 =A0 =A0if (i in b)
> >>>>> =A0 =A0 =A0 =A0c[i] =3D c[i] b[i]
> >>>>> =A0 =A0}
> >>>>> }
> >>>> <snip>
> >>>> The semantics of join2() are wrong because within zip() you include
> >>>> values that have an index in either array rather than just the value=
s
> >>>> that have an index in both arrays. zip() should be:
> >>> Thanks Ed for correcting me. But zip should indeed concatenate values
> >>> that have an index in either array or both. It doesn't have the same
> >>> semantics as zip in Ruby or in whatever language you refer to.
>
> >> Steffen - I'm not familiar with Ruby and I'm not refering to a
> >> language, I'm refering to the inner join semantics of databases (and
> >> UNIX tools). Look:
>
> >> $ cat file1
> >> a b c
> >> d e
> >> $ cat file2
> >> a x y
> >> k g
> >> $ join file1 file2
> >> a b c x y
> >> $
>
> >> Only the lines that match on the key field in both files get included
> >> in output, not those that only exist in one file. For our join(), only
> >> the array elements that match on the key indices in both arrays should
> >> get included in the output, not those that only exist in one array.
>
> >> Regards,
>
> >> =A0 =A0 =A0Ed.
>
> > Ed thank you very much for explaining this to me. I understand now that
> > I gave my function the wrong name. Perhaps "zipcat" is a better name
> > than "join2". With the name join2 I wanted to express that I concatenat=
e
> > the result of zip as the function name join is used in contexts of gawk=
,
> > Perl, and Ruby, to concatenate the strings of a list resp. array. But
> > this leads to serious misunderstandings. Sorry for that. I should have
> > detected the flaw by myself.
>
> Renamed code:
>
> ---------------------------8<-----------------------------------
> {
> =A0 =A0split($0, field, FS, seps)
>
> =A0 =A0# when indices in seps start minimally with 0
> =A0 =A0zc =3D zipcat(field, seps)
>
> =A0 =A0# when indices in seps start minimally with 1
> =A0 =A0# zc =3D zipcat(seps, field)
>
> }
>
> $0 !=3D zc {
> =A0 =A0++errcnt
> =A0 =A0fmt =3D "%3d:%d:ERROR:%s:\n"
> =A0 =A0printf fmt, NR, NF, $0
> =A0 =A0printf fmt, NR, NF, zc
> =A0 =A0mini =3D minidx(seps)
> =A0 =A0maxi =3D maxidx(seps)
> =A0 =A0for (i =3D mini; i <=3D maxi; ++i)
> =A0 =A0 =A0if (i in seps)
> =A0 =A0 =A0 =A0printf "%d:-->%s<--\n", i, seps[i]
>
> }
>
> END {
> =A0 =A0printf "total %d errors\n", errcnt
> =A0 =A0exit(errcnt > 0)
>
> }
>
> # some general and portable functions on arrays with (integer) indices;
> # arrays may have any max or min (integer) index and
> # may have holes in the range of indices
> # in the distinct functions the input arrays but not the output array
> # are not changed
> # the thrust of development was on clean design not on smallness
> # or efficiency
>
> function zipcat(a, b, =A0 c)
> {
> =A0 =A0zip(a, b, c)
> =A0 =A0return cat(c)
>
> }
>
> function cat(a, =A0 mini, maxi, i, s)
> {
> =A0 =A0mini =3D minidx(a) + 0
> =A0 =A0maxi =3D maxidx(a) + 0
> =A0 =A0for (i =3D mini; i <=3D maxi; ++i)
> =A0 =A0 =A0if (i in a)
> =A0 =A0 =A0 =A0s =3D s a[i]
> =A0 =A0return s
>
> }
>
> function zip(a, b, c, =A0 mini, maxi, i)
> {
> =A0 =A0del(c)
> =A0 =A0mini =3D minidx2(a, b) + 0
> =A0 =A0maxi =3D maxidx2(a, b) + 0
> =A0 =A0for (i =3D mini; i <=3D maxi; ++i) {
> =A0 =A0 =A0if (i in a)
> =A0 =A0 =A0 =A0c[i] =3D a[i]
> =A0 =A0 =A0if (i in b)
> =A0 =A0 =A0 =A0c[i] =3D c[i] b[i]
> =A0 =A0}
>
> }
<snip>

Steffen - In case you care: thinking about it a bit more now, what
you're actually doing in zip() (and so zipcat()) is an outer join so
if you really wanted to have a function do join()s then you could
implement it with the interface:

join(a,b,c) =3D inner join of a and b into c
join(a,b,c,"i") =3D inner join (the default)
join(a,b,c,"o") =3D full outer join
join(a,b,c,"l") =3D left outer join
join(a,b,c,"r") =3D right outer join

So instead of zip() you could have (check it, but you get the idea!):

function join(a, b, c, type,   mini, maxi, i)
{
   del(c)
   mini =3D minidx2(a, b) + 0
   maxi =3D maxidx2(a, b) + 0
   if (type =3D=3D "l") {
     for (i =3D mini; i <=3D maxi; ++i) {
       if (i in a) {
         c[i] =3D a[i]
         if (i in b)
           c[i] =3D c[i] b[i]
       }
     }
   else if (type =3D=3D "r") {
     for (i =3D mini; i <=3D maxi; ++i) {
       if (i in b) {
         if (i in a)
           c[i] =3D a[i]
         c[i] =3D c[i] b[i]
       }
     }
   } else if (type =3D=3D "o") {
     for (i =3D mini; i <=3D maxi; ++i) {
       if (i in a)
         c[i] =3D a[i]
       if (i in b)
         c[i] =3D c[i] b[i]
     }
   } else { # type is "i" or NULL
     for (i =3D mini; i <=3D maxi; ++i) {
       if ((i in a) && (i in b))
         c[i] =3D a[i] b[i]
     }
   }
}

and instead of zipcat():

function joincat(a, b, type,   c)
{
   join(a, b, c, type)
   return cat(c)
}

and then you'd have a couple of very reusable functions with clear
names and well-known semantics.

Regards,

    Ed.
0
Reply Ed 12/3/2008 4:39:26 PM

Ed Morton escribi�:
> Since this is now the longest thread I've remember seeing in this NG,
> I thought it might be good to summarise where we are and what options
> we're discussing.

Thanks, Ed, for doing it.

> 
> We have decided to add a 4th argument to split() in GNU awk so it will
> now be:
> 
>      N = split(string, fieldsArray [, fieldSep [, sepsArray]])
> 
> where fieldsArray will be an array of fields as today and sepsArray
> will be an array of separators associated with those fields. split()
> will, as today, return the number of elements stored in fieldsArray (N
> above).

Before converting the debate into a religious war (ever heard of the 
origin of the terms "bigendian/littleendian"?), please let me 
reformulate the following a bit.

> 
> So far so good. Now we're debating how to populate sepsArray and there
> are 2 proposals on the table:
> 
> 1) start at 0 if fieldSep is " " and there are leading separators,
> otherwise start at 1:
> ... 
> 
> 2) always start at one:
> ...

It it perhaps a matter of terminology.

1) The term "separator" suggest something that always lies between 
fields, so the i-th separator follows the i-th field. Something 
preceding the first field or following the last field is not a true 
separator. These are optional items, not counted (and numbered) as 
separators. Of course, they can be forced to exist, with null values, in 
order to avoid optional items.

2) If we think about something that always precedes and follows each 
field, we could use a more general term, like "delimiter". The i-th 
delimiter precedes the i-th field, and there is always a i+1-th 
delimiter after the i-th field. In that case the appropriate mnemonics 
could be:

     N = split(string, fieldsArray [, fieldDelim [, delimArray]])

The root of the debate is probably the twofold behaviour of split(). If 
the splitting pattern is " ", then is is handled as a "delimiter". Other 
splitting patterns are handled as "separators". So the appropriate name 
of the third argument should be a neutral one:

1)  N = split(string, fieldsArray [, pattern [, sepsArray]])
2)  N = split(string, fieldsArray [, pattern [, delimArray]])

It seems that we must choose one concept (and term) or the other, and 
use it consistently. Both options have their weirdness. Among others

1) Starting indexes at 0 is weird in AWK.

2) Weird common case:

       split( "xaxbxcx", fieldsArray, "x", delimArray )

gives null (forged, not matched) first and last fields and also null 
(forged, not matched) first and last delimiters before/after them, 
respectively.

At this point, I'm not trying to promote any particular option. Just 
trying to clarify the things.

Regards.
-- 
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado
0
Reply Manuel 12/3/2008 4:54:43 PM

On Dec 3, 10:54=A0am, Manuel Collado <m.coll...@lml.ls.fi.upm.es> wrote:
> Ed Morton escribi=F3:
>
> > Since this is now the longest thread I've remember seeing in this NG,
> > I thought it might be good to summarise where we are and what options
> > we're discussing.
>
> Thanks, Ed, for doing it.
>
>
>
> > We have decided to add a 4th argument to split() in GNU awk so it will
> > now be:
>
> > =A0 =A0 =A0N =3D split(string, fieldsArray [, fieldSep [, sepsArray]])
>
> > where fieldsArray will be an array of fields as today and sepsArray
> > will be an array of separators associated with those fields. split()
> > will, as today, return the number of elements stored in fieldsArray (N
> > above).
>
> Before converting the debate into a religious war (ever heard of the
> origin of the terms "bigendian/littleendian"?), please let me
> reformulate the following a bit.
>
>
>
> > So far so good. Now we're debating how to populate sepsArray and there
> > are 2 proposals on the table:
>
> > 1) start at 0 if fieldSep is " " and there are leading separators,
> > otherwise start at 1:
> > ...
>
> > 2) always start at one:
> > ...
>
> It it perhaps a matter of terminology.

Yes, but I think the terminology in awk already exists so, right or
wrong, we should just try to follow it as best we can.

> 1) The term "separator" suggest something that always lies between
> fields, so the i-th separator follows the i-th field. Something
> preceding the first field or following the last field is not a true
> separator. These are optional items, not counted (and numbered) as
> separators. Of course, they can be forced to exist, with null values, in
> order to avoid optional items.

In general plain text terms, yes, but in awk "field separator"
specifically means more than that.
>
> 2) If we think about something that always precedes and follows each
> field, we could use a more general term, like "delimiter". The i-th
> delimiter precedes the i-th field, and there is always a i+1-th
> delimiter after the i-th field. In that case the appropriate mnemonics
> could be:
>
> =A0 =A0 =A0N =3D split(string, fieldsArray [, fieldDelim [, delimArray]])

I don't think "delimiter" generally means there HAS to be i+1
delimiters but I may be thinking more of a "terminator".

> The root of the debate is probably the twofold behaviour of split(). If
> the splitting pattern is " ", then is is handled as a "delimiter". Other
> splitting patterns are handled as "separators". So the appropriate name
> of the third argument should be a neutral one:
>
> 1) =A0N =3D split(string, fieldsArray [, pattern [, sepsArray]])
> 2) =A0N =3D split(string, fieldsArray [, pattern [, delimArray]])
>
> It seems that we must choose one concept (and term) or the other, and
> use it consistently.

I get where you're coming from, but in awk that pattern that's both a
"separator" and a "delimiter" is already defined to be a "field
separator" as shown by the names of the FS variable and the "fieldsep"
argument for split().

> Both options have their weirdness. Among others
>
> 1) Starting indexes at 0 is weird in AWK.
>
> 2) Weird common case:
>
> =A0 =A0 =A0 =A0split( "xaxbxcx", fieldsArray, "x", delimArray )
>
> gives null (forged, not matched) first and last fields and also null
> (forged, not matched) first and last delimiters before/after them,
> respectively.

Yes, that is weird. With the "start at 1" approach, I'd have expected
the result to be:

seps[1]=3Dx; flds[1]=3Da
seps[2]=3Dx; flds[2]=3Db
seps[3]=3Dx; flds[3]=3Dc
seps[4]=3Dx;

but of course, unless "x" is actually " ", that doesn't populate flds
[] the way split() would today and since flds[1] will have a NULL
string today, seps[1] would have to have a corresponding NULL string.

> At this point, I'm not trying to promote any particular option. Just
> trying to clarify the things.

Thanks. I just want to make sure what we do is as simple, consistent,
obvious, and flexible as possible but, obviously, I have a strong
opinion on what that means. It is a shame that there's this difference
in what "field separator" means between FS being " " vs all other
values but I think we can avoid complicating how the seps[] array is
populated either way.

    Ed.

> Regards.
> --
> Manuel Collado -http://lml.ls.fi.upm.es/~mcollado

0
Reply Ed 12/3/2008 5:53:49 PM

On Dec 3, 10:54=A0am, Manuel Collado <m.coll...@lml.ls.fi.upm.es> wrote:
<snip>
> 2) Weird common case:
>
> =A0 =A0 =A0 =A0split( "xaxbxcx", fieldsArray, "x", delimArray )
>
> gives null (forged, not matched) first and last fields and also null
> (forged, not matched) first and last delimiters before/after them,
> respectively.

I think it's worth pointing out that, as Manuel mentions above, when
there is no actual field before the first separator, split() today
creates flds[1] and sets it to NULL which is exactly what I'm
proposing we do when there is no actual separator before the first
field - create seps[1] and set it to NULL. So that further consistency
with existing awk behavior is another reason to use the "always start
at 1" approach.

     Ed.
0
Reply Ed 12/3/2008 6:25:01 PM

Ed Morton wrote:
> [...]
> Steffen - In case you care: thinking about it a bit more now, what
> you're actually doing in zip() (and so zipcat()) is an outer join so
> if you really wanted to have a function do join()s then you could
> implement it with the interface:
> 
> join(a,b,c) = inner join of a and b into c
> join(a,b,c,"i") = inner join (the default)
> join(a,b,c,"o") = full outer join
> join(a,b,c,"l") = left outer join
> join(a,b,c,"r") = right outer join
> 
> So instead of zip() you could have (check it, but you get the idea!):
> 
> function join(a, b, c, type,   mini, maxi, i)
> {
>    del(c)
>    mini = minidx2(a, b) + 0
>    maxi = maxidx2(a, b) + 0
>    if (type == "l") {
>      for (i = mini; i <= maxi; ++i) {
>        if (i in a) {
>          c[i] = a[i]
>          if (i in b)
>            c[i] = c[i] b[i]
>        }
>      }
>    else if (type == "r") {
>      for (i = mini; i <= maxi; ++i) {
>        if (i in b) {
>          if (i in a)
>            c[i] = a[i]
>          c[i] = c[i] b[i]
>        }
>      }
>    } else if (type == "o") {
>      for (i = mini; i <= maxi; ++i) {
>        if (i in a)
>          c[i] = a[i]
>        if (i in b)
>          c[i] = c[i] b[i]
>      }
>    } else { # type is "i" or NULL
>      for (i = mini; i <= maxi; ++i) {
>        if ((i in a) && (i in b))
>          c[i] = a[i] b[i]
>      }
>    }
> }
> 
> and instead of zipcat():
> 
> function joincat(a, b, type,   c)
> {
>    join(a, b, c, type)
>    return cat(c)
> }
> 
> and then you'd have a couple of very reusable functions with clear
> names and well-known semantics.
> 
> Regards,
> 
>     Ed.

Cool, thank you very much Ed. You helped me enormously.
Now my functions are more general than before and have consistent and
clear names. I generalized them still a bit more and used divide & conquer.

Generalized code with new function names:

----------------------------8<---------------------------------
{
   split($0, field, FS, seps)

   # when indices in seps start minimally with 0
   jc = joincat(field, seps, "o")

   # when indices in seps start minimally with 1
   # jc = joincat(seps, field, "o")
}

$0 != jc {
   ++errcnt
   fmt = "%3d:%d:ERROR:%s:\n"
   printf fmt, NR, NF, $0
   printf fmt, NR, NF, jc
   mini = minidx(seps) + 0
   maxi = maxidx(seps) + 0
   for (i = mini; i <= maxi; ++i)
     if (i in seps)
       printf "%d:-->%s<--\n", i, seps[i]
}

END {
   printf "total %d errors\n", errcnt
   exit(errcnt > 0)
}

# Some general and portable functions on arrays with (integer) indices:
#
# Arrays may have any max or min (integer) index and
# may have holes in the range of indices.
#
# In the distinct functions the input arrays but not the output array
# are not changed.
#
# The thrust of development was on clean design not on smallness
# or efficiency.
#
# With very good ideas for join, joincat, and del from Ed Morton.

function joincat(a, b, type, sep1, sep2,   c)
{
    join(a, b, c, type, sep1)
    return cat(c, sep2)
}

function cat(a, sep,   mini, maxi, i, s)
{
   mini = minidx(a) + 0
   maxi = maxidx(a) + 0
   for (i = mini; i <= maxi; ++i) {
     if (i > mini)
       s = s sep
     if (i in a)
       s = s a[i]
   }
   return s
}

# some kinds of joins as in relational database systems
# join(a,b,c) = inner join of a and b into c
# join(a,b,c,"i") = inner join (the default)
# join(a,b,c,"o") = full outer join
# join(a,b,c,"l") = left outer join
# join(a,b,c,"r") = right outer join

function join(a, b, c, type, sep)
{
   # outer join in the first if branch to make it faster
   if (type == "o")
     outer_join(a, b, c, sep)
   else if (type == "l")
     left_join(a, b, c, sep)
   else if (type == "r")
     right_join(a, b, c, sep)
   else   # if "i" or undefined ("")
     inner_join(a, b, c, sep)
}

function left_join(a, b, c, sep,   mini, maxi, i)
{
   del(c)
   mini = minidx2(a, b) + 0
   maxi = maxidx2(a, b) + 0
   for (i = mini; i <= maxi; ++i) {
     if (i in a) {
       c[i] = a[i] sep
       if (i in b)
         c[i] = c[i] b[i]
     }
   }
}

function right_join(a, b, c, sep,   mini, maxi, i)
{
   del(c)
   mini = minidx2(a, b) + 0
   maxi = maxidx2(a, b) + 0
   for (i = mini; i <= maxi; ++i) {
     if (i in b) {
       if (i in a)
         c[i] = a[i]
       c[i] = c[i] sep b[i]
     }
   }
}

function outer_join(a, b, c, sep,   mini, maxi, i)
{
   del(c)
   mini = minidx2(a, b) + 0
   maxi = maxidx2(a, b) + 0
   for (i = mini; i <= maxi; ++i) {
     if (i in a)
       c[i] = a[i]
     if ((i in a) || (i in b))
       c[i] = c[i] sep
     if (i in b)
       c[i] = c[i] b[i]
   }
}

function inner_join(a, b, c, sep,   mini, maxi, i)
{
   del(c)
   mini = minidx2(a, b) + 0
   maxi = maxidx2(a, b) + 0
   for (i = mini; i <= maxi; ++i) {
     if ((i in a) && (i in b))
       c[i] = a[i] sep b[i]
   }
}

function maxidx2(a, b)
{
   if (emptyarr(a))
     return maxidx(b)
   else if (emptyarr(b))
     return maxidx(a)
   else
     return max(maxidx(a), maxidx(b))
}

function minidx2(a, b)
{
   if (emptyarr(a))
     return minidx(b)
   else if (emptyarr(b))
     return minidx(a)
   else
     return min(minidx(a), minidx(b))
}

function maxidx(a,   i, m)
{
   if (emptyarr(a))
     m = ""
   else {
     m = choose(a) + 0
     for (i in a) {
       i += 0
       if (i > m)
         m = i
     }
   }
   return m
}

function minidx(a,   i, m)
{
   if (emptyarr(a))
     m = ""
   else {
     m = choose(a) + 0
     for (i in a) {
       i += 0
       if (i < m)
         m = i
     }
   }
   return m
}

function choose(a,   k)
{
   for (k in a)
     return k
   return ""
}

function emptyarr(a,  k)
{
   for (k in a)
     return 0
   return 1
}

# portable delete
function del(a)
{
   split("", a)
}

function max(x, y)
{
   return x > y ? x : y
}

function min(x, y)
{
   return x < y ? x : y
}
----------------------------8<---------------------------------

-- 
Steffen
0
Reply Steffen 12/3/2008 6:57:18 PM

On Dec 3, 9:23=A0am, Ed Morton <mortons...@gmail.com> wrote:
> Since this is now the longest thread I've remember seeing in this NG,
> I thought it might be good to summarise where we are and what options
> we're discussing.
>
> We have decided to add a 4th argument to split() in GNU awk so it will
> now be:
>
> =A0 =A0 =A0N =3D split(string, fieldsArray [, fieldSep [, sepsArray]])
>
> where fieldsArray will be an array of fields as today and sepsArray
> will be an array of separators associated with those fields. split()
> will, as today, return the number of elements stored in fieldsArray (N
> above).
>
> So far so good. Now we're debating how to populate sepsArray and there
> are 2 proposals on the table:
>
> 1) start at 0 if fieldSep is " " and there are leading separators,
> otherwise start at 1:
>
> =A0 =A0 if (fieldSep is " " ) {
> =A0 =A0 =A0 =A0 if (there is leading separator) {
> =A0 =A0 =A0 =A0 =A0 =A0 sepsArray starts at 0
> =A0 =A0 =A0 =A0 =A0 =A0 sepsArray[0] =3D the leading separator
> =A0 =A0 =A0 =A0 } else {
> =A0 =A0 =A0 =A0 =A0 =A0 sepsArray starts at 1
> =A0 =A0 =A0 =A0 =A0 =A0 sepsArray[0] does not exist
> =A0 =A0 =A0 =A0 }
> =A0 =A0 =A0 =A0 if (there is trailing separator) {
> =A0 =A0 =A0 =A0 =A0 =A0 sepsArray ends at N
> =A0 =A0 =A0 =A0 =A0 =A0 sepsArray[N] =3D the trailing separator
> =A0 =A0 =A0 =A0 } else {
> =A0 =A0 =A0 =A0 =A0 =A0 sepsArray ends at N-1
> =A0 =A0 =A0 =A0 =A0 =A0 sepsArray[N] does not exist
> =A0 =A0 =A0 =A0 }
> =A0 =A0 =A0 =A0 if (there are leading && trailing separators) {
> =A0 =A0 =A0 =A0 =A0 =A0 length(sepsArray) =3D N+1
> =A0 =A0 =A0 =A0 } else if (there are leading || trailing separators) {
> =A0 =A0 =A0 =A0 =A0 =A0 length(sepsArray) =3D N
> =A0 =A0 =A0 =A0 } else {
> =A0 =A0 =A0 =A0 =A0 =A0 length(sepsArray) =3D N-1
> =A0 =A0 =A0 =A0 }
> =A0 =A0 } else {
> =A0 =A0 =A0 =A0 sepsArray starts at 1
> =A0 =A0 =A0 =A0 sepsArray ends at N-1
> =A0 =A0 =A0 =A0 length(sepsArray) =3D N-1
> =A0 =A0 }
> =A0 =A0 for (fld=3D2;fld<N;fld++) { #note: only always true for fields 2-=
>N-1.
>
> =A0 =A0 =A0 =A0 sepsArray[fld-1] =3D the separator before field fld
> =A0 =A0 =A0 =A0 sepsArray[fld] =3D the separator after field fld
> =A0 =A0 }
>
> 2) always start at one:
>
> =A0 =A0 sepsArray starts at 1
> =A0 =A0 sepsArray ends at N+1
> =A0 =A0 if ((fieldSep =3D=3D " ") && (there is leading separator)) {
> =A0 =A0 =A0 =A0 sepsArray[1] =3D the leading separator
> =A0 =A0 } else {
> =A0 =A0 =A0 =A0 sepsArray[1] =3D ""
> =A0 =A0 }
> =A0 =A0 if ((fieldSep =3D=3D " ") && (there is trailing separator)) {
> =A0 =A0 =A0 =A0 sepsArray[N+1] =3D the trailing separator
> =A0 =A0 } else {
> =A0 =A0 =A0 =A0 sepsArray[N+1] =3D ""
> =A0 =A0 }
> =A0 =A0 length(sepsArray) =3D N+1
> =A0 =A0 for (fld=3D1;fld<=3DN;fld++) { # note: always true for all fields=
 1->N.
>
> =A0 =A0 =A0 =A0 sepsArray[fld] =3D the separator before field fld
> =A0 =A0 =A0 =A0 sepsArray[fld+1] =3D the separator after field fld
> =A0 =A0 }
>
> Regards,
>
> =A0 =A0 Ed.

To try to bring this to closure, here's what I think are the pros and
cons of each approach:

1) start at 0 if fieldSep is " " and there are leading separators,
otherwise start at 1:
   pros:
      a) if we define a "separator" as the pattern that comes between
fields and not before the first field or after the last field (i.e.
differently from the way a "field separator" is currently defined in
awk), then the first separator in the record is always in seps[1] and
the last separator is always in seps[NF-1] (but seps[0] and/or seps
[NF] may also be populated with "delimiters" that matched the given
pattern).
   cons:
      a) the seps[] array can start at zero which is inconsistent with
flds[] or any other awk arrays (except ARGV[] and $N which both
special-case the zero-th element).
      b) the seps[] array can start at zero or one, and can end at
either NF-1 or NF.
      c) length(seps) can be either length(flds)-1 or length(flds) or
length(flds)+1 so you can't tell how many elements will be in seps
from the split() return value (the length of flds).
      d) inconsistency: If no leading separator is present, seps[0]
will not exist but if no leading field is present split() will create
flds[1] and populate it with a NULL string.
      e) there aren't always seps[] values before or after every
field, so some join operations on the 2 arrays won't work (e.g. you
can't just delete an entry from flds[] and do an inside join to
recompile $0)
      f) if we define a "separator" as the pattern that comes between
fields and in the special case of FS =3D " " may also come before the
first field and/or after the last field (i.e. the same way as a "field
separator" is currently defined in awk), then the first separator in
the record is sometimes in seps[0] and sometimes in seps[1] and the
last separator is sometimes in seps[NF-1] and sometimes in seps[NF].

2) always start at one:
   pros:
      a) the seps[] array starts at one like all other awk arrays
      b) the seps[] array always starts at 1 and ends at NF-1
      c) length(seps) is length(flds)+1 so you always know the length
of seps[] is the split() return value plus 1.
      d) if no leading separator is present, split() will create seps
[1] and populate it with a NULL string just like it does in the same
situation for flds[1]
      e) there are always seps[] values before and after every field
so any join will succeed.
      f) if we define a "separator" as the pattern that comes between
fields and in the special case of FS =3D " " may also come before the
first field and/or after the last field (i.e. the same way as a "field
separator" is currently defined in awk), then the first separator in
the record is always in seps[1] and the last separator is always in
seps[NF+1].
   cons:
      a) if we define a "separator" as the pattern that comes between
fields and not before the first field or after the last field (i.e.
differently from the way a "field separator" is currently defined in
awk), then the first separator in the record is always in seps[2] and
the last separator is always in seps[NF] (and seps[1] and seps[NF+1]
will also be populated with either "delimiters" that matched the given
pattern or NULL strings).

Feel free to add anything....

     Ed.
0
Reply Ed 12/4/2008 3:53:13 PM

On Nov 28, 11:39=A0am, pk <p...@pk.invalid> wrote:
> On Friday 28 November 2008 16:14, Barry Fishman wrote:
>
> > As others have shown a general solution in awk is quite hard. =A0Howeve=
r,
> > I suspect the user usually has some pre-knowledge of how the field
> > format should look:
>

If I'm following this correctly, the issue revolves around the way FS
deals with a regular expression that can match a null string. As was
mentioned earlier, "Note that the above function will loop forever
with, say, a string
like "aXbXc" and a RE like "X*"." What possible sense does it make to
use such an RE as a field separator? In any case, I think the
following works for all patterns by treating an RE that matches null
as if no field separator is specified. Perhaps Arnold can explain the
algorithm used for handling REs in FS.

$ cat chgChars.awk
# An RE that can match the null string will behave as if there are no
field
# separators in the line.

BEGIN { regex =3D "[0-9]+"; find =3D "X"; replace =3D "+"; fld =3D 3 }
{
        buf =3D $0; i =3D 0
        while(match(buf,regex))
                if(RLENGTH > 0) {
                        flds[++i] =3D substr(buf,1,RSTART - 1)
                        splits[i] =3D substr(buf,RSTART,RLENGTH)
                        buf =3D substr(buf,RSTART + RLENGTH)
                } else break
        flds[++i] =3D buf
        gsub(find,replace,flds[fld])
        $0 =3D ""
        for(j=3D1;j<=3Di;j++)
                $0 =3D $0 flds[j] splits[j]
        print $0
}


$ echo "aXaa1786bXbb44338cXcXc194768ddd" | awk -f chgChars.awk
aXaa1786bXbb44338c+c+c194768ddd

Regards to all,

Jim Hart



0
Reply xsltmaster 12/6/2008 4:57:07 AM

On Dec 5, 10:57=A0pm, xsltmas...@gmail.com wrote:
> On Nov 28, 11:39=A0am, pk <p...@pk.invalid> wrote:
>
> > On Friday 28 November 2008 16:14, Barry Fishman wrote:
>
> > > As others have shown a general solution in awk is quite hard. =A0Howe=
ver,
> > > I suspect the user usually has some pre-knowledge of how the field
> > > format should look:
>
> If I'm following this correctly, the issue revolves around the way FS
> deals with a regular expression that can match a null string.

That's one issue, but the main issues were how to deal with an FS that
is " " when there are leading/trailing spaces, and how to deal with an
FS that is precisely the null string "". Now that we've got all that
figured out, including how to deal with REs that CAN match the NULL
string, the issues are whether "separator" number N comes before or
after the "field" number N and should "separators" be considered to
always exist before/after every field even if they're NULL. See the
new thread (http://tinyurl.com/6x6wfs) for a summary.

    Ed.
0
Reply Ed 12/6/2008 3:44:48 PM

On Saturday 6 December 2008 05:57, xsltmaster@gmail.com wrote:

> On Nov 28, 11:39 am, pk <p...@pk.invalid> wrote:
>> On Friday 28 November 2008 16:14, Barry Fishman wrote:
>>
>> > As others have shown a general solution in awk is quite hard.  However,
>> > I suspect the user usually has some pre-knowledge of how the field
>> > format should look:
>>
> 
> If I'm following this correctly, the issue revolves around the way FS
> deals with a regular expression that can match a null string. As was
> mentioned earlier, "Note that the above function will loop forever
> with, say, a string
> like "aXbXc" and a RE like "X*"." What possible sense does it make to
> use such an RE as a field separator? 

Admittedly, not much imho. Nonetheless, FS can be an arbitrary regular
expression, so the case where it can match an empty string has to be
handled somehow.

> In any case, I think the following works for all patterns by treating an
> RE that matches null as if no field separator is specified. Perhaps Arnold
> can explain the algorithm used for handling REs in FS.

From what I understand, currently an empty match for FS is simply ignored,
and the search for a match continues at the following character.

0
Reply pk 12/6/2008 4:30:12 PM

Ed Morton said the following on 11/25/2008 4:53 PM:
> Someone recently asked how to do this (all spaces in "file" are tabs):
> 
> $ cat file
> aaa     bbb     cXcXc   ddd
> $ awk 'BEGIN{FS=OFS="\t"}gsub(/X/,"+",$3)' file
> aaa     bbb     c+c+c   ddd
> 
> i.e. change all ocurrences of a character in a field without changing
> the spacing between fields. The above only "works" because the field
> separator is a specific character and we can set the OFS to the same
> character so that though the record gets recompiled it looks the same
> after as before wrt field spacing. If the FS had been a space
> character (or any RE), however, we can't use that same trick (all
> spaces in "file" are chains of blank chars):
> 
> $ cat file
> aaa        bbb         cXcXc              ddd
> $ awk 'BEGIN{FS=OFS=" "}gsub(/X/,"+",$3)' file
> aaa bbb c+c+c ddd
> $ awk 'BEGIN{FS=OFS=" +"}gsub(/X/,"+",$3)' file
> aaa +bbb +c+c+c +ddd
> 
> This problem of unwanted recompilation of the record comes up so
> often, I wonder if anyone has a suggestion on a simple way to work
> around it in general. Unless you use GNU awks gensub(), the best I can
> think of is to carve up the record into the part before the 3rd field,
> then the 3rd field, then the part after the 3rd field and stick it all
> back together afterward manipulating the 3rd field, but any
> implementation I can imagine for that is very lengthy considering how
> simple the problem sounds. Any suggestions?
> 
>       Ed.
> 
for your example (or the trivial example where FS is def: [[:space:]]) 
the following splt can fill an array with the seperators:
n=split($0,seps,"[^[:space:]]+")
seps[1] is any leading seperator(s) and seps[n] is any trailing 
seperator(s)... example:
$ cat file
aaa        bbb         cXcXc              ddd
$ awk '{n=split($0,seps,"[^[:space:]]+");
  for(i=1;i<=n; ++i)print i"="seps[i]"=";}' file
1==
2=        =
3=         =
4=              =
5==

also:
$ awk '{n=split($0,seps,"[^[:space:]]+");gsub(/X/,"+",$3);
quote> for(i=1;i<=n; ++i)printf seps[i] $i; print ""}' file
aaa        bbb         c+c+c              ddd

Maybe someone more familiar with regEx could extrapolate the split:
n=split($0,seps,"(not current FS/OFS)+")
(if such syntax actually exists in regEx) - I haven't researched for if 
"(not current FS/OFS)+" can be expressed for any RE
-- 
(^\pop/^)
I'm lost... I've gone to look for myself.
If I should return before I get back, keep me here.
--
0
Reply pop 12/9/2008 4:00:58 AM

On Dec 8, 10:00=A0pm, pop <p_...@hotmail.com> wrote:
> Ed Morton said the following on 11/25/2008 4:53 PM:
>
>
>
> > Someone recently asked how to do this (all spaces in "file" are tabs):
>
> > $ cat file
> > aaa =A0 =A0 bbb =A0 =A0 cXcXc =A0 ddd
> > $ awk 'BEGIN{FS=3DOFS=3D"\t"}gsub(/X/,"+",$3)' file
> > aaa =A0 =A0 bbb =A0 =A0 c+c+c =A0 ddd
>
> > i.e. change all ocurrences of a character in a field without changing
> > the spacing between fields. The above only "works" because the field
> > separator is a specific character and we can set the OFS to the same
> > character so that though the record gets recompiled it looks the same
> > after as before wrt field spacing. If the FS had been a space
> > character (or any RE), however, we can't use that same trick (all
> > spaces in "file" are chains of blank chars):
>
> > $ cat file
> > aaa =A0 =A0 =A0 =A0bbb =A0 =A0 =A0 =A0 cXcXc =A0 =A0 =A0 =A0 =A0 =A0 =
=A0ddd
> > $ awk 'BEGIN{FS=3DOFS=3D" "}gsub(/X/,"+",$3)' file
> > aaa bbb c+c+c ddd
> > $ awk 'BEGIN{FS=3DOFS=3D" +"}gsub(/X/,"+",$3)' file
> > aaa +bbb +c+c+c +ddd
>
> > This problem of unwanted recompilation of the record comes up so
> > often, I wonder if anyone has a suggestion on a simple way to work
> > around it in general. Unless you use GNU awks gensub(), the best I can
> > think of is to carve up the record into the part before the 3rd field,
> > then the 3rd field, then the part after the 3rd field and stick it all
> > back together afterward manipulating the 3rd field, but any
> > implementation I can imagine for that is very lengthy considering how
> > simple the problem sounds. Any suggestions?
>
> > =A0 =A0 =A0 Ed.
>
> for your example (or the trivial example where FS is def: [[:space:]])
> the following splt can fill an array with the seperators:
> n=3Dsplit($0,seps,"[^[:space:]]+")

That wouldn't work in a POSIX awk for FS=3D" " and RS!=3D"\n", nor would
that general approach work for FS=3D"" or FS=3D<some general regexp>.

     Ed.
0
Reply Ed 12/9/2008 1:50:35 PM

Ed Morton said the following on 12/9/2008 7:50 AM:
> On Dec 8, 10:00 pm, pop <p_...@hotmail.com> wrote:
>> for your example (or the trivial example where FS is def: [[:space:]])
>> the following splt can fill an array with the seperators:
>> n=split($0,seps,"[^[:space:]]+")
> 
> That wouldn't work in a POSIX awk for FS=" " and RS!="\n", nor would
> that general approach work for FS="" or FS=<some general regexp>.
> 
>      Ed.

That's the conclusion I thought would be true. I had only used this for 
the default conditions and never had to think of a general solution 
(which is complicated to say the least)

-- 
(^\pop/^)
I'm lost... I've gone to look for myself.
If I should return before I get back, keep me here.
--
0
Reply pop 12/9/2008 2:14:14 PM

84 Replies
155 Views

(page loaded in 1.43 seconds)

Similiar Articles:


















7/16/2012 7:25:15 AM


Reply: