Removing duplicates from within sections of a file

  • Follow


Hi,

I have a file which contains:

SECTION1
A
B
C
A
A
D
SECTION2
A
B
C
B
D
SECTION3
E
E
C
A
E

I would like to remove duplicates from within each section of the file.

So the above example would become:

SECTION1
A
B
C
D
SECTION2
A
B
C
D
SECTION3
E
C
A

The values within each section can be matched with a single regular
expression.

Please can you help me with this.

Regards,
Jonny

0
Reply Jonny 2/13/2005 3:43:43 PM

Jonny wrote:
> Hi,
> 
> I have a file which contains:
> 
> SECTION1
> A
> B
> C
> A
> A
> D
> SECTION2
> A
> B
> C
> B
> D
> SECTION3
> E
> E
> C
> A
> E
> 
> I would like to remove duplicates from within each section of the file.
> 
> So the above example would become:
> 
> SECTION1
> A
> B
> C
> D
> SECTION2
> A
> B
> C
> D
> SECTION3
> E
> C
> A
> 
> The values within each section can be matched with a single regular
> expression.
> 
> Please can you help me with this.

It's similar to the solution Ed already posted. But you need to
memorize the contents of data for each section, like in

if ( (sect,$0) in mem) next
else { mem[sect,$0] = 1; print ... }

You may also design it to save some memory space, just memorizing
the data for one section, and clear and reuse an array with every
new section.

Janis
0
Reply Janis 2/13/2005 4:02:28 PM



Jonny wrote:
> Hi,
> 
> I have a file which contains:
> 
> SECTION1
> A
> B
> C
> A
> A
> D
> SECTION2
> A
> B
> C
> B
> D
> SECTION3
> E
> E
> C
> A
> E
> 
> I would like to remove duplicates from within each section of the file.
> 
> So the above example would become:
> 
> SECTION1
> A
> B
> C
> D
> SECTION2
> A
> B
> C
> D
> SECTION3
> E
> C
> A
> 
> The values within each section can be matched with a single regular
> expression.
> 
> Please can you help me with this.

Try something like this:

gawk '/SECTION/{delete a}!($0 in a){print;a[$0]=""}' file

	Ed.
0
Reply Ed 2/13/2005 4:03:38 PM

Hello,

this one works (tested):

/^SECTION/ {
  print $1
  delete s
  sec=$1
  next
}

{
  if ($1 in s)
    next
  print $1
  s[$1] ++
}
0
Reply ISO 2/13/2005 4:09:37 PM

Thanks for your replies.

They all work on the example I gave.  I will need to test them on some
larger files later to see if there are any performance differences.

Your help is appreciated.

Regards,
Jonny
0
Reply Jonny 2/13/2005 4:33:30 PM

> gawk '/SECTION/{delete a}!($0 in a){print;a[$0]=""}' file


I am new to awk, and am trying to learn it.  This could looks really
elegant, but I am having trouble understanding it.  Would you be able to
tell me in pseudocode what it is doing?

I do know that it is testing to see if the current line contains "SECTION",
and if so it initializes an array called "a".  I assume "a" contains the
words that you're trying to uniq.  But after that I get a little lost.  And
I don't see the keyword "in" in my gawk man page SuSE 9.1 Linux) where you
say "($0 in a)"

I'd be grateful if you could straighten me out.


0
Reply Buck 2/14/2005 2:21:10 PM

Buck Turgidson wrote:
>>gawk '/SECTION/{delete a}!($0 in a){print;a[$0]=""}' file

> I am new to awk, and am trying to learn it.  This could looks really
> elegant, but I am having trouble understanding it.  Would you be able to
> tell me in pseudocode what it is doing?

Elegant is an attribute that I would only use
when a solution is also easy to understand.
 
> I do know that it is testing to see if the current line contains "SECTION",
> and if so it initializes an array called "a".  I assume "a" contains the
> words that you're trying to uniq.  But after that I get a little lost.  And
> I don't see the keyword "in" in my gawk man page SuSE 9.1 Linux) where you
> say "($0 in a)"

The gawk man page explains the "in" operator,
but it is hard to find the word "in":

       The  special operator in may be used in an if or while statement to see
       if an array has an index consisting of a particular value.
              if (val in array)
                   print array[val]
       If the array has multiple subscripts, use (i, j) in array.
       The in construct may also be used in a for loop to iterate over all the
       elements of an array.
       An  element  may  be  deleted from an array using the delete statement.
       The delete statement may also be used to delete the entire contents  of
       an array, just by specifying the array name without a subscript.
0
Reply ISO 2/14/2005 2:53:13 PM


Buck Turgidson wrote:
>>gawk '/SECTION/{delete a}!($0 in a){print;a[$0]=""}' file
> 
> 
> 
> I am new to awk, and am trying to learn it.  This could looks really
> elegant, but I am having trouble understanding it.  Would you be able to
> tell me in pseudocode what it is doing?
> 
> I do know that it is testing to see if the current line contains "SECTION",
> and if so it initializes an array called "a".  I assume "a" contains the
> words that you're trying to uniq.  But after that I get a little lost.  And
> I don't see the keyword "in" in my gawk man page SuSE 9.1 Linux) where you
> say "($0 in a)"
> 
> I'd be grateful if you could straighten me out.

Some white-space and a couple of comments would probably help:

gawk '/SECTION/{delete a}	# if the current line contains the word
				# "SECTION", then delete (i.e. re-init)
				# array "a" if it exists.

!($0 in a)			# IF the string representing the current
		 		#    input record is NOT already an
				#    index for array a) THEN

{print;				# print the current record
a[$0]=""}			# and add it as an index to a, just
				# so we can use the "in" operator later
				# to test for a record already having
				# been read.

' file

You can read up on the "in" operator at 
http://www.gnu.org/software/gawk/manual/gawk.html#Reference-to-Elements.
It is absolutely crucial to know about the subtleties of "in" (e.g. it 
tests for but doesn't create an array index, unlike a["str"]=="") to do 
awk programming.

The PDF file of the whole gawk user guide is at 
http://www.gnu.org/software/gawk/manual/gawk.pdf and is well worth 
printing out and reading.

	Ed.
0
Reply Ed 2/14/2005 2:55:03 PM

Excellent.  Very helpful - thanks.


0
Reply Buck 2/14/2005 3:01:03 PM

Ed Morton wrote:

> The PDF file of the whole gawk user guide is at 
> http://www.gnu.org/software/gawk/manual/gawk.pdf and is well worth 
> printing out and reading.

Well, such large manuals are sometimes
easier to handle in this format:

http://www.oreilly.com/catalog/awkprog3/index.html
0
Reply ISO 2/14/2005 3:04:41 PM

Buck Turgidson <jc_va@hotmail.com> wrote:
>> gawk '/SECTION/{delete a}!($0 in a){print;a[$0]=""}' file
> 
> 
> I am new to awk, and am trying to learn it.  This could looks really
> elegant, but I am having trouble understanding it.  Would you be able to
> tell me in pseudocode what it is doing?
> 
> I do know that it is testing to see if the current line contains "SECTION",
> and if so it initializes an array called "a".  I assume "a" contains the
> words that you're trying to uniq.  But after that I get a little lost.  And
> I don't see the keyword "in" in my gawk man page SuSE 9.1 Linux) where you
> say "($0 in a)"

Look under Arrays.

$0 in a
basically is true, if $0 is an address in array a.
So, this 
For every line containing SECTION, deletes the stored array.
For every line not a member of the array, it prints that line, then
adds it to the array.

So, given 
SECTION
a
b
c
c
SECTION
c
d
e
a

It will print
a
b
c
c
d
e
a
0
Reply Ian 2/14/2005 3:09:35 PM


J�rgen Kahrs wrote:
> Ed Morton wrote:
> 
>> The PDF file of the whole gawk user guide is at 
>> http://www.gnu.org/software/gawk/manual/gawk.pdf and is well worth 
>> printing out and reading.
> 
> 
> Well, such large manuals are sometimes
> easier to handle in this format:
> 
> http://www.oreilly.com/catalog/awkprog3/index.html

Cute. The book's 3 years out of date though wrt the current version of 
gawk and the on-line document.

	Ed
0
Reply Ed 2/14/2005 3:28:42 PM

In article <fN2dnfL4Sso2Xo3fRVn-jQ@comcast.com>,
Ed Morton  <morton@lsupcaemnt.com> wrote:
>J�rgen Kahrs wrote:
>> Ed Morton wrote:
>> 
>>> The PDF file of the whole gawk user guide is at 
>>> http://www.gnu.org/software/gawk/manual/gawk.pdf and is well worth 
>>> printing out and reading.
>> 
>> Well, such large manuals are sometimes
>> easier to handle in this format:
>> 
>> http://www.oreilly.com/catalog/awkprog3/index.html
>
>Cute. The book's 3 years out of date though wrt the current version of 
>gawk and the on-line document.
>
>	Ed

It's not that out-of-date.  Very very little has been added to the language
since that book was published.  It's worth getting (a) because it's in
a nicer format, and (b) because it puts a few $$ in my pocket.

Thanks,

Arnold
-- 
Aharon (Arnold) Robbins --- Pioneer Consulting Ltd.	arnold AT skeeve DOT com
P.O. Box 354		Home Phone: +972  8 979-0381	Fax: +1 206 350 8765
Nof Ayalon		Cell Phone: +972 50  729-7545
D.N. Shimshon 99785	ISRAEL
0
Reply arnold 2/14/2005 3:50:11 PM

"Ed Morton" <morton@lsupcaemnt.com> wrote in message
news:EaudnRltt4RVJo3fRVn-1A@comcast.com...
>
>
> Buck Turgidson wrote:
> >>gawk '/SECTION/{delete a}!($0 in a){print;a[$0]=""}' file
> >
> >

>
> You can read up on the "in" operator at
> http://www.gnu.org/software/gawk/manual/gawk.html#Reference-to-Elements.
> It is absolutely crucial to know about the subtleties of "in" (e.g. it
> tests for but doesn't create an array index, unlike a["str"]=="") to do
> awk programming.



Did you mean to use the double equals above in your comment (unlike
a["str"]=="")), or is that just a typo?


0
Reply Buck 2/14/2005 4:06:04 PM


Aharon Robbins wrote:
> In article <fN2dnfL4Sso2Xo3fRVn-jQ@comcast.com>,
> Ed Morton  <morton@lsupcaemnt.com> wrote:
> 
>>J�rgen Kahrs wrote:
>>
>>>Ed Morton wrote:
>>>
>>>
>>>>The PDF file of the whole gawk user guide is at 
>>>>http://www.gnu.org/software/gawk/manual/gawk.pdf and is well worth 
>>>>printing out and reading.
>>>
>>>Well, such large manuals are sometimes
>>>easier to handle in this format:
>>>
>>>http://www.oreilly.com/catalog/awkprog3/index.html
>>
>>Cute. The book's 3 years out of date though wrt the current version of 
>>gawk and the on-line document.
>>
>>	Ed
> 
> 
> It's not that out-of-date. 

The version Jurgen referred to is dated May 2001, whereas the on-line 
one is June 2004 which is just over 3 years. Is there a newer release of 
the book?

  Very very little has been added to the language
> since that book was published.  It's worth getting (a) because it's in
> a nicer format, and (b) because it puts a few $$ in my pocket.

Do you have a small document on-line somewhere that just describes the 
updates since the book was published? I'd absolutely recommend people 
buy the book if that document of deltas was also available as I 
appreciate all the work that you've put into the tool and it's 
documentation.

	Ed.
0
Reply Ed 2/14/2005 4:09:12 PM

In article <dsu6e2-lh4.ln1@turf.turgidson.com>,
Buck Turgidson <jc_va@hotmail.com> wrote:
....
>> It is absolutely crucial to know about the subtleties of "in" (e.g. it
>> tests for but doesn't create an array index, unlike a["str"]=="") to do
>> awk programming.
>
>
>
>Did you mean to use the double equals above in your comment (unlike
>a["str"]=="")), or is that just a typo?

Yes, he does.  No, it is not a typo.

0
Reply gazelle 2/14/2005 4:13:53 PM

In article <4210c8b3$1@news.012.net.il>,
Aharon Robbins <arnold@skeeve.com> wrote:
....
>It's not that out-of-date.  Very very little has been added to the language
>since that book was published.  It's worth getting (a) because it's in
>a nicer format, and (b) because it puts a few $$ in my pocket.

(I assume the book we are talking about is EAP.  If not, please disregard
these comments)

It's a good book.  There are obscurities of the language contained in the
book that you will never, ever get from reading documentation in electronic
form (it just being a fact of life that whenever you read documentation in
electronic form, you are always in "scan" mode).

This fact (that physical documentation on paper allows you to absorb things
that you can never get from browsing it in electronic form), is far more
important than any incremental changes that have occurred over the last
3 years.

In fact, besides "switch", what is there?

0
Reply gazelle 2/14/2005 4:18:22 PM


Buck Turgidson wrote:

> "Ed Morton" <morton@lsupcaemnt.com> wrote in message
> news:EaudnRltt4RVJo3fRVn-1A@comcast.com...
> 
>>
>>Buck Turgidson wrote:
>>
>>>>gawk '/SECTION/{delete a}!($0 in a){print;a[$0]=""}' file
>>>
>>>
> 
>>You can read up on the "in" operator at
>>http://www.gnu.org/software/gawk/manual/gawk.html#Reference-to-Elements.
>>It is absolutely crucial to know about the subtleties of "in" (e.g. it
>>tests for but doesn't create an array index, unlike a["str"]=="") to do
>>awk programming.
> 
> 
> 
> 
> Did you mean to use the double equals above in your comment (unlike
> a["str"]=="")), or is that just a typo?
> 

Yes, I did meant to use them. This:

	if ("str" in a) print

will only print if a["str"] exists, whereas this:

	if (a["str"] != "") print

will only print if a["str"] is populated with something other than an 
empty string but it additionally will add an entry to "a" indexed by 
"str"  e.g.:

PS1> gawk 'BEGIN{if ("str" in a) print; for (i in a) print i; exit}'
PS1> gawk 'BEGIN{if (a["str"] != "") print; for (i in a) print i; exit}'
str

Regards,

	Ed.
0
Reply Ed 2/14/2005 4:22:31 PM

> >
> >Did you mean to use the double equals above in your comment (unlike
> >a["str"]=="")), or is that just a typo?
>
> Yes, he does.  No, it is not a typo.
>

I asked because it differs from his original code.  I thought he was
referencing the `a[$0]=""` part below where he uses the assignment operator.


gawk '/SECTION/{delete a}!($0 in a){print;a[$0]=""}' file


0
Reply Buck 2/14/2005 4:24:40 PM

Thanks for your patience in educating me.


0
Reply Buck 2/14/2005 4:30:10 PM

In article <9vv6e2-gj4.ln1@turf.turgidson.com>,
Buck Turgidson <jc_va@hotmail.com> wrote:
>> >
>> >Did you mean to use the double equals above in your comment (unlike
>> >a["str"]=="")), or is that just a typo?
>>
>> Yes, he does.  No, it is not a typo.
>>
>
>I asked because it differs from his original code.  I thought he was
>referencing the `a[$0]=""` part below where he uses the assignment operator.
>
>
>gawk '/SECTION/{delete a}!($0 in a){print;a[$0]=""}' file

Note, incidentally, that the more idiomatic way to express this is:

	a[$0]++

(invoking {print} via the default action)

0
Reply gazelle 2/14/2005 4:30:30 PM

This is a multi-part message in MIME format.
--------------010902060702020506070300
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit

Ed Morton wrote:

> Do you have a small document on-line somewhere that just describes the 
> updates since the book was published? I'd absolutely recommend people 
> buy the book if that document of deltas was also available as I 
> appreciate all the work that you've put into the tool and it's 
> documentation.

There is no official delta-doc for the book AFAIK.
But the NEWS file from the source distribution could
be interesting for you. I have appended the relevant
part.

--------------010902060702020506070300
Content-Type: text/plain;
 name="NEWS_after_3.1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="NEWS_after_3.1"

Changes from 3.1.4 to 3.1.5
---------------------------

1. The random() suite has been updated to a current FreeBSD version, which
   works on systems with > 32-bit ints.

2. A new option, `--exec' has been added. It's like -f but ends option
   processing.  It also disables `x=y' variable assignments, but not -v.
   It's needed mainly for CGI scripts, so that source code can't be
   passed in as part of the URL.

3. dfa.[ch] have been synced with GNU grep development.  This also fixes
   multiple regex matching problems in multibyte locales.

4. Updated to Automake 1.9.4.

5. Updated to Bison 2.0.

6. The getopt* and regex* files were synchronized with current GLIBC CVS.
   See the ChangeLog for the versions and minor edits made.

7. `configure --disable-nls' now disables just gawk's own translations.
   Gawk continues to work with the locale's numeric formatting.  This
   includes a bug fix in handling the printf ' flag (e.g., %'d).

8. Gawk is now multibyte aware.  This means that index(), length(),
   substr() and match() all work in terms of characters, not bytes.

9. Gawk is now smarter about parsing numeric constants in corner cases.

10. Not closing open redirections no longer causes gawk to exit non-zero.

xx. Various minor bugs fixed. See the ChangeLog for the details.

Changes from 3.1.3 to 3.1.4
---------------------------

1. Gawk now supports the POSIX %F format, falling back to %f if the local
   system printf doesn't handle it.

2. Gawk now supports the ' flag in printf. E.g., %'d in a locale with thousands
   separators includes the thousands separator in the value, e.g. 12,345.

   This has one problem; the ' flag is next to impossible to use on the
   command line, without major quoting games.  Oh well, TANSTAAFL.

3. The dfa code has been reinstated; the performance degradation was
   just too awful.  Sigh.  (For fun, use `export GAWK_NO_DFA=1' to
   see the difference.)

4. The special case `x = x y' is now recognized in the grammar, and gawk
   now uses `realloc' to append the new value to the end of the existing
   one.  This can speed up the common case of appending onto a string.

5. The dfa code was upgraded with most of the fixes from grep 2.5.1, and
   the regex code was upgraded with GLIBC as mid-January 2004.  The regex
   code is faster than it was, but still not as fast as the dfa code, so
   the dfa code stays in.  The getopt code was also synced to current GLIBC.

6. Support code upgraded to Automake 1.8.5, Autoconf 2.59, and gettext 0.14.1.

7. When --posix is in effect, sub/gsub now follow the 2001 POSIX behavior.
   Yippee.  This is even documented in the manual.

8. Gawk will now recover children that have died (input pipelines, two-way
   pipes), upon detecting EOF from them, thus avoiding filling
   up the process table.  Open file descriptors are not recovered
   (unfortunately), since that could break awk semantics.  See the
   ChangeLog and the source code for the details.

9. Handling of numbers like `0,1' in non-American locales ought to
   work correctly now.

10. IGNORECASE is now locale-aware for characters with values above 128.
    The dfa matcher is now used for IGNORECASE matches too.

11. Dynamic function loading is better. The documentation has been improved
    and some new APIs for use by dynamic functions have been added.

12. Gawk now has a fighting chance of working on older systems,
    a la SunOS 4.1.x.

13. Issues with multibyte support on HP-UX are now resolved. `configure' now
    disables such support there, since it's not up to what gawk needs.

14. There are now even more tests in the test suite.

15. Various bugs fixed; see ChangeLog for the details.

Changes from 3.1.2 to 3.1.3
---------------------------

1. Gawk now follows POSIX in handling of local numeric formats for
   input, output and number/string conversions.

2. Multibyte detection improved.  See README_d/README.multibyte for more
   info about multibyte locales.

3. Handling of `close' made more POSIX-compliant for POSIXLY_CORRECT,
   see the documentation.

4. The record reading code was redone, again.  This time it's much
   better. Really!

5. For RS = "\n" and RS = "", gawk now only sets RT when it has changed.
   This provides considerable performance improvement.

6. `match' now sets all the subscripts in the third argument array
   correctly, even if not all subexpressions matched.

7. Updated to Automake 1.7.5.  configure.in renamed configure.ac.

8. C-style switch statements are available, but must be enabled at
   compile time via `configure --enable-switch'.  For 3.2 they'll be
   enabled by default. Thanks to Michael Benzinger for the initial
   code.

9. %c now always prints no more than one character, whatever
   precision is provided.

10. strtonum(<number>) now works again.

11. Gawk is now much better about scalar/array typing of global
    uninitiailzed variables passed as parameters. Once the parameter
    is then used one way or the other, the global var's type is
    adjusted accordingly.  Thanks to Stepan Kasal for the original
    (considerable) changes.

12. Dynamic function loading under Windows32 should now be possible. See
    README_d/README.pcdynamic. Thanks to Patrick T.J. McPhee for the changes.

13. Updated to gettext 0.12.1.

14. Gawk now follows historical practice and POSIX for the return
    value of `rand': It's now  0 <= N < 1.

Changes from 3.1.1 to 3.1.2
---------------------------

1. Loops of the form:

	for (iggy in foo)
		next

   no longer leak memory.

2. gawk -v FIELDWIDTHS="..." now sets PROCINFO["FS"] correctly.

3. All builtin operations and functions should now fully evaluate their
   arguments so that side effects take place correctly.

4. Fixed a logic bug in gsub/gensub for matches to null strings that occurred
   later in the string after a nonnull match.

5. getgroups code now works on Ultrix again.

6. Completely new version of the full GNU regex engine now in place.

7. Argument parsing and variable assignment has been cleaned up.

8. An I/O bug on HP-UX has been documented and worked around. See
   README_d/README.hpux.

9. awklib/grcat should now compile correctly.

10. Updated to automake 1.7.3, autoconf 2.57 and gettext 0.11.5 ; thanks to
    Paul Eggert for the initial automake and autoconf work.

11. As a result of #6, removed the use of the dfa code from GNU grep.

12. It is now possible to use ptys for |& two-way pipes instead of
    pipes.  The basic plumbing for this was provided by Paolo Bonzini.
    To make this happen:

    	command = "unix command etc"
	PROCINFO[command, "pty"] = 1

	print ... |& command
	command |& getline stuff

    In other words, set the element in PROCINFO *before* opening the
    two-way pipe, and then gawk will use ptys instead of pipes.

    On systems without ptys or where all the ptys are in use, gawk
    will fall back to using plain pipes.

13. Fixed a regex matching across buffer boundaries bug, with a
    heuristic.  See io.c:rsre_get_a_record.

14. Profiling no longer dumps core if there are extension functions in place.

15. Grammar and scanner cleaned up, courtesy of Stepen Kasal, to hopefully
    once and for all fix the `/=' operator vs. `/=.../' regex ambiguity.
    Lots of other grammar simplifications applied, as well.

16. BINMODE should work now on more Windows ports.

17. Updated to bison 1.875.  Includes fix to bisonfix.sed script.

18. The NODE structure is now 20% (8 bytes) smaller (on x86, anyway), which
    should help conserve memory.

19. Builds not in the source directory should work again.

20. Arrays now use 2 NODE's per element instead of three. Combined with
    #18, (on the x86) this reduces the overhead from 120 bytes per element
    to just 64 bytes: almost a 50% improvement.

21. Programs that make heavy use of changing IGNORECASE should now be
    much faster, particularly if using a regular expression for FS or RS.
    IGNORECASE now correctly affects RS regex record splitting, as well.

22. IGNORECASE no longer affects single-character field splitting (FS = "c"),
    or single-character record splitting (RS = "c").

    This cleans up some weird behavior, and makes gawk better match the
    documentation, which says it only affects regex-based field splitting
    and record splitting.

    The documentation on this was improved, too.

23. The framework in test/ has been simplified, making it much easier to
    add new tests while keeping the size of Makefile.am reasonable. Thanks
    for this to Stepan Kasal.

24. --lint=invalid causes lint warnings only about stuff that's actually
    invalid.  This needs additional work.

25. More translations.

26. The `get_a_record' routine has been revamped (currently by splitting it
    into three variants).  This should improve long-term maintainability.

27. `match' now adds more entries to 3rd array arg:
	match("the big dog", /([a-z]+) ([a-z]+) ([a-z]+)/, data)
    fills in variables:
    	data[1, "start"], data[1, "length"], and so on.

28. New `asorti' function with same interface as `asort', but sorts indices
    instead of values.  

29. Documentation updated to FDL 1.2.

30. New `configure' option --disable-lint at compile time disables lint
    checking.  With GCC dead-code-elimination, cuts almost 200K off the
    executable size on GNU/Linux x86.  Presumably speeds up runtime.

    Using this will cause some of the tests in the test suite to fail.
    This option may be removed at a later date.

31. Various minor cleanups, see the ChangeLog for details.

Changes from 3.1.0 to 3.1.1
---------------------------

1. Six new translations.

2. Having more than 4 different values for OFMT and/or CONVFMT now works.

3. The handling of dynamic regexes is now more more sane, esp. w.r.t.
   the profiling code.  The profiling code has been fixed in several
   places.

4. The return value of index("", "") is now 1.

5. Gawk should no longer close fd 0 in child processes.

6. Fixed test for strtod semantics and regenerated configure.

7. Gawk can now be built with byacc; an accidental bison dependency was
   removed.

8. `yyerror' will no longer dump core on long source lines.

9. Gawk now correctly queries getgroups(2) to figure out how many groups
   the process has.

10. New configure option to force use of included strftime, e.g. on
    Solaris systems. See `./configure --help' for the details. Replaced
    the included strftime.c with the one from textutils.

11. OS/2 port has been updated.

12. Multi-byte character support has been added, courtesy of IBM Japan.

13. The `for (iggy in foo) delete foo[iggy]' -> `delete foo' optimisation
    now works.

14. Upgraded to gettext 0.11.2 and automake 1.5.

15. Full gettext compatibility (new dcngettext function).

16. The O'Reilly copyedits and indexing changes for the documentation have
    been folded into the texinfo version of the manuals.

17. A humongously long value for the AWKPATH environment variable will no
    longer dump core.

18. Configuration / Installation issues have been straightened out in
    Makefile.am.

--------------010902060702020506070300--
0
Reply ISO 2/14/2005 5:20:38 PM


Kenny McCormack wrote:

> In article <9vv6e2-gj4.ln1@turf.turgidson.com>,
> Buck Turgidson <jc_va@hotmail.com> wrote:
> 
>>>>Did you mean to use the double equals above in your comment (unlike
>>>>a["str"]=="")), or is that just a typo?
>>>
>>>Yes, he does.  No, it is not a typo.
>>>
>>
>>I asked because it differs from his original code.  I thought he was
>>referencing the `a[$0]=""` part below where he uses the assignment operator.
>>
>>
>>gawk '/SECTION/{delete a}!($0 in a){print;a[$0]=""}' file
> 
> 
> Note, incidentally, that the more idiomatic way to express this is:
> 
> 	a[$0]++
> 
> (invoking {print} via the default action)

Yup, you're right. This is the right way to do it:

gawk '/SECTION/{delete a}!a[$0]++' file

Thanks for catching it,

	Ed.
0
Reply Ed 2/14/2005 5:26:15 PM

In article <7oedncumGuO0UI3fRVn-2w@comcast.com>,
Ed Morton  <morton@lsupcaemnt.com> wrote:
>>>Cute. The book's 3 years out of date though wrt the current version of 
>>>gawk and the on-line document.
>>>
>>>	Ed
>> 
>> It's not that out-of-date. 
>
>The version Jurgen referred to is dated May 2001, whereas the on-line 
>one is June 2004 which is just over 3 years. Is there a newer release of 
>the book?

No.  It's not necessary right now.

>  Very very little has been added to the language
>> since that book was published.  It's worth getting (a) because it's in
>> a nicer format, and (b) because it puts a few $$ in my pocket.
>
>Do you have a small document on-line somewhere that just describes the 
>updates since the book was published? I'd absolutely recommend people 
>buy the book if that document of deltas was also available as I 
>appreciate all the work that you've put into the tool and it's 
>documentation.

There's the configure time code for switch (which I'd forgotten about),
and I think asorti().  As you saw in the NEWS file, it's mostly bug
fixes / performance tuning / POSIX compliance.  Hmm, the ' flag to
printf showed up in 3.1.4.  I'd be surprised if there are more than 5
in-the-language changes, and they're all very small things.

Arnold
-- 
Aharon (Arnold) Robbins --- Pioneer Consulting Ltd.	arnold AT skeeve DOT com
P.O. Box 354		Home Phone: +972  8 979-0381	Fax: +1 206 350 8765
Nof Ayalon		Cell Phone: +972 50  729-7545
D.N. Shimshon 99785	ISRAEL
0
Reply arnold 2/15/2005 9:36:56 AM

23 Replies
323 Views

(page loaded in 0.228 seconds)

Similiar Articles:


















7/16/2012 5:15:30 AM


Reply: