selecting records of more than certain number

  • Follow


Dear all,

I have a file like,

65895 135.55 02/02/2011
65895 198.25 02/02/2011
65895 118.35 02/02/2011
76446 177.25 03/02/2011
88958 115.95 04/02/2011
88958 198.85 02/02/2011
88958 136.75 04/02/2011
36987 115.95 03/02/2011
36987 115.05 08/02/2011

I want to list out the records which are having same field 1 value and
repeated more than twice.

In the given example I want to list out the first three records and
skip the fourth one and next three records and skip the last two. Give
me some idea. Thank you.


0
Reply visitnag (50) 3/19/2011 12:39:10 AM

In article <04e0686b-3da3-449e-9eed-3e609bdc82b6@w9g2000prg.googlegroups.com>,
nag  <visitnag@gmail.com> wrote:
>Dear all,
>
>I have a file like,
>
>65895 135.55 02/02/2011
>65895 198.25 02/02/2011
>65895 118.35 02/02/2011
>76446 177.25 03/02/2011
>88958 115.95 04/02/2011
>88958 198.85 02/02/2011
>88958 136.75 04/02/2011
>36987 115.95 03/02/2011
>36987 115.05 08/02/2011
>
>I want to list out the records which are having same field 1 value and
>repeated more than twice.
>
>In the given example I want to list out the first three records and
>skip the fourth one and next three records and skip the last two. Give
>me some idea. Thank you.

Something like:

{ x[++nr] = $0;y[$1]++ }
END	{
	for (i=1; i<=nr; i++) {
	    split(x[i],T)
	    if (y[T[1]] > 2)
		print x[i]
	    }
	}

-- 
They say compassion is a virtue, but I don't have the time!

    - David Byrne -

0
Reply gazelle 3/19/2011 1:39:26 AM


On 3/18/2011 7:39 PM, nag wrote:
> Dear all,
>
> I have a file like,
>
> 65895 135.55 02/02/2011
> 65895 198.25 02/02/2011
> 65895 118.35 02/02/2011
> 76446 177.25 03/02/2011
> 88958 115.95 04/02/2011
> 88958 198.85 02/02/2011
> 88958 136.75 04/02/2011
> 36987 115.95 03/02/2011
> 36987 115.05 08/02/2011
>
> I want to list out the records which are having same field 1 value and
> repeated more than twice.
>
> In the given example I want to list out the first three records and
> skip the fourth one and next three records and skip the last two. Give
> me some idea. Thank you.
>
>

untested:

awk '
{ arr[$1] = arr[$1] sep[$1] $0; cnt[$1]++; sep[$1] = ORS }
END { for (key in arr) if (cnt[key] > 2) print arr[key] }
' file

    Ed.
0
Reply Ed 3/19/2011 2:13:19 AM

On Fri, 18 Mar 2011 17:39:10 -0700 (PDT)
nag <visitnag@gmail.com> wrote:

> Dear all,
> 
> I have a file like,
> 
> 65895 135.55 02/02/2011
> 65895 198.25 02/02/2011
> 65895 118.35 02/02/2011
> 76446 177.25 03/02/2011
> 88958 115.95 04/02/2011
> 88958 198.85 02/02/2011
> 88958 136.75 04/02/2011
> 36987 115.95 03/02/2011
> 36987 115.05 08/02/2011
> 
> I want to list out the records which are having same field 1 value and
> repeated more than twice.
> 
> In the given example I want to list out the first three records and
> skip the fourth one and next three records and skip the last two. Give
> me some idea. Thank you.

awk '$1 in a {if(a[$1])print a[$1];a[$1]="";print} 
!($1 in a){a[$1]=$0}' file 


0
Reply pk 3/20/2011 11:02:59 AM

In article <im4o08$51a$1@speranza.aioe.org>, pk  <pk@pk.invalid> wrote:
>On Fri, 18 Mar 2011 17:39:10 -0700 (PDT)
>nag <visitnag@gmail.com> wrote:
>
>> Dear all,
>> 
>> I have a file like,
>> 
>> 65895 135.55 02/02/2011
>> 65895 198.25 02/02/2011
>> 65895 118.35 02/02/2011
>> 76446 177.25 03/02/2011
>> 88958 115.95 04/02/2011
>> 88958 198.85 02/02/2011
>> 88958 136.75 04/02/2011
>> 36987 115.95 03/02/2011
>> 36987 115.05 08/02/2011
>> 
>> I want to list out the records which are having same field 1 value and
>> repeated more than twice.
>> 
>> In the given example I want to list out the first three records and
>> skip the fourth one and next three records and skip the last two. Give
>> me some idea. Thank you.
>
>awk '$1 in a {if(a[$1])print a[$1];a[$1]="";print} 
>!($1 in a){a[$1]=$0}' file 
>
>

Care to explain?  That looks a little hinky (which is not to say it is
wrong, but just hard to parse).  It would probably help both me and the OP
if you did some explainin'.

In terms of results, I think it is not to spec, because the spec says "more
than twice" and your program displayed the 36987 lines (where count == 2).

-- 
But the Bush apologists hope that you won't remember all that. And they
also have a theory, which I've been hearing more and more - namely,
that President Obama, though not yet in office or even elected, caused the
2008 slump. You see, people were worried in advance about his future
policies, and that's what caused the economy to tank. Seriously.

    (Paul Krugman - Addicted to Bush)

0
Reply gazelle 3/20/2011 11:47:25 AM

On Sun, 20 Mar 2011 11:47:25 +0000 (UTC)
gazelle@shell.xmission.com (Kenny McCormack) wrote:

> In article <im4o08$51a$1@speranza.aioe.org>, pk  <pk@pk.invalid> wrote:
> >On Fri, 18 Mar 2011 17:39:10 -0700 (PDT)
> >nag <visitnag@gmail.com> wrote:
> >
> >> Dear all,
> >> 
> >> I have a file like,
> >> 
> >> 65895 135.55 02/02/2011
> >> 65895 198.25 02/02/2011
> >> 65895 118.35 02/02/2011
> >> 76446 177.25 03/02/2011
> >> 88958 115.95 04/02/2011
> >> 88958 198.85 02/02/2011
> >> 88958 136.75 04/02/2011
> >> 36987 115.95 03/02/2011
> >> 36987 115.05 08/02/2011
> >> 
> >> I want to list out the records which are having same field 1 value and
> >> repeated more than twice.
> >> 
> >> In the given example I want to list out the first three records and
> >> skip the fourth one and next three records and skip the last two. Give
> >> me some idea. Thank you.
> >
> >awk '$1 in a {if(a[$1])print a[$1];a[$1]="";print} 
> >!($1 in a){a[$1]=$0}' file 
> >
> >
> 
> Care to explain?  That looks a little hinky (which is not to say it is
> wrong, but just hard to parse).  It would probably help both me and the OP
> if you did some explainin'.

It just saves the record containing the first occurrence of each key, and
if that key is seen again, it prints the line (but if there was something
saved, print that first). But as you correctly point out below, that's not
what was requested. 
 
> In terms of results, I think it is not to spec, because the spec says
> "more than twice" and your program displayed the 36987 lines (where count
> == 2).

Right, I missed that (should have read more carefully). Mine prints those
that appear more than once. I suppose it can thus be changed as follows,
based on the same idea:

awk '++c[$1] <= 2 {s[c[$1]]=$0; next}
{if(s[1]){print s[1];print s[2]};s[1]="";print}'

Which can trivially be generalized to print lines with keys occurring more
than N times. It can also be changed to accommodate for the case where
lines with the same keys are not consecutive in the input (using s[$1,c[$1]]
etc.)

If the latter condition is assumed, as seems to be the case in the sample
input, another way is

awk '$1!=p{if(b"" && c > 2)print b; p=$1;c=0;b=s=""}{b=b s $0;s=RS;c++}
END{if(b"" && c > 2)print b}'
0
Reply pk 3/20/2011 12:00:56 PM

"nag" <visitnag@gmail.com> wrote in message 
news:04e0686b-3da3-449e-9eed-3e609bdc82b6@w9g2000prg.googlegroups.com...
> Dear all,
>
> I have a file like,
>
> 65895 135.55 02/02/2011
> 65895 198.25 02/02/2011
> 65895 118.35 02/02/2011
> 76446 177.25 03/02/2011
> 88958 115.95 04/02/2011
> 88958 198.85 02/02/2011
> 88958 136.75 04/02/2011
> 36987 115.95 03/02/2011
> 36987 115.05 08/02/2011
>
> I want to list out the records which are having same field 1 value and
> repeated more than twice.
>
> In the given example I want to list out the first three records and
> skip the fourth one and next three records and skip the last two. Give
> me some idea. Thank you.

Well I saw the others but I found them a bit hard to follow. So I think this 
works for this sample data:

$1 != tag { tag = $1; line1 = $0; line2 = ""; next }
!line2 { line2 = $0; next }
line1 { print line1; print line2; line1 = "" }
{ print }

Its lack of arrays makes it harder to generalize to other N, though.

- Anton Treuenfels 

0
Reply Anton 3/22/2011 4:50:05 AM

"Anton Treuenfels" <teamtempest@yahoo.com> wrote in message 
news:Bt6dncGg6pgdtxXQnZ2dnUVZ_jadnZ2d@earthlink.com...
>
> $1 != tag { tag = $1; line1 = $0; line2 = ""; next }
> !line2 { line2 = $0; next }
> line1 { print line1; print line2; line1 = "" }
> { print }
>
> Its lack of arrays makes it harder to generalize to other N, though.

But then for my own amusement I added an array:

$1 != tag { tag = $1; ndx = 1 }
ndx < N { d[ndx++] = $0; next }
ndx == N { for ( i = 1; i < N; i++ ) print d[i]; ndx++ }
{ print }

- Anton Treuenfels 

0
Reply Anton 3/23/2011 1:23:48 AM

7 Replies
226 Views

(page loaded in 0.079 seconds)

5/4/2013 5:13:53 PM


Reply: