Re: PROC SQL Duplicate Record Selection

  • Follow


Group by will only return summarized rows UNLESS you also include a
non-group-by field in the select clause.  If that's the case you'll receive
"NOTE: The query requires remerging summary statistics back with the
original data" and all your "detail" rows that satisfy the having condition
will be returned.  e.g. in the event mytable contained a variable "d" and
the select read "select *, d ..."

I'm not sure why the grouped query result presents a problem to you.  If the
table contains only 3 variables, and you're returning only combinations of
variables that are duplicated ( by virtue of the having clause ), you DO
have enough information to go back to the original table.  i.e. having ALL
rows returned of duplicate a,b,c values will not result in any a,b,c
combinations not returned by the summarized query.

If you're really looking for the number of duplicates for each a,b,c
combination, add the count(*) as cnt to your select clause.

proc sql;
  create table mydupes1 as
   select *, count(*) as cnt
     from mytable
    group by a,b,c
   having calculated cnt > 1
  ;

  /*  Want the total number of duplicated records?  */

  select sum(cnt) as total_duplicated_records
    from mydupes1
  ;
quit;

In any case, the following detail to summary join does return what you're
looking for:

proc sql;
  create table mydupes1 as
   select all.*
     from mytable all,
          ( select *
              from mytable
         group by a, b, c
        having count(*) > 1
          ) summ
        where all.a = summ.a
          and all.b = summ.b
          and all.c = summ.c
   order by a, b, c;
  ;
quit;

-----Original Message-----
From: Talbot Michael Katz [mailto:topkatz@MSN.COM]
Sent: Friday, October 29, 2004 9:43 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: PROC SQL Duplicate Record Selection


Hi.

I have a data set with three fields that I pulled from a larger table.
There were some duplicate records, and I wanted to see a few of them so
that I could go back to the original table and investigate.  I issued the
following query:

proc sql;
  create table mydupes1 as
  select *
  from mytable
  group by a, b, c
  having count(*) > 1
  order by a, b, c;
quit;

I thought this would return all instances of the duplicate records, but it
returned only a single instance of each record that has duplicates.  I
thought that's what I would get if I put "select distinct" and, indeed, I
tried that and got identical results.

I even tried the following:

proc sql;
  create table mydupes2 as
  select *
  from mytable
  except
  select *
  from mytable
  group by a, b, c
  having count(*) = 1
  order by a, b, c;
quit;

and, to my surprise, got the same thing (only one representative for each
duplicate).  I could get all the duplicates if I matched mytable and
mydupes1, and I could also use a DATA step and SET BY with FIRST and LAST
syntax if the data set MYTABLE were already sorted, but can it be done in
a "single step" PROC SQL without matching?

Thanks!

--  TMK  --

------------------------------------------------------------
This e-mail may be privileged and/or confidential, and the sender does not waive any related rights and obligations. Any distribution, use or copying of this e-mail or the information it contains by other than an intended recipient is unauthorized. If you received this e-mail in error, please advise me (by return e-mail or otherwise) immediately.

Ce courrier �lectronique est confidentiel et prot�g�. L'exp�diteur ne renonce pas aux droits et obligations qui s'y rapportent. Toute diffusion, utilisation ou copie de ce message ou des renseignements qu'il contient par une personne autre que le (les) destinataire(s) d�sign�(s) est interdite. Si vous recevez ce courrier �lectronique par erreur, veuillez m'en aviser imm�diatement, par retour de courrier �lectronique ou par un autre moyen.

============================================================
0
Reply harry.droogendyk1 (872) 10/30/2004 2:09:08 AM

Droogendyk, Harry wrote:
> Group by will only return summarized rows UNLESS you also include a
> non-group-by field in the select clause.

Excellent explanation.

> In any case, the following detail to summary join does return what
> you're
> looking for:
>
> proc sql;
>   create table mydupes1 as
>    select all.*
>      from mytable all,
>           ( select *
>               from mytable
>          group by a, b, c
>         having count(*) > 1
>           ) summ
>         where all.a = summ.a
>           and all.b = summ.b
>           and all.c = summ.c
>    order by a, b, c;
>   ;
> quit;

Prior thread about SPDE indicates this is probably the best form for large
data.

For robustness I show two forms that select via IN .  The second form is
invalid but demonstrates a (concatenated) IN construct that would be
desirable.


%let seed=1234;

data foo;
  do a = 1 to 5;
  do b = 1 to 5;
  do c = 1 to 5;
    do _n_ = 1 to 3 * ranuni(&seed);
      output;
    end;
  end; end; end;
run;

* today this is valid;
* most robust way to concatenate numeric fields is to use RB8.;

proc sql;
  create table dups as
  select *
  from foo
  where put(a,rb8.)||put(b,rb8.)||put(c,rb8.)
        in
        (select put(a,rb8.)||put(b,rb8.)||put(c,rb8.)
         from   foo
         group by a,b,c
         having count(*) > 1
        )
  order by monotonic()
  ;
quit;


* this will be valid someday?;
* (v1,...vN) in (select a1,...,aN from ...)
* - return true if N values match N values of a row in select set of another
query;

proc sql;
  create table dups as
  select *
  from foo
  where (a,b,c)   %* comma separated list indicating implicit concatenation;
        in
        (select a,b,c
         from   foo
         group by a,b,c
         having count(*) > 1
        )
  ;
quit;

-- 
Richard A. DeVenezia
http://www.devenezia.com/


0
Reply radevenz (1536) 10/30/2004 5:12:18 AM


1 Replies
62 Views

(page loaded in 0.102 seconds)

Similiar Articles:













7/25/2012 7:53:22 AM


Reply: