delete dupes with a unique id using a temp table

I need to delete the dupes across bu's.  The records are unique based
on bu+ndc but there are 400+ duplicates..
There is a sequential id that is unique.  I am trying to remove these
400+ records.

1.  SELECT *
FROM ci
GROUP BY bu,ndc
HAVING COUNT(concat(bu,ndc) )>1

This query yields 400+ dupes which I need to remove.


2.
I could do this query in Oracle but not in MySQL.
You cannot delete from the same table as in the subquery.


 DELETE FROM ci  b
WHERE EXISTS ((
SELECT bu,ndc, count(1)
FROM ci
GROUP BY bu,ndc
HAVING  COUNT(1)>1
) a
where  a.bu=b.bu
and a.ndc = b.ndc



3.   I need to create a temp table with the unique ids in order to
compare the ci table to the duplicates.
I create the temp table with the SELECT from above that now has the
duplicate records in it with the id.
CREATE TABLE  ci_tmp AS
(SELECT c.id,c.bu,c.zonenm,c.ndc
FROM ci c
GROUP BY c.bu,c.ndc
HAVING  COUNT(concat(c.bu,c.ndc) )>1)

4.  Next I have to run my delete query matching against the ids in the
temp table.
AT this point I am not sure exactly how to do this or what I need to
delete.
This is what I have but it isn't right because it doesn't delete any
records.


DELETE FROM cart_inventory
WHERE id IN (select id
FROM cart_inventory_tmp
GROUP BY bu,nationaldrugcode
HAVING COUNT(concat(bu,nationaldrugcode) )>1)


I also tried   deleting using max(id) but I don't know if it requires
a self-join and I think the above process might be easier?

delete  FROM cart_inventory as as a
max(id)
where  concat(a.bu, a.ndc)=concat(b.bu,b.ndc)

If you could explain the process of Query#4 what I am trying to match
against that might help but I basically understand
that you need to get the id of one of the duplicates from the temp
table and match it against the ci table and delete that id.
It seems like it should work but it doesn't because I run query # 4
and do the select for the dupes and they are all still there..



thanks,
0
jr
1/4/2011 12:37:05 AM
comp.databases.mysql 6920 articles. 1 followers. laredotornado (869) is leader. Post Follow

16 Replies
466 Views

Similar Articles

[PageSpeed] 54

On Jan 3, 7:37=A0pm, jr <jlro...@yahoo.com> wrote:
> I need to delete the dupes across bu's. =A0The records are unique based
> on bu+ndc but there are 400+ duplicates..
> There is a sequential id that is unique. =A0I am trying to remove these
> 400+ records.
>
> 1. =A0SELECT *
> FROM ci
> GROUP BY bu,ndc
> HAVING COUNT(concat(bu,ndc) )>1
>
> This query yields 400+ dupes which I need to remove.
>
> 2.
> I could do this query in Oracle but not in MySQL.
> You cannot delete from the same table as in the subquery.
>
> =A0DELETE FROM ci =A0b
> WHERE EXISTS ((
> SELECT bu,ndc, count(1)
> FROM ci
> GROUP BY bu,ndc
> HAVING =A0COUNT(1)>1
> ) a
> where =A0a.bu=3Db.bu
> and a.ndc =3D b.ndc
>
> 3. =A0 I need to create a temp table with the unique ids in order to
> compare the ci table to the duplicates.
> I create the temp table with the SELECT from above that now has the
> duplicate records in it with the id.
> CREATE TABLE =A0ci_tmp AS
> (SELECT c.id,c.bu,c.zonenm,c.ndc
> FROM ci c
> GROUP BY c.bu,c.ndc
> HAVING =A0COUNT(concat(c.bu,c.ndc) )>1)
>
> 4. =A0Next I have to run my delete query matching against the ids in the
> temp table.
> AT this point I am not sure exactly how to do this or what I need to
> delete.
> This is what I have but it isn't right because it doesn't delete any
> records.
>
> DELETE FROM cart_inventory
> WHERE id IN (select id
> FROM cart_inventory_tmp
> GROUP BY bu,nationaldrugcode
> HAVING COUNT(concat(bu,nationaldrugcode) )>1)
>
> I also tried =A0 deleting using max(id) but I don't know if it requires
> a self-join and I think the above process might be easier?
>
> delete =A0FROM cart_inventory as as a
> max(id)
> where =A0concat(a.bu, a.ndc)=3Dconcat(b.bu,b.ndc)
>
> If you could explain the process of Query#4 what I am trying to match
> against that might help but I basically understand
> that you need to get the id of one of the duplicates from the temp
> table and match it against the ci table and delete that id.
> It seems like it should work but it doesn't because I run query # 4
> and do the select for the dupes and they are all still there..
>
> thanks,


1) are the rows truely DUPLICATE or do you just have a bu,ndc
duplicate - are there other columns and are they also duplicate or are
they unique?  If "unique", which one is correct?

See: http://dev.mysql.com/doc/refman/5.5/en/alter-table.html and
search for IGNORE

"IGNORE is a MySQL extension to standard SQL. It controls how ALTER
TABLE works if there are duplicates on unique keys in the new table or
if warnings occur when strict mode is enabled. If IGNORE is not
specified, the copy is aborted and rolled back if duplicate-key errors
occur. If IGNORE is specified, only the first row is used of rows with
duplicates on a unique key, The other conflicting rows are deleted.
Incorrect values are truncated to the closest matching acceptable
value."

ALTER IGNORE TABLE ci ADD UNIQUE KEY (bu,ndc ) ;

How it knows which one is "first" I do not know...

I would test this before actually executingto ensure you are purging
the correct data.

Next step - figure out why you are getting unwanted dupes and fix the
problem.

0
onedbguru
1/4/2011 1:13:49 AM
On Jan 3, 5:13=A0pm, onedbguru <onedbg...@yahoo.com> wrote:
> On Jan 3, 7:37=A0pm, jr <jlro...@yahoo.com> wrote:
>
>
>
>
>
>
>
>
>
> > I need to delete the dupes across bu's. =A0The records are unique based
> > on bu+ndc but there are 400+ duplicates..
> > There is a sequential id that is unique. =A0I am trying to remove these
> > 400+ records.
>
> > 1. =A0SELECT *
> > FROM ci
> > GROUP BY bu,ndc
> > HAVING COUNT(concat(bu,ndc) )>1
>
> > This query yields 400+ dupes which I need to remove.
>
> > 2.
> > I could do this query in Oracle but not in MySQL.
> > You cannot delete from the same table as in the subquery.
>
> > =A0DELETE FROM ci =A0b
> > WHERE EXISTS ((
> > SELECT bu,ndc, count(1)
> > FROM ci
> > GROUP BY bu,ndc
> > HAVING =A0COUNT(1)>1
> > ) a
> > where =A0a.bu=3Db.bu
> > and a.ndc =3D b.ndc
>
> > 3. =A0 I need to create a temp table with the unique ids in order to
> > compare the ci table to the duplicates.
> > I create the temp table with the SELECT from above that now has the
> > duplicate records in it with the id.
> > CREATE TABLE =A0ci_tmp AS
> > (SELECT c.id,c.bu,c.zonenm,c.ndc
> > FROM ci c
> > GROUP BY c.bu,c.ndc
> > HAVING =A0COUNT(concat(c.bu,c.ndc) )>1)
>
> > 4. =A0Next I have to run my delete query matching against the ids in th=
e
> > temp table.
> > AT this point I am not sure exactly how to do this or what I need to
> > delete.
> > This is what I have but it isn't right because it doesn't delete any
> > records.
>
> > DELETE FROM cart_inventory
> > WHERE id IN (select id
> > FROM cart_inventory_tmp
> > GROUP BY bu,nationaldrugcode
> > HAVING COUNT(concat(bu,nationaldrugcode) )>1)
>
> > I also tried =A0 deleting using max(id) but I don't know if it requires
> > a self-join and I think the above process might be easier?
>
> > delete =A0FROM cart_inventory as as a
> > max(id)
> > where =A0concat(a.bu, a.ndc)=3Dconcat(b.bu,b.ndc)
>
> > If you could explain the process of Query#4 what I am trying to match
> > against that might help but I basically understand
> > that you need to get the id of one of the duplicates from the temp
> > table and match it against the ci table and delete that id.
> > It seems like it should work but it doesn't because I run query # 4
> > and do the select for the dupes and they are all still there..
>
> > thanks,
>
> 1) are the rows truely DUPLICATE or do you just have a bu,ndc
> duplicate - are there other columns and are they also duplicate or are
> they unique? =A0If "unique", which one is correct?
>
> See:http://dev.mysql.com/doc/refman/5.5/en/alter-table.htmland
> search for IGNORE
>
> "IGNORE is a MySQL extension to standard SQL. It controls how ALTER
> TABLE works if there are duplicates on unique keys in the new table or
> if warnings occur when strict mode is enabled. If IGNORE is not
> specified, the copy is aborted and rolled back if duplicate-key errors
> occur. If IGNORE is specified, only the first row is used of rows with
> duplicates on a unique key, The other conflicting rows are deleted.
> Incorrect values are truncated to the closest matching acceptable
> value."
>
> ALTER IGNORE TABLE ci ADD UNIQUE KEY (bu,ndc ) ;
>
> How it knows which one is "first" I do not know...
>
> I would test this before actually executingto ensure you are purging
> the correct data.
>
> Next step - figure out why you are getting unwanted dupes and fix the
> problem.

it is bu,ndc duplicates I am trying to get rid of, this is data
cleansing, the first or the 2nd duplicate doesn't matter, there is no
duplicate on the unique key
only duplicate on the bu,ndc.  The concat(bu,ndc) is how I determine
the row is unique.  The id was assigned as I uploaded the data.
Should I just use the max(id) if so how do I set up the self-join?
thanks,
0
jr
1/4/2011 1:27:29 AM
On Jan 3, 5:13=A0pm, onedbguru <onedbg...@yahoo.com> wrote:
> On Jan 3, 7:37=A0pm, jr <jlro...@yahoo.com> wrote:
>
>
>
>
>
>
>
>
>
> > I need to delete the dupes across bu's. =A0The records are unique based
> > on bu+ndc but there are 400+ duplicates..
> > There is a sequential id that is unique. =A0I am trying to remove these
> > 400+ records.
>
> > 1. =A0SELECT *
> > FROM ci
> > GROUP BY bu,ndc
> > HAVING COUNT(concat(bu,ndc) )>1
>
> > This query yields 400+ dupes which I need to remove.
>
> > 2.
> > I could do this query in Oracle but not in MySQL.
> > You cannot delete from the same table as in the subquery.
>
> > =A0DELETE FROM ci =A0b
> > WHERE EXISTS ((
> > SELECT bu,ndc, count(1)
> > FROM ci
> > GROUP BY bu,ndc
> > HAVING =A0COUNT(1)>1
> > ) a
> > where =A0a.bu=3Db.bu
> > and a.ndc =3D b.ndc
>
> > 3. =A0 I need to create a temp table with the unique ids in order to
> > compare the ci table to the duplicates.
> > I create the temp table with the SELECT from above that now has the
> > duplicate records in it with the id.
> > CREATE TABLE =A0ci_tmp AS
> > (SELECT c.id,c.bu,c.zonenm,c.ndc
> > FROM ci c
> > GROUP BY c.bu,c.ndc
> > HAVING =A0COUNT(concat(c.bu,c.ndc) )>1)
>
> > 4. =A0Next I have to run my delete query matching against the ids in th=
e
> > temp table.
> > AT this point I am not sure exactly how to do this or what I need to
> > delete.
> > This is what I have but it isn't right because it doesn't delete any
> > records.
>
> > DELETE FROM cart_inventory
> > WHERE id IN (select id
> > FROM cart_inventory_tmp
> > GROUP BY bu,nationaldrugcode
> > HAVING COUNT(concat(bu,nationaldrugcode) )>1)
>
> > I also tried =A0 deleting using max(id) but I don't know if it requires
> > a self-join and I think the above process might be easier?
>
> > delete =A0FROM cart_inventory as as a
> > max(id)
> > where =A0concat(a.bu, a.ndc)=3Dconcat(b.bu,b.ndc)
>
> > If you could explain the process of Query#4 what I am trying to match
> > against that might help but I basically understand
> > that you need to get the id of one of the duplicates from the temp
> > table and match it against the ci table and delete that id.
> > It seems like it should work but it doesn't because I run query # 4
> > and do the select for the dupes and they are all still there..
>
> > thanks,
>
> 1) are the rows truely DUPLICATE or do you just have a bu,ndc
> duplicate - are there other columns and are they also duplicate or are
> they unique? =A0If "unique", which one is correct?
>
> See:http://dev.mysql.com/doc/refman/5.5/en/alter-table.htmland
> search for IGNORE
>
> "IGNORE is a MySQL extension to standard SQL. It controls how ALTER
> TABLE works if there are duplicates on unique keys in the new table or
> if warnings occur when strict mode is enabled. If IGNORE is not
> specified, the copy is aborted and rolled back if duplicate-key errors
> occur. If IGNORE is specified, only the first row is used of rows with
> duplicates on a unique key, The other conflicting rows are deleted.
> Incorrect values are truncated to the closest matching acceptable
> value."
>
> ALTER IGNORE TABLE ci ADD UNIQUE KEY (bu,ndc ) ;
>
> How it knows which one is "first" I do not know...
>
> I would test this before actually executingto ensure you are purging
> the correct data.
>
> Next step - figure out why you are getting unwanted dupes and fix the
> problem.

I tried the ALTER IGNORE  and changed the unique key to the (bu,ndc)
on a backup table.
It changed the number of records in ci by 169 so I am afraid to do it
like that.
If I could figure out how to delete using the max(id) I would do it
that way.
0
jr
1/4/2011 1:59:19 AM
On 2011-01-04 01:37, jr wrote:
[...]
> 2.
> I could do this query in Oracle but not in MySQL.
> You cannot delete from the same table as in the subquery.
> 
> 
>  DELETE FROM ci  b
> WHERE EXISTS ((
> SELECT bu,ndc, count(1)
> FROM ci
> GROUP BY bu,ndc
> HAVING  COUNT(1)>1
> ) a
> where  a.bu=b.bu
> and a.ndc = b.ndc
> 

You can hide the table at a deeper nesting level, but it will probably
be very slow. Another option is to delete from a join (non-standard)
like (untested):

delete b.*
from ci as b
join (
  SELECT bu,ndc
  FROM ci
  GROUP BY bu,ndc
  HAVING  COUNT(1)>1
) as a
    on a.bu=b.bu
   and a.ndc = b.ndc

If bu, ndu is a candidate key, you might want to add a unique constraint:

    alter table ci add constraint ... unique (bu, ndc);

to enforce this.

[...]

/Lennart
0
Lennart
1/4/2011 8:35:16 AM
On Jan 4, 12:35=A0am, Lennart Jonsson <erik.lennart.jons...@gmail.com>
wrote:
> On 2011-01-04 01:37, jr wrote:
> [...]
>
> > 2.
> > I could do this query in Oracle but not in MySQL.
> > You cannot delete from the same table as in the subquery.
>
> > =A0DELETE FROM ci =A0b
> > WHERE EXISTS ((
> > SELECT bu,ndc, count(1)
> > FROM ci
> > GROUP BY bu,ndc
> > HAVING =A0COUNT(1)>1
> > ) a
> > where =A0a.bu=3Db.bu
> > and a.ndc =3D b.ndc
>
> You can hide the table at a deeper nesting level, but it will probably
> be very slow. Another option is to delete from a join (non-standard)
> like (untested):
>
> delete b.*
> from ci as b
> join (
> =A0 SELECT bu,ndc
> =A0 FROM ci
> =A0 GROUP BY bu,ndc
> =A0 HAVING =A0COUNT(1)>1
> ) as a
> =A0 =A0 on a.bu=3Db.bu
> =A0 =A0and a.ndc =3D b.ndc
>
> If bu, ndu is a candidate key, you might want to add a unique constraint:
>
> =A0 =A0 alter table ci add constraint ... unique (bu, ndc);
>
> to enforce this.
>
> [...]
>
> /Lennart

The query is beautiful although it deleted 1085 rows.   When I did the
first query to locate duplicates it found only 487.  Why is
that? ?????
Where did the extra
This is the query I used to find the dupes.  Is that wrong or is your
query wrong?  Did it delete the dupe and the original record as well?

select bu,ndc, count(*)
from ci
group by bu,ndc
having COUNT(concat(bu,nationaldrugcode)) > 1



I wouldn't hide the table at a deeper level because users are using
the web page and it is somewhat slow already because of a UNION
query.  I could try it if all else fails.  I will try the unique
constraint.

thanks,
0
jr
1/4/2011 3:37:05 PM
On 2011-01-04 16:37, jr wrote:
[...]
> 
> The query is beautiful although it deleted 1085 rows.   When I did the
> first query to locate duplicates it found only 487.  Why is
> that? ?????
> Where did the extra
> This is the query I used to find the dupes.  Is that wrong or is your
> query wrong?  Did it delete the dupe and the original record as well?
> 
> select bu,ndc, count(*)
> from ci
> group by bu,ndc
> having COUNT(concat(bu,nationaldrugcode)) > 1
> 
> 

I should have look more carefully, I just translated your attempt:

DELETE FROM ci  b
WHERE EXISTS ((
SELECT bu,ndc, count(1)
FROM ci
GROUP BY bu,ndc
HAVING  COUNT(1)>1
) a
where  a.bu=b.bu
and a.ndc = b.ndc

into

delete b.*
from ci as b
join (
  SELECT bu,ndc
  FROM ci
  GROUP BY bu,ndc
  HAVING  COUNT(1)>1
) as a
    on a.bu=b.bu
   and a.ndc = b.ndc

It is probably not what you want because it will delete *all* rows that
are duplicated (including the first occurrence). Compare the result from:

SELECT *
FROM ci  b
WHERE EXISTS (
    SELECT bu,ndc, count(1)
    FROM ci
    GROUP BY bu,ndc
    HAVING COUNT(1)>1
) a
where a.bu=b.bu
and a.ndc = b.ndc

with:

select b.*
from ci as b
join (
  SELECT bu,ndc
  FROM ci
  GROUP BY bu,ndc
  HAVING  COUNT(1)>1
) as a
    on a.bu=b.bu
   and a.ndc = b.ndc

> 
> I wouldn't hide the table at a deeper level because users are using
> the web page and it is somewhat slow already because of a UNION
> query.  I could try it if all else fails.

No, you misunderstood me. What I meant was that you can hide the table
inside the delete stmt as in:

SELECT *
FROM ci  b
WHERE EXISTS (
    select * FROM (
        SELECT bu,ndc, count(1)
        FROM ci
        GROUP BY bu,ndc
        HAVING COUNT(1)>1
    ) AS X
) a
where a.bu=b.bu
and a.ndc = b.ndc

but this will also delete more rows than you want (I guess). You will
have to enumerate the duplicate rows somehow (the sequential unique id
is probably your best option). Here's one attempt:

delete b.*
from ci as b
where exists (
    select 1
    from ci as c
    where c.bu = b.bu
      and c.ndc = b.ndc
      and c.id < b.id
)

But mysql does not allow this, so we'll rewrite that as a join:

delete b.*
from ci as b
join ci as c
    on c.bu = b.bu
    and c.ndc = b.ndc
    and c.id < b.id

Example:

create table ci (
    id int not null primary key,
    bu int not null,
    ndc int not null
) engine = innodb;

insert into ci (id, bu, ndc)
values (1,1,1),(2,1,1),(3,2,1),(4,2,2),(5,2,2),(6,1,1);

The first query will delete *all* duplicates, i.e. all rows but the one
with id = 3:

mysql> select b.*
    -> from ci as b
    -> join (
    ->   SELECT bu,ndc
    ->   FROM ci
    ->   GROUP BY bu,ndc
    ->   HAVING  COUNT(1)>1
    -> ) as a
    ->     on a.bu=b.bu
    ->    and a.ndc = b.ndc;
+----+----+-----+
| id | bu | ndc |
+----+----+-----+
|  1 |  1 |   1 |
|  2 |  1 |   1 |
|  4 |  2 |   2 |
|  5 |  2 |   2 |
|  6 |  1 |   1 |
+----+----+-----+

The rewritten query leaves the duplicates with the lowest id:

mysql> select b.* from ci as b
    -> join ci as c
    ->     on c.bu = b.bu
    ->     and c.ndc = b.ndc
    ->     and c.id < b.id;
+----+----+-----+
| id | bu | ndc |
+----+----+-----+
|  2 |  1 |   1 |
|  5 |  2 |   2 |
|  6 |  1 |   1 |
|  6 |  1 |   1 |
+----+----+-----+
4 rows in set (0.05 sec)

mysql> delete b.*
    -> from ci as b
    -> join ci as c
    ->    on c.bu = b.bu
    ->   and c.ndc = b.ndc
    ->   and c.id < b.id;
Query OK, 3 rows affected (0.05 sec)

mysql> select * from ci;
+----+----+-----+
| id | bu | ndc |
+----+----+-----+
|  1 |  1 |   1 |
|  3 |  2 |   1 |
|  4 |  2 |   2 |
+----+----+-----+
3 rows in set (0.00 sec)

Don't just take the code and execute it, try understand the similarities
and differences from the first attempt.

HTH
/Lennart
0
Lennart
1/4/2011 9:19:57 PM
On 2011-01-04 22:19, Lennart Jonsson wrote:
[...]
> 
> SELECT *
> FROM ci  b
> WHERE EXISTS (
>     SELECT bu,ndc, count(1)
>     FROM ci
>     GROUP BY bu,ndc
>     HAVING COUNT(1)>1
> ) a
> where a.bu=b.bu
> and a.ndc = b.ndc
> 

I get an syntax error here, so I guess uncorrelated exists subqueries
are not allowed either. You can rewrite this to:

SELECT b.* FROM ci as b where exists ( select bu,ndc from ci as a where
a.bu = b.bu and a.ndc = b.ndc group by  bu,ndc having count(1) > 1);
+----+----+-----+
| id | bu | ndc |
+----+----+-----+
|  1 |  1 |   1 |
|  2 |  1 |   1 |
|  4 |  2 |   2 |
|  5 |  2 |   2 |
|  6 |  1 |   1 |
+----+----+-----+

/Lennart
0
Lennart
1/4/2011 9:38:30 PM
On 2011-01-04 22:38, Lennart Jonsson wrote:
> On 2011-01-04 22:19, Lennart Jonsson wrote:
> [...]
>> 
>> SELECT *
>> FROM ci  b
>> WHERE EXISTS (
>>     SELECT bu,ndc, count(1)
>>     FROM ci
>>     GROUP BY bu,ndc
>>     HAVING COUNT(1)>1
>> ) a
>> where a.bu=b.bu
>> and a.ndc = b.ndc
>> 
> 
> I get an syntax error here

Never mind, there where several other reasons for the syntax error, I
just did not notice them.
0
Lennart
1/5/2011 11:53:49 PM
On Jan 5, 3:53=A0pm, Lennart Jonsson <erik.lennart.jons...@gmail.com>
wrote:
> On 2011-01-04 22:38, Lennart Jonsson wrote:
>
>
>
>
>
>
>
>
>
> > On 2011-01-04 22:19, Lennart Jonsson wrote:
> > [...]
>
> >> SELECT *
> >> FROM ci =A0b
> >> WHERE EXISTS (
> >> =A0 =A0 SELECT bu,ndc, count(1)
> >> =A0 =A0 FROM ci
> >> =A0 =A0 GROUP BY bu,ndc
> >> =A0 =A0 HAVING COUNT(1)>1
> >> ) a
> >> where a.bu=3Db.bu
> >> and a.ndc =3D b.ndc
>
> > I get an syntax error here
>
> Never mind, there where several other reasons for the syntax error, I
> just did not notice them.

I do get it, thank, you finished the first query that I was trying to
do even though it was wrong.
If I could have done it then I could have seen it was wrong.  I did
want to do the self-join.
  I get it you can't delete from the same table as a subquery.  Thanks
for explaining the rest of it.
0
jr
1/6/2011 1:55:16 PM
On Jan 4, 1:38=A0pm, Lennart Jonsson <erik.lennart.jons...@gmail.com>
wrote:
> On 2011-01-04 22:19, Lennart Jonsson wrote:
> [...]
>
>
>
> > SELECT *
> > FROM ci =A0b
> > WHERE EXISTS (
> > =A0 =A0 SELECT bu,ndc, count(1)
> > =A0 =A0 FROM ci
> > =A0 =A0 GROUP BY bu,ndc
> > =A0 =A0 HAVING COUNT(1)>1
> > ) a
> > where a.bu=3Db.bu
> > and a.ndc =3D b.ndc
>
> I get an syntax error here, so I guess uncorrelated exists subqueries
> are not allowed either. You can rewrite this to:
>
> SELECT b.* FROM ci as b where exists ( select bu,ndc from ci as a where
> a.bu =3D b.bu and a.ndc =3D b.ndc group by =A0bu,ndc having count(1) > 1)=
;
> +----+----+-----+
> | id | bu | ndc |
> +----+----+-----+
> | =A01 | =A01 | =A0 1 |
> | =A02 | =A01 | =A0 1 |
> | =A04 | =A02 | =A0 2 |
> | =A05 | =A02 | =A0 2 |
> | =A06 | =A01 | =A0 1 |
> +----+----+-----+
>
> /Lennart

I ran through all your exercises ;-)
I see why my first query didn't work there were 2 WHERE clauses.
The 2nd query of yours deletes all duplicates because it is an equi-
join on bu&ndc.
The 3rd query,  I did get a syntax error on the select:
select *
from ci  b
where exists(
select bu,ndc, count(1)
from ci
group by bu, ndc
having count(1)>1
)a
where a.bu=3Db.bu
and a.ndc=3Db.ndc

and I also tried it as a delete:
delete * from ci b
where exists(
select bu,ndc, count(1)
from ci
group by bu, ndc
having count(1)>1
)a
where a.bu=3Db.bu
and a.ndc=3Db.ndc,
I got syntax errors.  You mentioned some other reason other than a
syntax error?

The last query worked as you said.
joining also on the id.
It worked really well on one pass!!!!!!!!!!!

0
jr
1/12/2011 11:36:42 PM
On 2011-01-13 00:36, jr wrote:
[...]
> The 3rd query,  I did get a syntax error on the select:
> select *
> from ci  b
> where exists(
> select bu,ndc, count(1)
> from ci
> group by bu, ndc
> having count(1)>1
> )a
> where a.bu=b.bu
> and a.ndc=b.ndc
>
[...]
> I got syntax errors.  You mentioned some other reason other than a
> syntax error?
> 

You already mentioned the two where clauses, fixing that leaves us with:

.... exists (
    select bu,ndc, count(1)
    from ci
    group by bu, ndc
    having count(1)>1
) a

exists is a predicate ( sometimes referred to as propositional function,
you can think of it as an anonymous function returning true/false ) and
you can't give it a name ( there is no point in doing so either ).
Furthermore bu, ndc, etc inside the predicate is not visible outside the
predicate, so you can't reference them like you do in:

  where a.bu=b.bu
    and a.ndc=b.ndc

You can however reference variables outside of the predicate from within
the predicate, so

select *
from ci  b
where exists(
    select bu,ndc, count(1)
    from ci as a

    where a.bu=b.bu
    and a.ndc=b.ndc

    group by bu, ndc
    having count(1)>1
)

is syntactically correct (will still give you the wrong result).


/Lennart
0
Lennart
1/13/2011 6:10:26 AM
On Jan 4, 1:19=A0pm, Lennart Jonsson <erik.lennart.jons...@gmail.com>
wrote:
> On 2011-01-04 16:37, jr wrote:
> [...]
>
>
>
> > The query is beautiful although it deleted 1085 rows. =A0 When I did th=
e
> > first query to locate duplicates it found only 487. =A0Why is
> > that? ?????
> > Where did the extra
> > This is the query I used to find the dupes. =A0Is that wrong or is your
> > query wrong? =A0Did it delete the dupe and the original record as well?
>
> > select bu,ndc, count(*)
> > from ci
> > group by bu,ndc
> > having COUNT(concat(bu,nationaldrugcode)) > 1
>
> I should have look more carefully, I just translated your attempt:
>
> DELETE FROM ci =A0b
> WHERE EXISTS ((
> SELECT bu,ndc, count(1)
> FROM ci
> GROUP BY bu,ndc
> HAVING =A0COUNT(1)>1
> ) a
> where =A0a.bu=3Db.bu
> and a.ndc =3D b.ndc
>
> into
>
> delete b.*
> from ci as b
> join (
> =A0 SELECT bu,ndc
> =A0 FROM ci
> =A0 GROUP BY bu,ndc
> =A0 HAVING =A0COUNT(1)>1
> ) as a
> =A0 =A0 on a.bu=3Db.bu
> =A0 =A0and a.ndc =3D b.ndc
>
> It is probably not what you want because it will delete *all* rows that
> are duplicated (including the first occurrence). Compare the result from:
>
> SELECT *
> FROM ci =A0b
> WHERE EXISTS (
> =A0 =A0 SELECT bu,ndc, count(1)
> =A0 =A0 FROM ci
> =A0 =A0 GROUP BY bu,ndc
> =A0 =A0 HAVING COUNT(1)>1
> ) a
> where a.bu=3Db.bu
> and a.ndc =3D b.ndc
>
> with:
>
> select b.*
> from ci as b
> join (
> =A0 SELECT bu,ndc
> =A0 FROM ci
> =A0 GROUP BY bu,ndc
> =A0 HAVING =A0COUNT(1)>1
> ) as a
> =A0 =A0 on a.bu=3Db.bu
> =A0 =A0and a.ndc =3D b.ndc
>
>
>
> > I wouldn't hide the table at a deeper level because users are using
> > the web page and it is somewhat slow already because of a UNION
> > query. =A0I could try it if all else fails.
>
> No, you misunderstood me. What I meant was that you can hide the table
> inside the delete stmt as in:
>
> SELECT *
> FROM ci =A0b
> WHERE EXISTS (
> =A0 =A0 select * FROM (
> =A0 =A0 =A0 =A0 SELECT bu,ndc, count(1)
> =A0 =A0 =A0 =A0 FROM ci
> =A0 =A0 =A0 =A0 GROUP BY bu,ndc
> =A0 =A0 =A0 =A0 HAVING COUNT(1)>1
> =A0 =A0 ) AS X
> ) a
> where a.bu=3Db.bu
> and a.ndc =3D b.ndc
>
> but this will also delete more rows than you want (I guess). You will
> have to enumerate the duplicate rows somehow (the sequential unique id
> is probably your best option). Here's one attempt:
>
> delete b.*
> from ci as b
> where exists (
> =A0 =A0 select 1
> =A0 =A0 from ci as c
> =A0 =A0 where c.bu =3D b.bu
> =A0 =A0 =A0 and c.ndc =3D b.ndc
> =A0 =A0 =A0 and c.id < b.id
> )
>
> But mysql does not allow this, so we'll rewrite that as a join:
>
> delete b.*
> from ci as b
> join ci as c
> =A0 =A0 on c.bu =3D b.bu
> =A0 =A0 and c.ndc =3D b.ndc
> =A0 =A0 and c.id < b.id
>
> Example:
>
> create table ci (
> =A0 =A0 id int not null primary key,
> =A0 =A0 bu int not null,
> =A0 =A0 ndc int not null
> ) engine =3D innodb;
>
> insert into ci (id, bu, ndc)
> values (1,1,1),(2,1,1),(3,2,1),(4,2,2),(5,2,2),(6,1,1);
>
> The first query will delete *all* duplicates, i.e. all rows but the one
> with id =3D 3:
>
> mysql> select b.*
> =A0 =A0 -> from ci as b
> =A0 =A0 -> join (
> =A0 =A0 -> =A0 SELECT bu,ndc
> =A0 =A0 -> =A0 FROM ci
> =A0 =A0 -> =A0 GROUP BY bu,ndc
> =A0 =A0 -> =A0 HAVING =A0COUNT(1)>1
> =A0 =A0 -> ) as a
> =A0 =A0 -> =A0 =A0 on a.bu=3Db.bu
> =A0 =A0 -> =A0 =A0and a.ndc =3D b.ndc;
> +----+----+-----+
> | id | bu | ndc |
> +----+----+-----+
> | =A01 | =A01 | =A0 1 |
> | =A02 | =A01 | =A0 1 |
> | =A04 | =A02 | =A0 2 |
> | =A05 | =A02 | =A0 2 |
> | =A06 | =A01 | =A0 1 |
> +----+----+-----+
>
> The rewritten query leaves the duplicates with the lowest id:
>
> mysql> select b.* from ci as b
> =A0 =A0 -> join ci as c
> =A0 =A0 -> =A0 =A0 on c.bu =3D b.bu
> =A0 =A0 -> =A0 =A0 and c.ndc =3D b.ndc
> =A0 =A0 -> =A0 =A0 and c.id < b.id;
> +----+----+-----+
> | id | bu | ndc |
> +----+----+-----+
> | =A02 | =A01 | =A0 1 |
> | =A05 | =A02 | =A0 2 |
> | =A06 | =A01 | =A0 1 |
> | =A06 | =A01 | =A0 1 |
> +----+----+-----+
> 4 rows in set (0.05 sec)
>
> mysql> delete b.*
> =A0 =A0 -> from ci as b
> =A0 =A0 -> join ci as c
> =A0 =A0 -> =A0 =A0on c.bu =3D b.bu
> =A0 =A0 -> =A0 and c.ndc =3D b.ndc
> =A0 =A0 -> =A0 and c.id < b.id;
> Query OK, 3 rows affected (0.05 sec)
>
> mysql> select * from ci;
> +----+----+-----+
> | id | bu | ndc |
> +----+----+-----+
> | =A01 | =A01 | =A0 1 |
> | =A03 | =A02 | =A0 1 |
> | =A04 | =A02 | =A0 2 |
> +----+----+-----+
> 3 rows in set (0.00 sec)
>
> Don't just take the code and execute it, try understand the similarities
> and differences from the first attempt.
>
> HTH
> /Lennart

Lennart,
Regarding this last delete query, it worked on your small sample
database but  I tried it on my db of 46825 records and it didn't work.
It left me with only 24,571 records.  I did a query to get the num of
duplicates and there were only 9996.  Most of those were only
duplicates, but there were some triplicates etc.  So it deleted way
too many and I had to kill the server eventually.   I came up with 3
other queries that does the deleting in passes by comparing a table
with the duplicate ids with a table containing the max(id)s.  It
seemed to work except something is wrong there also. It deleted all
the duplicates, triplcates, quadruplicates, etc.  It should have only
deleted one of the duplicates with the max(id) and left the original
duplicate.

1.create table ci_tmp1 as
(select id,bu,ndc, count(*)
from ci
group by bu,ndc
having COUNT(concat(bu,ndc)) > 1
)

2.create table ci_tmp2 as(select max(c.id) from ci c, ci_tmp1 t
WHERE t.bu=3Dc.bu AND t.ndc=3Dc.ndc
GROUP BY  t.bu, t.ndc)

3.DELETE FROM ci WHERE id IN (select * from ci_tmp2)

This delete query got near to the correct amount of duplicates
34,576.   If you subtract 9996 from 46825  you get 36,829.
The only problem with this result was I was expecting to have the
triplicates and quadrupleticates left in the ci table and I would have
to run through
the 3 queries again.   When I checked the ci table there were no more
duplicates.   What I was hoping was that your query results number
would match with
my way of doing it in 3 passes and I would know I could use your query
but now I'm not even sure if my queries are correct?  What do you
think?
Regards,  Janis
..
0
jr
1/17/2011 9:41:17 PM
On Jan 12, 10:10=A0pm, Lennart Jonsson <erik.lennart.jons...@gmail.com>
wrote:
> On 2011-01-13 00:36, jr wrote:
> [...]
>
>
>
>
>
>
>
>
>
> > The 3rd query, =A0I did get a syntax error on the select:
> > select *
> > from ci =A0b
> > where exists(
> > select bu,ndc, count(1)
> > from ci
> > group by bu, ndc
> > having count(1)>1
> > )a
> > where a.bu=3Db.bu
> > and a.ndc=3Db.ndc
>
> [...]
> > I got syntax errors. =A0You mentioned some other reason other than a
> > syntax error?
>
> You already mentioned the two where clauses, fixing that leaves us with:
>
> ... exists (
> =A0 =A0 select bu,ndc, count(1)
> =A0 =A0 from ci
> =A0 =A0 group by bu, ndc
> =A0 =A0 having count(1)>1
> ) a
>
> exists is a predicate ( sometimes referred to as propositional function,
> you can think of it as an anonymous function returning true/false ) and
> you can't give it a name ( there is no point in doing so either ).
> Furthermore bu, ndc, etc inside the predicate is not visible outside the
> predicate, so you can't reference them like you do in:
>
> =A0 where a.bu=3Db.bu
> =A0 =A0 and a.ndc=3Db.ndc
>
> You can however reference variables outside of the predicate from within
> the predicate, so
>
> select *
> from ci =A0b
> where exists(
> =A0 =A0 select bu,ndc, count(1)
> =A0 =A0 from ci as a
>
> =A0 =A0 where a.bu=3Db.bu
> =A0 =A0 and a.ndc=3Db.ndc
>
> =A0 =A0 group by bu, ndc
> =A0 =A0 having count(1)>1
> )
>
> is syntactically correct (will still give you the wrong result).
>
> /Lennart

thanks,
0
jr
1/17/2011 9:45:02 PM
On 2011-01-17 22:41, jr wrote:
[...]
> Lennart,
> Regarding this last delete query, it worked on your small sample
> database but  I tried it on my db of 46825 records and it didn't work.
> It left me with only 24,571 records.  I did a query to get the num of
> duplicates and there were only 9996.  Most of those were only
> duplicates, but there were some triplicates etc.  So it deleted way
> too many and I had to kill the server eventually.   I came up with 3
> other queries that does the deleting in passes by comparing a table
> with the duplicate ids with a table containing the max(id)s.  It
> seemed to work except something is wrong there also. It deleted all
> the duplicates, triplcates, quadruplicates, etc.  It should have only
> deleted one of the duplicates with the max(id) and left the original
> duplicate.
> 
> 1.create table ci_tmp1 as
> (select id,bu,ndc, count(*)
> from ci
> group by bu,ndc
> having COUNT(concat(bu,ndc)) > 1
> )
> 

This query is illegal and will give you random results. It will only
work if id is functionally dependent of bu, ndc. You can prevent this
misbehaviour by enabling ONLY_FULL_GROUP_BY in sql_mode, mysql will then
return an error instead of a wrong result.

/Lennart

[...]

0
Lennart
1/18/2011 5:54:07 AM
On Jan 17, 9:54=A0pm, Lennart Jonsson <erik.lennart.jons...@gmail.com>
wrote:
> On 2011-01-17 22:41, jr wrote:
> [...]
>
>
>
>
>
>
>
>
>
> > Lennart,
> > Regarding this last delete query, it worked on your small sample
> > database but =A0I tried it on my db of 46825 records and it didn't work=
..
> > It left me with only 24,571 records. =A0I did a query to get the num of
> > duplicates and there were only 9996. =A0Most of those were only
> > duplicates, but there were some triplicates etc. =A0So it deleted way
> > too many and I had to kill the server eventually. =A0 I came up with 3
> > other queries that does the deleting in passes by comparing a table
> > with the duplicate ids with a table containing the max(id)s. =A0It
> > seemed to work except something is wrong there also. It deleted all
> > the duplicates, triplcates, quadruplicates, etc. =A0It should have only
> > deleted one of the duplicates with the max(id) and left the original
> > duplicate.
>
> > 1.create table ci_tmp1 as
> > (select id,bu,ndc, count(*)
> > from ci
> > group by bu,ndc
> > having COUNT(concat(bu,ndc)) > 1
> > )
>
> This query is illegal and will give you random results. It will only
> work if id is functionally dependent of bu, ndc. You can prevent this
> misbehaviour by enabling ONLY_FULL_GROUP_BY in sql_mode, mysql will then
> return an error instead of a wrong result.
>
> /Lennart
>
> [...]

thanks for explaining it.  It was very useful.
0
jr
1/18/2011 5:09:23 PM
On 2011-01-18 18:09, jr wrote:
[...]
>> This query is illegal and will give you random results. It will only
>> work if id is functionally dependent of bu, ndc. You can prevent this
>> misbehaviour by enabling ONLY_FULL_GROUP_BY in sql_mode, mysql will then
>> return an error instead of a wrong result.
>>
>> /Lennart
>>
>> [...]
> 
> thanks for explaining it.  It was very useful.

You can verify by:

set sql_mode='';

create table t (
    id int not null primary key,
    bu int not null,
    ndc int not null
) engine = innodb;

insert into t (id, bu, ndc) values (1,1,1),(2,1,1),(3,1,1);

select id, bu, ndc, count(1) from t group by bu, ndc;
+----+----+-----+----------+
| id | bu | ndc | count(1) |
+----+----+-----+----------+
|  1 |  1 |   1 |        3 |
+----+----+-----+----------+
1 row in set (0.00 sec)

id might be any one of 1, 2 or 3

/Lennart

0
Lennart
1/18/2011 7:25:10 PM
Reply: