Thread: Removing duplicate entries

Removing duplicate entries

From
"Scott Ford"
Date:
Hi all -

I have two tables - customers and documents.  Each customer can be
related to several different documents.  There has been a bug in some
software that's been duplicating the addition of a certain type of
document for the customers.  Even though each customer can have
documents of the same type, in this case I don't want it.

customers
    customer_id
    ...

documents
    customer_id
    document_id
    document_type_id
    ...

So, for example, there are two documents with the same document_type_id
associated with one customer.

Can someone help me with a SQL statement that might help me remove the
duplicate documents for a certain document_type_id?

Thanks,
Scott.

Re: Removing duplicate entries

From
Jaime Casanova
Date:
On 1/11/06, Scott Ford <Scott.Ford@bullfrogpower.com> wrote:
> Hi all -
>
> I have two tables - customers and documents.  Each customer can be
> related to several different documents.  There has been a bug in some
> software that's been duplicating the addition of a certain type of
> document for the customers.  Even though each customer can have
> documents of the same type, in this case I don't want it.
>
> customers
>        customer_id
>        ...
>
> documents
>        customer_id
>        document_id
>        document_type_id
>        ...
>
> So, for example, there are two documents with the same document_type_id
> associated with one customer.
>
> Can someone help me with a SQL statement that might help me remove the
> duplicate documents for a certain document_type_id?
>
> Thanks,
> Scott.
>

i think what you need is a SQL to _search_ all the cases...
then look at them... and deletes them one at a time... that's the
safer way to go...

--
regards,
Jaime Casanova
(DBA: DataBase Aniquilator ;)

Re: Removing duplicate entries

From
Michael Fuhr
Date:
On Wed, Jan 11, 2006 at 02:06:53PM -0500, Scott Ford wrote:
> customers
>     customer_id
>     ...
>
> documents
>     customer_id
>     document_id
>     document_type_id
>     ...
>
> So, for example, there are two documents with the same document_type_id
> associated with one customer.
>
> Can someone help me with a SQL statement that might help me remove the
> duplicate documents for a certain document_type_id?

Is document_id a primary key (or otherwise unique)?  If so then
something like this might work:

DELETE FROM documents WHERE document_id NOT IN (
  SELECT min(document_id)
  FROM documents
  GROUP BY customer_id, document_type_id
);

Be sure to understand what this query does before running it; I
might be making assumptions about your data that aren't correct.
I'd advise trying this or any other suggestion against test data
before using it on data you don't want to lose, and I'd also recommend
using a transaction that you can roll back if necessary (i.e., start
a transaction, run the delete, run some queries to make sure the
changes are correct, then either commit or roll back the transaction).

--
Michael Fuhr

Re: Removing duplicate entries

From
"Scott Ford"
Date:
Okay, let me elaborate a little more and maybe it will help.  Although I
think this is along the right track.

Okay - here's the problem in more detail:

customer
    customer_id (pk)
    ...

documentation
    documentation_id (pk)
    customer_id (fk)
    document_type_id (fk)
    accepted

document_types
    document_type_id (pk)
    document_name
    ...

What I want to do is remove duplicate entries for the same
customer_id/document_type_id for only a certain document_type_id.

-----Original Message-----
From: Michael Fuhr [mailto:mike@fuhr.org]
Sent: January 11, 2006 3:18 PM
To: Scott Ford
Cc: pgsql-novice@postgresql.org
Subject: Re: [NOVICE] Removing duplicate entries

On Wed, Jan 11, 2006 at 02:06:53PM -0500, Scott Ford wrote:
> customers
>     customer_id
>     ...
>
> documents
>     customer_id
>     document_id
>     document_type_id
>     ...
>
> So, for example, there are two documents with the same
document_type_id
> associated with one customer.
>
> Can someone help me with a SQL statement that might help me remove the
> duplicate documents for a certain document_type_id?

Is document_id a primary key (or otherwise unique)?  If so then
something like this might work:

DELETE FROM documents WHERE document_id NOT IN (
  SELECT min(document_id)
  FROM documents
  GROUP BY customer_id, document_type_id
);

Be sure to understand what this query does before running it; I
might be making assumptions about your data that aren't correct.
I'd advise trying this or any other suggestion against test data
before using it on data you don't want to lose, and I'd also recommend
using a transaction that you can roll back if necessary (i.e., start
a transaction, run the delete, run some queries to make sure the
changes are correct, then either commit or roll back the transaction).

--
Michael Fuhr

Re: Removing duplicate entries

From
Michael Fuhr
Date:
On Wed, Jan 11, 2006 at 04:21:45PM -0500, Scott Ford wrote:
> customer
>     customer_id (pk)
>     ...
>
> documentation
>     documentation_id (pk)
>     customer_id (fk)
>     document_type_id (fk)
>     accepted
>
> document_types
>     document_type_id (pk)
>     document_name
>     ...
>
> What I want to do is remove duplicate entries for the same
> customer_id/document_type_id for only a certain document_type_id.

Will a modified version of my earlier query work?  For example,
suppose I start with this data:

SELECT * FROM doc ORDER BY cust_id, doc_type_id, doc_id;
 doc_id | cust_id | doc_type_id
--------+---------+-------------
      1 |       1 |           1
      2 |       1 |           1
      3 |       1 |           2
      4 |       1 |           2
      5 |       1 |           2
      6 |       1 |           3
      7 |       1 |           3
      8 |       2 |           1
      9 |       2 |           2
     10 |       2 |           3
(10 rows)

Then I run this delete to remove duplicates for doc_type_id 2:

DELETE FROM doc
WHERE doc_type_id = 2 AND doc_id NOT IN (
  SELECT min(doc_id)
  FROM doc
  WHERE doc_type_id = 2 -- not necessary but probably more efficient
  GROUP BY cust_id, doc_type_id
);

I end up with this:

SELECT * FROM doc ORDER BY cust_id, doc_type_id, doc_id;
 doc_id | cust_id | doc_type_id
--------+---------+-------------
      1 |       1 |           1
      2 |       1 |           1
      3 |       1 |           2
      6 |       1 |           3
      7 |       1 |           3
      8 |       2 |           1
      9 |       2 |           2
     10 |       2 |           3
(8 rows)

cust_id 1's duplicates for doc_type_id 2 have been removed (doc_id
4 and 5) but cust_id 1's duplicates for doc_type_id 1 and 3 remain.
cust_id 2 had no duplicates; all of its records remain.

Is that what you're looking for?  If not then please post some
sample data and describe exactly which records you want to delete.
It would be helpful if you post the example as CREATE TABLE and
INSERT statements that people can load into their own database.

--
Michael Fuhr

Re: Removing duplicate entries

From
Michael Fuhr
Date:
[Please copy the mailing list on replies.]

On Thu, Jan 12, 2006 at 10:24:10AM -0500, Scott Ford wrote:
> That is exactly the result that I'm looking for.  But won't that
> actually remove the documents with doc_type_id where there is only one
> for a given customer too?

No, it shouldn't (see below).

> Maybe I need to read up on my GROUP BY clause.
>
> I thought that:
>
> SELECT min(doc_id)
>   FROM doc
>   WHERE doc_type_id = 2 -- not necessary but probably more efficient
>   GROUP BY cust_id, doc_type_id
>
> would select one instance of doc_type_id = 2 from each customer if there
> was one, and would select the one with the lowest doc_id if there was
> more then one.

Right.  This subquery generates the list of doc_id's *not* to delete
(you could use max instead of min; the point is to choose exactly
one out of the set).  The WHERE clause in the subquery isn't necessary
because the DELETE already restricts the affected records to those
with doc_type_id 2; however, adding the restriction to the subquery
generates a smaller list of doc_id's to check against, which might
make the delete faster.  In other words, the delete statement says
"delete all records with doc_type_id 2 except for these", where
"these" is the list generated by the subquery.  "These" can contain
but doesn't need to contain doc_id's for documents with a doc_type_id
other than 2 because we've already said they're not eligible for
deletion.

> So then wouldn't the row:
>
> doc_id | cust_id | doc_type_id
> --------+---------+-------------
>       9 |       2 |           2
>
> get removed too?

No, as the example I posted shows (the output came from sample data
and actually running the queries shown).

> I'll test this out some test data in the meant time.

Please do: it's important that you understand how the delete works
and be satisfied that it *does* work.  As I mentioned before, I
might be misunderstanding your requirements or making unwarranted
assumptions about your data; ultimately it's up to you to determine
whether the delete works or not.

--
Michael Fuhr

Re: Removing duplicate entries

From
"Scott Ford"
Date:
Right.  Missed the *not*.  It worked just fine on the test data.  Thanks
again, Mike.

Scott.

-----Original Message-----
From: Michael Fuhr [mailto:mike@fuhr.org]
Sent: January 12, 2006 12:00 PM
To: Scott Ford
Cc: pgsql-novice@postgresql.org
Subject: Re: [NOVICE] Removing duplicate entries

[Please copy the mailing list on replies.]

On Thu, Jan 12, 2006 at 10:24:10AM -0500, Scott Ford wrote:
> That is exactly the result that I'm looking for.  But won't that
> actually remove the documents with doc_type_id where there is only one
> for a given customer too?

No, it shouldn't (see below).

> Maybe I need to read up on my GROUP BY clause.
>
> I thought that:
>
> SELECT min(doc_id)
>   FROM doc
>   WHERE doc_type_id = 2 -- not necessary but probably more efficient
>   GROUP BY cust_id, doc_type_id
>
> would select one instance of doc_type_id = 2 from each customer if
there
> was one, and would select the one with the lowest doc_id if there was
> more then one.

Right.  This subquery generates the list of doc_id's *not* to delete
(you could use max instead of min; the point is to choose exactly
one out of the set).  The WHERE clause in the subquery isn't necessary
because the DELETE already restricts the affected records to those
with doc_type_id 2; however, adding the restriction to the subquery
generates a smaller list of doc_id's to check against, which might
make the delete faster.  In other words, the delete statement says
"delete all records with doc_type_id 2 except for these", where
"these" is the list generated by the subquery.  "These" can contain
but doesn't need to contain doc_id's for documents with a doc_type_id
other than 2 because we've already said they're not eligible for
deletion.

> So then wouldn't the row:
>
> doc_id | cust_id | doc_type_id
> --------+---------+-------------
>       9 |       2 |           2
>
> get removed too?

No, as the example I posted shows (the output came from sample data
and actually running the queries shown).

> I'll test this out some test data in the meant time.

Please do: it's important that you understand how the delete works
and be satisfied that it *does* work.  As I mentioned before, I
might be misunderstanding your requirements or making unwarranted
assumptions about your data; ultimately it's up to you to determine
whether the delete works or not.

--
Michael Fuhr