Thread: duplicates

duplicates

From

Allan Kelly

Date:

07 April 2000, 06:46:54

I have a system bug which means that I have occasional dupicate entries in
my 'subscribers' table. I'm trying to find a query which performs something
like a 'where user_name is not unique' query.

At the moment we use this cludge:

select count(*), user_name from subscribersgroup by user_name order by count;

(I'm surprised I can't add 'where count > 1' but it seems conditionals on 
aggregate fields are not allowed).

This gives me a very long list with the 'not unique' entries at the bottom, eg

count | user_name
------+------------------
1     | bill.hicks [ ..cut 9 zillion results.. ]
1     | margaret.thatcher
4     | linus.torvalds
9     | bill.gates

I then have to do 

select oid from subscribers where user_name = 'linus.torvalds';

and delete all but one of the records. Is there a better way to do this?
And yes, we're working on a system fix to avoid the problem in the 1st place!

TIA, al.

-- 
# Allan Kelly                                http://www.plotsearch.co.uk# (+44) (0)131 524 8500#
allan.kelly@buildstore.co.uk...    ..# /Software Engineer/i            . .    .    . .# ------------------------------
*      . .     .    . .# "If you are a Visual Basic programmer,   *       . .     .#  these details are none of your
business."       *       .  . .# Mr Bunny's Guide to Active X, by Carlton Egremont III      *     . .#
------------------------------     vi: set noet tw=80 sts=4 ts=8  : .

Re: duplicates

From

Patrik Kudo

Date:

07 April 2000, 07:28:57

Hi Allan

Firstly I'd suggest a unique index on the column that should be unique
to force uniqueness on that column. But since you already have double
tuples, you can do the following:

delete from subscribers where exists (select 1 from subscribers s where s.user_name = subscribers.user_name  and s.oid
>subscribers.oid)
 

I'm not 100% certain it'll do the work right, so PLEASE try it out with
a testtable first =)

Regards,
Patrik Kudo

Allan Kelly wrote:
> 
> I have a system bug which means that I have occasional dupicate entries in
> my 'subscribers' table. I'm trying to find a query which performs something
> like a 'where user_name is not unique' query.
> 
> At the moment we use this cludge:
> 
> select count(*), user_name from subscribers
>         group by user_name order by count;
> 
> (I'm surprised I can't add 'where count > 1' but it seems conditionals on
> aggregate fields are not allowed).
> 
> This gives me a very long list with the 'not unique' entries at the bottom, eg
> 
> count | user_name
> ------+------------------
> 1     | bill.hicks
>   [ ..cut 9 zillion results.. ]
> 1     | margaret.thatcher
> 4     | linus.torvalds
> 9     | bill.gates
> 
> I then have to do
> 
> select oid from subscribers where user_name = 'linus.torvalds';
> 
> and delete all but one of the records. Is there a better way to do this?
> And yes, we're working on a system fix to avoid the problem in the 1st place!
> 
> TIA, al.

Re: duplicates

From

Patrik Kudo

Date:

07 April 2000, 07:33:58

Hi again.

Your query:

> select count(*), user_name from subscribers
>         group by user_name order by count;
> 

would work if you instead wrote:

select count(*), user_name from subscribersgroup by user_name having count(*) > 1

Regards,
Patrik Kudo

Re: duplicates

From

"Moray McConnachie"

Date:

07 April 2000, 08:08:00

----- Original Message -----
From: "Allan Kelly" <allan.kelly@buildstore.co.uk>
To: <pgsql-sql@postgresql.org>
Sent: Friday, April 07, 2000 11:34 AM
Subject: [SQL] duplicates


> I have a system bug which means that I have occasional dupicate entries in
> my 'subscribers' table. I'm trying to find a query which performs
something
> like a 'where user_name is not unique' query.
>
> At the moment we use this cludge:
>
> select count(*), user_name from subscribers
> group by user_name order by count;
>
> (I'm surprised I can't add 'where count > 1' but it seems conditionals on
> aggregate fields are not allowed).
>
> This gives me a very long list with the 'not unique' entries at the
bottom, eg
>
> count | user_name
> ------+------------------
> 1     | bill.hicks
>   [ ..cut 9 zillion results.. ]
> 1     | margaret.thatcher
> 4     | linus.torvalds
> 9     | bill.gates
>
> I then have to do
>
> select oid from subscribers where user_name = 'linus.torvalds';
>
> and delete all but one of the records. Is there a better way to do this?

DELETE FROM subscribers WHERE EXISTS(SELECT x.username FROM subscribers x
WHERE subscribers.oid
<x.oid AND subscribers.username LIKE x.username);

ought to do it, I *think*. It will delete all but the most recently inserted
copy, i.e. that with the highest oid. Self-joins (which this is,
effectively), are often quicker and more efficient than counts.

Yours,
Moray

Re: duplicates

From

"Bryan White"

Date:

07 April 2000, 10:07:27

> I have a system bug which means that I have occasional dupicate entries in
> my 'subscribers' table. I'm trying to find a query which performs
something
> like a 'where user_name is not unique' query.
>
> At the moment we use this cludge:
>
> select count(*), user_name from subscribers
> group by user_name order by count;
>
> (I'm surprised I can't add 'where count > 1' but it seems conditionals on
> aggregate fields are not allowed).
>
> This gives me a very long list with the 'not unique' entries at the
bottom, eg
>
> count | user_name
> ------+------------------
> 1     | bill.hicks
>   [ ..cut 9 zillion results.. ]
> 1     | margaret.thatcher
> 4     | linus.torvalds
> 9     | bill.gates
>
> I then have to do
>
> select oid from subscribers where user_name = 'linus.torvalds';
>
> and delete all but one of the records. Is there a better way to do this?
> And yes, we're working on a system fix to avoid the problem in the 1st
place!

I would create a unique index on the user_name field (it sounds like a field
you want indexed anyway).
This way the insert will fail when it is a duplicate.  Of course you may
have to alter the code that does the insert to ckeck for failure and handle
it.  Also you will have to delete your duplicates before creating the unique
index.