Home > mailing lists

Re: SQL for removing duplicates? - Mailing list pgsql-novice

From
Subject	Re: SQL for removing duplicates?
Date	June 13, 2006 19:22:47
Msg-id	200606132222.k5DMMbv21040@panix3.panix.com Whole thread Raw
In response to	Re: SQL for removing duplicates? (Brad Nicholson <bnichols@ca.afilias.info>)
List	pgsql-novice

Tree view

   From: Brad Nicholson <bnichols@ca.afilias.info>
   Date: Jun 13, 2006 2:26 PM
   Subject: Re: [NOVICE] SQL for removing duplicates?
   To: kynn@panix.com

   kynn@panix.com wrote:
   > Hi.  I'm stumped.  I have a large table (about 8.5M records), let's
   > call it t, whose columns include x and y.  I want to remove records
   > from this table so that any pair of values for these two fields appear
   > only once.  (This will get rid of about 15% of the records in t.)
   >
   > One simple solution would be something like
   >
   >   CREATE TABLE tmp AS SELECT DISTINCT ON ( x, y ) * FROM t;
   >   DROP TABLE t;
   >   ALTER TABLE tmp RENAME TO t;
   >
   > This works, but it uses a lot of space.  I would prefer to simply cull
   > the unwanted records from t, but I just can't figure out the SQL for
   > it.  Any help with it would be *much* appreciated.

   If your table is created with OIDs, this should work.  If not add a
   unique column to the table and use that in place of oid.

   DELETE FROM t where oid IN (select t2.oid from t t2 EXCEPT SELECT
   max(t3.oid) from t t3 group by t3.x, t3.y);

   Also note, the query plan for this is going to be very ugly, it might
   very well be cheaper to use the solution that you initially mentioned.

Thanks!  That's just what I was looking for.

kj

   --
   Brad Nicholson  416-673-4106
   Database Administrator, Afilias Canada Corp.

   ------=_Part_9560_6324061.1150236298505
   Content-Type: text/html; charset=ISO-8859-1
   Content-Transfer-Encoding: 7bit
   Content-Disposition: inline

   <br><br>---------- Forwarded message ----------<br><span class="gmail_quote">From: <b class="gmail_sendername">Brad
Nicholson</b><<a href="mailto:bnichols@ca.afilias.info">bnichols@ca.afilias.info</a>><br>Date: Jun 13, 2006 2:26
PM
   <br>Subject: Re: [NOVICE] SQL for removing duplicates?<br>To: <a
href="mailto:kynn@panix.com">kynn@panix.com</a><br>Cc:<a
href="mailto:pgsql-novice@postgresql.org">pgsql-novice@postgresql.org</a><br><br></span><ahref="mailto:kynn@panix.com"> 
   kynn@panix.com</a> wrote:<br>> Hi.  I'm stumped.  I have a large table (about 8.5M records),
let's<br>>call it t, whose columns include x and y.  I want to remove records<br>> from this table so
thatany pair of values for these two fields appear 
   <br>> only once.  (This will get rid of about 15% of the records in t.)<br>><br>> One simple
solutionwould be something like<br>><br>>   CREATE TABLE tmp AS SELECT DISTINCT ON ( x, y ) * FROM
t;<br>>  DROP TABLE t; 
   <br>>   ALTER TABLE tmp RENAME TO t;<br>><br>> This works, but it uses a lot of
space.  Iwould prefer to simply cull<br>> the unwanted records from t, but I just can't figure out the SQL
for<br>>it.  Any help with it would be *much* appreciated. 
   <br><br><br>If your table is created with OIDs, this should work.  If not add a<br>unique column to the
tableand use that in place of oid.<br><br>DELETE FROM t where oid IN (select t2.oid from t t2 EXCEPT
SELECT<br>max(t3.oid
   ) from t t3 group by t3.x, t3.y);<br><br>Also note, the query plan for this is going to be very ugly, it
might<br>verywell be cheaper to use the solution that you initially mentioned.<br><br><br><br><br><br><br>--<br>Brad
Nicholson  416-673-4106
   <br>Database Administrator, Afilias Canada Corp.<br><br>

   ------=_Part_9560_6324061.1150236298505--

pgsql-novice by date:

From: Brad Nicholson
Date: 13 June 2006, 15:26:23
Subject: Re: SQL for removing duplicates?

From: Tom Allison
Date: 13 June 2006, 21:19:35
Subject: ruby

Re: SQL for removing duplicates? - Mailing list pgsql-novice

Previous

Next