Re: SQL for removing duplicates? - Mailing list pgsql-novice
From | |
---|---|
Subject | Re: SQL for removing duplicates? |
Date | |
Msg-id | 200606132222.k5DMMbv21040@panix3.panix.com Whole thread Raw |
In response to | Re: SQL for removing duplicates? (Brad Nicholson <bnichols@ca.afilias.info>) |
List | pgsql-novice |
From: Brad Nicholson <bnichols@ca.afilias.info> Date: Jun 13, 2006 2:26 PM Subject: Re: [NOVICE] SQL for removing duplicates? To: kynn@panix.com kynn@panix.com wrote: > Hi. I'm stumped. I have a large table (about 8.5M records), let's > call it t, whose columns include x and y. I want to remove records > from this table so that any pair of values for these two fields appear > only once. (This will get rid of about 15% of the records in t.) > > One simple solution would be something like > > CREATE TABLE tmp AS SELECT DISTINCT ON ( x, y ) * FROM t; > DROP TABLE t; > ALTER TABLE tmp RENAME TO t; > > This works, but it uses a lot of space. I would prefer to simply cull > the unwanted records from t, but I just can't figure out the SQL for > it. Any help with it would be *much* appreciated. If your table is created with OIDs, this should work. If not add a unique column to the table and use that in place of oid. DELETE FROM t where oid IN (select t2.oid from t t2 EXCEPT SELECT max(t3.oid) from t t3 group by t3.x, t3.y); Also note, the query plan for this is going to be very ugly, it might very well be cheaper to use the solution that you initially mentioned. Thanks! That's just what I was looking for. kj -- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp. ------=_Part_9560_6324061.1150236298505 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline <br><br>---------- Forwarded message ----------<br><span class="gmail_quote">From: <b class="gmail_sendername">Brad Nicholson</b><<a href="mailto:bnichols@ca.afilias.info">bnichols@ca.afilias.info</a>><br>Date: Jun 13, 2006 2:26 PM <br>Subject: Re: [NOVICE] SQL for removing duplicates?<br>To: <a href="mailto:kynn@panix.com">kynn@panix.com</a><br>Cc:<a href="mailto:pgsql-novice@postgresql.org">pgsql-novice@postgresql.org</a><br><br></span><ahref="mailto:kynn@panix.com"> kynn@panix.com</a> wrote:<br>> Hi. I'm stumped. I have a large table (about 8.5M records), let's<br>>call it t, whose columns include x and y. I want to remove records<br>> from this table so thatany pair of values for these two fields appear <br>> only once. (This will get rid of about 15% of the records in t.)<br>><br>> One simple solutionwould be something like<br>><br>> CREATE TABLE tmp AS SELECT DISTINCT ON ( x, y ) * FROM t;<br>> DROP TABLE t; <br>> ALTER TABLE tmp RENAME TO t;<br>><br>> This works, but it uses a lot of space. Iwould prefer to simply cull<br>> the unwanted records from t, but I just can't figure out the SQL for<br>>it. Any help with it would be *much* appreciated. <br><br><br>If your table is created with OIDs, this should work. If not add a<br>unique column to the tableand use that in place of oid.<br><br>DELETE FROM t where oid IN (select t2.oid from t t2 EXCEPT SELECT<br>max(t3.oid ) from t t3 group by t3.x, t3.y);<br><br>Also note, the query plan for this is going to be very ugly, it might<br>verywell be cheaper to use the solution that you initially mentioned.<br><br><br><br><br><br><br>--<br>Brad Nicholson 416-673-4106 <br>Database Administrator, Afilias Canada Corp.<br><br> ------=_Part_9560_6324061.1150236298505--
pgsql-novice by date: