Home > mailing lists

Re: Removing duplicate records from a bulk upload (rationale behind selecting a method) - Mailing list pgsql-general

From	Tom Lane
Subject	Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)
Date	December 9, 2014 02:52:36
Msg-id	14733.1418093544@sss.pgh.pa.us Whole thread Raw
In response to	Re: Removing duplicate records from a bulk upload (rationale behind selecting a method) (Scott Marlowe <scott.marlowe@gmail.com>)
List	pgsql-general

Tree view

Scott Marlowe <scott.marlowe@gmail.com> writes:
> If you're de-duping a whole table, no need to create indexes, as it's
> gonna have to hit every row anyway. Fastest way I've found has been:

> select a,b,c into newtable from oldtable group by a,b,c;

> On pass, done.

> If you want to use less than the whole row, you can use select
> distinct on (col1, col2) * into newtable from oldtable;

Also, the DISTINCT ON method can be refined to control which of a set of
duplicate keys is retained, if you can identify additional columns that
constitute a preference order for retaining/discarding dupes.  See the
"latest weather reports" example in the SELECT reference page.

In any case, it's advisable to crank up work_mem while performing this
operation.

            regards, tom lane

pgsql-general by date:

From: Scott Marlowe
Date: 09 December 2014, 02:35:34
Subject: Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)

From: "Huang, Suya"
Date: 09 December 2014, 05:00:18
Subject: Re: FW: SQL rolling window without aggregation

Re: Removing duplicate records from a bulk upload (rationale behind selecting a method) - Mailing list pgsql-general

Previous

Next