Home > mailing lists

Re: Bulkloading using COPY - ignore duplicates? - Mailing list pgsql-hackers

From	Ross J. Reedstrom
Subject	Re: Bulkloading using COPY - ignore duplicates?
Date	December 13, 2001 11:36:25
Msg-id	20011213102259.B10349@rice.edu Whole thread Raw
In response to	Re: Bulkloading using COPY - ignore duplicates? (Lee Kindness <lkindness@csl.co.uk>)
List	pgsql-hackers

Tree view

On Thu, Dec 13, 2001 at 03:44:31PM +0000, Lee Kindness wrote:
> Patrick Welche writes:
>  > On Thu, Dec 13, 2001 at 01:25:11PM +0000, Lee Kindness wrote:
>  > > That's what I'm currently doing as a workaround - a SELECT DISTINCT
>  > > from a temporary table into the real table with the unique index on
>  > > it. However this takes absolute ages - say 5 seconds for the copy
>  > > (which is the ballpark figure I aiming toward and can achieve with
>  > > Ingres) plus another 30ish seconds for the SELECT DISTINCT.
>  > Then your column really isn't unique,
> 
> That's another discussion entirely ;) - it's spat out by a real-time
> system which doesn't have the time or resources to check this. Further
> precision loss later in the data's life adds more duplicates...

Hmm, the data has a later life - sounds like you'll need to remove dups
then, anyway, so can you get away with just letting the dups in? Remove
the UNIQUE requirement, and let the real time system just dump away.
How critical is it to later steps that there be no dups? And how many
(potential) dups is your RTS producing, anyway?

Your later processing (which apparently can _generate_ dups) might be
the out of the critical time path place to worry about removing dups.

Ross

P.S. This falls into the class of problem solving characterized by
"if you can't solve the problem as stated, restate the problem to be
one you _can_ solve" ;-)

> 
>  > so how about dropping the unique index, import the data, fix the
>  > duplicates, recreate the unique index - just as another possible
>  > work around ;) 
> 
> This is just going to be the same(ish) time, no?
> 
>  CREATE TABLE tab (p1 INT, p2 INT, other1 INT, other2 INT);
>  COPY tab FROM 'file';
>  DELETE FROM tab WHERE p1, p2 NOT IN (SELECT DISTINCT p1, p2
>                                       FROM tab);
>  CREATE UNIQUE INDEX tab_idx ON tab USING BTREE(p1, p2);
> 
> or am I missing something?
> 
> Thanks, Lee.
> 
> -- 
>  Lee Kindness, Senior Software Engineer, Concept Systems Limited.
>  http://services.csl.co.uk/ http://www.csl.co.uk/ +44 131 5575595
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
> 
> http://www.postgresql.org/users-lounge/docs/faq.html

pgsql-hackers by date:

From: Larry Rosenman
Date: 13 December 2001, 11:36:22
Subject: Re: Platform testing (last call?)

From: Doug McNaught
Date: 13 December 2001, 12:19:51
Subject: Re: Platform testing (last call?)

Re: Bulkloading using COPY - ignore duplicates? - Mailing list pgsql-hackers

Previous

Next