Re: Bulkloading using COPY - ignore duplicates? - Mailing list pgsql-hackers

From Thomas Swan
Subject Re: Bulkloading using COPY - ignore duplicates?
Date
Msg-id 3BB87BE2.3000406@olemiss.edu
Whole thread Raw
In response to Bulkloading using COPY - ignore duplicates?  (Lee Kindness <lkindness@csl.co.uk>)
List pgsql-hackers
Lee Kindness wrote:

>Tom Lane writes:
> > Lee Kindness <lkindness@csl.co.uk> writes:
> > > Would this seem a reasonable thing to do? Does anyone rely on COPY
> > > FROM causing an ERROR on duplicate input?
> > Yes.  This change will not be acceptable unless it's made an optional
> > (and not default, IMHO, though perhaps that's negotiable) feature of
> > COPY.
>
>I see where you're coming from, but seriously what's the use/point of
>COPY aborting and doing a rollback if one duplicate key is found? I
>think it's quite reasonable to presume the input to COPY has had as
>little processing done on it as possible. I could loop through the
>input file before sending it to COPY but that's just wasting cycles
>and effort - Postgres has btree lookup built in, I don't want to roll
>my own before giving Postgres my input file!
>
> > The implementation might be rather messy too.  I don't much care
> > for the notion of a routine as low-level as bt_check_unique knowing
> > that the context is or is not COPY.  We might have to do some
> > restructuring.
>
>Well in reality it wouldn't be "you're getting run from copy" but
>rather "notice on duplicate, rather than error & exit". There is a
>telling comment in nbtinsert.c just before _bt_check_unique() is
>called:
>
>    /*
>     * If we're not allowing duplicates, make sure the key isn't already
>     * in the index.  XXX this belongs somewhere else, likely
>     */
>
>So perhaps dupes should be searched for before _bt_doinsert is called,
>or somewhere more appropriate?
>
> > > Would:
> > >  WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar)
> > > need to be added to the COPY command (I hope not)?
> > It occurs to me that skip-the-insert might be a useful option for
> > INSERTs that detect a unique-key conflict, not only for COPY.  (Cf.
> > the regular discussions we see on whether to do INSERT first or
> > UPDATE first when the key might already exist.)  Maybe a SET variable
> > that applies to all forms of insertion would be appropriate.
>
>That makes quite a bit of sense.
>
This is tring to avoid one step.

IMHO, you should copy into a temporary table and the do a select 
distinct from it into the table that you want.

A.  You can validate your data before you put it into your permanent table.
B.    This doesn't cost you much.

Don't make the assumption that bulk copies have not been checked or 
validated.  The assumption should be correct data or you shouldn't be 
using COPY.





>



pgsql-hackers by date:

Previous
From: Lee Kindness
Date:
Subject: Re: Bulkloading using COPY - ignore duplicates?
Next
From: "Zeugswetter Andreas SB SD"
Date:
Subject: Re: Bulkloading using COPY - ignore duplicates?