Thread: Bulkloading using COPY - ignore duplicates?

Bulkloading using COPY - ignore duplicates?

From

Lee Kindness

Date:

01 October 2001, 07:20:50

Hello,

I'm in the process of porting a large application from Ingres to
PostgreSQL. We make heavy use of bulkloading using the 'COPY'
statement in ESQL/C. Consider the SQL statements below (in a psql
session on an arbitrary database):
CREATE TABLE copytest(f1 INTEGER, f2 INTEGER);CREATE UNIQUE INDEX copytest_idx ON copytest USING BTREE(f1, f2);COPY
copytestFROM '/tmp/copytest';
 

Given the file /tmp/copytest:
1    12    23    34    44    45    56    6

will result in the following output:
ERROR:  copy: line 5, Cannot insert a duplicate key into unique index copytest_idx

However my application code is assuming that duplicate rows will
simply be ignored (this is the case in Ingres, and I believe Oracle's
bulkloader too). I propose modifying _bt_check_unique() in
/backend/access/nbtree/nbtinsert.c to emit a NOTICE (rather than
ERROR) elog() and return NULL (or appropriate) to the calling function
if a duplicate key is detected and a 'COPY FROM' is in progress (add
new parameter to flag this).

Would this seem a reasonable thing to do? Does anyone rely on COPY
FROM causing an ERROR on duplicate input? Would:
WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar)

need to be added to the COPY command (I hope not)?

Thanks,

-- Lee Kindness, Senior Software EngineerConcept Systems Limited.

Re: Bulkloading using COPY - ignore duplicates?

From

Justin Clift

Date:

01 October 2001, 07:41:39

Lee Kindness wrote:
> 
<snip>
> 
>  WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar)

I would suggest :

WITH ON_DUPLICATE = IGNORE|TERMINATE

Or maybe IGNORE_DUPLICATE

purely for easier understanding, given there is no present standard nor
other databases' syntax to conform to.

:)

Regards and best wishes,

Justin Clift

> 
> need to be added to the COPY command (I hope not)?
> 
> Thanks,
> 
> --
>  Lee Kindness, Senior Software Engineer
>  Concept Systems Limited.

-- 
"My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there."    - Indira Gandhi

Re: Bulkloading using COPY - ignore duplicates?

From

Tom Lane

Date:

01 October 2001, 09:37:16

Lee Kindness <lkindness@csl.co.uk> writes:
> Would this seem a reasonable thing to do? Does anyone rely on COPY
> FROM causing an ERROR on duplicate input?

Yes.  This change will not be acceptable unless it's made an optional
(and not default, IMHO, though perhaps that's negotiable) feature of
COPY.

The implementation might be rather messy too.  I don't much care for the
notion of a routine as low-level as bt_check_unique knowing that the
context is or is not COPY.  We might have to do some restructuring.

> Would:
>  WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar)
> need to be added to the COPY command (I hope not)?

It occurs to me that skip-the-insert might be a useful option for
INSERTs that detect a unique-key conflict, not only for COPY.  (Cf.
the regular discussions we see on whether to do INSERT first or
UPDATE first when the key might already exist.)  Maybe a SET variable
that applies to all forms of insertion would be appropriate.
        regards, tom lane

Re: Bulkloading using COPY - ignore duplicates?

From

Lee Kindness

Date:

01 October 2001, 09:45:06

Justin Clift writes:> Lee Kindness wrote:> >  WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar)> I would suggest :>
WITHON_DUPLICATE = IGNORE|TERMINATE> purely for easier understanding, given there is no present standard nor> other
databases'syntax to conform to.
 

Personally I don't see the need, and think that 'COPY FROM' could well
just go with the new semantics...

Onto an implementation issue - _bt_check_unique() returns a
TransactionId, my plans were to return NullTransactionId on a
duplicate key but naturally this is used in the success
scenario. Looking in backend/transam/transam.c I see:
TransactionId NullTransactionId = (TransactionId) 0;TransactionId AmiTransactionId = (TransactionId) 512;TransactionId
FirstTransactionId= (TransactionId) 514;
 

From this I'd gather <514 can be used as magic-values/constants, So
would I be safe doing:
TransactionId XXXXTransactionId = (TransactionId) 1;

and return XXXXTransactionId from _bt_check_unique() back to
_bt_do_insert()? Naturally XXXX is something meaningful. I presume all
I need to know is if 'xwait' in _bt_check_unique() is ever '1'...

Thanks,

--Lee Kindness, Senior Software EngineerConcept Systems Limited.

Re: Bulkloading using COPY - ignore duplicates?

From

Lee Kindness

Date:

01 October 2001, 10:01:23

Tom Lane writes:> Lee Kindness <lkindness@csl.co.uk> writes:> > Would this seem a reasonable thing to do? Does anyone
relyon COPY> > FROM causing an ERROR on duplicate input?> Yes.  This change will not be acceptable unless it's made an
optional>(and not default, IMHO, though perhaps that's negotiable) feature of> COPY.

I see where you're coming from, but seriously what's the use/point of
COPY aborting and doing a rollback if one duplicate key is found? I
think it's quite reasonable to presume the input to COPY has had as
little processing done on it as possible. I could loop through the
input file before sending it to COPY but that's just wasting cycles
and effort - Postgres has btree lookup built in, I don't want to roll
my own before giving Postgres my input file!
> The implementation might be rather messy too.  I don't much care> for the notion of a routine as low-level as
bt_check_uniqueknowing> that the context is or is not COPY.  We might have to do some> restructuring.

Well in reality it wouldn't be "you're getting run from copy" but
rather "notice on duplicate, rather than error & exit". There is a
telling comment in nbtinsert.c just before _bt_check_unique() is
called:
/* * If we're not allowing duplicates, make sure the key isn't already * in the index.  XXX this belongs somewhere
else,likely */

So perhaps dupes should be searched for before _bt_doinsert is called,
or somewhere more appropriate?
> > Would:> >  WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar)> > need to be added to the COPY command (I hope
not)?>It occurs to me that skip-the-insert might be a useful option for> INSERTs that detect a unique-key conflict, not
onlyfor COPY.  (Cf.> the regular discussions we see on whether to do INSERT first or> UPDATE first when the key might
alreadyexist.)  Maybe a SET variable> that applies to all forms of insertion would be appropriate.

That makes quite a bit of sense.

-- Lee Kindness, Senior Software EngineerConcept Systems Limited.

Re: Bulkloading using COPY - ignore duplicates?

From

Tom Lane

Date:

01 October 2001, 10:09:03

Lee Kindness <lkindness@csl.co.uk> writes:
> I see where you're coming from, but seriously what's the use/point of
> COPY aborting and doing a rollback if one duplicate key is found?

Error detection.  If I'm loading what I think is valid data, having the
system silently ignore certain types of errors is not acceptable ---
I'm especially not pleased at the notion of removing an error check
that's always been there because someone else thinks that would make it
more convenient for his application.

> I think it's quite reasonable to presume the input to COPY has had as
> little processing done on it as possible.

The primary and traditional use of COPY has always been to reload dumped
data.  That's why it doesn't do any fancy processing like DEFAULT
insertion, and that's why it should be quite strict about error
conditions.  In a reload scenario, any sort of problem deserves
careful investigation.
        regards, tom lane

Re: Bulkloading using COPY - ignore duplicates?

From

Tom Lane

Date:

01 October 2001, 10:09:55

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
> I thought that the problem was, that you cannot simply skip the 
> insert, because at that time the tuple (pointer) might have already 
> been successfully inserted into an other index/heap, and thus this was 
> only sanely possible with savepoints/undo.

Hmm, good point.  If we don't error out the transaction then that tuple
would become good when we commit.  This is nastier than it appears.
        regards, tom lane

Re: Bulkloading using COPY - ignore duplicates?

From

Lee Kindness

Date:

01 October 2001, 10:18:09

Tom Lane writes:> I'm especially not pleased at the notion of removing an error check> that's always been there because
someoneelse thinks that would make it> more convenient for his application.
 

Please, don't get me wrong - I don't want to come across arrogant. I'm
simply trying to improve the 'COPY FROM' command in a situation where
speed is a critical issue and the data is dirty... And that must be a
relatively common scenario in industry.

And I never said the duplicate should be silently ignored - an
elog(NOTICE) should still be output.

Lee.

Re: Bulkloading using COPY - ignore duplicates?

From

Thomas Swan

Date:

01 October 2001, 10:27:06

Lee Kindness wrote:

>Tom Lane writes:
> > Lee Kindness <lkindness@csl.co.uk> writes:
> > > Would this seem a reasonable thing to do? Does anyone rely on COPY
> > > FROM causing an ERROR on duplicate input?
> > Yes.  This change will not be acceptable unless it's made an optional
> > (and not default, IMHO, though perhaps that's negotiable) feature of
> > COPY.
>
>I see where you're coming from, but seriously what's the use/point of
>COPY aborting and doing a rollback if one duplicate key is found? I
>think it's quite reasonable to presume the input to COPY has had as
>little processing done on it as possible. I could loop through the
>input file before sending it to COPY but that's just wasting cycles
>and effort - Postgres has btree lookup built in, I don't want to roll
>my own before giving Postgres my input file!
>
> > The implementation might be rather messy too.  I don't much care
> > for the notion of a routine as low-level as bt_check_unique knowing
> > that the context is or is not COPY.  We might have to do some
> > restructuring.
>
>Well in reality it wouldn't be "you're getting run from copy" but
>rather "notice on duplicate, rather than error & exit". There is a
>telling comment in nbtinsert.c just before _bt_check_unique() is
>called:
>
>    /*
>     * If we're not allowing duplicates, make sure the key isn't already
>     * in the index.  XXX this belongs somewhere else, likely
>     */
>
>So perhaps dupes should be searched for before _bt_doinsert is called,
>or somewhere more appropriate?
>
> > > Would:
> > >  WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar)
> > > need to be added to the COPY command (I hope not)?
> > It occurs to me that skip-the-insert might be a useful option for
> > INSERTs that detect a unique-key conflict, not only for COPY.  (Cf.
> > the regular discussions we see on whether to do INSERT first or
> > UPDATE first when the key might already exist.)  Maybe a SET variable
> > that applies to all forms of insertion would be appropriate.
>
>That makes quite a bit of sense.
>
This is tring to avoid one step.

IMHO, you should copy into a temporary table and the do a select 
distinct from it into the table that you want.

A.  You can validate your data before you put it into your permanent table.
B.    This doesn't cost you much.

Don't make the assumption that bulk copies have not been checked or 
validated.  The assumption should be correct data or you shouldn't be 
using COPY.





>

Re: Bulkloading using COPY - ignore duplicates?

From

"Zeugswetter Andreas SB SD"

Date:

01 October 2001, 10:34:45

> > Would this seem a reasonable thing to do? Does anyone rely on COPY
> > FROM causing an ERROR on duplicate input?
> 
> Yes.  This change will not be acceptable unless it's made an optional
> (and not default, IMHO, though perhaps that's negotiable) feature of
> COPY.
> 
> The implementation might be rather messy too.  I don't much 
> care for the
> notion of a routine as low-level as bt_check_unique knowing that the
> context is or is not COPY.  We might have to do some restructuring.
> 
> > Would:
> >  WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar)
> > need to be added to the COPY command (I hope not)?
> 
> It occurs to me that skip-the-insert might be a useful option for
> INSERTs that detect a unique-key conflict, not only for COPY.  (Cf.
> the regular discussions we see on whether to do INSERT first or
> UPDATE first when the key might already exist.)  Maybe a SET variable
> that applies to all forms of insertion would be appropriate.

Imho yes, but:
I thought that the problem was, that you cannot simply skip the 
insert, because at that time the tuple (pointer) might have already 
been successfully inserted into an other index/heap, and thus this was 
only sanely possible with savepoints/undo.

An idea would probably be to at once mark the new tuple dead, and
proceed
normally?

Andreas

Re: Bulkloading using COPY - ignore duplicates?

From

Thomas Swan

Date:

01 October 2001, 10:43:10

Zeugswetter Andreas SB SD wrote:<br /><blockquote cite="mid:46C15C39FEB2C44BA555E356FBCD6FA41EB3A0@m0114.s-mxs.net"
type="cite"><blockquotetype="cite"><pre wrap="">IMHO, you should copy into a temporary table and the do a select <br
/>distinctfrom it into the table that you want.<br /></pre></blockquote><pre wrap=""><br />Which would be way too slow
fornormal operation :-(<br />We are talking about a "fast as possible" data load from a flat file<br />that may have
duplicates(or even data errors, but that <br />is another issue).<br /><br />Andreas<br /></pre></blockquote> Then the
IGNORE_DUPLICATEwould definitely be the way to go, if speed is the question...<br /><br /><br /><br />

Re: Bulkloading using COPY - ignore duplicates?

From

"Zeugswetter Andreas SB SD"

Date:

01 October 2001, 10:43:10

> IMHO, you should copy into a temporary table and the do a select 
> distinct from it into the table that you want.

Which would be way too slow for normal operation :-(
We are talking about a "fast as possible" data load from a flat file
that may have duplicates (or even data errors, but that 
is another issue).

Andreas

Re: Bulkloading using COPY - ignore duplicates?

From

Lee Kindness

Date:

01 October 2001, 11:31:34

Okay,

If I'm going to modify 'COPY INTO' to include 'ignore duplicates'
functionality it looks like I'll have to add to the COPY syntax. The
most obvious way is to add:
WITH IGNORE DUPLICATES

to the syntax. I'm going to need my hand held a bit for this! The
grammar for COPY will need updating in gram.y and specifically the
'WITH' keyword will have 'IGNORE DUPLICATES' as well as 'NULL AS'.

Any pointers?

Thanks, Lee.

Re: Bulkloading using COPY - ignore duplicates?

From

Lee Kindness

Date:

01 October 2001, 11:50:54

Lee Kindness writes:> If I'm going to modify 'COPY INTO' to include 'ignore duplicates'> functionality it looks like
I'llhave to add to the COPY syntax. The> most obvious way is to add:>  WITH IGNORE DUPLICATES
 

Or does it make more sense to add a 'COPY_IGNORE_DUPLICATES' SET
parameter? 

Lee.

Re: Bulkloading using COPY - ignore duplicates?

From

Bruce Momjian

Date:

01 October 2001, 12:08:04

> However my application code is assuming that duplicate rows will
> simply be ignored (this is the case in Ingres, and I believe Oracle's
> bulkloader too). I propose modifying _bt_check_unique() in
> /backend/access/nbtree/nbtinsert.c to emit a NOTICE (rather than
> ERROR) elog() and return NULL (or appropriate) to the calling function
> if a duplicate key is detected and a 'COPY FROM' is in progress (add
> new parameter to flag this).

If you have a UNIQUE index on the table, just throwing away duplicates
seems really bad to me.  I know Ingres had that heapsort structure that
would remove duplicates.  That may be an interesting feature to add as
an operation that can be performed.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Bulkloading using COPY - ignore duplicates?

From

Tom Lane

Date:

01 October 2001, 13:22:19

I said:
> "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
>> I thought that the problem was, that you cannot simply skip the 
>> insert, because at that time the tuple (pointer) might have already 
>> been successfully inserted into an other index/heap, and thus this was 
>> only sanely possible with savepoints/undo.

> Hmm, good point.  If we don't error out the transaction then that tuple
> would become good when we commit.  This is nastier than it appears.

On further thought, I think it *would* be possible to do this without
savepoints, but it'd take some restructuring of the index AM API.
What'd have to happen is that a unique index could not raise an elog
ERROR when it detects a uniqueness conflict.  Instead, it'd return a
uniqueness-conflict indication back to its caller.  This would have
to propagate up to the level of the executor.  At that point we'd make
the choice of whether to raise an error or not.  If not, we'd need to
modify the just-created tuple to mark it deleted by the current
transaction.  We can't remove it, since that would leave any
already-created entries in other indexes pointing to nothing.  But
marking it deleted by the same xact and command ID that inserted it
would leave things in a valid state until VACUUM comes along to do the
janitorial work.

To support backoff in the case of a conflict during UPDATE, it'd also be
necessary to un-mark the prior version of the tuple, which we'd already
marked as deleted.  This might create some concurrency issues in case
there are other updaters waiting to see if we commit or not.  (The same
issue would arise for savepoint-based undo, though.)  We might want to
punt on that part for now.

The effects don't stop propagating there, either.  The decision not to
insert the tuple must be reported up still further, so that the executor
knows not to run any AFTER INSERT/UPDATE triggers and knows not to count
the tuple as inserted/updated for the command completion report.

In short, quite a lot of code to touch to make this happen ...
        regards, tom lane

Re: Bulkloading using COPY - ignore duplicates?

From

"Hiroshi Inoue"

Date:

01 October 2001, 16:57:12

> -----Original Message-----
> From: Tom Lane
>
> I said:
> > "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
> >> I thought that the problem was, that you cannot simply skip the
> >> insert, because at that time the tuple (pointer) might have already
> >> been successfully inserted into an other index/heap, and thus this was
> >> only sanely possible with savepoints/undo.
>
> > Hmm, good point.  If we don't error out the transaction then that tuple
> > would become good when we commit.  This is nastier than it appears.
>
> On further thought, I think it *would* be possible to do this without
> savepoints,

It's a very well known issue that the partial rolloback functionality is
a basis of this kind of problem and it's the reason I've mentioned that
UNDO functionality has the highest priority. IMHO we shouldn't
implement a partial rolloback functionality specific to an individual
problem.

regards,
Hiroshi Inoue

Re: Bulkloading using COPY - ignore duplicates?

From

"Mikheev, Vadim"

Date:

01 October 2001, 17:25:57

> The effects don't stop propagating there, either. The decision
> not to insert the tuple must be reported up still further, so
> that the executor knows not to run any AFTER INSERT/UPDATE
> triggers and knows not to count the tuple as inserted/updated
> for the command completion report.

But what about BEFORE insert/update triggers which could insert
records too?

Vadim

Re: Bulkloading using COPY - ignore duplicates?

From

Tom Lane

Date:

01 October 2001, 17:39:40

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
>> The effects don't stop propagating there, either. The decision
>> not to insert the tuple must be reported up still further, so
>> that the executor knows not to run any AFTER INSERT/UPDATE
>> triggers and knows not to count the tuple as inserted/updated
>> for the command completion report.

> But what about BEFORE insert/update triggers which could insert
> records too?

Well, what about them?  It's already possible for a later BEFORE trigger
to cause the actual insertion to be suppressed, so I don't see any
difference from what we have now.  If a BEFORE trigger takes actions
on the assumption that the insert will happen, it's busted already.

Mind you, I'm not actually advocating that we do any of this ;-).
I was just sketching a possible implementation approach in case someone
wants to try it.
        regards, tom lane

Re: Bulkloading using COPY - ignore duplicates?

From

"Mikheev, Vadim"

Date:

01 October 2001, 17:56:11

> > But what about BEFORE insert/update triggers which could
> > insert records too?
> 
> Well, what about them?  It's already possible for a later
> BEFORE trigger to cause the actual insertion to be suppressed,
> so I don't see any difference from what we have now.
> If a BEFORE trigger takes actions on the assumption that the
> insert will happen, it's busted already.

This problem could be solved now by implementing *single* trigger.
In future, we could give users ability to specify trigger
execution order.
But with proposed feature ...

> Mind you, I'm not actually advocating that we do any of this ;-).

I understand -:)

> I was just sketching a possible implementation approach in
> case someone wants to try it.

And I'm just sketching possible problems -:)

Vadim

Re: Bulkloading using COPY - ignore duplicates?

From

Peter Eisentraut

Date:

02 October 2001, 18:08:53

Tom Lane writes:

> It occurs to me that skip-the-insert might be a useful option for
> INSERTs that detect a unique-key conflict, not only for COPY.  (Cf.
> the regular discussions we see on whether to do INSERT first or
> UPDATE first when the key might already exist.)  Maybe a SET variable
> that applies to all forms of insertion would be appropriate.

What we need is:

1. Make errors not abort the transaction.

2. Error codes

Then you can make your client deal with this in which ever way you want,
at least for single-value inserts.

However, it seems to me that COPY ignoring duplicates can easily be done
by preprocessing the input file.

-- 
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter

Re: Bulkloading using COPY - ignore duplicates?

From

"Jim Buttafuoco"

Date:

02 October 2001, 19:00:05

I have used Oracle SQLOADER for many years now.  It has the ability to 
put rejects/discards/bad into an output file and keep on going,  maybe 
this should be added to the copy command.


COPY [ BINARY ] table [ WITH OIDS ]   FROM { 'filename' | stdin }   [ [USING] DELIMITERS 'delimiter' ]   [ WITH NULL AS
'nullstring' ]   [ DISCARDS 'filename' ]           
 

what do you think???


> Tom Lane writes:
> 
> > It occurs to me that skip-the-insert might be a useful option for
> > INSERTs that detect a unique-key conflict, not only for COPY.  (Cf.
> > the regular discussions we see on whether to do INSERT first or
> > UPDATE first when the key might already exist.)  Maybe a SET 
variable
> > that applies to all forms of insertion would be appropriate.
> 
> What we need is:
> 
> 1. Make errors not abort the transaction.
> 
> 2. Error codes
> 
> Then you can make your client deal with this in which ever way you 
want,
> at least for single-value inserts.
> 
> However, it seems to me that COPY ignoring duplicates can easily be 
done
> by preprocessing the input file.
> 
> -- 
> Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter
> 
> 
> ---------------------------(end of broadcast)-------------------------
--
> TIP 4: Don't 'kill -9' the postmaster
> 
>

Re: Bulkloading using COPY - ignore duplicates?

From

Lee Kindness

Date:

03 October 2001, 06:01:40

Peter Eisentraut writes:> However, it seems to me that COPY ignoring duplicates can easily be> done by preprocessing
theinput file.
 

Or by post-processing, like (error checking cut):
void import_shots(char *impfile, int lineshoot_id){ char tab_name[128]; char tab_temp[128];
 frig_file(impfile); /* add the postgres header */ sprintf(tab_name, "shot_%d", lineshoot_id); sprintf(tab_temp,
"shot_%d_tmp",lineshoot_id);
 
 sprintf(cmd, "CREATE TEMPORARY TABLE %s AS SELECT * FROM shot",  tab_temp); EXEC SQL EXECUTE IMMEDIATE :cmd; EXEC SQL
COMMITWORK; /* will not work without comit here! */
 
 sprintf(cmd, "COPY BINARY %s FROM '%s'", tab_temp, impfile); append_page_alloc(cmd, tab_name, impfile, 1); EXEC SQL
EXECUTEIMMEDIATE :cmd; sprintf(cmd, "INSERT INTO %s SELECT DISTINCT ON(shot_time) * FROM %s",  tab_name, tab_temp);
EXECSQL EXECUTE IMMEDIATE :cmd;
 
 sprintf(cmd, "DROP TABLE %s", tab_temp); EXEC SQL EXECUTE IMMEDIATE :cmd;
 EXEC SQL COMMIT WORK ; remove(impfile);}

However this is adding significant time to the import
operation. Likewise I could loop round the input file first and hunt
for duplicates, again with a performance hit.

My main point is that Postgres can easily and quickly check for
duplicates during the COPY (as it does currently) and it adds zero
execution time to simply ignore these duplicate rows. Obviously this
is a useful feature otherwise Oracle, Ingres and other commercial
relational databases wouldn't feature similiar functionality.

Yes, in an ideal world the input to COPY should be clean and
consistent with defined indexes. However this is only really the case
when COPY is used for database/table backup and restore. It misses the
point that a major use of COPY is in speed optimisation on bulk
inserts...

Lee.

Bulkloading using COPY - ignore duplicates?

From

Lee Kindness

Date:

08 October 2001, 08:41:35

Guys,

I've made some inroads towards adding 'ignore duplicates'
functionality to PostgreSQL's COPY command. I've updated the parser
grammar for COPY FROM to now accept:
COPY [ BINARY ] table [ WITH OIDS ]   FROM { 'filename' | stdin }   [ [USING] DELIMITERS 'delimiter' ]   [ WITH [NULL
AS'null string']          [IGNORE DUPLICATES] ]
 

and added code to propagate this setting down to the CopyFrom function
in backend/commands/copy.c.

I also played around with _bt_check_unique, _bt_do_insert and btinsert
to return NULL on duplicate rather than elog(ERROR). Likewise
ExecInsertIndexTuples and index_insert were passed the
ignore_duplicate flag and index_insert changed to elog(ERROR) if the
return from the insert function was NULL and ignore_duplicate flag was
false.

These changes worked and gave the desired result for the COPY FROM
command, however as many mentioned these changes are far too low
level... After assessing the situation more fully, I believe the
following change in CopyFrom would be more suitable:
    /* BEFORE ROW INSERT Triggers */    if (resultRelInfo->ri_TrigDesc &&
resultRelInfo->ri_TrigDesc->n_before_row[TRIGGER_EVENT_INSERT]> 0)    {        HeapTuple    newtuple;        newtuple =
ExecBRInsertTriggers(estate,resultRelInfo, tuple);
 
        if (newtuple == NULL)        /* "do nothing" */            skip_tuple = true;        else if (newtuple !=
tuple)/* modified by Trigger(s) */        {            heap_freetuple(tuple);            tuple = newtuple;        }
}
    /* new code */    if( ignore_duplicates == true )    {            if( duplicate index value )
skip_tuple= true;    }
 
    if (!skip_tuple)    {


Now I imagine 'duplicate index value' would be functionally similar to
_bt_check_unique but obviously higher level. Is there any existing
code with the functionality I desire? Can anyone point me in the right
way...

Thanks,

Lee Kindness.

Lee Kindness writes:> I'm in the process of porting a large application from Ingres to> PostgreSQL. We make heavy use
ofbulkloading using the 'COPY'> statement in ESQL/C. Consider the SQL statements below (in a psql> session on an
arbitrarydatabase):> >  CREATE TABLE copytest(f1 INTEGER, f2 INTEGER);>  CREATE UNIQUE INDEX copytest_idx ON copytest
USINGBTREE(f1, f2);>  COPY copytest FROM '/tmp/copytest';> > Given the file /tmp/copytest:> >  1    1>  2    2>  3
3> 4    4>  4    4>  5    5>  6    6> > will result in the following output:> >  ERROR:  copy: line 5, Cannot insert a
duplicatekey into unique index copytest_idx> > However my application code is assuming that duplicate rows will> simply
beignored (this is the case in Ingres, and I believe Oracle's> bulkloader too). I propose modifying _bt_check_unique()
in>/backend/access/nbtree/nbtinsert.c to emit a NOTICE (rather than> ERROR) elog() and return NULL (or appropriate) to
thecalling function> if a duplicate key is detected and a 'COPY FROM' is in progress (add> new parameter to flag
this).>> Would this seem a reasonable thing to do? Does anyone rely on COPY> FROM causing an ERROR on duplicate input?
Would:>>  WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar)> > need to be added to the COPY command (I hope not)?> >
Thanks,>> -- >  Lee Kindness, Senior Software Engineer>  Concept Systems Limited.

Bulkloading using COPY - ignore duplicates?

From

Lee Kindness

Date:

11 December 2001, 11:08:03

Gents,

I started quite a long thread about this back in September. To
summarise I was proposing that COPY FROM would not abort the
transaction when it encountered data which would cause a uniqueness
violation on the table index(s).

Generally I think this was seen as a 'Good Thing'TM for a number of
reasons:
1. Performance enhancements when doing doing bulk inserts - pre or
post processing the data to remove duplicates is very time
consuming. Likewise the best tool should always be used for the job at
and, and for searching/removing things it's a database.

2. Feature parity with other database systems. For example Oracle's
SQLOADER has a feature to not insert duplicates and rather move
them to another file for later investigation.

Naturally the default behaviour would be the current one of assuming
valid data. Also the duplicate check would not add anything to the
current code path for COPY FROM - it would not take any longer.

I attempted to add this functionality to PostgreSQL myself but got as
far as an updated parser and a COPY FROM which resulted in a database
recovery!

So (here's the question finally) is it worthwhile adding this
enhancement to the TODO list?

Thanks, Lee.

-- Lee Kindness, Senior Software Engineer, Concept Systems Limited.http://services.csl.co.uk/ http://www.csl.co.uk/ +44
1315575595

Re: Bulkloading using COPY - ignore duplicates?

From

Patrick Welche

Date:

13 December 2001, 07:38:29

On Mon, Oct 01, 2001 at 03:17:43PM +0100, Lee Kindness wrote:
> Tom Lane writes:
>  > I'm especially not pleased at the notion of removing an error check
>  > that's always been there because someone else thinks that would make it
>  > more convenient for his application.
> 
> Please, don't get me wrong - I don't want to come across arrogant. I'm
> simply trying to improve the 'COPY FROM' command in a situation where
> speed is a critical issue and the data is dirty... And that must be a
> relatively common scenario in industry.

Isn't that when you do your bulk copy into into a holding table, then
clean it up, and then insert into your live system?

Patrick

Re: Bulkloading using COPY - ignore duplicates?

From

Lee Kindness

Date:

13 December 2001, 08:29:43

Patrick Welche writes:> On Mon, Oct 01, 2001 at 03:17:43PM +0100, Lee Kindness wrote:> > Please, don't get me wrong - I
don'twant to come across arrogant. I'm> > simply trying to improve the 'COPY FROM' command in a situation where> >
speedis a critical issue and the data is dirty... And that must be a> > relatively common scenario.> Isn't that when
youdo your bulk copy into into a holding table, then> clean it up, and then insert into your live system?
 

That's what I'm currently doing as a workaround - a SELECT DISTINCT
from a temporary table into the real table with the unique index on
it. However this takes absolute ages - say 5 seconds for the copy
(which is the ballpark figure I aiming toward and can achieve with
Ingres) plus another 30ish seconds for the SELECT DISTINCT.

The majority of database systems out there handle this situation in
one manner or another (MySQL ignores or replaces; Ingres ignores;
Oracle ignores or logs; others...). Indeed PostgreSQL currently checks
for duplicates in the COPY code but throws an elog(ERROR) rather than
ignoring the row, or passing the error back up the call chain.

My use of PostgreSQL is very time critical, and sadly this issue alone
may force an evaluation of Oracle's performance in this respect!

Best regards, Lee Kindness.

-- Lee Kindness, Senior Software Engineer, Concept Systems Limited.http://services.csl.co.uk/ http://www.csl.co.uk/ +44
1315575595

Re: Bulkloading using COPY - ignore duplicates?

From

Hannu Krosing

Date:

13 December 2001, 09:08:30

Lee Kindness wrote:
> 
> Patrick Welche writes:
>  > On Mon, Oct 01, 2001 at 03:17:43PM +0100, Lee Kindness wrote:
>  > > Please, don't get me wrong - I don't want to come across arrogant. I'm
>  > > simply trying to improve the 'COPY FROM' command in a situation where
>  > > speed is a critical issue and the data is dirty... And that must be a
>  > > relatively common scenario.
>  > Isn't that when you do your bulk copy into into a holding table, then
>  > clean it up, and then insert into your live system?
> 
> That's what I'm currently doing as a workaround - a SELECT DISTINCT
> from a temporary table into the real table with the unique index on
> it. However this takes absolute ages - say 5 seconds for the copy
> (which is the ballpark figure I aiming toward and can achieve with
> Ingres) plus another 30ish seconds for the SELECT DISTINCT.
> 
> The majority of database systems out there handle this situation in
> one manner or another (MySQL ignores or replaces; Ingres ignores;
> Oracle ignores or logs; others...). Indeed PostgreSQL currently checks
> for duplicates in the COPY code but throws an elog(ERROR) rather than
> ignoring the row, or passing the error back up the call chain.

I guess postgresql will be able to do it once savepoints get
implemented.

> My use of PostgreSQL is very time critical, and sadly this issue alone
> may force an evaluation of Oracle's performance in this respect!

Can't you clean the duplicates _outside_ postgresql, say

cat dumpfile | sort | uniq | psql db -c 'copy mytable from stdin'

with your version of uniq.

or perhaps

psql db -c 'copy mytable to stdout' >> dumpfile
sort dumpfile | uniq | psql db -c 'copy mytable from stdin'

if you already have something in mytable.

------------
Hannu

Re: Bulkloading using COPY - ignore duplicates?

From

Lee Kindness

Date:

13 December 2001, 09:59:44

Hannu Krosing writes:> Lee Kindness wrote:> > The majority of database systems out there handle this situation in> >
onemanner or another (MySQL ignores or replaces; Ingres ignores;> > Oracle ignores or logs; others...). Indeed
PostgreSQLcurrently checks> > for duplicates in the COPY code but throws an elog(ERROR) rather than> > ignoring the
row,or passing the error back up the call chain.> I guess postgresql will be able to do it once savepoints get>
implemented.

This is encouraging to hear. I can see how this would make the code
changes relatively minimal and more manageable - the changes to the
current code are simply over my head!

Are savepoints relatively high up on the TODO list, once 7.2 is out the
door?
> > My use of PostgreSQL is very time critical, and sadly this issue alone> > may force an evaluation of Oracle's
performancein this respect!> Can't you clean the duplicates _outside_ postgresql, say> cat dumpfile | sort | uniq |
psqldb -c 'copy mytable from stdin'
 

This is certainly a possibility, however it's just really moving the
processing elsewhere. The combined time is still around the same.

I've/we've done a lot of investigation with approaches like this and
also with techniques assuming the locality of the duplicates (which is
a no-goer). None improve the situation.

I'm not going to compare the time of just using INSERTs rather than
COPY...

Thanks for your response, Lee Kindness.

-- Lee Kindness, Senior Software Engineer, Concept Systems Limited.http://services.csl.co.uk/ http://www.csl.co.uk/ +44
1315575595

Re: Bulkloading using COPY - ignore duplicates?

From

Lee Kindness

Date:

13 December 2001, 10:08:33

Lee Kindness writes:> I'm not going to compare the time of just using INSERTs rather than> COPY...

Ooops, I'm NOW going to... Obviously my subconscious is telling me
otherwise - bring on the Christmas party!

Lee.

-- Lee Kindness, Senior Software Engineer, Concept Systems Limited.http://services.csl.co.uk/ http://www.csl.co.uk/ +44
1315575595

Re: Bulkloading using COPY - ignore duplicates?

From

Patrick Welche

Date:

13 December 2001, 10:38:32

On Thu, Dec 13, 2001 at 01:25:11PM +0000, Lee Kindness wrote:
> That's what I'm currently doing as a workaround - a SELECT DISTINCT
> from a temporary table into the real table with the unique index on
> it. However this takes absolute ages - say 5 seconds for the copy
> (which is the ballpark figure I aiming toward and can achieve with
> Ingres) plus another 30ish seconds for the SELECT DISTINCT.

Then your column really isn't unique, so how about dropping the unique index,
import the data, fix the duplicates, recreate the unique index - just as
another possible work around ;)

Patrick

Re: Bulkloading using COPY - ignore duplicates?

From

Lee Kindness

Date:

13 December 2001, 10:59:44

Patrick Welche writes:> On Thu, Dec 13, 2001 at 01:25:11PM +0000, Lee Kindness wrote:> > That's what I'm currently
doingas a workaround - a SELECT DISTINCT> > from a temporary table into the real table with the unique index on> > it.
Howeverthis takes absolute ages - say 5 seconds for the copy> > (which is the ballpark figure I aiming toward and can
achievewith> > Ingres) plus another 30ish seconds for the SELECT DISTINCT.> Then your column really isn't unique,
 

That's another discussion entirely ;) - it's spat out by a real-time
system which doesn't have the time or resources to check this. Further
precision loss later in the data's life adds more duplicates...
> so how about dropping the unique index, import the data, fix the> duplicates, recreate the unique index - just as
anotherpossible> work around ;) 
 

This is just going to be the same(ish) time, no?
CREATE TABLE tab (p1 INT, p2 INT, other1 INT, other2 INT);COPY tab FROM 'file';DELETE FROM tab WHERE p1, p2 NOT IN
(SELECTDISTINCT p1, p2                                     FROM tab);CREATE UNIQUE INDEX tab_idx ON tab USING BTREE(p1,
p2);

or am I missing something?

Thanks, Lee.

-- Lee Kindness, Senior Software Engineer, Concept Systems Limited.http://services.csl.co.uk/ http://www.csl.co.uk/ +44
1315575595

Re: Bulkloading using COPY - ignore duplicates?

From

"Ross J. Reedstrom"

Date:

13 December 2001, 11:36:25

On Thu, Dec 13, 2001 at 03:44:31PM +0000, Lee Kindness wrote:
> Patrick Welche writes:
>  > On Thu, Dec 13, 2001 at 01:25:11PM +0000, Lee Kindness wrote:
>  > > That's what I'm currently doing as a workaround - a SELECT DISTINCT
>  > > from a temporary table into the real table with the unique index on
>  > > it. However this takes absolute ages - say 5 seconds for the copy
>  > > (which is the ballpark figure I aiming toward and can achieve with
>  > > Ingres) plus another 30ish seconds for the SELECT DISTINCT.
>  > Then your column really isn't unique,
> 
> That's another discussion entirely ;) - it's spat out by a real-time
> system which doesn't have the time or resources to check this. Further
> precision loss later in the data's life adds more duplicates...

Hmm, the data has a later life - sounds like you'll need to remove dups
then, anyway, so can you get away with just letting the dups in? Remove
the UNIQUE requirement, and let the real time system just dump away.
How critical is it to later steps that there be no dups? And how many
(potential) dups is your RTS producing, anyway?

Your later processing (which apparently can _generate_ dups) might be
the out of the critical time path place to worry about removing dups.

Ross

P.S. This falls into the class of problem solving characterized by
"if you can't solve the problem as stated, restate the problem to be
one you _can_ solve" ;-)

> 
>  > so how about dropping the unique index, import the data, fix the
>  > duplicates, recreate the unique index - just as another possible
>  > work around ;) 
> 
> This is just going to be the same(ish) time, no?
> 
>  CREATE TABLE tab (p1 INT, p2 INT, other1 INT, other2 INT);
>  COPY tab FROM 'file';
>  DELETE FROM tab WHERE p1, p2 NOT IN (SELECT DISTINCT p1, p2
>                                       FROM tab);
>  CREATE UNIQUE INDEX tab_idx ON tab USING BTREE(p1, p2);
> 
> or am I missing something?
> 
> Thanks, Lee.
> 
> -- 
>  Lee Kindness, Senior Software Engineer, Concept Systems Limited.
>  http://services.csl.co.uk/ http://www.csl.co.uk/ +44 131 5575595
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
> 
> http://www.postgresql.org/users-lounge/docs/faq.html

Re: Bulkloading using COPY - ignore duplicates?

From

Peter Eisentraut

Date:

13 December 2001, 13:29:52

Lee Kindness writes:

> Yes, in an ideal world the input to COPY should be clean and
> consistent with defined indexes. However this is only really the case
> when COPY is used for database/table backup and restore. It misses the
> point that a major use of COPY is in speed optimisation on bulk
> inserts...

I think allowing this feature would open up a world of new dangerous
ideas, such as ignoring check contraints or foreign keys or magically
massaging other tables so that the foreign keys are satisfied, or ignoring
default values, or whatever.  The next step would then be allowing the
same optimizations in INSERT.  I feel COPY should load the data and that's
it.  If you don't like the data you have then you have to fix it first.

-- 
Peter Eisentraut   peter_e@gmx.net

Re: Bulkloading using COPY - ignore duplicates?

From

Peter Eisentraut

Date:

13 December 2001, 13:29:59

Lee Kindness writes:

>  1. Performance enhancements when doing doing bulk inserts - pre or
> post processing the data to remove duplicates is very time
> consuming. Likewise the best tool should always be used for the job at
> and, and for searching/removing things it's a database.

Arguably, a better tool for this is sort(1).  For instance, if you have a
typical copy input file with tab-separated fields and the primary key is
in columns 1 and 2, you can remove duplicates with

sort -k 1,2 -u INFILE > OUTFILE

To get a record of what duplicates were removed, use diff.

-- 
Peter Eisentraut   peter_e@gmx.net

Re: Bulkloading using COPY - ignore duplicates?

From

"Ross J. Reedstrom"

Date:

13 December 2001, 13:38:38

O.K., time to start looking into the _nature_ of the dups in your
data, to see if there's anything specific to take advantage of, since
the general solution (tell the DBMS to ignore dups) isn't available,
and isn't likely to get there real soon. 

So what does your data look like, and how do the dups occur?

Any chance it's in a really simple format, and the dups are also really
simple, like 'one record per line, dups occur as identical adjacent
lines?' if so, 'uniq' will solve the problem with little to no speed
penalty. (it's the sort that kills ...)

Or are you only gettinga dup'ed field,m and the rule 'ignore later
records?' I could see this happen if the dta is timestamped at a 
granularity that doesn't _exactly_ match the repetition rate: e.g.
stamp to the second, record once a second.

So, what's it look like? Since it's one format, I bet a small, simple
pipe filter could handle dup elimination on the fly.

Ross

On Thu, Dec 13, 2001 at 05:02:15PM +0000, Lee Kindness wrote:
> 
> The RTS outputs to a file which is then subsequently used as input to
> other packages, one of which is the application i'm concerned
> with. While fixing at source is the ideal solution there are terabytes
> of legacy data around (this is raw seismic navigational data). Also
> there are more than one competing packages...
> 
> Our package post-processes (we're still very concerned about speed as
> this is normally done while 'shooting' the seismic data) this data to
> produce the final seismic navigational data, which is then later used
> by other products...
> 
> The problem at hand is importing the initial data - no duplicates are
> produced by the program itself later (nor in its output data).
> 
> Sadly a large number of later SQL queries assume no duplicates and
> would result in incorrect processing calculations, amongst other
> things. The shear number of these queries makes changing them
> impractical.
> 
>  > P.S. This falls into the class of problem solving characterized by
>  > "if you can't solve the problem as stated, restate the problem to be
>  > one you _can_ solve" ;-)
> 
> Which is what i've been knocking my head against for the last few
> weeks ;) The real problem is a move away from our current RDMS
> (Ingres) to PostgreSQL will not happen if the performance of the
> product significantly decreases (which it currently has for the import
> stage) and since Ingres already just ignores the duplicates...
> 
> I really want to move to PostgreSQL...
> 
> Thanks for your input,
> 
> -- 
>  Lee Kindness, Senior Software Engineer, Concept Systems Limited.
>  http://services.csl.co.uk/ http://www.csl.co.uk/ +44 131 5575595

Re: Bulkloading using COPY - ignore duplicates?

From

Lee Kindness

Date:

14 December 2001, 06:36:36

Peter Eisentraut writes:> I think allowing this feature would open up a world of new> dangerous ideas, such as ignoring
checkcontraints or foreign keys> or magically massaging other tables so that the foreign keys are> satisfied, or
ignoringdefault values, or whatever.  The next step> would then be allowing the same optimizations in INSERT.  I feel>
COPYshould load the data and that's it.  If you don't like the> data you have then you have to fix it first.
 

I agree that PostgreSQL's checks during COPY are a bonus and I
wouldn't dream of not having them. Many database systems provide a
fast bulkload by ignoring these constraits and cross references -
that's a tricky/horrid situation.

However I suppose the question is should such 'invalid data' abort the
transaction, it seems a bit drastic...

I suppose i'm not really after a IGNORE DUPLICATES option, but rather
a CONTINUE ON ERROR kind of thing.

Regards, Lee.