Thread: Re: Bulkloading using COPY - ignore duplicates?

Re: Bulkloading using COPY - ignore duplicates?

From
"Jim Buttafuoco"
Date:
I agree with Lee,  I also like Oracle's options for a discard file, so
you can look at what was rejected, fix your problem and reload if
necessary just the rejects.

Jim


> Peter Eisentraut writes:
>  > I think allowing this feature would open up a world of new
>  > dangerous ideas, such as ignoring check contraints or foreign keys
>  > or magically massaging other tables so that the foreign keys are
>  > satisfied, or ignoring default values, or whatever.  The next step
>  > would then be allowing the same optimizations in INSERT.  I feel
>  > COPY should load the data and that's it.  If you don't like the
>  > data you have then you have to fix it first.
> 
> I agree that PostgreSQL's checks during COPY are a bonus and I
> wouldn't dream of not having them. Many database systems provide a
> fast bulkload by ignoring these constraits and cross references -
> that's a tricky/horrid situation.
> 
> However I suppose the question is should such 'invalid data' abort the
> transaction, it seems a bit drastic...
> 
> I suppose i'm not really after a IGNORE DUPLICATES option, but rather
> a CONTINUE ON ERROR kind of thing.
> 
> Regards, Lee.
> 
> 




Re: Bulkloading using COPY - ignore duplicates?

From
Peter Eisentraut
Date:
Jim Buttafuoco writes:

> I agree with Lee,  I also like Oracle's options for a discard file, so
> you can look at what was rejected, fix your problem and reload if
> necessary just the rejects.

How do you know which one is the duplicate and which one is the good one?
More likely you will have to fix the entire thing.  Anything else would
undermine the general data model except in specific use cases.

-- 
Peter Eisentraut   peter_e@gmx.net



Re: Bulkloading using COPY - ignore duplicates?

From
Lee Kindness
Date:
Peter Eisentraut writes:> Jim Buttafuoco writes:> > I agree with Lee,  I also like Oracle's options for a discard file,
so>> you can look at what was rejected, fix your problem and reload if> > necessary just the rejects.> How do you know
whichone is the duplicate and which one is the good one?> More likely you will have to fix the entire thing.  Anything
elsewould> undermine the general data model except in specific use cases.
 

In the general case most data is sequential, in which case it would be
normal to assume that the first record is the definitive one. Most
database systems go with this assumption apart from MySQL which gives
the user a choice between IGNORE or UPDATE...

Lee.


Re: Bulkloading using COPY - ignore duplicates?

From
Lee Kindness
Date:
Peter Eisentraut writes:> Jim Buttafuoco writes:> > I agree with Lee,  I also like Oracle's options for a discard file,
so>> you can look at what was rejected, fix your problem and reload if> > necessary just the rejects.> How do you know
whichone is the duplicate and which one is the good one?> More likely you will have to fix the entire thing.  Anything
elsewould> undermine the general data model except in specific use cases.
 

Consider SELECT DISTINCT - which is the 'duplicate' and which one is
the good one?

Lee.


Re: Bulkloading using COPY - ignore duplicates?

From
Peter Eisentraut
Date:
Lee Kindness writes:

> Consider SELECT DISTINCT - which is the 'duplicate' and which one is
> the good one?

It's not the same thing.  SELECT DISTINCT only eliminates rows that are
completely the same, not only equal in their unique contraints.

Maybe you're thinking of SELECT DISTINCT ON ().  Observe the big warning
that the result of that statement are random unless ORDER BY is used.  --
But that's not the same thing either.  We've never claimed that the COPY
input has an ordering assumption.  In fact you're asking for a bit more
than an ordering assumption, you're saying that the earlier data is better
than the later data.  I think in a random use case that is more likely
*not* to be the case because the data at the end is newer.


Btw., here's another concern about this proposed feature:  If I do a
client-side COPY, how will you sent the "ignored" rows back to the client?

-- 
Peter Eisentraut   peter_e@gmx.net



Re: Bulkloading using COPY - ignore duplicates?

From
Lee Kindness
Date:
Peter Eisentraut writes:> Lee Kindness writes:> > Consider SELECT DISTINCT - which is the 'duplicate' and which one is>
>the good one?> It's not the same thing.  SELECT DISTINCT only eliminates rows that are> completely the same, not only
equalin their unique contraints.> Maybe you're thinking of SELECT DISTINCT ON ().  Observe the big warning> that the
resultof that statement are random unless ORDER BY is used.  --> But that's not the same thing either.  We've never
claimedthat the COPY> input has an ordering assumption.  In fact you're asking for a bit more> than an ordering
assumption,you're saying that the earlier data is better> than the later data.  I think in a random use case that is
morelikely> *not* to be the case because the data at the end is newer.
 

You're right - I was meaning 'SELECT DISTINCT ON ()'. However I'm only
using it as an example of where the database is choosing (be it
randomly) the data to discarded. While I've said in this thread that
'COPY FROM IGNORE DUPLICATES' would ignore later duplicates I'm not
really that concerned about what it ignores; first, later, random,
... I agree if it was of concern then it should be pre-processed.
> Btw., here's another concern about this proposed feature:  If I do> a client-side COPY, how will you sent the
"ignored"rows back to> the client?
 

Again a number of different ideas have been mixed up in the
discussion. Oracle's logging option was only given as an example of
how other database systems deal with this option - If it wasn't
explicitly given then it's reasonable to discard the extra
information.

What really would be nice in the SQL-world is a standardised COPY
statement...

Best regards, Lee Kindness.


Re: Bulkloading using COPY - ignore duplicates?

From
Tom Lane
Date:
Lee Kindness <lkindness@csl.co.uk> writes:
> You're right - I was meaning 'SELECT DISTINCT ON ()'. However I'm only
> using it as an example of where the database is choosing (be it
> randomly) the data to discarded.

Not a good example to support your argument.  The entire point of
DISTINCT ON (imho) is that the rows that are kept or discarded are
*not* random, but can be selected by the user by specifying additional
sort columns.  DISTINCT ON would be pretty useless if it weren't for
that flexibility.  The corresponding concept in COPY will need to
provide flexible means for deciding which row to keep and which to
drop, else it'll be pretty useless.
        regards, tom lane


Re: Bulkloading using COPY - ignore duplicates?

From
Lee Kindness
Date:
Tom Lane writes:> Lee Kindness <lkindness@csl.co.uk> writes:> > You're right - I was meaning 'SELECT DISTINCT ON ()'.
HoweverI'm only> > using it as an example of where the database is choosing (be it> > randomly) the data to discarded.>
Nota good example to support your argument.  The entire point of> DISTINCT ON (imho) is that the rows that are kept or
discardedare> *not* random, but can be selected by the user by specifying additional> sort columns.  DISTINCT ON would
bepretty useless if it weren't for> that flexibility.  The corresponding concept in COPY will need to> provide flexible
meansfor deciding which row to keep and which to> drop, else it'll be pretty useless.
 

At which point it becomes quicker to resort to INSERT...

Here's the crux question - how can I get management to go with
PostgreSQL when a core operation (import of data into a transient
database) is at least 6 times slower than the current version?

With a lot of work investigating the incoming data, the number of
incoming duplicates has been massively reduced by fixing/tackling at
source. However rouge values do still crop up (the data originates
from a real-time system with multiple hardware inputs from multiple
hardware vendors) and when they do (even just 1) the performance dies.
Add to this terrabytes of legacy data...

While you may see the option of ignoring duplicates in COPY as 'pretty
useless', it obviously has its place/use otherwise every other
database system wouldn't have support for it! (not that following the
pack is always a good idea)

In an ideal world 'COPY FROM' would only be used with data output by
'COPY TO' and it would be nice and sanitised. However in some fields
this often is not a possibility due to performance constraints!

Best regards,

-- Lee Kindness, Senior Software Engineer, Concept Systems Limited.http://services.csl.co.uk/
http://www.csl.co.uk/


Re: Bulkloading using COPY - ignore duplicates?

From
Tom Lane
Date:
Lee Kindness <lkindness@csl.co.uk> writes:
> In an ideal world 'COPY FROM' would only be used with data output by
> 'COPY TO' and it would be nice and sanitised. However in some fields
> this often is not a possibility due to performance constraints!

Of course, the more bells and whistles we add to COPY, the slower it
will get, which rather defeats the purpose no?
        regards, tom lane


Re: Bulkloading using COPY - ignore duplicates?

From
Lee Kindness
Date:
Tom Lane writes:> Lee Kindness <lkindness@csl.co.uk> writes:> > In an ideal world 'COPY FROM' would only be used with
dataoutput by> > 'COPY TO' and it would be nice and sanitised. However in some fields> > this often is not a
possibilitydue to performance constraints!> Of course, the more bells and whistles we add to COPY, the slower it> will
get,which rather defeats the purpose no?
 

Indeed, but as I've mentioned in this thread in the past, the code
path for COPY FROM already does a check against the unique index (if
there is one) but bombs-out rather than handling it...

It wouldn't add any execution time if there were no duplicates in the
input!

regards, Lee.


Re: Bulkloading using COPY - ignore duplicates?

From
Bruce Momjian
Date:
Lee Kindness wrote:
> Tom Lane writes:
>  > Lee Kindness <lkindness@csl.co.uk> writes:
>  > > In an ideal world 'COPY FROM' would only be used with data output by
>  > > 'COPY TO' and it would be nice and sanitised. However in some fields
>  > > this often is not a possibility due to performance constraints!
>  > Of course, the more bells and whistles we add to COPY, the slower it
>  > will get, which rather defeats the purpose no?
> 
> Indeed, but as I've mentioned in this thread in the past, the code
> path for COPY FROM already does a check against the unique index (if
> there is one) but bombs-out rather than handling it...
> 
> It wouldn't add any execution time if there were no duplicates in the
> input!

I know many purists object to allowing COPY to discard invalid rows in
COPY input, but it seems we have lots of requests for this feature, with
few workarounds except pre-processing the flat file.  Of course, if they
use INSERT, they will get errors that they can just ignore.  I don't see
how allowing errors in COPY is any more illegal, except that COPY is one
command while multiple INSERTs are separate commands.

Seems we need to allow such a capability, if only crudely.  I don't
think we can create a discard file because of the problem with remote
COPY.

I think we can allow something like:
COPY FROM '/tmp/x' WITH ERRORS 2

meaning we will allow at most two errors and will report the error line
numbers to the user.  I think this syntax clearly indicates that errors
are being accepted in the input.  An alternate syntax would allow an
unlimited number of errors:
COPY FROM '/tmp/x' WITH ERRORS

The errors can be non-unique errors, or even CHECK constraint errors.

Unless I hear complaints, I will add it to TODO.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Bulkloading using COPY - ignore duplicates?

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> I think we can allow something like:
>     COPY FROM '/tmp/x' WITH ERRORS 2

This is not going to happen, at least not until after there's a
wholesale revision of error handling.  As things stand we do not
have a choice: elog(ERROR) must abort the transaction, because we
can't guarantee that things are in good enough shape to continue.
See the archives for previous discussions.
        regards, tom lane


Re: Bulkloading using COPY - ignore duplicates?

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > I think we can allow something like:
> >     COPY FROM '/tmp/x' WITH ERRORS 2
> 
> This is not going to happen, at least not until after there's a
> wholesale revision of error handling.  As things stand we do not
> have a choice: elog(ERROR) must abort the transaction, because we
> can't guarantee that things are in good enough shape to continue.
> See the archives for previous discussions.

Yes, I realize we need subtransactions or something, but we should add
it to the TODO list if it is a valid request, right?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Bulkloading using COPY - ignore duplicates?

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> I think we can allow something like:
> COPY FROM '/tmp/x' WITH ERRORS 2

> Yes, I realize we need subtransactions or something, but we should add
> it to the TODO list if it is a valid request, right?

Well, I don't like that particular API in any case.  Why would I think
that 2 errors are okay and 3 are not, if I'm loading a
many-thousand-line COPY file?  Wouldn't it matter *what* the errors
are, at least as much as how many there are?  "Discard duplicate rows"
is one thing, but "ignore bogus data" (eg, unrecognizable timestamps)
is not the same animal at all.

As someone already remarked, the correct, useful form of such a feature
is to echo the rejected lines to some sort of output file that I can
look at afterwards.  How many errors there are is not the issue.
        regards, tom lane


Re: Bulkloading using COPY - ignore duplicates?

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > I think we can allow something like:
> > COPY FROM '/tmp/x' WITH ERRORS 2
> 
> > Yes, I realize we need subtransactions or something, but we should add
> > it to the TODO list if it is a valid request, right?
> 
> Well, I don't like that particular API in any case.  Why would I think
> that 2 errors are okay and 3 are not, if I'm loading a
> many-thousand-line COPY file?  Wouldn't it matter *what* the errors

I threw the count idea in as a possible compromise.  :-)

> are, at least as much as how many there are?  "Discard duplicate rows"
> is one thing, but "ignore bogus data" (eg, unrecognizable timestamps)
> is not the same animal at all.

Yes, when we have error codes, it would be nice to specify certain
errors to ignore.

> As someone already remarked, the correct, useful form of such a feature
> is to echo the rejected lines to some sort of output file that I can
> look at afterwards.  How many errors there are is not the issue.

How about for TODO:
* Allow COPY to report error lines and continue; requiresnested transactions;  optionally allow error codes to be
specified


--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Bulkloading using COPY - ignore duplicates?

From
Tom Lane
Date:
> How about for TODO:
>     * Allow COPY to report error lines and continue; requires
>     nested transactions;  optionally allow error codes to be specified

Okay, that seems reasonable.
        regards, tom lane


Re: Bulkloading using COPY - ignore duplicates?

From
Bruce Momjian
Date:
Tom Lane wrote:
> > How about for TODO:
> >     * Allow COPY to report error lines and continue; requires
> >     nested transactions;  optionally allow error codes to be specified
> 
> Okay, that seems reasonable.

Good.  Now that I think of it, nested transactions don't seem required. 
We already allow pg_dump to dump a database using INSERTs, and we don't
put those inserts in a single transaction when we load them:CREATE TABLE "test" (        "x" integer);INSERT INTO
"test"VALUES (1);INSERT INTO "test" VALUES (2);
 

Should we be wrapping these INSERTs in a transaction?  Can we do COPY
with each row being its own transaction?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Bulkloading using COPY - ignore duplicates?

From
Bruce Momjian
Date:
Bruce Momjian wrote:
> Tom Lane wrote:
> > > How about for TODO:
> > >     * Allow COPY to report error lines and continue; requires
> > >     nested transactions;  optionally allow error codes to be specified
> > 
> > Okay, that seems reasonable.
> 
> Good.  Now that I think of it, nested transactions don't seem required. 
> We already allow pg_dump to dump a database using INSERTs, and we don't
> put those inserts in a single transaction when we load them:
>     
>     CREATE TABLE "test" (
>             "x" integer
>     );
>     
>     INSERT INTO "test" VALUES (1);
>     INSERT INTO "test" VALUES (2);
> 
> Should we be wrapping these INSERTs in a transaction?  Can we do COPY
> with each row being its own transaction?

OK, added to TODO:
       o Allow COPY to report error lines and continue;  optionally          allow error codes to be specified

Seems nested transactions are not required if we load each COPY line in
its own transaction, like we do with INSERT from pg_dump.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Bulkloading using COPY - ignore duplicates?

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Seems nested transactions are not required if we load each COPY line in
> its own transaction, like we do with INSERT from pg_dump.

I don't think that's an acceptable answer.  Consider
    BEGIN;    other stuff;    COPY ....;    other stuff;    ROLLBACK;
        regards, tom lane


Re: Bulkloading using COPY - ignore duplicates?

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Seems nested transactions are not required if we load each COPY line in
> > its own transaction, like we do with INSERT from pg_dump.
> 
> I don't think that's an acceptable answer.  Consider

Oh, very good point.  "Requires nested transactions" added to TODO.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Bulkloading using COPY - ignore duplicates?

From
"Mikheev, Vadim"
Date:
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > Seems nested transactions are not required if we load
> > > each COPY line in its own transaction, like we do with
> > > INSERT from pg_dump.
> > 
> > I don't think that's an acceptable answer.  Consider
> 
> Oh, very good point.  "Requires nested transactions" added to TODO.

Also add performance issue with per-line-commit...

Also-II - there is more common name for required feature - savepoints.

Vadim


Re: Bulkloading using COPY - ignore duplicates?

From
Bruce Momjian
Date:
Mikheev, Vadim wrote:
> > > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > > Seems nested transactions are not required if we load
> > > > each COPY line in its own transaction, like we do with
> > > > INSERT from pg_dump.
> > > 
> > > I don't think that's an acceptable answer.  Consider
> > 
> > Oh, very good point.  "Requires nested transactions" added to TODO.
> 
> Also add performance issue with per-line-commit...
> 
> Also-II - there is more common name for required feature - savepoints.

OK, updated TODO to prefer savepoints term.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Bulkloading using COPY - ignore duplicates?

From
Daniel Kalchev
Date:
>>>Bruce Momjian said:> Mikheev, Vadim wrote:> > > > Bruce Momjian <pgman@candle.pha.pa.us> writes:> > > > > Seems
nestedtransactions are not required if we load> > > > > each COPY line in its own transaction, like we do with> > > > >
INSERTfrom pg_dump.> > > > > > > > I don't think that's an acceptable answer.  Consider> > > > > > Oh, very good point.
"Requires nested transactions" added to TODO.> > > > Also add performance issue with per-line-commit...> > > > Also-II
-there is more common name for required feature - savepoints.> > OK, updated TODO to prefer savepoints term.
 

Now, how about the same functionality for

INSERT into table1 SELECT * from table2 ... WITH ERRORS;

Should allow the insert to complete, even if table1 has unique indexes and we 
try to insert duplicate rows. Might save LOTS of time in bulkloading scripts 
not having to do single INSERTs.

Guess all this will be available in 7.3?

Daniel



Re: Bulkloading using COPY - ignore duplicates?

From
"Vadim Mikheev"
Date:
> Now, how about the same functionality for
>
> INSERT into table1 SELECT * from table2 ... WITH ERRORS;
>
> Should allow the insert to complete, even if table1 has unique indexes and
we
> try to insert duplicate rows. Might save LOTS of time in bulkloading
scripts
> not having to do single INSERTs.

1. I prefer Oracle' (and others, I believe) way - put statement(s) in PL
block and define
for what exceptions (errors) what actions should be taken (ie IGNORE for
NON_UNIQ_KEY
error, etc).

2. For INSERT ... SELECT statement one can put DISTINCT in select' target
list.

> Guess all this will be available in 7.3?

We'll see.

Vadim




Re: Bulkloading using COPY - ignore duplicates?

From
Daniel Kalchev
Date:
>>>"Vadim Mikheev" said:> 1. I prefer Oracle' (and others, I believe) way - put statement(s) in PL> block and define>
forwhat exceptions (errors) what actions should be taken (ie IGNORE for> NON_UNIQ_KEY> error, etc).
 

Some people prefer 'pure' SQL. Anyway, it can be argued which is worse - the 
usage of non-SQL language, or usage of extended SQL language. I guess the SQL 
standard does not provide for such functionality?
> 2. For INSERT ... SELECT statement one can put DISTINCT in select' target> list.

With this construct, you are effectively copying rows from one table to 
another - or constructing rows from various sources (constants, other tables 
etc) and inserting these in the table. If the target table has unique indexes 
(or constraints), and some of the rows returned by SELECT violate the 
restrictions - you are supposed to get errors - and unfortunately the entire 
INSERT is aborted. I fail to see how DISTINCT can help here... Perhaps it is 
possible to include checking for already existing tuples in the destination 
table in the select... but this will significantly increase the runtime, 
especially when the destination table is huge.

My idea is to let this INSERT statement insert as much of its rows as 
possible, eventually returning NOTICEs or ignoring the errors (with an IGNORE 
ERRORS syntax for example :)

I believe all this functionality will have to consider the syntax firts.

Daniel



Re: Bulkloading using COPY - ignore duplicates?

From
Bruce Momjian
Date:
Vadim Mikheev wrote:
> > Now, how about the same functionality for
> >
> > INSERT into table1 SELECT * from table2 ... WITH ERRORS;
> >
> > Should allow the insert to complete, even if table1 has unique indexes and
> we
> > try to insert duplicate rows. Might save LOTS of time in bulkloading
> scripts
> > not having to do single INSERTs.
> 
> 1. I prefer Oracle' (and others, I believe) way - put statement(s) in PL
> block and define
> for what exceptions (errors) what actions should be taken (ie IGNORE for
> NON_UNIQ_KEY
> error, etc).

Added to TODO:
* Allow command blocks that can ignore certain types of errors

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Bulkloading using COPY - ignore duplicates?

From
"Mikheev, Vadim"
Date:
>  > 1. I prefer Oracle' (and others, I believe) way - put
> statement(s) in PL block and define for what exceptions
> (errors) what actions should be taken (ie IGNORE for
>  > NON_UNIQ_KEY error, etc).
> 
> Some people prefer 'pure' SQL. Anyway, it can be argued which
> is worse - the usage of non-SQL language, or usage of extended
> SQL language. I guess the SQL standard does not provide for such
> functionality?

Yes, there is no such syntax in standard. And imho when some
feature is not in standard then it's better to implement it
how others do (for as much compatibility as possible/significant).

>  > 2. For INSERT ... SELECT statement one can put DISTINCT in
> select' target list.
> 
> With this construct, you are effectively copying rows from
> one table to another - or constructing rows from various
> sources (constants, other tables etc) and inserting these
> in the table. If the target table has unique indexes 
> (or constraints), and some of the rows returned by SELECT violate

Sorry, I didn't consider this case, you're right.

> I believe all this functionality will have to consider the 
> syntax firts.

All this functionality will have to consider savepoints
implementation first. As for syntax - we could implement both.

Vadim