Thread: Re: Bulkloading using COPY - ignore duplicates?
I agree with Lee, I also like Oracle's options for a discard file, so you can look at what was rejected, fix your problem and reload if necessary just the rejects. Jim > Peter Eisentraut writes: > > I think allowing this feature would open up a world of new > > dangerous ideas, such as ignoring check contraints or foreign keys > > or magically massaging other tables so that the foreign keys are > > satisfied, or ignoring default values, or whatever. The next step > > would then be allowing the same optimizations in INSERT. I feel > > COPY should load the data and that's it. If you don't like the > > data you have then you have to fix it first. > > I agree that PostgreSQL's checks during COPY are a bonus and I > wouldn't dream of not having them. Many database systems provide a > fast bulkload by ignoring these constraits and cross references - > that's a tricky/horrid situation. > > However I suppose the question is should such 'invalid data' abort the > transaction, it seems a bit drastic... > > I suppose i'm not really after a IGNORE DUPLICATES option, but rather > a CONTINUE ON ERROR kind of thing. > > Regards, Lee. > >
Jim Buttafuoco writes: > I agree with Lee, I also like Oracle's options for a discard file, so > you can look at what was rejected, fix your problem and reload if > necessary just the rejects. How do you know which one is the duplicate and which one is the good one? More likely you will have to fix the entire thing. Anything else would undermine the general data model except in specific use cases. -- Peter Eisentraut peter_e@gmx.net
Peter Eisentraut writes:> Jim Buttafuoco writes:> > I agree with Lee, I also like Oracle's options for a discard file, so>> you can look at what was rejected, fix your problem and reload if> > necessary just the rejects.> How do you know whichone is the duplicate and which one is the good one?> More likely you will have to fix the entire thing. Anything elsewould> undermine the general data model except in specific use cases. In the general case most data is sequential, in which case it would be normal to assume that the first record is the definitive one. Most database systems go with this assumption apart from MySQL which gives the user a choice between IGNORE or UPDATE... Lee.
Peter Eisentraut writes:> Jim Buttafuoco writes:> > I agree with Lee, I also like Oracle's options for a discard file, so>> you can look at what was rejected, fix your problem and reload if> > necessary just the rejects.> How do you know whichone is the duplicate and which one is the good one?> More likely you will have to fix the entire thing. Anything elsewould> undermine the general data model except in specific use cases. Consider SELECT DISTINCT - which is the 'duplicate' and which one is the good one? Lee.
Lee Kindness writes: > Consider SELECT DISTINCT - which is the 'duplicate' and which one is > the good one? It's not the same thing. SELECT DISTINCT only eliminates rows that are completely the same, not only equal in their unique contraints. Maybe you're thinking of SELECT DISTINCT ON (). Observe the big warning that the result of that statement are random unless ORDER BY is used. -- But that's not the same thing either. We've never claimed that the COPY input has an ordering assumption. In fact you're asking for a bit more than an ordering assumption, you're saying that the earlier data is better than the later data. I think in a random use case that is more likely *not* to be the case because the data at the end is newer. Btw., here's another concern about this proposed feature: If I do a client-side COPY, how will you sent the "ignored" rows back to the client? -- Peter Eisentraut peter_e@gmx.net
Peter Eisentraut writes:> Lee Kindness writes:> > Consider SELECT DISTINCT - which is the 'duplicate' and which one is> >the good one?> It's not the same thing. SELECT DISTINCT only eliminates rows that are> completely the same, not only equalin their unique contraints.> Maybe you're thinking of SELECT DISTINCT ON (). Observe the big warning> that the resultof that statement are random unless ORDER BY is used. --> But that's not the same thing either. We've never claimedthat the COPY> input has an ordering assumption. In fact you're asking for a bit more> than an ordering assumption,you're saying that the earlier data is better> than the later data. I think in a random use case that is morelikely> *not* to be the case because the data at the end is newer. You're right - I was meaning 'SELECT DISTINCT ON ()'. However I'm only using it as an example of where the database is choosing (be it randomly) the data to discarded. While I've said in this thread that 'COPY FROM IGNORE DUPLICATES' would ignore later duplicates I'm not really that concerned about what it ignores; first, later, random, ... I agree if it was of concern then it should be pre-processed. > Btw., here's another concern about this proposed feature: If I do> a client-side COPY, how will you sent the "ignored"rows back to> the client? Again a number of different ideas have been mixed up in the discussion. Oracle's logging option was only given as an example of how other database systems deal with this option - If it wasn't explicitly given then it's reasonable to discard the extra information. What really would be nice in the SQL-world is a standardised COPY statement... Best regards, Lee Kindness.
Lee Kindness <lkindness@csl.co.uk> writes: > You're right - I was meaning 'SELECT DISTINCT ON ()'. However I'm only > using it as an example of where the database is choosing (be it > randomly) the data to discarded. Not a good example to support your argument. The entire point of DISTINCT ON (imho) is that the rows that are kept or discarded are *not* random, but can be selected by the user by specifying additional sort columns. DISTINCT ON would be pretty useless if it weren't for that flexibility. The corresponding concept in COPY will need to provide flexible means for deciding which row to keep and which to drop, else it'll be pretty useless. regards, tom lane
Tom Lane writes:> Lee Kindness <lkindness@csl.co.uk> writes:> > You're right - I was meaning 'SELECT DISTINCT ON ()'. HoweverI'm only> > using it as an example of where the database is choosing (be it> > randomly) the data to discarded.> Nota good example to support your argument. The entire point of> DISTINCT ON (imho) is that the rows that are kept or discardedare> *not* random, but can be selected by the user by specifying additional> sort columns. DISTINCT ON would bepretty useless if it weren't for> that flexibility. The corresponding concept in COPY will need to> provide flexible meansfor deciding which row to keep and which to> drop, else it'll be pretty useless. At which point it becomes quicker to resort to INSERT... Here's the crux question - how can I get management to go with PostgreSQL when a core operation (import of data into a transient database) is at least 6 times slower than the current version? With a lot of work investigating the incoming data, the number of incoming duplicates has been massively reduced by fixing/tackling at source. However rouge values do still crop up (the data originates from a real-time system with multiple hardware inputs from multiple hardware vendors) and when they do (even just 1) the performance dies. Add to this terrabytes of legacy data... While you may see the option of ignoring duplicates in COPY as 'pretty useless', it obviously has its place/use otherwise every other database system wouldn't have support for it! (not that following the pack is always a good idea) In an ideal world 'COPY FROM' would only be used with data output by 'COPY TO' and it would be nice and sanitised. However in some fields this often is not a possibility due to performance constraints! Best regards, -- Lee Kindness, Senior Software Engineer, Concept Systems Limited.http://services.csl.co.uk/ http://www.csl.co.uk/
Lee Kindness <lkindness@csl.co.uk> writes: > In an ideal world 'COPY FROM' would only be used with data output by > 'COPY TO' and it would be nice and sanitised. However in some fields > this often is not a possibility due to performance constraints! Of course, the more bells and whistles we add to COPY, the slower it will get, which rather defeats the purpose no? regards, tom lane
Tom Lane writes:> Lee Kindness <lkindness@csl.co.uk> writes:> > In an ideal world 'COPY FROM' would only be used with dataoutput by> > 'COPY TO' and it would be nice and sanitised. However in some fields> > this often is not a possibilitydue to performance constraints!> Of course, the more bells and whistles we add to COPY, the slower it> will get,which rather defeats the purpose no? Indeed, but as I've mentioned in this thread in the past, the code path for COPY FROM already does a check against the unique index (if there is one) but bombs-out rather than handling it... It wouldn't add any execution time if there were no duplicates in the input! regards, Lee.
Lee Kindness wrote: > Tom Lane writes: > > Lee Kindness <lkindness@csl.co.uk> writes: > > > In an ideal world 'COPY FROM' would only be used with data output by > > > 'COPY TO' and it would be nice and sanitised. However in some fields > > > this often is not a possibility due to performance constraints! > > Of course, the more bells and whistles we add to COPY, the slower it > > will get, which rather defeats the purpose no? > > Indeed, but as I've mentioned in this thread in the past, the code > path for COPY FROM already does a check against the unique index (if > there is one) but bombs-out rather than handling it... > > It wouldn't add any execution time if there were no duplicates in the > input! I know many purists object to allowing COPY to discard invalid rows in COPY input, but it seems we have lots of requests for this feature, with few workarounds except pre-processing the flat file. Of course, if they use INSERT, they will get errors that they can just ignore. I don't see how allowing errors in COPY is any more illegal, except that COPY is one command while multiple INSERTs are separate commands. Seems we need to allow such a capability, if only crudely. I don't think we can create a discard file because of the problem with remote COPY. I think we can allow something like: COPY FROM '/tmp/x' WITH ERRORS 2 meaning we will allow at most two errors and will report the error line numbers to the user. I think this syntax clearly indicates that errors are being accepted in the input. An alternate syntax would allow an unlimited number of errors: COPY FROM '/tmp/x' WITH ERRORS The errors can be non-unique errors, or even CHECK constraint errors. Unless I hear complaints, I will add it to TODO. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I think we can allow something like: > COPY FROM '/tmp/x' WITH ERRORS 2 This is not going to happen, at least not until after there's a wholesale revision of error handling. As things stand we do not have a choice: elog(ERROR) must abort the transaction, because we can't guarantee that things are in good enough shape to continue. See the archives for previous discussions. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > I think we can allow something like: > > COPY FROM '/tmp/x' WITH ERRORS 2 > > This is not going to happen, at least not until after there's a > wholesale revision of error handling. As things stand we do not > have a choice: elog(ERROR) must abort the transaction, because we > can't guarantee that things are in good enough shape to continue. > See the archives for previous discussions. Yes, I realize we need subtransactions or something, but we should add it to the TODO list if it is a valid request, right? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I think we can allow something like: > COPY FROM '/tmp/x' WITH ERRORS 2 > Yes, I realize we need subtransactions or something, but we should add > it to the TODO list if it is a valid request, right? Well, I don't like that particular API in any case. Why would I think that 2 errors are okay and 3 are not, if I'm loading a many-thousand-line COPY file? Wouldn't it matter *what* the errors are, at least as much as how many there are? "Discard duplicate rows" is one thing, but "ignore bogus data" (eg, unrecognizable timestamps) is not the same animal at all. As someone already remarked, the correct, useful form of such a feature is to echo the rejected lines to some sort of output file that I can look at afterwards. How many errors there are is not the issue. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > I think we can allow something like: > > COPY FROM '/tmp/x' WITH ERRORS 2 > > > Yes, I realize we need subtransactions or something, but we should add > > it to the TODO list if it is a valid request, right? > > Well, I don't like that particular API in any case. Why would I think > that 2 errors are okay and 3 are not, if I'm loading a > many-thousand-line COPY file? Wouldn't it matter *what* the errors I threw the count idea in as a possible compromise. :-) > are, at least as much as how many there are? "Discard duplicate rows" > is one thing, but "ignore bogus data" (eg, unrecognizable timestamps) > is not the same animal at all. Yes, when we have error codes, it would be nice to specify certain errors to ignore. > As someone already remarked, the correct, useful form of such a feature > is to echo the rejected lines to some sort of output file that I can > look at afterwards. How many errors there are is not the issue. How about for TODO: * Allow COPY to report error lines and continue; requiresnested transactions; optionally allow error codes to be specified -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> How about for TODO: > * Allow COPY to report error lines and continue; requires > nested transactions; optionally allow error codes to be specified Okay, that seems reasonable. regards, tom lane
Tom Lane wrote: > > How about for TODO: > > * Allow COPY to report error lines and continue; requires > > nested transactions; optionally allow error codes to be specified > > Okay, that seems reasonable. Good. Now that I think of it, nested transactions don't seem required. We already allow pg_dump to dump a database using INSERTs, and we don't put those inserts in a single transaction when we load them:CREATE TABLE "test" ( "x" integer);INSERT INTO "test"VALUES (1);INSERT INTO "test" VALUES (2); Should we be wrapping these INSERTs in a transaction? Can we do COPY with each row being its own transaction? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian wrote: > Tom Lane wrote: > > > How about for TODO: > > > * Allow COPY to report error lines and continue; requires > > > nested transactions; optionally allow error codes to be specified > > > > Okay, that seems reasonable. > > Good. Now that I think of it, nested transactions don't seem required. > We already allow pg_dump to dump a database using INSERTs, and we don't > put those inserts in a single transaction when we load them: > > CREATE TABLE "test" ( > "x" integer > ); > > INSERT INTO "test" VALUES (1); > INSERT INTO "test" VALUES (2); > > Should we be wrapping these INSERTs in a transaction? Can we do COPY > with each row being its own transaction? OK, added to TODO: o Allow COPY to report error lines and continue; optionally allow error codes to be specified Seems nested transactions are not required if we load each COPY line in its own transaction, like we do with INSERT from pg_dump. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Seems nested transactions are not required if we load each COPY line in > its own transaction, like we do with INSERT from pg_dump. I don't think that's an acceptable answer. Consider BEGIN; other stuff; COPY ....; other stuff; ROLLBACK; regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Seems nested transactions are not required if we load each COPY line in > > its own transaction, like we do with INSERT from pg_dump. > > I don't think that's an acceptable answer. Consider Oh, very good point. "Requires nested transactions" added to TODO. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > Seems nested transactions are not required if we load > > > each COPY line in its own transaction, like we do with > > > INSERT from pg_dump. > > > > I don't think that's an acceptable answer. Consider > > Oh, very good point. "Requires nested transactions" added to TODO. Also add performance issue with per-line-commit... Also-II - there is more common name for required feature - savepoints. Vadim
Mikheev, Vadim wrote: > > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > > Seems nested transactions are not required if we load > > > > each COPY line in its own transaction, like we do with > > > > INSERT from pg_dump. > > > > > > I don't think that's an acceptable answer. Consider > > > > Oh, very good point. "Requires nested transactions" added to TODO. > > Also add performance issue with per-line-commit... > > Also-II - there is more common name for required feature - savepoints. OK, updated TODO to prefer savepoints term. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
>>>Bruce Momjian said:> Mikheev, Vadim wrote:> > > > Bruce Momjian <pgman@candle.pha.pa.us> writes:> > > > > Seems nestedtransactions are not required if we load> > > > > each COPY line in its own transaction, like we do with> > > > > INSERTfrom pg_dump.> > > > > > > > I don't think that's an acceptable answer. Consider> > > > > > Oh, very good point. "Requires nested transactions" added to TODO.> > > > Also add performance issue with per-line-commit...> > > > Also-II -there is more common name for required feature - savepoints.> > OK, updated TODO to prefer savepoints term. Now, how about the same functionality for INSERT into table1 SELECT * from table2 ... WITH ERRORS; Should allow the insert to complete, even if table1 has unique indexes and we try to insert duplicate rows. Might save LOTS of time in bulkloading scripts not having to do single INSERTs. Guess all this will be available in 7.3? Daniel
> Now, how about the same functionality for > > INSERT into table1 SELECT * from table2 ... WITH ERRORS; > > Should allow the insert to complete, even if table1 has unique indexes and we > try to insert duplicate rows. Might save LOTS of time in bulkloading scripts > not having to do single INSERTs. 1. I prefer Oracle' (and others, I believe) way - put statement(s) in PL block and define for what exceptions (errors) what actions should be taken (ie IGNORE for NON_UNIQ_KEY error, etc). 2. For INSERT ... SELECT statement one can put DISTINCT in select' target list. > Guess all this will be available in 7.3? We'll see. Vadim
>>>"Vadim Mikheev" said:> 1. I prefer Oracle' (and others, I believe) way - put statement(s) in PL> block and define> forwhat exceptions (errors) what actions should be taken (ie IGNORE for> NON_UNIQ_KEY> error, etc). Some people prefer 'pure' SQL. Anyway, it can be argued which is worse - the usage of non-SQL language, or usage of extended SQL language. I guess the SQL standard does not provide for such functionality? > 2. For INSERT ... SELECT statement one can put DISTINCT in select' target> list. With this construct, you are effectively copying rows from one table to another - or constructing rows from various sources (constants, other tables etc) and inserting these in the table. If the target table has unique indexes (or constraints), and some of the rows returned by SELECT violate the restrictions - you are supposed to get errors - and unfortunately the entire INSERT is aborted. I fail to see how DISTINCT can help here... Perhaps it is possible to include checking for already existing tuples in the destination table in the select... but this will significantly increase the runtime, especially when the destination table is huge. My idea is to let this INSERT statement insert as much of its rows as possible, eventually returning NOTICEs or ignoring the errors (with an IGNORE ERRORS syntax for example :) I believe all this functionality will have to consider the syntax firts. Daniel
Vadim Mikheev wrote: > > Now, how about the same functionality for > > > > INSERT into table1 SELECT * from table2 ... WITH ERRORS; > > > > Should allow the insert to complete, even if table1 has unique indexes and > we > > try to insert duplicate rows. Might save LOTS of time in bulkloading > scripts > > not having to do single INSERTs. > > 1. I prefer Oracle' (and others, I believe) way - put statement(s) in PL > block and define > for what exceptions (errors) what actions should be taken (ie IGNORE for > NON_UNIQ_KEY > error, etc). Added to TODO: * Allow command blocks that can ignore certain types of errors -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > 1. I prefer Oracle' (and others, I believe) way - put > statement(s) in PL block and define for what exceptions > (errors) what actions should be taken (ie IGNORE for > > NON_UNIQ_KEY error, etc). > > Some people prefer 'pure' SQL. Anyway, it can be argued which > is worse - the usage of non-SQL language, or usage of extended > SQL language. I guess the SQL standard does not provide for such > functionality? Yes, there is no such syntax in standard. And imho when some feature is not in standard then it's better to implement it how others do (for as much compatibility as possible/significant). > > 2. For INSERT ... SELECT statement one can put DISTINCT in > select' target list. > > With this construct, you are effectively copying rows from > one table to another - or constructing rows from various > sources (constants, other tables etc) and inserting these > in the table. If the target table has unique indexes > (or constraints), and some of the rows returned by SELECT violate Sorry, I didn't consider this case, you're right. > I believe all this functionality will have to consider the > syntax firts. All this functionality will have to consider savepoints implementation first. As for syntax - we could implement both. Vadim