Thread: Bulkloading using COPY - ignore duplicates?
Hello, I'm in the process of porting a large application from Ingres to PostgreSQL. We make heavy use of bulkloading using the 'COPY' statement in ESQL/C. Consider the SQL statements below (in a psql session on an arbitrary database): CREATE TABLE copytest(f1 INTEGER, f2 INTEGER);CREATE UNIQUE INDEX copytest_idx ON copytest USING BTREE(f1, f2);COPY copytestFROM '/tmp/copytest'; Given the file /tmp/copytest: 1 12 23 34 44 45 56 6 will result in the following output: ERROR: copy: line 5, Cannot insert a duplicate key into unique index copytest_idx However my application code is assuming that duplicate rows will simply be ignored (this is the case in Ingres, and I believe Oracle's bulkloader too). I propose modifying _bt_check_unique() in /backend/access/nbtree/nbtinsert.c to emit a NOTICE (rather than ERROR) elog() and return NULL (or appropriate) to the calling function if a duplicate key is detected and a 'COPY FROM' is in progress (add new parameter to flag this). Would this seem a reasonable thing to do? Does anyone rely on COPY FROM causing an ERROR on duplicate input? Would: WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar) need to be added to the COPY command (I hope not)? Thanks, -- Lee Kindness, Senior Software EngineerConcept Systems Limited.
Lee Kindness wrote: > <snip> > > WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar) I would suggest : WITH ON_DUPLICATE = IGNORE|TERMINATE Or maybe IGNORE_DUPLICATE purely for easier understanding, given there is no present standard nor other databases' syntax to conform to. :) Regards and best wishes, Justin Clift > > need to be added to the COPY command (I hope not)? > > Thanks, > > -- > Lee Kindness, Senior Software Engineer > Concept Systems Limited. -- "My grandfather once told me that there are two kinds of people: those who work and those who take the credit. He told me to try to be in the first group; there was less competition there." - Indira Gandhi
Lee Kindness <lkindness@csl.co.uk> writes: > Would this seem a reasonable thing to do? Does anyone rely on COPY > FROM causing an ERROR on duplicate input? Yes. This change will not be acceptable unless it's made an optional (and not default, IMHO, though perhaps that's negotiable) feature of COPY. The implementation might be rather messy too. I don't much care for the notion of a routine as low-level as bt_check_unique knowing that the context is or is not COPY. We might have to do some restructuring. > Would: > WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar) > need to be added to the COPY command (I hope not)? It occurs to me that skip-the-insert might be a useful option for INSERTs that detect a unique-key conflict, not only for COPY. (Cf. the regular discussions we see on whether to do INSERT first or UPDATE first when the key might already exist.) Maybe a SET variable that applies to all forms of insertion would be appropriate. regards, tom lane
Justin Clift writes:> Lee Kindness wrote:> > WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar)> I would suggest :> WITHON_DUPLICATE = IGNORE|TERMINATE> purely for easier understanding, given there is no present standard nor> other databases'syntax to conform to. Personally I don't see the need, and think that 'COPY FROM' could well just go with the new semantics... Onto an implementation issue - _bt_check_unique() returns a TransactionId, my plans were to return NullTransactionId on a duplicate key but naturally this is used in the success scenario. Looking in backend/transam/transam.c I see: TransactionId NullTransactionId = (TransactionId) 0;TransactionId AmiTransactionId = (TransactionId) 512;TransactionId FirstTransactionId= (TransactionId) 514; From this I'd gather <514 can be used as magic-values/constants, So would I be safe doing: TransactionId XXXXTransactionId = (TransactionId) 1; and return XXXXTransactionId from _bt_check_unique() back to _bt_do_insert()? Naturally XXXX is something meaningful. I presume all I need to know is if 'xwait' in _bt_check_unique() is ever '1'... Thanks, --Lee Kindness, Senior Software EngineerConcept Systems Limited.
Tom Lane writes:> Lee Kindness <lkindness@csl.co.uk> writes:> > Would this seem a reasonable thing to do? Does anyone relyon COPY> > FROM causing an ERROR on duplicate input?> Yes. This change will not be acceptable unless it's made an optional>(and not default, IMHO, though perhaps that's negotiable) feature of> COPY. I see where you're coming from, but seriously what's the use/point of COPY aborting and doing a rollback if one duplicate key is found? I think it's quite reasonable to presume the input to COPY has had as little processing done on it as possible. I could loop through the input file before sending it to COPY but that's just wasting cycles and effort - Postgres has btree lookup built in, I don't want to roll my own before giving Postgres my input file! > The implementation might be rather messy too. I don't much care> for the notion of a routine as low-level as bt_check_uniqueknowing> that the context is or is not COPY. We might have to do some> restructuring. Well in reality it wouldn't be "you're getting run from copy" but rather "notice on duplicate, rather than error & exit". There is a telling comment in nbtinsert.c just before _bt_check_unique() is called: /* * If we're not allowing duplicates, make sure the key isn't already * in the index. XXX this belongs somewhere else,likely */ So perhaps dupes should be searched for before _bt_doinsert is called, or somewhere more appropriate? > > Would:> > WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar)> > need to be added to the COPY command (I hope not)?>It occurs to me that skip-the-insert might be a useful option for> INSERTs that detect a unique-key conflict, not onlyfor COPY. (Cf.> the regular discussions we see on whether to do INSERT first or> UPDATE first when the key might alreadyexist.) Maybe a SET variable> that applies to all forms of insertion would be appropriate. That makes quite a bit of sense. -- Lee Kindness, Senior Software EngineerConcept Systems Limited.
Lee Kindness <lkindness@csl.co.uk> writes: > I see where you're coming from, but seriously what's the use/point of > COPY aborting and doing a rollback if one duplicate key is found? Error detection. If I'm loading what I think is valid data, having the system silently ignore certain types of errors is not acceptable --- I'm especially not pleased at the notion of removing an error check that's always been there because someone else thinks that would make it more convenient for his application. > I think it's quite reasonable to presume the input to COPY has had as > little processing done on it as possible. The primary and traditional use of COPY has always been to reload dumped data. That's why it doesn't do any fancy processing like DEFAULT insertion, and that's why it should be quite strict about error conditions. In a reload scenario, any sort of problem deserves careful investigation. regards, tom lane
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: > I thought that the problem was, that you cannot simply skip the > insert, because at that time the tuple (pointer) might have already > been successfully inserted into an other index/heap, and thus this was > only sanely possible with savepoints/undo. Hmm, good point. If we don't error out the transaction then that tuple would become good when we commit. This is nastier than it appears. regards, tom lane
Tom Lane writes:> I'm especially not pleased at the notion of removing an error check> that's always been there because someoneelse thinks that would make it> more convenient for his application. Please, don't get me wrong - I don't want to come across arrogant. I'm simply trying to improve the 'COPY FROM' command in a situation where speed is a critical issue and the data is dirty... And that must be a relatively common scenario in industry. And I never said the duplicate should be silently ignored - an elog(NOTICE) should still be output. Lee.
Lee Kindness wrote: >Tom Lane writes: > > Lee Kindness <lkindness@csl.co.uk> writes: > > > Would this seem a reasonable thing to do? Does anyone rely on COPY > > > FROM causing an ERROR on duplicate input? > > Yes. This change will not be acceptable unless it's made an optional > > (and not default, IMHO, though perhaps that's negotiable) feature of > > COPY. > >I see where you're coming from, but seriously what's the use/point of >COPY aborting and doing a rollback if one duplicate key is found? I >think it's quite reasonable to presume the input to COPY has had as >little processing done on it as possible. I could loop through the >input file before sending it to COPY but that's just wasting cycles >and effort - Postgres has btree lookup built in, I don't want to roll >my own before giving Postgres my input file! > > > The implementation might be rather messy too. I don't much care > > for the notion of a routine as low-level as bt_check_unique knowing > > that the context is or is not COPY. We might have to do some > > restructuring. > >Well in reality it wouldn't be "you're getting run from copy" but >rather "notice on duplicate, rather than error & exit". There is a >telling comment in nbtinsert.c just before _bt_check_unique() is >called: > > /* > * If we're not allowing duplicates, make sure the key isn't already > * in the index. XXX this belongs somewhere else, likely > */ > >So perhaps dupes should be searched for before _bt_doinsert is called, >or somewhere more appropriate? > > > > Would: > > > WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar) > > > need to be added to the COPY command (I hope not)? > > It occurs to me that skip-the-insert might be a useful option for > > INSERTs that detect a unique-key conflict, not only for COPY. (Cf. > > the regular discussions we see on whether to do INSERT first or > > UPDATE first when the key might already exist.) Maybe a SET variable > > that applies to all forms of insertion would be appropriate. > >That makes quite a bit of sense. > This is tring to avoid one step. IMHO, you should copy into a temporary table and the do a select distinct from it into the table that you want. A. You can validate your data before you put it into your permanent table. B. This doesn't cost you much. Don't make the assumption that bulk copies have not been checked or validated. The assumption should be correct data or you shouldn't be using COPY. >
> > Would this seem a reasonable thing to do? Does anyone rely on COPY > > FROM causing an ERROR on duplicate input? > > Yes. This change will not be acceptable unless it's made an optional > (and not default, IMHO, though perhaps that's negotiable) feature of > COPY. > > The implementation might be rather messy too. I don't much > care for the > notion of a routine as low-level as bt_check_unique knowing that the > context is or is not COPY. We might have to do some restructuring. > > > Would: > > WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar) > > need to be added to the COPY command (I hope not)? > > It occurs to me that skip-the-insert might be a useful option for > INSERTs that detect a unique-key conflict, not only for COPY. (Cf. > the regular discussions we see on whether to do INSERT first or > UPDATE first when the key might already exist.) Maybe a SET variable > that applies to all forms of insertion would be appropriate. Imho yes, but: I thought that the problem was, that you cannot simply skip the insert, because at that time the tuple (pointer) might have already been successfully inserted into an other index/heap, and thus this was only sanely possible with savepoints/undo. An idea would probably be to at once mark the new tuple dead, and proceed normally? Andreas
Zeugswetter Andreas SB SD wrote:<br /><blockquote cite="mid:46C15C39FEB2C44BA555E356FBCD6FA41EB3A0@m0114.s-mxs.net" type="cite"><blockquotetype="cite"><pre wrap="">IMHO, you should copy into a temporary table and the do a select <br />distinctfrom it into the table that you want.<br /></pre></blockquote><pre wrap=""><br />Which would be way too slow fornormal operation :-(<br />We are talking about a "fast as possible" data load from a flat file<br />that may have duplicates(or even data errors, but that <br />is another issue).<br /><br />Andreas<br /></pre></blockquote> Then the IGNORE_DUPLICATEwould definitely be the way to go, if speed is the question...<br /><br /><br /><br />
> IMHO, you should copy into a temporary table and the do a select > distinct from it into the table that you want. Which would be way too slow for normal operation :-( We are talking about a "fast as possible" data load from a flat file that may have duplicates (or even data errors, but that is another issue). Andreas
Okay, If I'm going to modify 'COPY INTO' to include 'ignore duplicates' functionality it looks like I'll have to add to the COPY syntax. The most obvious way is to add: WITH IGNORE DUPLICATES to the syntax. I'm going to need my hand held a bit for this! The grammar for COPY will need updating in gram.y and specifically the 'WITH' keyword will have 'IGNORE DUPLICATES' as well as 'NULL AS'. Any pointers? Thanks, Lee.
Lee Kindness writes:> If I'm going to modify 'COPY INTO' to include 'ignore duplicates'> functionality it looks like I'llhave to add to the COPY syntax. The> most obvious way is to add:> WITH IGNORE DUPLICATES Or does it make more sense to add a 'COPY_IGNORE_DUPLICATES' SET parameter? Lee.
> However my application code is assuming that duplicate rows will > simply be ignored (this is the case in Ingres, and I believe Oracle's > bulkloader too). I propose modifying _bt_check_unique() in > /backend/access/nbtree/nbtinsert.c to emit a NOTICE (rather than > ERROR) elog() and return NULL (or appropriate) to the calling function > if a duplicate key is detected and a 'COPY FROM' is in progress (add > new parameter to flag this). If you have a UNIQUE index on the table, just throwing away duplicates seems really bad to me. I know Ingres had that heapsort structure that would remove duplicates. That may be an interesting feature to add as an operation that can be performed. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
I said: > "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: >> I thought that the problem was, that you cannot simply skip the >> insert, because at that time the tuple (pointer) might have already >> been successfully inserted into an other index/heap, and thus this was >> only sanely possible with savepoints/undo. > Hmm, good point. If we don't error out the transaction then that tuple > would become good when we commit. This is nastier than it appears. On further thought, I think it *would* be possible to do this without savepoints, but it'd take some restructuring of the index AM API. What'd have to happen is that a unique index could not raise an elog ERROR when it detects a uniqueness conflict. Instead, it'd return a uniqueness-conflict indication back to its caller. This would have to propagate up to the level of the executor. At that point we'd make the choice of whether to raise an error or not. If not, we'd need to modify the just-created tuple to mark it deleted by the current transaction. We can't remove it, since that would leave any already-created entries in other indexes pointing to nothing. But marking it deleted by the same xact and command ID that inserted it would leave things in a valid state until VACUUM comes along to do the janitorial work. To support backoff in the case of a conflict during UPDATE, it'd also be necessary to un-mark the prior version of the tuple, which we'd already marked as deleted. This might create some concurrency issues in case there are other updaters waiting to see if we commit or not. (The same issue would arise for savepoint-based undo, though.) We might want to punt on that part for now. The effects don't stop propagating there, either. The decision not to insert the tuple must be reported up still further, so that the executor knows not to run any AFTER INSERT/UPDATE triggers and knows not to count the tuple as inserted/updated for the command completion report. In short, quite a lot of code to touch to make this happen ... regards, tom lane
> -----Original Message----- > From: Tom Lane > > I said: > > "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: > >> I thought that the problem was, that you cannot simply skip the > >> insert, because at that time the tuple (pointer) might have already > >> been successfully inserted into an other index/heap, and thus this was > >> only sanely possible with savepoints/undo. > > > Hmm, good point. If we don't error out the transaction then that tuple > > would become good when we commit. This is nastier than it appears. > > On further thought, I think it *would* be possible to do this without > savepoints, It's a very well known issue that the partial rolloback functionality is a basis of this kind of problem and it's the reason I've mentioned that UNDO functionality has the highest priority. IMHO we shouldn't implement a partial rolloback functionality specific to an individual problem. regards, Hiroshi Inoue
> The effects don't stop propagating there, either. The decision > not to insert the tuple must be reported up still further, so > that the executor knows not to run any AFTER INSERT/UPDATE > triggers and knows not to count the tuple as inserted/updated > for the command completion report. But what about BEFORE insert/update triggers which could insert records too? Vadim
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes: >> The effects don't stop propagating there, either. The decision >> not to insert the tuple must be reported up still further, so >> that the executor knows not to run any AFTER INSERT/UPDATE >> triggers and knows not to count the tuple as inserted/updated >> for the command completion report. > But what about BEFORE insert/update triggers which could insert > records too? Well, what about them? It's already possible for a later BEFORE trigger to cause the actual insertion to be suppressed, so I don't see any difference from what we have now. If a BEFORE trigger takes actions on the assumption that the insert will happen, it's busted already. Mind you, I'm not actually advocating that we do any of this ;-). I was just sketching a possible implementation approach in case someone wants to try it. regards, tom lane
> > But what about BEFORE insert/update triggers which could > > insert records too? > > Well, what about them? It's already possible for a later > BEFORE trigger to cause the actual insertion to be suppressed, > so I don't see any difference from what we have now. > If a BEFORE trigger takes actions on the assumption that the > insert will happen, it's busted already. This problem could be solved now by implementing *single* trigger. In future, we could give users ability to specify trigger execution order. But with proposed feature ... > Mind you, I'm not actually advocating that we do any of this ;-). I understand -:) > I was just sketching a possible implementation approach in > case someone wants to try it. And I'm just sketching possible problems -:) Vadim
Tom Lane writes: > It occurs to me that skip-the-insert might be a useful option for > INSERTs that detect a unique-key conflict, not only for COPY. (Cf. > the regular discussions we see on whether to do INSERT first or > UPDATE first when the key might already exist.) Maybe a SET variable > that applies to all forms of insertion would be appropriate. What we need is: 1. Make errors not abort the transaction. 2. Error codes Then you can make your client deal with this in which ever way you want, at least for single-value inserts. However, it seems to me that COPY ignoring duplicates can easily be done by preprocessing the input file. -- Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
I have used Oracle SQLOADER for many years now. It has the ability to put rejects/discards/bad into an output file and keep on going, maybe this should be added to the copy command. COPY [ BINARY ] table [ WITH OIDS ] FROM { 'filename' | stdin } [ [USING] DELIMITERS 'delimiter' ] [ WITH NULL AS 'nullstring' ] [ DISCARDS 'filename' ] what do you think??? > Tom Lane writes: > > > It occurs to me that skip-the-insert might be a useful option for > > INSERTs that detect a unique-key conflict, not only for COPY. (Cf. > > the regular discussions we see on whether to do INSERT first or > > UPDATE first when the key might already exist.) Maybe a SET variable > > that applies to all forms of insertion would be appropriate. > > What we need is: > > 1. Make errors not abort the transaction. > > 2. Error codes > > Then you can make your client deal with this in which ever way you want, > at least for single-value inserts. > > However, it seems to me that COPY ignoring duplicates can easily be done > by preprocessing the input file. > > -- > Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter > > > ---------------------------(end of broadcast)------------------------- -- > TIP 4: Don't 'kill -9' the postmaster > >
Peter Eisentraut writes:> However, it seems to me that COPY ignoring duplicates can easily be> done by preprocessing theinput file. Or by post-processing, like (error checking cut): void import_shots(char *impfile, int lineshoot_id){ char tab_name[128]; char tab_temp[128]; frig_file(impfile); /* add the postgres header */ sprintf(tab_name, "shot_%d", lineshoot_id); sprintf(tab_temp, "shot_%d_tmp",lineshoot_id); sprintf(cmd, "CREATE TEMPORARY TABLE %s AS SELECT * FROM shot", tab_temp); EXEC SQL EXECUTE IMMEDIATE :cmd; EXEC SQL COMMITWORK; /* will not work without comit here! */ sprintf(cmd, "COPY BINARY %s FROM '%s'", tab_temp, impfile); append_page_alloc(cmd, tab_name, impfile, 1); EXEC SQL EXECUTEIMMEDIATE :cmd; sprintf(cmd, "INSERT INTO %s SELECT DISTINCT ON(shot_time) * FROM %s", tab_name, tab_temp); EXECSQL EXECUTE IMMEDIATE :cmd; sprintf(cmd, "DROP TABLE %s", tab_temp); EXEC SQL EXECUTE IMMEDIATE :cmd; EXEC SQL COMMIT WORK ; remove(impfile);} However this is adding significant time to the import operation. Likewise I could loop round the input file first and hunt for duplicates, again with a performance hit. My main point is that Postgres can easily and quickly check for duplicates during the COPY (as it does currently) and it adds zero execution time to simply ignore these duplicate rows. Obviously this is a useful feature otherwise Oracle, Ingres and other commercial relational databases wouldn't feature similiar functionality. Yes, in an ideal world the input to COPY should be clean and consistent with defined indexes. However this is only really the case when COPY is used for database/table backup and restore. It misses the point that a major use of COPY is in speed optimisation on bulk inserts... Lee.
Guys, I've made some inroads towards adding 'ignore duplicates' functionality to PostgreSQL's COPY command. I've updated the parser grammar for COPY FROM to now accept: COPY [ BINARY ] table [ WITH OIDS ] FROM { 'filename' | stdin } [ [USING] DELIMITERS 'delimiter' ] [ WITH [NULL AS'null string'] [IGNORE DUPLICATES] ] and added code to propagate this setting down to the CopyFrom function in backend/commands/copy.c. I also played around with _bt_check_unique, _bt_do_insert and btinsert to return NULL on duplicate rather than elog(ERROR). Likewise ExecInsertIndexTuples and index_insert were passed the ignore_duplicate flag and index_insert changed to elog(ERROR) if the return from the insert function was NULL and ignore_duplicate flag was false. These changes worked and gave the desired result for the COPY FROM command, however as many mentioned these changes are far too low level... After assessing the situation more fully, I believe the following change in CopyFrom would be more suitable: /* BEFORE ROW INSERT Triggers */ if (resultRelInfo->ri_TrigDesc && resultRelInfo->ri_TrigDesc->n_before_row[TRIGGER_EVENT_INSERT]> 0) { HeapTuple newtuple; newtuple = ExecBRInsertTriggers(estate,resultRelInfo, tuple); if (newtuple == NULL) /* "do nothing" */ skip_tuple = true; else if (newtuple != tuple)/* modified by Trigger(s) */ { heap_freetuple(tuple); tuple = newtuple; } } /* new code */ if( ignore_duplicates == true ) { if( duplicate index value ) skip_tuple= true; } if (!skip_tuple) { Now I imagine 'duplicate index value' would be functionally similar to _bt_check_unique but obviously higher level. Is there any existing code with the functionality I desire? Can anyone point me in the right way... Thanks, Lee Kindness. Lee Kindness writes:> I'm in the process of porting a large application from Ingres to> PostgreSQL. We make heavy use ofbulkloading using the 'COPY'> statement in ESQL/C. Consider the SQL statements below (in a psql> session on an arbitrarydatabase):> > CREATE TABLE copytest(f1 INTEGER, f2 INTEGER);> CREATE UNIQUE INDEX copytest_idx ON copytest USINGBTREE(f1, f2);> COPY copytest FROM '/tmp/copytest';> > Given the file /tmp/copytest:> > 1 1> 2 2> 3 3> 4 4> 4 4> 5 5> 6 6> > will result in the following output:> > ERROR: copy: line 5, Cannot insert a duplicatekey into unique index copytest_idx> > However my application code is assuming that duplicate rows will> simply beignored (this is the case in Ingres, and I believe Oracle's> bulkloader too). I propose modifying _bt_check_unique() in>/backend/access/nbtree/nbtinsert.c to emit a NOTICE (rather than> ERROR) elog() and return NULL (or appropriate) to thecalling function> if a duplicate key is detected and a 'COPY FROM' is in progress (add> new parameter to flag this).>> Would this seem a reasonable thing to do? Does anyone rely on COPY> FROM causing an ERROR on duplicate input? Would:>> WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar)> > need to be added to the COPY command (I hope not)?> > Thanks,>> -- > Lee Kindness, Senior Software Engineer> Concept Systems Limited.
Gents, I started quite a long thread about this back in September. To summarise I was proposing that COPY FROM would not abort the transaction when it encountered data which would cause a uniqueness violation on the table index(s). Generally I think this was seen as a 'Good Thing'TM for a number of reasons: 1. Performance enhancements when doing doing bulk inserts - pre or post processing the data to remove duplicates is very time consuming. Likewise the best tool should always be used for the job at and, and for searching/removing things it's a database. 2. Feature parity with other database systems. For example Oracle's SQLOADER has a feature to not insert duplicates and rather move them to another file for later investigation. Naturally the default behaviour would be the current one of assuming valid data. Also the duplicate check would not add anything to the current code path for COPY FROM - it would not take any longer. I attempted to add this functionality to PostgreSQL myself but got as far as an updated parser and a COPY FROM which resulted in a database recovery! So (here's the question finally) is it worthwhile adding this enhancement to the TODO list? Thanks, Lee. -- Lee Kindness, Senior Software Engineer, Concept Systems Limited.http://services.csl.co.uk/ http://www.csl.co.uk/ +44 1315575595
On Mon, Oct 01, 2001 at 03:17:43PM +0100, Lee Kindness wrote: > Tom Lane writes: > > I'm especially not pleased at the notion of removing an error check > > that's always been there because someone else thinks that would make it > > more convenient for his application. > > Please, don't get me wrong - I don't want to come across arrogant. I'm > simply trying to improve the 'COPY FROM' command in a situation where > speed is a critical issue and the data is dirty... And that must be a > relatively common scenario in industry. Isn't that when you do your bulk copy into into a holding table, then clean it up, and then insert into your live system? Patrick
Patrick Welche writes:> On Mon, Oct 01, 2001 at 03:17:43PM +0100, Lee Kindness wrote:> > Please, don't get me wrong - I don'twant to come across arrogant. I'm> > simply trying to improve the 'COPY FROM' command in a situation where> > speedis a critical issue and the data is dirty... And that must be a> > relatively common scenario.> Isn't that when youdo your bulk copy into into a holding table, then> clean it up, and then insert into your live system? That's what I'm currently doing as a workaround - a SELECT DISTINCT from a temporary table into the real table with the unique index on it. However this takes absolute ages - say 5 seconds for the copy (which is the ballpark figure I aiming toward and can achieve with Ingres) plus another 30ish seconds for the SELECT DISTINCT. The majority of database systems out there handle this situation in one manner or another (MySQL ignores or replaces; Ingres ignores; Oracle ignores or logs; others...). Indeed PostgreSQL currently checks for duplicates in the COPY code but throws an elog(ERROR) rather than ignoring the row, or passing the error back up the call chain. My use of PostgreSQL is very time critical, and sadly this issue alone may force an evaluation of Oracle's performance in this respect! Best regards, Lee Kindness. -- Lee Kindness, Senior Software Engineer, Concept Systems Limited.http://services.csl.co.uk/ http://www.csl.co.uk/ +44 1315575595
Lee Kindness wrote: > > Patrick Welche writes: > > On Mon, Oct 01, 2001 at 03:17:43PM +0100, Lee Kindness wrote: > > > Please, don't get me wrong - I don't want to come across arrogant. I'm > > > simply trying to improve the 'COPY FROM' command in a situation where > > > speed is a critical issue and the data is dirty... And that must be a > > > relatively common scenario. > > Isn't that when you do your bulk copy into into a holding table, then > > clean it up, and then insert into your live system? > > That's what I'm currently doing as a workaround - a SELECT DISTINCT > from a temporary table into the real table with the unique index on > it. However this takes absolute ages - say 5 seconds for the copy > (which is the ballpark figure I aiming toward and can achieve with > Ingres) plus another 30ish seconds for the SELECT DISTINCT. > > The majority of database systems out there handle this situation in > one manner or another (MySQL ignores or replaces; Ingres ignores; > Oracle ignores or logs; others...). Indeed PostgreSQL currently checks > for duplicates in the COPY code but throws an elog(ERROR) rather than > ignoring the row, or passing the error back up the call chain. I guess postgresql will be able to do it once savepoints get implemented. > My use of PostgreSQL is very time critical, and sadly this issue alone > may force an evaluation of Oracle's performance in this respect! Can't you clean the duplicates _outside_ postgresql, say cat dumpfile | sort | uniq | psql db -c 'copy mytable from stdin' with your version of uniq. or perhaps psql db -c 'copy mytable to stdout' >> dumpfile sort dumpfile | uniq | psql db -c 'copy mytable from stdin' if you already have something in mytable. ------------ Hannu
Hannu Krosing writes:> Lee Kindness wrote:> > The majority of database systems out there handle this situation in> > onemanner or another (MySQL ignores or replaces; Ingres ignores;> > Oracle ignores or logs; others...). Indeed PostgreSQLcurrently checks> > for duplicates in the COPY code but throws an elog(ERROR) rather than> > ignoring the row,or passing the error back up the call chain.> I guess postgresql will be able to do it once savepoints get> implemented. This is encouraging to hear. I can see how this would make the code changes relatively minimal and more manageable - the changes to the current code are simply over my head! Are savepoints relatively high up on the TODO list, once 7.2 is out the door? > > My use of PostgreSQL is very time critical, and sadly this issue alone> > may force an evaluation of Oracle's performancein this respect!> Can't you clean the duplicates _outside_ postgresql, say> cat dumpfile | sort | uniq | psqldb -c 'copy mytable from stdin' This is certainly a possibility, however it's just really moving the processing elsewhere. The combined time is still around the same. I've/we've done a lot of investigation with approaches like this and also with techniques assuming the locality of the duplicates (which is a no-goer). None improve the situation. I'm not going to compare the time of just using INSERTs rather than COPY... Thanks for your response, Lee Kindness. -- Lee Kindness, Senior Software Engineer, Concept Systems Limited.http://services.csl.co.uk/ http://www.csl.co.uk/ +44 1315575595
Lee Kindness writes:> I'm not going to compare the time of just using INSERTs rather than> COPY... Ooops, I'm NOW going to... Obviously my subconscious is telling me otherwise - bring on the Christmas party! Lee. -- Lee Kindness, Senior Software Engineer, Concept Systems Limited.http://services.csl.co.uk/ http://www.csl.co.uk/ +44 1315575595
On Thu, Dec 13, 2001 at 01:25:11PM +0000, Lee Kindness wrote: > That's what I'm currently doing as a workaround - a SELECT DISTINCT > from a temporary table into the real table with the unique index on > it. However this takes absolute ages - say 5 seconds for the copy > (which is the ballpark figure I aiming toward and can achieve with > Ingres) plus another 30ish seconds for the SELECT DISTINCT. Then your column really isn't unique, so how about dropping the unique index, import the data, fix the duplicates, recreate the unique index - just as another possible work around ;) Patrick
Patrick Welche writes:> On Thu, Dec 13, 2001 at 01:25:11PM +0000, Lee Kindness wrote:> > That's what I'm currently doingas a workaround - a SELECT DISTINCT> > from a temporary table into the real table with the unique index on> > it. Howeverthis takes absolute ages - say 5 seconds for the copy> > (which is the ballpark figure I aiming toward and can achievewith> > Ingres) plus another 30ish seconds for the SELECT DISTINCT.> Then your column really isn't unique, That's another discussion entirely ;) - it's spat out by a real-time system which doesn't have the time or resources to check this. Further precision loss later in the data's life adds more duplicates... > so how about dropping the unique index, import the data, fix the> duplicates, recreate the unique index - just as anotherpossible> work around ;) This is just going to be the same(ish) time, no? CREATE TABLE tab (p1 INT, p2 INT, other1 INT, other2 INT);COPY tab FROM 'file';DELETE FROM tab WHERE p1, p2 NOT IN (SELECTDISTINCT p1, p2 FROM tab);CREATE UNIQUE INDEX tab_idx ON tab USING BTREE(p1, p2); or am I missing something? Thanks, Lee. -- Lee Kindness, Senior Software Engineer, Concept Systems Limited.http://services.csl.co.uk/ http://www.csl.co.uk/ +44 1315575595
On Thu, Dec 13, 2001 at 03:44:31PM +0000, Lee Kindness wrote: > Patrick Welche writes: > > On Thu, Dec 13, 2001 at 01:25:11PM +0000, Lee Kindness wrote: > > > That's what I'm currently doing as a workaround - a SELECT DISTINCT > > > from a temporary table into the real table with the unique index on > > > it. However this takes absolute ages - say 5 seconds for the copy > > > (which is the ballpark figure I aiming toward and can achieve with > > > Ingres) plus another 30ish seconds for the SELECT DISTINCT. > > Then your column really isn't unique, > > That's another discussion entirely ;) - it's spat out by a real-time > system which doesn't have the time or resources to check this. Further > precision loss later in the data's life adds more duplicates... Hmm, the data has a later life - sounds like you'll need to remove dups then, anyway, so can you get away with just letting the dups in? Remove the UNIQUE requirement, and let the real time system just dump away. How critical is it to later steps that there be no dups? And how many (potential) dups is your RTS producing, anyway? Your later processing (which apparently can _generate_ dups) might be the out of the critical time path place to worry about removing dups. Ross P.S. This falls into the class of problem solving characterized by "if you can't solve the problem as stated, restate the problem to be one you _can_ solve" ;-) > > > so how about dropping the unique index, import the data, fix the > > duplicates, recreate the unique index - just as another possible > > work around ;) > > This is just going to be the same(ish) time, no? > > CREATE TABLE tab (p1 INT, p2 INT, other1 INT, other2 INT); > COPY tab FROM 'file'; > DELETE FROM tab WHERE p1, p2 NOT IN (SELECT DISTINCT p1, p2 > FROM tab); > CREATE UNIQUE INDEX tab_idx ON tab USING BTREE(p1, p2); > > or am I missing something? > > Thanks, Lee. > > -- > Lee Kindness, Senior Software Engineer, Concept Systems Limited. > http://services.csl.co.uk/ http://www.csl.co.uk/ +44 131 5575595 > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html
Lee Kindness writes: > Yes, in an ideal world the input to COPY should be clean and > consistent with defined indexes. However this is only really the case > when COPY is used for database/table backup and restore. It misses the > point that a major use of COPY is in speed optimisation on bulk > inserts... I think allowing this feature would open up a world of new dangerous ideas, such as ignoring check contraints or foreign keys or magically massaging other tables so that the foreign keys are satisfied, or ignoring default values, or whatever. The next step would then be allowing the same optimizations in INSERT. I feel COPY should load the data and that's it. If you don't like the data you have then you have to fix it first. -- Peter Eisentraut peter_e@gmx.net
Lee Kindness writes: > 1. Performance enhancements when doing doing bulk inserts - pre or > post processing the data to remove duplicates is very time > consuming. Likewise the best tool should always be used for the job at > and, and for searching/removing things it's a database. Arguably, a better tool for this is sort(1). For instance, if you have a typical copy input file with tab-separated fields and the primary key is in columns 1 and 2, you can remove duplicates with sort -k 1,2 -u INFILE > OUTFILE To get a record of what duplicates were removed, use diff. -- Peter Eisentraut peter_e@gmx.net
O.K., time to start looking into the _nature_ of the dups in your data, to see if there's anything specific to take advantage of, since the general solution (tell the DBMS to ignore dups) isn't available, and isn't likely to get there real soon. So what does your data look like, and how do the dups occur? Any chance it's in a really simple format, and the dups are also really simple, like 'one record per line, dups occur as identical adjacent lines?' if so, 'uniq' will solve the problem with little to no speed penalty. (it's the sort that kills ...) Or are you only gettinga dup'ed field,m and the rule 'ignore later records?' I could see this happen if the dta is timestamped at a granularity that doesn't _exactly_ match the repetition rate: e.g. stamp to the second, record once a second. So, what's it look like? Since it's one format, I bet a small, simple pipe filter could handle dup elimination on the fly. Ross On Thu, Dec 13, 2001 at 05:02:15PM +0000, Lee Kindness wrote: > > The RTS outputs to a file which is then subsequently used as input to > other packages, one of which is the application i'm concerned > with. While fixing at source is the ideal solution there are terabytes > of legacy data around (this is raw seismic navigational data). Also > there are more than one competing packages... > > Our package post-processes (we're still very concerned about speed as > this is normally done while 'shooting' the seismic data) this data to > produce the final seismic navigational data, which is then later used > by other products... > > The problem at hand is importing the initial data - no duplicates are > produced by the program itself later (nor in its output data). > > Sadly a large number of later SQL queries assume no duplicates and > would result in incorrect processing calculations, amongst other > things. The shear number of these queries makes changing them > impractical. > > > P.S. This falls into the class of problem solving characterized by > > "if you can't solve the problem as stated, restate the problem to be > > one you _can_ solve" ;-) > > Which is what i've been knocking my head against for the last few > weeks ;) The real problem is a move away from our current RDMS > (Ingres) to PostgreSQL will not happen if the performance of the > product significantly decreases (which it currently has for the import > stage) and since Ingres already just ignores the duplicates... > > I really want to move to PostgreSQL... > > Thanks for your input, > > -- > Lee Kindness, Senior Software Engineer, Concept Systems Limited. > http://services.csl.co.uk/ http://www.csl.co.uk/ +44 131 5575595
Peter Eisentraut writes:> I think allowing this feature would open up a world of new> dangerous ideas, such as ignoring checkcontraints or foreign keys> or magically massaging other tables so that the foreign keys are> satisfied, or ignoringdefault values, or whatever. The next step> would then be allowing the same optimizations in INSERT. I feel> COPYshould load the data and that's it. If you don't like the> data you have then you have to fix it first. I agree that PostgreSQL's checks during COPY are a bonus and I wouldn't dream of not having them. Many database systems provide a fast bulkload by ignoring these constraits and cross references - that's a tricky/horrid situation. However I suppose the question is should such 'invalid data' abort the transaction, it seems a bit drastic... I suppose i'm not really after a IGNORE DUPLICATES option, but rather a CONTINUE ON ERROR kind of thing. Regards, Lee.