Thread: Practical error logging for very large COPY statements
If you've ever loaded 100 million rows, you'll know just how annoying it is to find that you have a duplicate row somewhere in there. Experience shows that there is always one, whatever oath the analyst swears beforehand. It's hard to find out which row is the duplicate, plus you've just screwed up a big table. It needs a VACUUM, then a reload. I'd like to find a way to handle this manual task programmatically. What I'd like to do is add an ERRORTABLE clause to COPY. The main problem is how we detect a duplicate row violation, yet prevent it from aborting the transaction. What I propose is to log uniqueness violations only when there is only a single unique index on a table. Flow of control would be to: locate page of index where value should go lock index block _bt_check_unique, but don't error if violation theninsert row into ERRORTABLEelse insert row into data block insert row into unique index unlock index block do other indexes Which is very similar code to the recently proposed MERGE logic. With that logic, a COPY will run to completion, yet be able to report the odd couple of unique index violations in found along the way. More importantly we can then handle rows those with another program to locate where those errors came from and resolve them. In most cases with a single unique index, the index inserts are rightmost index entries anyway, so there is scope here for an additional optimisation: keep both index and data blocks locked across multiple row inserts until either the unique index or the data block fills. Thats better than taking a full table lock, since it allows concurrent access to the rest of the table, but its also more efficient than continually re-requesting the same blocks (which looks like about 10-15% saving on performance from hash lookups, lock/unlock, etc). Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > What I'd like to do is add an ERRORTABLE clause to COPY. The main > problem is how we detect a duplicate row violation, yet prevent it from > aborting the transaction. If this only solves the problem of duplicate keys, and not any other kind of COPY error, it's not going to be much of an advance. > Flow of control would be to: > locate page of index where value should go > lock index block > _bt_check_unique, but don't error > if violation then insert row into ERRORTABLE > else > insert row into data block > insert row into unique index > unlock index block > do other indexes Ugh. Do you realize how many levels of modularity violation are implied by that sketch? Have you even thought about the fact that we have more than one kind of index? regards, tom lane
Tom Lane wrote: >Simon Riggs <simon@2ndquadrant.com> writes: > > >>What I'd like to do is add an ERRORTABLE clause to COPY. The main >>problem is how we detect a duplicate row violation, yet prevent it from >>aborting the transaction. >> >> > >If this only solves the problem of duplicate keys, and not any other >kind of COPY error, it's not going to be much of an advance. > > Yeah, and I see errors from bad data as often as from violating constraints. Maybe the best way if we do something like this would be to have the error table contain a single text, or maybe bytea, field which contained the raw offending input line. cheers andrew
Seems similar to the pgloader project on pgfoundry.org. Chris Simon Riggs wrote: > If you've ever loaded 100 million rows, you'll know just how annoying it > is to find that you have a duplicate row somewhere in there. Experience > shows that there is always one, whatever oath the analyst swears > beforehand. > > It's hard to find out which row is the duplicate, plus you've just > screwed up a big table. It needs a VACUUM, then a reload. > > I'd like to find a way to handle this manual task programmatically. > > What I'd like to do is add an ERRORTABLE clause to COPY. The main > problem is how we detect a duplicate row violation, yet prevent it from > aborting the transaction. > > What I propose is to log uniqueness violations only when there is only a > single unique index on a table. > > Flow of control would be to: > > locate page of index where value should go > lock index block > _bt_check_unique, but don't error > if violation then insert row into ERRORTABLE > else > insert row into data block > insert row into unique index > unlock index block > do other indexes > > Which is very similar code to the recently proposed MERGE logic. > > With that logic, a COPY will run to completion, yet be able to report > the odd couple of unique index violations in found along the way. More > importantly we can then handle rows those with another program to locate > where those errors came from and resolve them. > > > > In most cases with a single unique index, the index inserts are > rightmost index entries anyway, so there is scope here for an additional > optimisation: keep both index and data blocks locked across multiple row > inserts until either the unique index or the data block fills. Thats > better than taking a full table lock, since it allows concurrent access > to the rest of the table, but its also more efficient than continually > re-requesting the same blocks (which looks like about 10-15% saving on > performance from hash lookups, lock/unlock, etc). > > Best Regards, Simon Riggs > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match
On Mon, 2005-11-21 at 19:38 -0500, Andrew Dunstan wrote: > > Tom Lane wrote: > > >Simon Riggs <simon@2ndquadrant.com> writes: > > > > > >>What I'd like to do is add an ERRORTABLE clause to COPY. The main > >>problem is how we detect a duplicate row violation, yet prevent it from > >>aborting the transaction. > >> > >If this only solves the problem of duplicate keys, and not any other > >kind of COPY error, it's not going to be much of an advance. > > > Yeah, and I see errors from bad data as often as from violating > constraints. Maybe the best way if we do something like this would be to > have the error table contain a single text, or maybe bytea, field which > contained the raw offending input line. I have committed the sin of omission again. Duplicate row violation is the big challenge, but not the only function planned. Formatting errors occur much more frequently, so yes we'd want to log all of that too. And yes, it would be done in the way you suggest. Here's a fuller, but still brief sketch: COPY ... FROM ....[ERRORTABLES format1 [uniqueness1] [ERRORLIMIT percent]] where Format1, Uniqueness1 would be created from new by this command (or error if they already exist) Format1 would hold formatting errors so would be in a blob table with cols (line number, col number, error number, fullrowstring) Uniqueness1 would be same definition as table, but with no indexes This table would be optional, indicating no uniqueness violation checks would be needed to be carried out. If present and yet no unique indexes exist, then Uniqueness1 would be ignored (and not created). ERRORLIMIT percent would abort the COPY if more than percent errors were found, after the first 1000 records (that limit could also be stated if required). Without the ERRORTABLES clause, COPY would work exactly as it does now. How does that sound? Best Regards, Simon Riggs
On Tue, 2005-11-22 at 10:00 +0800, Christopher Kings-Lynne wrote: > Seems similar to the pgloader project on pgfoundry.org. It is similar and good, but I regard that as a workaround rather than the way forward. Best Regards, Simon Riggs
On Mon, 2005-11-21 at 19:05 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > Flow of control would be to: > > > locate page of index where value should go > > lock index block > > _bt_check_unique, but don't error > > if violation then insert row into ERRORTABLE > > else > > insert row into data block > > insert row into unique index > > unlock index block > > do other indexes > > Ugh. Do you realize how many levels of modularity violation are implied > by that sketch? IMHO the above is fairly ugly, but I suggest it now because: 1. I want to avoid uniqueness violations in COPY 2. The logic used is very similar to that recently proposed for MERGE. If anybody has a better idea for (1), shout it out now. If the logic is OK for MERGE, then it should be OK for COPY with uniqeness violation trapping also. Both use uniqueness checking first, so you'd need to argue against both or neither. > Have you even thought about the fact that we have more > than one kind of index? Yes, but they don't support unique indexes do they? Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > I have committed the sin of omission again. > Duplicate row violation is the big challenge, but not the only function > planned. Formatting errors occur much more frequently, so yes we'd want > to log all of that too. And yes, it would be done in the way you > suggest. > Here's a fuller, but still brief sketch: > COPY ... FROM .... > [ERRORTABLES format1 [uniqueness1] > [ERRORLIMIT percent]] This is getting worse, not better :-( The general problem that needs to be solved is "trap any error that occurs during attempted insertion of a COPY row, and instead of aborting the copy, record the data and the error message someplace else". Seen in that light, implementing a special path for uniqueness violations is pretty pointless. You could almost do this today in about five minutes with a PG_TRY construct. The hard part is to distinguish errors that COPY can safely trap from errors that must be allowed to abort the transaction anyway (usually because the backend won't be in a consistent state if it's not allowed to do post-abort cleanup). I think the latter class would mostly be "internal" errors, and so not trapping them shouldn't be a big problem for usefulness; but we can't simply ignore the possibility that they would occur during COPY. regards, tom lane
On Tue, Nov 22, 2005 at 09:58:44AM -0500, Tom Lane wrote: > The general problem that needs to be solved is "trap any error that > occurs during attempted insertion of a COPY row, and instead of aborting > the copy, record the data and the error message someplace else". Seen > in that light, implementing a special path for uniqueness violations is > pretty pointless. Actually, there are really only a few errors people want to trap I imagine: - CHECK constraints (all handled in ExecConstraints) - Duplicate keys - Foreign key violations (all handled by triggers) Rather than worry about all the events we can't safely trap, how about we simply deal with the handful that are trappable. For example, we let people create an ON ERROR trigger and use the existing trigger interface. We have three possibilities: - They return the same tuple, throw the error - They return NULL, ignore error, goto next tuple - They return a different tuple, retest the conditions The trigger can then do anything a normal trigger can do, including copying to another table if people like that. This doesn't seem like awfully hard work, does it? Initially at least, no TRY blocks needed... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Martijn van Oosterhout <kleptog@svana.org> writes: > On Tue, Nov 22, 2005 at 09:58:44AM -0500, Tom Lane wrote: >> The general problem that needs to be solved is "trap any error that >> occurs during attempted insertion of a COPY row, and instead of aborting >> the copy, record the data and the error message someplace else". > Actually, there are really only a few errors people want to trap I > imagine: You've forgotten bad data, eg "foo" in an integer field, or an untranslatable multibyte character. The bad-data problem is what lets out trigger-based solutions, or indeed anything that presumes that the bad data can be forced into a particular representation. regards, tom lane
On Tue, Nov 22, 2005 at 10:45:50AM -0500, Tom Lane wrote: > Martijn van Oosterhout <kleptog@svana.org> writes: > > Actually, there are really only a few errors people want to trap I > > imagine: > > You've forgotten bad data, eg "foo" in an integer field, or an > untranslatable multibyte character. The bad-data problem is what lets > out trigger-based solutions, or indeed anything that presumes that the > bad data can be forced into a particular representation. So don't pass the row in that case. The trigger still has the oppotunity to return a null tuple to have the error ignored. I don't think it diminishes the benefits of the idea, being that a general trigger mechanism is way better than adding special exception blocks to INPERT and/or COPY to handle special cases. I've looked around some other RDBMSs and they don't tell you in the exception handler the row that caused the error, so we're hardly behind the pack here. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Tom Lane <tgl@sss.pgh.pa.us> writes: > Martijn van Oosterhout <kleptog@svana.org> writes: > > On Tue, Nov 22, 2005 at 09:58:44AM -0500, Tom Lane wrote: > >> The general problem that needs to be solved is "trap any error that > >> occurs during attempted insertion of a COPY row, and instead of aborting > >> the copy, record the data and the error message someplace else". > > > Actually, there are really only a few errors people want to trap I > > imagine: > > You've forgotten bad data, eg "foo" in an integer field, or an > untranslatable multibyte character. The bad-data problem is what lets > out trigger-based solutions, or indeed anything that presumes that the > bad data can be forced into a particular representation. I think that's precisely the point here though. There are basically two categories of errors: 1) Data that can be parsed and loaded but generates some sort of constraint violation such as a UNIQUE violation, foreignkey violation, or other constraint violation. 2) Data that can't be parsed as the correct data type at all. It would be nice to be able to have the former loaded into an actual table where it can be queried and perhaps fixed and reloaded. The latter clearly cannot. I would say it should just generate a log entry. -- greg
Greg Stark <gsstark@mit.edu> writes: > I think that's precisely the point here though. There are basically two > categories of errors: > 1) Data that can be parsed and loaded but generates some sort of constraint > violation such as a UNIQUE violation, foreign key violation, or other > constraint violation. > 2) Data that can't be parsed as the correct data type at all. > It would be nice to be able to have the former loaded into an actual table > where it can be queried and perhaps fixed and reloaded. > The latter clearly cannot. Sure it can --- you just have to dump it as raw text (or perhaps bytea, as someone suggested upthread). I think the distinction you are proposing between constraint errors and datatype errors is entirely artificial. Who's to say what is a constraint error and what is a datatype error, especially when you start thinking about cases like varchar length constraints or domain-type constraints? If we create a mechanism that behaves differently depending on whether the error is detected before or after we try to form a tuple containing the data, we're going to have something that is exceedingly awkward to use, because the behavior will be nearly arbitrary from the user's viewpoint. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > > It would be nice to be able to have the former loaded into an actual table > > where it can be queried and perhaps fixed and reloaded. > > > The latter clearly cannot. > > Sure it can --- you just have to dump it as raw text (or perhaps bytea, > as someone suggested upthread). I didn't just say "loaded into an actual table" I said "loaded into an actual table where it can be queried and perhaps fixed and reloaded". From a practical point of view having the data in the already parsed format is a whole lot more useful. You can then do a query to look up the record it conflicted with or look up possible foreign key values that would work instead of the failed reference. You can also insert it directly into the table instead of having to dump it out to a text file and load it with COPY again. Actually I think it would be useful to be able to do this to constraints generally, not just during COPY. If I update or insert a record and it fails due to a constraint violation it would be handy to be able to view the failed record. Perhaps what's really needed is something like CREATE TRIGGER AFTER CONSTRAINT VIOLATION which can then go ahead and insert the record into some other table if it feels like. COPY then would just need an option to proceed even after an error. Presumably only to be used if you're inserting into a clean ETL table, not directly into production tables. > I think the distinction you are proposing between constraint errors > and datatype errors is entirely artificial. Who's to say what is a > constraint error and what is a datatype error, especially when you > start thinking about cases like varchar length constraints or > domain-type constraints? If we create a mechanism that behaves > differently depending on whether the error is detected before or after > we try to form a tuple containing the data, we're going to have > something that is exceedingly awkward to use, because the behavior will > be nearly arbitrary from the user's viewpoint. Well sure from a theoretical point of view. However from a practical point of view there's a whole lot more that can be done with the data once it's in a meaningful format. There's not much you can do with text other than stare at it (and you can't even necessarily do that with bytea). -- greg
On Tue, Nov 22, 2005 at 12:16:00PM -0500, Tom Lane wrote: > I think the distinction you are proposing between constraint errors > and datatype errors is entirely artificial. Who's to say what is a > constraint error and what is a datatype error, especially when you > start thinking about cases like varchar length constraints or > domain-type constraints? If we create a mechanism that behaves > differently depending on whether the error is detected before or after > we try to form a tuple containing the data, we're going to have > something that is exceedingly awkward to use, because the behavior will > be nearly arbitrary from the user's viewpoint. By that reasoning, to be consistant, we should never pass any data at all, which seems even more useless. This is however what other databases do, but I think we can do better. I don't think the distinction is really that arbitrary. If the data can be represented as a tuple in the correct datatypes, you're fine. Domain types are tricky, but to be consistant they should get the tuple data either. My main reasoning is that while you can write a 3 line Perl script to verify the format of all your integers, you can't write a script in any language to check for foreign key constraints or duplicate key errors. So those are the important actions. If it comes down to making datatype errors untrappable then I think I can live with that. MY thinking was that if it happening in the executor or parser, tough. That we only handle errors happing at the final insert/update/delete. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
>>Seems similar to the pgloader project on pgfoundry.org. > > > It is similar and good, but I regard that as a workaround rather than > the way forward. Yes, your way would be rad :)
> Actually, there are really only a few errors people want to trap I > imagine: > > - CHECK constraints (all handled in ExecConstraints) > - Duplicate keys > - Foreign key violations (all handled by triggers) > > Rather than worry about all the events we can't safely trap, how about > we simply deal with the handful that are trappable. For example, we let > people create an ON ERROR trigger and use the existing trigger > interface. We have three possibilities: Trap as many as we can and in the 'rejects' table have an 'sqlstate' field that has the SQLSTATE code generated by the failure. That way you can trivially look for all the ones that failed for whatever reason you like. Chris