Thread: Practical error logging for very large COPY statements

Practical error logging for very large COPY statements

From
Simon Riggs
Date:
If you've ever loaded 100 million rows, you'll know just how annoying it
is to find that you have a duplicate row somewhere in there. Experience
shows that there is always one, whatever oath the analyst swears
beforehand.

It's hard to find out which row is the duplicate, plus you've just
screwed up a big table. It needs a VACUUM, then a reload.

I'd like to find a way to handle this manual task programmatically. 

What I'd like to do is add an ERRORTABLE clause to COPY. The main
problem is how we detect a duplicate row violation, yet prevent it from
aborting the transaction.

What I propose is to log uniqueness violations only when there is only a
single unique index on a table.

Flow of control would be to:
   locate page of index where value should go   lock index block   _bt_check_unique, but don't error       if violation
theninsert row into ERRORTABLEelse    insert row into data block    insert row into unique index    unlock index block
 do other indexes
 

Which is very similar code to the recently proposed MERGE logic.

With that logic, a COPY will run to completion, yet be able to report
the odd couple of unique index violations in found along the way. More
importantly we can then handle rows those with another program to locate
where those errors came from and resolve them.



In most cases with a single unique index, the index inserts are
rightmost index entries anyway, so there is scope here for an additional
optimisation: keep both index and data blocks locked across multiple row
inserts until either the unique index or the data block fills. Thats
better than taking a full table lock, since it allows concurrent access
to the rest of the table, but its also more efficient than continually
re-requesting the same blocks (which looks like about 10-15% saving on
performance from hash lookups, lock/unlock, etc).

Best Regards, Simon Riggs



Re: Practical error logging for very large COPY statements

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> What I'd like to do is add an ERRORTABLE clause to COPY. The main
> problem is how we detect a duplicate row violation, yet prevent it from
> aborting the transaction.

If this only solves the problem of duplicate keys, and not any other
kind of COPY error, it's not going to be much of an advance.

> Flow of control would be to:

>     locate page of index where value should go
>     lock index block
>     _bt_check_unique, but don't error
>         if violation then insert row into ERRORTABLE
>     else
>         insert row into data block
>         insert row into unique index
>         unlock index block
>         do other indexes

Ugh.  Do you realize how many levels of modularity violation are implied
by that sketch?  Have you even thought about the fact that we have more
than one kind of index?
        regards, tom lane


Re: Practical error logging for very large COPY statements

From
Andrew Dunstan
Date:

Tom Lane wrote:

>Simon Riggs <simon@2ndquadrant.com> writes:
>  
>
>>What I'd like to do is add an ERRORTABLE clause to COPY. The main
>>problem is how we detect a duplicate row violation, yet prevent it from
>>aborting the transaction.
>>    
>>
>
>If this only solves the problem of duplicate keys, and not any other
>kind of COPY error, it's not going to be much of an advance.
>  
>

Yeah, and I see errors from bad data as often as from violating 
constraints. Maybe the best way if we do something like this would be to 
have the error table contain a single text, or maybe bytea, field which 
contained the raw offending input line.

cheers

andrew



Re: Practical error logging for very large COPY statements

From
Christopher Kings-Lynne
Date:
Seems similar to the pgloader project on pgfoundry.org.

Chris

Simon Riggs wrote:
> If you've ever loaded 100 million rows, you'll know just how annoying it
> is to find that you have a duplicate row somewhere in there. Experience
> shows that there is always one, whatever oath the analyst swears
> beforehand.
> 
> It's hard to find out which row is the duplicate, plus you've just
> screwed up a big table. It needs a VACUUM, then a reload.
> 
> I'd like to find a way to handle this manual task programmatically. 
> 
> What I'd like to do is add an ERRORTABLE clause to COPY. The main
> problem is how we detect a duplicate row violation, yet prevent it from
> aborting the transaction.
> 
> What I propose is to log uniqueness violations only when there is only a
> single unique index on a table.
> 
> Flow of control would be to:
> 
>     locate page of index where value should go
>     lock index block
>     _bt_check_unique, but don't error
>         if violation then insert row into ERRORTABLE
>     else
>         insert row into data block
>         insert row into unique index
>         unlock index block
>         do other indexes
> 
> Which is very similar code to the recently proposed MERGE logic.
> 
> With that logic, a COPY will run to completion, yet be able to report
> the odd couple of unique index violations in found along the way. More
> importantly we can then handle rows those with another program to locate
> where those errors came from and resolve them.
> 
> 
> 
> In most cases with a single unique index, the index inserts are
> rightmost index entries anyway, so there is scope here for an additional
> optimisation: keep both index and data blocks locked across multiple row
> inserts until either the unique index or the data block fills. Thats
> better than taking a full table lock, since it allows concurrent access
> to the rest of the table, but its also more efficient than continually
> re-requesting the same blocks (which looks like about 10-15% saving on
> performance from hash lookups, lock/unlock, etc).
> 
> Best Regards, Simon Riggs
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match



Re: Practical error logging for very large COPY

From
Simon Riggs
Date:
On Mon, 2005-11-21 at 19:38 -0500, Andrew Dunstan wrote:
> 
> Tom Lane wrote:
> 
> >Simon Riggs <simon@2ndquadrant.com> writes:
> >  
> >
> >>What I'd like to do is add an ERRORTABLE clause to COPY. The main
> >>problem is how we detect a duplicate row violation, yet prevent it from
> >>aborting the transaction.
> >>    
> >If this only solves the problem of duplicate keys, and not any other
> >kind of COPY error, it's not going to be much of an advance.
> >  

> Yeah, and I see errors from bad data as often as from violating 
> constraints. Maybe the best way if we do something like this would be to 
> have the error table contain a single text, or maybe bytea, field which 
> contained the raw offending input line.

I have committed the sin of omission again.

Duplicate row violation is the big challenge, but not the only function
planned. Formatting errors occur much more frequently, so yes we'd want
to log all of that too. And yes, it would be done in the way you
suggest.

Here's a fuller, but still brief sketch:

COPY ... FROM ....[ERRORTABLES format1 [uniqueness1] [ERRORLIMIT percent]]

where Format1, Uniqueness1 would be created from new by this command (or
error if they already exist)

Format1 would hold formatting errors so would be in a blob table with
cols (line number, col number, error number, fullrowstring)

Uniqueness1 would be same definition as table, but with no indexes
This table would be optional, indicating no uniqueness violation checks
would be needed to be carried out. If present and yet no unique indexes
exist, then Uniqueness1 would be ignored (and not created).

ERRORLIMIT percent would abort the COPY if more than percent errors were
found, after the first 1000 records (that limit could also be stated if
required).

Without the ERRORTABLES clause, COPY would work exactly as it does now.

How does that sound?

Best Regards, Simon Riggs



Re: Practical error logging for very large COPY

From
Simon Riggs
Date:
On Tue, 2005-11-22 at 10:00 +0800, Christopher Kings-Lynne wrote:
> Seems similar to the pgloader project on pgfoundry.org.

It is similar and good, but I regard that as a workaround rather than
the way forward.

Best Regards, Simon Riggs




Re: Practical error logging for very large COPY

From
Simon Riggs
Date:
On Mon, 2005-11-21 at 19:05 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > Flow of control would be to:
> 
> >     locate page of index where value should go
> >     lock index block
> >     _bt_check_unique, but don't error
> >         if violation then insert row into ERRORTABLE
> >     else
> >         insert row into data block
> >         insert row into unique index
> >         unlock index block
> >         do other indexes
> 
> Ugh.  Do you realize how many levels of modularity violation are implied
> by that sketch?  

IMHO the above is fairly ugly, but I suggest it now because:
1. I want to avoid uniqueness violations in COPY
2. The logic used is very similar to that recently proposed for MERGE. 

If anybody has a better idea for (1), shout it out now.

If the logic is OK for MERGE, then it should be OK for COPY with
uniqeness violation trapping also. Both use uniqueness checking first,
so you'd need to argue against both or neither.

> Have you even thought about the fact that we have more
> than one kind of index?

Yes, but they don't support unique indexes do they?

Best Regards, Simon Riggs



Re: Practical error logging for very large COPY

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> I have committed the sin of omission again.

> Duplicate row violation is the big challenge, but not the only function
> planned. Formatting errors occur much more frequently, so yes we'd want
> to log all of that too. And yes, it would be done in the way you
> suggest.

> Here's a fuller, but still brief sketch:

> COPY ... FROM ....
>     [ERRORTABLES format1 [uniqueness1]
>      [ERRORLIMIT percent]]


This is getting worse, not better :-(

The general problem that needs to be solved is "trap any error that
occurs during attempted insertion of a COPY row, and instead of aborting
the copy, record the data and the error message someplace else".  Seen
in that light, implementing a special path for uniqueness violations is
pretty pointless.

You could almost do this today in about five minutes with a PG_TRY
construct.  The hard part is to distinguish errors that COPY can safely
trap from errors that must be allowed to abort the transaction anyway
(usually because the backend won't be in a consistent state if it's not
allowed to do post-abort cleanup).  I think the latter class would
mostly be "internal" errors, and so not trapping them shouldn't be a big
problem for usefulness; but we can't simply ignore the possibility that
they would occur during COPY.
        regards, tom lane


Re: Practical error logging for very large COPY

From
Martijn van Oosterhout
Date:
On Tue, Nov 22, 2005 at 09:58:44AM -0500, Tom Lane wrote:
> The general problem that needs to be solved is "trap any error that
> occurs during attempted insertion of a COPY row, and instead of aborting
> the copy, record the data and the error message someplace else".  Seen
> in that light, implementing a special path for uniqueness violations is
> pretty pointless.

Actually, there are really only a few errors people want to trap I
imagine:

- CHECK constraints (all handled in ExecConstraints)
- Duplicate keys
- Foreign key violations (all handled by triggers)

Rather than worry about all the events we can't safely trap, how about
we simply deal with the handful that are trappable. For example, we let
people create an ON ERROR trigger and use the existing trigger
interface. We have three possibilities:

- They return the same tuple, throw the error
- They return NULL, ignore error, goto next tuple
- They return a different tuple, retest the conditions

The trigger can then do anything a normal trigger can do, including
copying to another table if people like that.

This doesn't seem like awfully hard work, does it? Initially at least,
no TRY blocks needed...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Practical error logging for very large COPY

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> On Tue, Nov 22, 2005 at 09:58:44AM -0500, Tom Lane wrote:
>> The general problem that needs to be solved is "trap any error that
>> occurs during attempted insertion of a COPY row, and instead of aborting
>> the copy, record the data and the error message someplace else".

> Actually, there are really only a few errors people want to trap I
> imagine:

You've forgotten bad data, eg "foo" in an integer field, or an
untranslatable multibyte character.  The bad-data problem is what lets
out trigger-based solutions, or indeed anything that presumes that the
bad data can be forced into a particular representation.
        regards, tom lane


Re: Practical error logging for very large COPY

From
Martijn van Oosterhout
Date:
On Tue, Nov 22, 2005 at 10:45:50AM -0500, Tom Lane wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> > Actually, there are really only a few errors people want to trap I
> > imagine:
>
> You've forgotten bad data, eg "foo" in an integer field, or an
> untranslatable multibyte character.  The bad-data problem is what lets
> out trigger-based solutions, or indeed anything that presumes that the
> bad data can be forced into a particular representation.

So don't pass the row in that case. The trigger still has the
oppotunity to return a null tuple to have the error ignored. I don't
think it diminishes the benefits of the idea, being that a general
trigger mechanism is way better than adding special exception blocks to
INPERT and/or COPY to handle special cases.

I've looked around some other RDBMSs and they don't tell you in the
exception handler the row that caused the error, so we're hardly behind
the pack here.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Practical error logging for very large COPY

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Martijn van Oosterhout <kleptog@svana.org> writes:
> > On Tue, Nov 22, 2005 at 09:58:44AM -0500, Tom Lane wrote:
> >> The general problem that needs to be solved is "trap any error that
> >> occurs during attempted insertion of a COPY row, and instead of aborting
> >> the copy, record the data and the error message someplace else".
> 
> > Actually, there are really only a few errors people want to trap I
> > imagine:
> 
> You've forgotten bad data, eg "foo" in an integer field, or an
> untranslatable multibyte character.  The bad-data problem is what lets
> out trigger-based solutions, or indeed anything that presumes that the
> bad data can be forced into a particular representation.

I think that's precisely the point here though. There are basically two
categories of errors:

1) Data that can be parsed and loaded but generates some sort of constraint  violation such as a UNIQUE violation,
foreignkey violation, or other  constraint violation.
 

2) Data that can't be parsed as the correct data type at all.

It would be nice to be able to have the former loaded into an actual table
where it can be queried and perhaps fixed and reloaded.

The latter clearly cannot. I would say it should just generate a log entry.


-- 
greg



Re: Practical error logging for very large COPY

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> I think that's precisely the point here though. There are basically two
> categories of errors:

> 1) Data that can be parsed and loaded but generates some sort of constraint
>    violation such as a UNIQUE violation, foreign key violation, or other
>    constraint violation.

> 2) Data that can't be parsed as the correct data type at all.

> It would be nice to be able to have the former loaded into an actual table
> where it can be queried and perhaps fixed and reloaded.

> The latter clearly cannot.

Sure it can --- you just have to dump it as raw text (or perhaps bytea,
as someone suggested upthread).

I think the distinction you are proposing between constraint errors
and datatype errors is entirely artificial.  Who's to say what is a
constraint error and what is a datatype error, especially when you
start thinking about cases like varchar length constraints or
domain-type constraints?  If we create a mechanism that behaves
differently depending on whether the error is detected before or after
we try to form a tuple containing the data, we're going to have
something that is exceedingly awkward to use, because the behavior will
be nearly arbitrary from the user's viewpoint.
        regards, tom lane


Re: Practical error logging for very large COPY

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> > It would be nice to be able to have the former loaded into an actual table
> > where it can be queried and perhaps fixed and reloaded.
> 
> > The latter clearly cannot.
> 
> Sure it can --- you just have to dump it as raw text (or perhaps bytea,
> as someone suggested upthread).

I didn't just say "loaded into an actual table" I said "loaded into an actual
table where it can be queried and perhaps fixed and reloaded". From a
practical point of view having the data in the already parsed format is a
whole lot more useful. You can then do a query to look up the record it
conflicted with or look up possible foreign key values that would work instead
of the failed reference. You can also insert it directly into the table
instead of having to dump it out to a text file and load it with COPY again.

Actually I think it would be useful to be able to do this to constraints
generally, not just during COPY. If I update or insert a record and it fails
due to a constraint violation it would be handy to be able to view the failed
record.

Perhaps what's really needed is something like CREATE TRIGGER AFTER CONSTRAINT
VIOLATION which can then go ahead and insert the record into some other table
if it feels like.

COPY then would just need an option to proceed even after an error. Presumably
only to be used if you're inserting into a clean ETL table, not directly into
production tables.

> I think the distinction you are proposing between constraint errors
> and datatype errors is entirely artificial.  Who's to say what is a
> constraint error and what is a datatype error, especially when you
> start thinking about cases like varchar length constraints or
> domain-type constraints?  If we create a mechanism that behaves
> differently depending on whether the error is detected before or after
> we try to form a tuple containing the data, we're going to have
> something that is exceedingly awkward to use, because the behavior will
> be nearly arbitrary from the user's viewpoint.

Well sure from a theoretical point of view. However from a practical point of
view there's a whole lot more that can be done with the data once it's in a
meaningful format. There's not much you can do with text other than stare at
it (and you can't even necessarily do that with bytea).

-- 
greg



Re: Practical error logging for very large COPY

From
Martijn van Oosterhout
Date:
On Tue, Nov 22, 2005 at 12:16:00PM -0500, Tom Lane wrote:
> I think the distinction you are proposing between constraint errors
> and datatype errors is entirely artificial.  Who's to say what is a
> constraint error and what is a datatype error, especially when you
> start thinking about cases like varchar length constraints or
> domain-type constraints?  If we create a mechanism that behaves
> differently depending on whether the error is detected before or after
> we try to form a tuple containing the data, we're going to have
> something that is exceedingly awkward to use, because the behavior will
> be nearly arbitrary from the user's viewpoint.

By that reasoning, to be consistant, we should never pass any data at
all, which seems even more useless. This is however what other
databases do, but I think we can do better.

I don't think the distinction is really that arbitrary. If the data can
be represented as a tuple in the correct datatypes, you're fine. Domain
types are tricky, but to be consistant they should get the tuple data
either.

My main reasoning is that while you can write a 3 line Perl script to
verify the format of all your integers, you can't write a script in any
language to check for foreign key constraints or duplicate key errors.
So those are the important actions. If it comes down to making datatype
errors untrappable then I think I can live with that.

MY thinking was that if it happening in the executor or parser, tough.
That we only handle errors happing at the final insert/update/delete.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Practical error logging for very large COPY statements

From
Christopher Kings-Lynne
Date:
>>Seems similar to the pgloader project on pgfoundry.org.
> 
> 
> It is similar and good, but I regard that as a workaround rather than
> the way forward.

Yes, your way would be rad :)



Re: Practical error logging for very large COPY

From
Christopher Kings-Lynne
Date:
> Actually, there are really only a few errors people want to trap I
> imagine:
> 
> - CHECK constraints (all handled in ExecConstraints)
> - Duplicate keys
> - Foreign key violations (all handled by triggers)
> 
> Rather than worry about all the events we can't safely trap, how about
> we simply deal with the handful that are trappable. For example, we let
> people create an ON ERROR trigger and use the existing trigger
> interface. We have three possibilities:

Trap as many as we can and in the 'rejects' table have an 'sqlstate' 
field that has the SQLSTATE code generated by the failure.  That way you 
can trivially look for all the ones that failed for whatever reason you 
like.

Chris