Practical error logging for very large COPY statements - Mailing list pgsql-hackers

From Simon Riggs
Subject Practical error logging for very large COPY statements
Date
Msg-id 1132612879.4959.474.camel@localhost.localdomain
Whole thread Raw
Responses Re: Practical error logging for very large COPY statements
Re: Practical error logging for very large COPY statements
List pgsql-hackers
If you've ever loaded 100 million rows, you'll know just how annoying it
is to find that you have a duplicate row somewhere in there. Experience
shows that there is always one, whatever oath the analyst swears
beforehand.

It's hard to find out which row is the duplicate, plus you've just
screwed up a big table. It needs a VACUUM, then a reload.

I'd like to find a way to handle this manual task programmatically. 

What I'd like to do is add an ERRORTABLE clause to COPY. The main
problem is how we detect a duplicate row violation, yet prevent it from
aborting the transaction.

What I propose is to log uniqueness violations only when there is only a
single unique index on a table.

Flow of control would be to:
   locate page of index where value should go   lock index block   _bt_check_unique, but don't error       if violation
theninsert row into ERRORTABLEelse    insert row into data block    insert row into unique index    unlock index block
 do other indexes
 

Which is very similar code to the recently proposed MERGE logic.

With that logic, a COPY will run to completion, yet be able to report
the odd couple of unique index violations in found along the way. More
importantly we can then handle rows those with another program to locate
where those errors came from and resolve them.



In most cases with a single unique index, the index inserts are
rightmost index entries anyway, so there is scope here for an additional
optimisation: keep both index and data blocks locked across multiple row
inserts until either the unique index or the data block fills. Thats
better than taking a full table lock, since it allows concurrent access
to the rest of the table, but its also more efficient than continually
re-requesting the same blocks (which looks like about 10-15% saving on
performance from hash lookups, lock/unlock, etc).

Best Regards, Simon Riggs



pgsql-hackers by date:

Previous
From: Seneca Cunningham
Date:
Subject: Should libedit be preferred to libreadline?
Next
From: Tom Lane
Date:
Subject: Re: PostgreSQL 8.1.0 catalog corruption