Home > mailing lists

Practical error logging for very large COPY statements - Mailing list pgsql-hackers

From	Simon Riggs
Subject	Practical error logging for very large COPY statements
Date	November 21, 2005 18:41:20
Msg-id	1132612879.4959.474.camel@localhost.localdomain Whole thread Raw
Responses	Re: Practical error logging for very large COPY statements Re: Practical error logging for very large COPY statements
List	pgsql-hackers

Tree view

If you've ever loaded 100 million rows, you'll know just how annoying it
is to find that you have a duplicate row somewhere in there. Experience
shows that there is always one, whatever oath the analyst swears
beforehand.

It's hard to find out which row is the duplicate, plus you've just
screwed up a big table. It needs a VACUUM, then a reload.

I'd like to find a way to handle this manual task programmatically.

What I'd like to do is add an ERRORTABLE clause to COPY. The main
problem is how we detect a duplicate row violation, yet prevent it from
aborting the transaction.

What I propose is to log uniqueness violations only when there is only a
single unique index on a table.

Flow of control would be to:
locate page of index where value should go lock index block _bt_check_unique, but don't error if violation
theninsert row into ERRORTABLEelse insert row into data block insert row into unique index unlock index block
do other indexes

Which is very similar code to the recently proposed MERGE logic.

With that logic, a COPY will run to completion, yet be able to report
the odd couple of unique index violations in found along the way. More
importantly we can then handle rows those with another program to locate
where those errors came from and resolve them.

In most cases with a single unique index, the index inserts are
rightmost index entries anyway, so there is scope here for an additional
optimisation: keep both index and data blocks locked across multiple row
inserts until either the unique index or the data block fills. Thats
better than taking a full table lock, since it allows concurrent access
to the rest of the table, but its also more efficient than continually
re-requesting the same blocks (which looks like about 10-15% saving on
performance from hash lookups, lock/unlock, etc).

Best Regards, Simon Riggs

pgsql-hackers by date:

From: Seneca Cunningham
Date: 21 November 2005, 18:38:19
Subject: Should libedit be preferred to libreadline?

From: Tom Lane
Date: 21 November 2005, 18:44:07
Subject: Re: PostgreSQL 8.1.0 catalog corruption

Practical error logging for very large COPY statements - Mailing list pgsql-hackers

Previous

Next