Copy From & Insert UNLESS - Mailing list pgsql-hackers

From James William Pye
Subject Copy From & Insert UNLESS
Date
Msg-id 20060204013938.GA96565@lit.jwp.name
Whole thread Raw
Responses Re: Copy From & Insert UNLESS
Re: Copy From & Insert UNLESS
Re: Copy From & Insert UNLESS
List pgsql-hackers
Greets folks, [YABLP: Yet Another Bulk Loading Proposal]

The subject of this letter is referring to giving INSERT and COPY FROM STDIN
the ability to alter the destination of rows that violate any constraints named
in a user specified set.

I am seeking, as many others are or have, to improve the performance on bulk
loads to live systems where constraint violations may occur and filtering can
be done more efficiently within the backend.

Primarily, I'm concerned with UNIQUE violations. However, I think tackling the
general case is the wiser action.

The attached patch is *not* being included for any sort of application, and I
make no claims of it functioning as I intended it to or as I may imply it to. =)
The patch only served the purpose of providing rough numbers for the case of
unique violations.

Despite the fact that my experimental patch uses error trapping, that is *not*
what I have in mind for the implementation. I do not want to trap errors upon
insert or copy from. Rather, I wish to implement functionality that would allow
alternate destinations for tuples that violate user specified constraints on
the table, which, by default, will be to simply drop the tuple.

My proposed syntax is along the lines of:

   INSERT INTO table [ ( column [, ...] ) ]
*   [UNLESS CONSTRAINT VIOLATION
     [ON (constraint_name [, ...]) [THEN INSERT INTO table]] [, OR ...]]
    { DEFAULT VALUES | VALUES ( { expression | DEFAULT } [, ...] ) | query }

 and

   COPY tablename [ ( column [, ...] ) ]
    FROM { 'filename' | STDIN }
*   [UNLESS CONSTRAINT VIOLATION
     [ON (constraint_name [, ...]) [THEN INSERT INTO table]] [, OR ...]]
   ...

The purpose of the arguably unintuitive THEN INSERT INTO x portion is to provide
the mechanism in which a user can specify the destination table for tuples that
violated the associated set of constraints. Using the OR portion allows the user
to specify additional sets of constraints for different destinations.

A tuple will be withheld from the target table if ANY of the constraints
listed in any of the constraint_name sets is violated. Constraint sets should
not [may not?] reference the same constraint multiple times, even among
different sets.

Example:

 \d dest_table
   Table "public.dest_table"
  Column |  Type   | Modifiers
 --------+---------+-----------
  i      | integer | not null
  j      | integer |
 Indexes:
     "dest_table_pkey" PRIMARY KEY, btree (i)
 Check constraints:
     "dest_table_j_check" CHECK (j > 0)

 CREATE TEMP TABLE pkey_failures (i int, j int);
 CREATE TEMP TABLE check_failures (i int, j int);

 COPY dest_table FROM STDIN
  UNLESS CONSTRAINT VIOLATION
   ON (dest_table_pkey) THEN INSERT INTO pkey_failures
   OR (dest_table_j_check) THEN INSERT INTO check_failures;

For most constraints, this proposed implementation should be fairly easy to
implement. However, the B-Tree index has the uniqueness check within its
insert access method, _bt_check_unique. Perhaps the best solution here is to
expose this check function--with changes, of course--and define a new access
method entry, 'amcheck' or, perhaps, 'amscanforinsert' where state information
would be given back to the caller for later use in the actual insert. (Perhaps
evident, but I'm not as familiar with the index code as I would like to be for
this sort of speculation, so please excuse me if I am not making good sense.)

There is one other annoying change. The constraints specified in an INSERT
UNLESS should be checked before all other unspecified constraints. This is to
elegantly handle the insertion case where two violations can occur, one with a
constraint that the user specified, and one that the user didn't. Regardless of
the order in which constraints are collected for checking, the user specified
ones should be checked first to avoid unwelcome errors from being thrown when
the tuple was going to be tossed anyways.

This proposal purposefully does not discuss bad data errors as I think that
should be seen as a separate issue. Perhaps a future feature within the UNLESS
syntax.


Prior Discussions or Mentions [See the last two.]

implicit abort harmful?
http://archives.postgresql.org/pgsql-general/2003-05/msg00962.php
how to continue a transaction after an error?
http://archives.postgresql.org/pgsql-sql/2000-11/msg00097.php
mass import to table with unique index
http://archives.postgresql.org/pgsql-general/2003-01/msg01465.php
Duplicate key insert question
http://archives.postgresql.org/pgsql-general/2003-07/msg00056.php

Ignore when using COPY FROM (Matthew Kennedy) (small, complaint/fr)
http://archives.postgresql.org/pgsql-general/2000-08/msg00681.php
COPY and duplicates (Ryan Mahoney) (small, complaint/fr)
http://archives.postgresql.org/pgsql-general/2001-07/msg00569.php

Bulk loading using COPY - ignore duplicates? (Lee Kindness?)
http://archives.postgresql.org/pgsql-hackers/2002-01/msg00029.php
Practical error logging for very large COPY statements (Simon Riggs)
http://archives.postgresql.org/pgsql-hackers/2005-11/msg01100.php


Extant Solutions

There are quite a few solutions to this problem as I'm sure many (all?) know:

. Temporary table that filters out the evil tuples.
. BEFORE TRIGGER handling the tuple if the constraint of interest is violated.
. INSERT wrapped in a subtransaction.
. (Other variations)

Temporary tables are probably the fastest here. However, it still exhibits
redundancy, and requires post-load tuple movement(extra step).

Savepoints require client side logic in order to detect the appropriate
error code to trap or raise. (Also, this seems to be quite slow, regardless.)

A before trigger is going to require that interested constraints be tested
twice and for code to be effectively duplicated.

So, there are quite a few ways to do error controlled bulk loading. Temporary
tables appear to be the best current solution. However, I think the
implementation described in this proposal will yield improvements by simply
reducing redundancy.


Simple Numbers [Using the attached, *very* experimental patch]:

(PostgreSQL 8.2 [~HEAD], redhat 9)

These numbers were rendered from a simple single client trial where psql
and the postmaster were running on the same machine. I ran each trial a
few times and I would cleanup and VACUUM FULL between runs. The numbers do
not include the cleanup or the vacuum.

The "Insert unless" trials are drawn from my attached patch.
(For the INSERT UNLESS cases I used the patch, otherwise a clean PG.)

CREATE TABLE foo (i int PRIMARY KEY);

[Straight SQL INSERTs (via psql and an SQL file)]
    (100000 Violations)
        Auto-Commit Transactions: 22.213
        Savepoints: (ran out of shared memory)
        Insert into temporary table, Merge: 24.225
*        Insert Unless: 14.668

    (50000 Violations, 50000 New)
        Auto-Commit Transactions: 33.342
        Savepoints: (untested)
        Insert into temporary table, Merge: 24.243
*        Insert Unless: 14.260

    (100000 New)
        Auto-Commit Transactions: 47.990
        Savepoints: 3:05.60 (three minutes!!)
        Temporary table: 26.178
*        Insert Unless: 14.283

The numbers here look pretty good, especially for such a hackish patch.
[btw, I hope I screwed up somehow on the savepoints.]

However:

[COPY FROM STDIN (via psql and a file of data)]
    (100000 Violations)
        Copy Unless: 2.4132
        Copy to temporary, Merge: 0.72675

    (50000 Conflicts, 50000 New)
        Copy Unless: 2.1145
        Copy to temporary, Merge: 1.469

    (100000 New)
        Copy Unless: 1.6386
        Copy to temporary, Merge: 2.4305

The numbers here don't look so good now. :(
However, I'm convinced that this is showing the inefficiencies in my current
hack, rather than dissolving the likelihood of the discussed implementation being
an improvement.

(The files that rendered these results are available on request. They are 100K a
piece after being bzip'd)
--
Regards, James William Pye

Attachment

pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: [PORTS] Failed install - libgen.so doesn't exist
Next
From: Bruce Momjian
Date:
Subject: Re: [Bizgres-general] WAL bypass for INSERT, UPDATE and