Syntax of INSERT...ON DUPLICATE KEY LOCK FOR UPDATE - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Syntax of INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
Date
Msg-id CAM3SWZRzWH5gbAVoDHcCovf75Pjz9GsOhDOZBA1KNM4S2jgcSQ@mail.gmail.com
Whole thread Raw
Responses Re: Syntax of INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
Re: Syntax of INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
List pgsql-hackers
Someone suggested to me that I solicit opinion on the chosen syntax of
INSERT...ON DUPLICATE KEY LOCK FOR UPDATE on a separate thread. I'll
do that here, while also giving a formal user-level definition of the
feature. I'd like to solicit feedback from a wider set of people than
those participating in the main thread, while avoiding talking about
arcane details around locking which have dominated discussions up
until this point.

The patch proposed adds an optional ON DUPLICATE KEY LOCK FOR UPDATE
clause. It also extends the RETURNING clause to support projecting
"REJECTS".

Idiomatic wCTE usage, through which it's possible to build an
"UPSERT", might look like:

CREATE TABLE foo
( a int4 PRIMARY KEY, b int4, c text
);

WITH rej AS (
INSERT INTO foo (a, b, c)
-- Multiple rows proposed for insertion - may insert or update each
VALUES (44, 1078, 'insert'), (55, 1088, 'insert')
ON DUPLICATE KEY LOCK FOR UPDATE
RETURNING REJECTS a)
UPDATE foo SET c = 'update' FROM rej WHERE foo.a = rej.a;

This has what I like to call the fundamental upsert property: You
always either get an insert, or an update. People are already
incorrectly using wCTEs like this; I thought I'd make it work
correctly.

I believe that we'll be well served by the flexibility of what I've
proposed. In many respects it is similar to SQL MERGE. For example, I
could have deleted rather than updated, or I could have had a more
complex predicate in the update clause, so that updates did not occur
on all rejecting rows. While I'm not particularly attached to the
syntax, I do think that the flexibility is a key strength.

I recently suggested that rather than RETURNING REJECTS, we could have
a REJECTING clause, which would see a DML statement project strictly
the complement of what RETURNING projects in the same context. So
perhaps you could also see what RETURNING would not have projected
because a before row trigger returned NULL (i.e. when a before trigger
indicates to not proceed with insertion). That is certainly more
general, and so is perhaps preferable. It's also less verbose, and it
seems less likely to matter that we'll need to make REJECTING a fully
reserved keyword, as compared to REJECTS. (RETURNING is already a
fully reserved keyword not described by the standard, so this makes a
certain amount of sense to me). If nothing else, REJECTING is more
terse than RETURNING REJECTS.

Like the MySQL feature "INSERT...ON DUPLICATE KEY UPDATE", duplicates
are defined as would-be violators of any and all unique indexes. Like
the MySQL feature, the user is generally obligated to make sure they
know ahead of time which unique index any violation may come from, or
else they may update the wrong row (i.e. the row they ultimately
update is not already locked). Unlike the MySQL feature, expert users
have some capacity to recover if that problem is anticipated, because
we also may project out the rejecting row's ctid.

Notably the wCTE pattern, with a ctid join condition doesn't work very
well, because, as src/backend/optimizer/path/tidpath.c says:
* There is currently no special support for joins involving CTID; in* particular nothing corresponding to
best_inner_indexscan().   Since it's* not very useful to store TIDs of one table in another table, there* doesn't seem
tobe enough use-case to justify adding a lot of code* for that.
 

You end up with a seqscan, not a tidscan, so I don't think every
novice user is going to try this as a premature optimization, without
appreciating the hazards of tid updates. Projecting the rejecting
row's tid is an expert level feature, added mostly with things like
multi-master replication conflict resolution in mind. Those use-cases
will find this feature quite important, and will particularly value
the flexibility. In fact, I think for that use-case, it's even more
useful than SQL MERGE for a couple of reasons, in particular the
capability to defer doing anything with the locked row until later
commands.

This feature is not supposed to fully satisfy those calling for SQL
MERGE. I anticipate that we'll still get MERGE in a future release,
and the implementation anticipates this as well.

More formally, what the feature does is:

* Ensure that a row is either inserted successfully, or that if an
effort to do so was unsuccessful, the first conclusively committed
tuple with a conflicting value is exclusive locked. Not all
conflicting tuples are locked, just the first, although the order in
which we check unique indexes for conflicts is well defined (at least
with my implementation, where when we find a conflicting TID, it must
really be the first one at that juncture, because all previous unique
indexes are value locked).

* Ensure that in READ COMMITTED mode, the locked row is always
visible. Since we can get a lock on the tuple blamed for insertion not
proceeding, it must be conclusively committed and not updated or
deleted by anyone else, but in and of itself that isn't sufficient.
The tuple's xact may logically be still-in-progress to our snapshot,
and it would not be acceptable to have it be impossible to update for
that reason. So there is a special case adjustment to the general
semantics of MVCC snapshots, just to facilitate this feature. Higher
isolation levels actively forbid themselves from proceeding when they
observe that they'll have to avail of this special rule to update, by
throwing a serialization failure.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: "David E. Wheeler"
Date:
Subject: Re: nested hstore patch
Next
From: Peter Geoghegan
Date:
Subject: Re: Race condition in b-tree page deletion