MERGE SQL Statement - Mailing list pgsql-hackers

From Simon Riggs
Subject MERGE SQL Statement
Date
Msg-id 1208372338.4259.202.camel@ebony.site
Whole thread Raw
Responses Re: MERGE SQL Statement
List pgsql-hackers
I've analysed various flavours of MERGE command to understand and
propose what we should use for PostgreSQL.

The results aren't what you'd expect from a quick flick through the
standard, so lets look at my main concerns:

1. The simplest syntax is for SQL:2003. The syntax for DB2, SQL Server
and Oracle is more complex, with SQL:2008(final draft) being very
similar to DB2 and SQL Server, so unlikely to be a point of contention
in the standard. I suggest we go with the latter, but yes, its still in
draft (yawn).

2. MySQL and Teradata have their own syntax for the row-oriented Upsert
operation. Both of those are more useful (IMHO) than MERGE for OLTP
apps, while MERGE is very useful for bulk data loads. I'm open to the
idea that we do something like this in addition to MERGE.

3. The way MERGE works is to define a left outer join between source and
target, then specify a series of WHEN clauses that may or may not apply.
It **isn't** just a simple Update/Insert and so much of what we have
discussed previously goes straight in the trash. AFAICS the way it is
specified to work it would be fairly straightforward to cause race
conditions and failures when using multiple concurrent MERGE statements.

General Example of the recommended syntax for PostgreSQL

MERGE INTO Stock S  /* target */
       USING DailySales DS   /* source table */
       ON S.Item = DS.Item   /* left outer join source to target */
       WHEN MATCHED AND (QtyOnHand - QtySold = 0) THEN
    /* delete item if no stock remaining */               DELETE 
       WHEN MATCHED THEN /* No AND clause, so drop thru */
                /* update value if some remains */               UPDATE SET QtyOnHand = QtyOnHand - QtySold
       WHEN NOT MATCHED THEN
               /* insert a row if the stock is new */               INSERT VALUES (Item, QtySold)
;

So lets look at the syntaxes and then review how it might work.

SYNTAX
======
SQL:2003
--------
MERGE INTO target [AS correlation-name]
USING [table-ref | subquery]
ON <search-condition>
[WHEN MATCHED THEN MergeUpdate]
[WHEN NOT MATCHED THEN MergeInsert]

Oracle 11g 
----------
MERGE INTO target [AS correlation-name]
USING [table-ref | subquery]
ON <search-condition>
[WHEN MATCHED THEN MergeUpdate  WHERE <where-clause> DELETE WHERE <where-clause>]
[WHEN NOT MATCHED THEN MergeInsert WHERE <where-clause>]

Differences from SQL:2003 are
* Update and Insert have WHERE clauses on them
* Oracle allows multiple WHEN ... WHERE clauses
* Oracle allows an error logging clause also
* optional DELETE statement as part of the UPDATE, so you can only
DELETE what you update (yeh, really)
* WHEN MATCHED/WHEN NOT MATCHED must be in fixed order, only

IBM DB2
-------
MERGE INTO target [AS correlation-name]
USING [table-ref | subquery]
ON <search-condition>
[WHEN MATCHED [AND <where-clause>] THEN MergeUpdate | MergeDelete]
[WHEN NOT MATCHED [AND <where-clause>] THEN MergeInsert | SignalClause]
[ELSE IGNORE]

Differences from SQL:2003 are
* Update and Insert have AND clauses on them (like WHERE...)
* DB2 allows multiple WHEN ... AND clauses
* DELETE is also a full-strength option, not part of the MergeUpdate
clause as it is in Oracle
* DB2 allows a SIGNAL statement, similar to RAISE
* ELSE IGNORE is an optional syntax, which does nothing

SQL Server 2008
---------------

MERGE [INTO] target [AS correlation-name]
USING [table-ref | subquery]
ON <search-condition>
[WHEN MATCHED [AND <where-clause>] THEN MergeUpdate | MergeDelete]
[WHEN NOT MATCHED [AND <where-clause>] THEN MergeInsert]

Differences from SQL:2003 are
* Update and Insert have AND clauses on them (like WHERE...)
* DB2 allows multiple WHEN ... AND clauses
* DELETE is also a full-strength option, not part of the MergeUpdate
clause as it is in Oracle

SQL:2008
--------

MERGE INTO target [AS correlation-name]
USING [table-ref | subquery]
ON <search-condition>
[WHEN MATCHED [AND <where-clause>] THEN MergeUpdate]
[WHEN NOT MATCHED [AND <where-clause>] THEN MergeInsert]

Differences from SQL:2003 are
* Update and Insert have AND clauses on them (like WHERE...)
* Allows multiple WHEN ... AND clauses

Alternate Syntax
----------------

MySQL supports
* REPLACE INTO
* INSERT ... ON DUPLICATE KEY UPDATE ... 

Teradata supports
* UPDATE ... ELSE INSERT ...
* MERGE with an additional error logging clause

The Teradata and Oracle error logging clauses are very cute and I
suggest we do something similar for COPY, at least.

Proposed Syntax for PostgreSQL
==============================

MERGE INTO table [[AS] alias]
USING [table-ref | query]
ON join-condition
[WHEN MATCHED [AND condition] THEN MergeUpdate | DELETE]
[WHEN NOT MATCHED [AND condition] THEN MergeInsert]

MergeUpdate is
UPDATE SET { column = { expression | DEFAULT } |         ( column [, ...] ) = ( { expression | DEFAULT } [, ...] ) }
[, ...]
(yes, there is no WHERE clause here)

MergeInsert is
INSERT [ ( column [, ...] ) ]   { DEFAULT VALUES | VALUES ( { expression | DEFAULT } [, ...] )
[, ...]}
(no subquery allowed)


Notes and behaviours
--------------------

* It is possible for concurrent MERGE statements to cause duplicate
INSERT violations because of a race condition between when we check
whether the row is matching/not matching and when we apply the
appropriate WHEN clause, if any. This is just the same as what we do now
with try-UPDATE-then-INSERT logic. (This seems to end the discussion
about whether we do inserts/updates first because the matching test is
always performed before we take the action; or perhaps it means we don't
like MERGE as much as we did before and would prefer alternate
syntaxes...). Maybe we could avoid some problems by applying
heap_lock_tuple() to each matched row, so we know it will stay matched
while we evaluate the WHEN clauses? Maybe not.

* USING query can be a VALUES clause if we wish to do single/few row
operations, so MERGE can be used for bulk-loading and OLTP

* There is no RETURNING option for MERGE, nor for any INSERT/UPDATE
sub-clauses 
* WHERE CURRENT OF cursor is not supported anywhere
* The join can't be recursive, so no WITH support (common expressions,
i.e. non-recursive WITH are supported by SQLServer 2008)
* conditions would not allow sub-selects 

* MERGE would work on base tables only, just like COPY
* Changes are made only to that single table
* Cannot update a column mentioned in the ON clause cos that would make
my head hurt too much

* MERGE will perform a left outer join between source on left and target
on right. There must be no more than 1 row from table-ref for each row
in the table. Each row in the table can only be updated once during each
MERGE statement. Each non-matching row in the table-ref will result in
one INSERT into table.
* WHEN clauses are tested in the order specified. If the AND condition
returns false then we skip onto the next WHEN clause. We stop once a
WHEN clause activates, so only *one* action is ever activated for each
row.
* AND clauses need not form a complete set, i.e. it is possible that no
action will result. It is also possible that some WHEN clauses will
never activate because of the execution order; we would not try to
prevent this, just document it as a possible user error.

* MERGE will respect Triggers, but not Rules since the rules behaviour
is roughly orthogonal to the WHEN clauses 
* MERGE fires UPDATE and INSERT triggers according to which WHEN clause
is activated (if any)

* It's unclear whether MERGE should activate statement-level triggers,
or not. Few of the above sources are explicit either way on this point.
DB2 always runs both UPDATE and INSERT statement-level triggers, whether
or not rows have been changed; I would suggest we do that also for
before triggers. For after statement triggers I would suggest that we
track how many updates and inserts are caused and if updates > 0 then we
activate the after statement for update triggers, and if inserts > 0
then we activate the after statement for insert triggers. If a statement
level trigger is activated by both update and insert then it would be
possible for both TRIGGER_FIRED_BY_UPDATE() and
TRIGGER_FIRED_BY_DELETE() to be true (for statement level triggers
only), which would be a change from what we do now, even if the old
behaviour was not explicitly mutually exclusive. In both cases I suggest
we run Update triggers before Insert triggers consistently for both
before and after statement triggers.

* The number of rows changed should be (inserts + updates) which should
be < number of rows returned by table-ref. It would be good to get
access to the number of rows inserted and updated, so I propose that we
return a NOTICE statement with this information.

* The way MERGE is specified, the internals design seems to fall out
fairly clearly:
- set up a left outer join between source and target, with a junk
attribute that might be NULL to indicate NOT MATCHED. Let planner
optimise that as it wishes, nothing special needed
- in ExecutePlan have a new operation type of CMD_MERGE, which then
allows us to apply the WHEN clauses to decide what kind of final
operation will result for that tuple (U, D or I)

I'm planning to agree the syntax and write regression tests first, so we
all agree the behaviour we are aiming towards.

Thoughts? (Apart from jokes about having a WHY clause...)

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: How to submit a patch
Next
From: Bruce Momjian
Date:
Subject: Re: How to submit a patch