Thread: Internal design of MERGE, with Rules

Internal design of MERGE, with Rules

From
Simon Riggs
Date:
MERGE looks like it may need some new infrastructure to support it,
depending upon the implementation route. Guidance and discussion
requested from main hackers, if possible.

This is a separate post because there's no further discussion here on
the external behaviour of MERGE, or its concurrency/locking.

Assumptions
-----------

* MERGE must be a parameterisable, preparable command. The parameters
might be anywhere in the statement i.e. in the USING clause, or in the
action statements, or both.

* MERGE must work like any other statement, e.g. multi-statement
requests must work also e.g. UPDATE...; MERGE ... ;INSERT ...

* MERGE works on tables only, not views. PostgreSQL doesn't support
updateable views, so this isn't a problem to solve as part of this
implementation.

* MERGE contains possibly multiple INSERTs, UPDATEs and DELETEs, each of
which can invoke a rule to generate even more commands. 

* We would not add MERGE to the list of possible events that invoke a
Rule. (The data changing events would remain as INSERT, UPDATE, DELETE).
However, a MERGE statement might be used *within* a Rule, giving a tree
of plans rather than a single plan or a list of plans. (I can't see any
way or any reason to say "no allowed" to that...)

Implications
------------

To fully support prepared queries, MERGE needs to be a "query" not a
utility command as I had originally envisaged. MERGE internally consists
of one driving query that is always executed in full to retrieve all
tuples generated, plus a number of other queries that may or may not get
executed according to the evaluation of the WHEN clauses.

Rule re-writing happens at the plan level. That means we need to
recognise MERGE as being a new form of query: a query that has multiple
other optional queries buried within it, each of which can be
independently re-written. Only by doing that can we apply the rules
correctly to the various parts of MERGE.

That means we probably need to introduce new infrastructure in the tcop
or executor modules to handle queries-within-queries. This isn't
special-casing MERGE so much as introducing infrastructure for a new
class of query, such as MERGE, REPLACE, INSERT ELSE UPDATE. (Merge
itself does cover almost all cases of this type of query, but we'd be
able to fairly easily support all of the different syntax).

MERGE would then be represented by a query that has many "side
queries" (so called so we don't confused calling them sub-queries). Each
side query can be rewritten into a list of queries. Any of those queries
could be a MERGE itself, so we might make each Query a tree of Queries.

The main query and the side queries are related, in that the values used
for the side queries come from the main query. The side queries'
external references will be replaced by parameters, in addition to other
parameters already supplied. So the side queries are parameterised, even
if the MERGE statement wasn't parameterised. The main driving query will
be executed once, the side queries once per row (if at all).

Let's look back at the infrastructure aspects. My options so far for
implementation are

0. Write MERGE in PL/pgSQL. Err, no.

1. We treat the MERGE similar to an Update-with-conditional-effects and
have the MERGE's main plan deliver tuples that we then act on during
ExecutePlan() using a new ExecMerge() function. We would preserve the
text of the side queries and execute them using SPI from within
ExecMerge(). That takes care of the rule rewriting, since we effectively
don't know what's happening inside SPI. Feels ugly, is all I can say,
though actually very similar to the way RI works.

2. Similar to (1), but instead of using SPI we start another portal from
within ExecMerge() function. Portals within portals. Seems very likely
for something to break in an unexpected way.

3. We treat a MERGE as a new kind of portal query. Any query containing
a MERGE will be forced to be a PORTAL_MULTI_QUERY. We then introduce a
new query processing routine for MERGE, ProcessQueryWithActions() rather
than running ProcessQuery() during PortalRunMulti(). This can then
execute the merge-query similarly to a cursor, retrieving tuples that
are then fed to the side queries that then run in separate portals.
Doing it this way will mean that we don't need to run the executor
recursively, we just switch from main query to retrieve next row and
then execute the appropriate side query. The executor level calls will
look almost exactly as it would if we wrote MERGE in PL/pgSQL - fetching
from a main cursor and then executing side queries as appropriate. The
hacking will have similar-ish effects to that induced by the changes for
RETURNING.

Something like
ExecutorStart(mainQueryDesc, 0);for (;;){    /* FETCH 1 */    ExecutorRun(mainQueryDesc, ForwardScanDirection, 1L);
if(mainQueryDesc->estate->es_processed == 0)        break;
 
    //identify side query 
    //internal equivalent of exec_bind_message    //internal equivalent of exec_execute_message}

(3) sounds best to me, so far, though feel free to explain otherwise if
you see obvious holes.

For (2) or (3) we will need to alter the ReWriter and Planner to recurse
through the tree to rewrite and plan the queries. It's not going to
effect the main act of planning or rewriting, just how we initiate it.

I'll implement a very crude prototype to flush out problems in this
area.


The rest of this post is some general thoughts on the rest of the
implementation. I don't foresee any issues there, just lots of hacking.


Examples
--------

Some examples of how MERGE might be rewritten with rules are

MERGE
WHEN MATCHED UPDATE
WHEN NOT MATCHEDINSERT

might be re-written as

MERGE
WHEN MATCHED UPDATE; INSERT
WHEN NOT MATCHED INSERT; INSERT

or even

MERGE
WHEN MATCHED UPDATE; (MERGE      WHEN MATCHED UPDATE      WHEN NOT MATCHED INSERT)
WHEN NOT MATCHED INSERT; INSERT

Normally, an UPDATE that triggers another UPDATE on the same table will
create a rule recursion error. If we have an UPDATE that calls a MERGE
that generates an UPDATE on the target table, then we would like to be
able to throw a similar recursion error. To do that we would need to
expand all of the sub-actions of the MERGE during a single query
rewrite. If we defer the rewriting of the side queries until execution
then we will get an infinite recursion error, just slightly later.

Also, I can't see any sane set of rules that would generate a set of
statements that will violate the required implementation of MERGE. You
can always do something stupid like an ON UPDATE DO ALSO DELETE (on same
row), however. That would technically be a violation of MERGE, but would
have the same effect as ON UPDATE DO INSTEAD DELETE, which would be OK.

MERGE side queries
------------------

The main driving query will return the ctid plus all other required
columns. The side queries will be constructed by adding in the result
relation and a tidscan clause to each query, like this

UPDATE SET (set-clause) 

becomes

UPDATE result-relation SET (set-clause) WHERE ctid = $X;

We wouldn't use WHERE CURRENT OF because the main query doesn't qualify
as an updateable cursor, so that formulation shouldn't work. (Though
maybe we might choose to bend the rules a little).

In addition, I would translate the MERGE's WHEN clauses into a CASE WHEN
expression that can be run as part of the main driving query. This
expression will then tell us the subscript of the side query list that
will need to be executed for each row returned.
case
when <when-not-matched-condition-0> then 0
when <when-not-matched-condition-1> then 1
...
when <when-not-matched-condition-N> then N
else -1
end
,case 
when <when-matched-condition-0> then 0
when <when-matched-condition-1> then 1
...
when <when-matched-condition-N> then N
else -1
end

The main query will then look like this

select   target.ctid,case when-not-matched (as above),case when-matched (as above),(all other columns required for side
queries)   
 
from <source-query> left outer join <target> on <join-condition>
where (<when-matched-condition-0>
or <when-matched-condition-1>
...
or <when-matched-condition-N>)
or (<when-not-matched-condition-0>
or <when-not-matched-condition-1>
...
or <when-not-matched-condition-N>)

The WHERE clause is likely required in case we get queries like this

MERGE target t
USING (select * from source) s
ON (s.pkey = t.pkey)
WHEN MATCHED AND s.pkey = $1UPDATE SET col = $2;

which would be perfectly valid, even if we might hope that they had
coded like this

MERGE target
USING (select * from source WHERE index-column = $1)
ON (join-condition)
WHEN MATCHED UPDATE SET col = $2;

Other notes
--------------------------

* No, MERGE won't work on inherited table hierarchies. At least to begin
with.

* MERGE doesn't support RETURNING, if we did we'd have to enforce a
RETURNING clause to be the same target list on all sub-plans of the
query tree, which is unlikely to be easy. We could enforce a single
RETURNING clause on the whole MERGE statement, but that's another
thought entirely. We have no precedent from either the SQL Standard or
other implementations to do this

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: Internal design of MERGE, with Rules

From
Sam Mason
Date:
On Wed, Apr 30, 2008 at 04:58:52PM +0100, Simon Riggs wrote:
> That means we probably need to introduce new infrastructure in the tcop
> or executor modules to handle queries-within-queries. This isn't
> special-casing MERGE so much as introducing infrastructure for a new
> class of query, such as MERGE, REPLACE, INSERT ELSE UPDATE. (Merge
> itself does cover almost all cases of this type of query, but we'd be
> able to fairly easily support all of the different syntax).
> 
> MERGE would then be represented by a query that has many "side
> queries" (so called so we don't confused calling them sub-queries).

Why make them special cases?  (I'm sure there's a good reason!)

I've sometimes wanted to be able to put DML statements inside SELECT
statements.  The following is a slightly reasonable example:
 INSERT INTO ilog (i,ts,n)   SELECT i, now(), COUNT(*)   FROM (     INSERT INTO bar (x,y)       SELECT 5, n       FROM
baz      WHERE i = 10       RETURNING i) x   GROUP BY i;
 

Hum, that implies a nasty schema (ilog.ts should be in bar).  Ah well, I
hope you get the idea.  The simplest example I can think of is:
 SELECT * FROM (INSERT INTO foo (n) VALUES (1) RETURNING 1) x;

Hum, I think I may have just thought of why.  What would:
 SELECT (INSERT INTO foo (n) VALUES (f.n) RETURNING 1) FROM foo f;

be expected to do?  I'm thinking of nasty cases where the outer code
reading foo never stops because there's new stuff being inserted ahead
of it.  But then again, you can put the insert into a stored procedure
and it does indeed do something sensible.

 Sam


Re: Internal design of MERGE, with Rules

From
Simon Riggs
Date:
On Thu, 2008-05-01 at 00:26 +0100, Sam Mason wrote:

> On Wed, Apr 30, 2008 at 04:58:52PM +0100, Simon Riggs wrote:
> > That means we probably need to introduce new infrastructure in the tcop
> > or executor modules to handle queries-within-queries. This isn't
> > special-casing MERGE so much as introducing infrastructure for a new
> > class of query, such as MERGE, REPLACE, INSERT ELSE UPDATE. (Merge
> > itself does cover almost all cases of this type of query, but we'd be
> > able to fairly easily support all of the different syntax).
> > 
> > MERGE would then be represented by a query that has many "side
> > queries" (so called so we don't confused calling them sub-queries).
> 
> Why make them special cases?  (I'm sure there's a good reason!)

Well, I'm not making them special cases. The infrastructure would be
generalised for any statement type that wanted to do roughly this.

> I've sometimes wanted to be able to put DML statements inside SELECT
> statements.  The following is a slightly reasonable example:
> 
>   INSERT INTO ilog (i,ts,n)
>     SELECT i, now(), COUNT(*)
>     FROM (
>       INSERT INTO bar (x,y)
>         SELECT 5, n
>         FROM baz
>         WHERE i = 10
>         RETURNING i) x
>     GROUP BY i;

OK, but that's not what I'm working on... useful as it sounds.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: Internal design of MERGE, with Rules

From
Simon Riggs
Date:
On Wed, 2008-04-30 at 16:58 +0100, Simon Riggs wrote:
> The main query will then look like this
> 
> select   target.ctid
>         ,case when-not-matched (as above)
>         ,case when-matched (as above)
>         ,(all other columns required for side queries)  
> from <source-query> left outer join <target> on <join-condition>
> where (<when-matched-condition-0>
> or <when-matched-condition-1>
> ...
> or <when-matched-condition-N>)
> or (<when-not-matched-condition-0>
> or <when-not-matched-condition-1>
> ...
> or <when-not-matched-condition-N>)
> 
> The WHERE clause is likely required in case we get queries like this
> 
> MERGE target t
> USING (select * from source) s
> ON (s.pkey = t.pkey)
> WHEN MATCHED AND s.pkey = $1
>         UPDATE SET col = $2;
> 
> which would be perfectly valid, even if we might hope that they had
> coded like this
> 
> MERGE target
> USING (select * from source WHERE index-column = $1)
> ON (join-condition)
> WHEN MATCHED 
>         UPDATE SET col = $2; 

Peter has just jogged my memory about double evaluation of volatile
functions, so the above transformation isn't correct.

We would not be able to fully optimise a MERGE statement like this 

MERGE target t
USING (select * from source) sON (s.pkey = t.pkey)WHEN MATCHED AND s.key = $1       UPDATE SET col = $2;

since we won't be able to pass the clause "s.pkey = $1" down into the s
query so it would use an index. The following statement will be faster,
but will in all cases give an identical result:

MERGE target t
USING (select * from source WHERE key = $1) sON (s.pkey = t.pkey)WHEN MATCHED       UPDATE SET col = $2;

I don't think its too important, since the latter is the way people
would have used MERGE in SQL:2003 anyway.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com