Re: WIP patch for multiple column assignment in UPDATE - Mailing list pgsql-hackers

From Pavel Stehule
Subject Re: WIP patch for multiple column assignment in UPDATE
Date
Msg-id CAFj8pRBJko7YNNLNzk2MCQ50XJJzNf+1ODcgno4FuRRts9KOyg@mail.gmail.com
Whole thread Raw
In response to WIP patch for multiple column assignment in UPDATE  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: WIP patch for multiple column assignment in UPDATE  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hello

I did some tests and It looks so it allows only some form of nested loop.

postgres=# explain (analyze, timing off, buffers) update a1 set b = (select b from a2 where a1.a = a2.a);
                                                  QUERY PLAN                                                  
---------------------------------------------------------------------------------------------------------------
 Update on a1  (cost=0.00..8456925.00 rows=1000000 width=10) (actual rows=0 loops=1)
   Buffers: shared hit=9017134 read=14376 dirtied=58170 written=1014
   ->  Seq Scan on a1  (cost=0.00..8456925.00 rows=1000000 width=10) (actual rows=1000000 loops=1)
         Buffers: shared hit=4005465 read=4424 written=971
         SubPlan 1
           ->  Index Scan using a2_a_idx on a2  (cost=0.42..8.44 rows=1 width=4) (actual rows=1 loops=1000000)
                 Index Cond: (a1.a = a)
                 Buffers: shared hit=4005464
 Planning time: 0.212 ms
 Execution time: 30114.101 ms
(10 rows)

do you plan some sophisticated mechanism - like MERGE or some similar?

Regards

Pavel


2014-06-16 17:17 GMT+02:00 Tom Lane <tgl@sss.pgh.pa.us>:
Attached is a very-much-WIP patch for supporting
         UPDATE foo SET ..., (a,b,c) = (select x,y,z from ...), ...

It lacks documentation, ruleutils.c support, or any solution for the
rule NEW.* expansion problem I mentioned Saturday.  The reason I'm
posting it now is to get feedback about an implementation choice that
feels a bit klugy to me; but I don't see any clearly better idea.

The post-parse-analysis representation I've chosen here is that the
output columns of the sub-select are represented by PARAM_MULTIEXEC
Params, and the sub-select itself appears in a resjunk entry at the end
of the targetlist; that is, the UPDATE tlist is more or less like

   $1,  -- to be assigned to a
   $2,  -- to be assigned to b
   $3,  -- to be assigned to c
   (select x,y,z from ...),  -- resjunk entry, value will be discarded

If the sub-select is uncorrelated with the outer query, the planner
turns it into an InitPlan, replacing the resjunk tlist item with a
NULL constant, and then everything happens normally at execution.

But more usually, the sub-select will be correlated with the outer
query.  In this case, the subquery turns into a SubPlan tree that
stays in the resjunk item.  At the start of execution, the ParamExecData
structs for each of its output Params are marked with execPlan pointers
to the subplan, just as would happen for an InitPlan.  This causes the
subplan to get executed when the first of the output Params is evaluated;
it loads the ParamExecData structs for all its output Params, and then
the later Params just take data from the structs.  When execution reaches
the MULTIEXEC SubPlan in the resjunk tlist item, no evaluation of the
subplan is needed; but instead we re-mark all the output ParamExecData
structs with non-null execPlan pointers, so that a fresh execution of
the subplan will happen in the next evaluation of the targetlist.

The klugy aspect of this is that it assumes that the SubPlan item will
appear later in the tlist than all the Params referencing it.  This is
true at the moment because resjunk tlist items always appear after
non-resjunk ones.  There are a few other places that already depend on
this ordering, but we've mostly tried to avoid introducing new
dependencies on it.

The alternative that I'd originally had in mind, before put-it-in-a-
resjunk-item occurred to me, was to invent a new "secondary tlist"
field of Query and of ModifyTable plan nodes, as I sketched back in
http://www.postgresql.org/message-id/1783.1399054541@sss.pgh.pa.us
We'd put the MULTIEXEC SubPlans in this secondary tlist and expect
the executor to evaluate it just before evaluating the main tlist.
However, that way isn't terribly pretty either, because it extends
knowledge of this feature to a *whole lot* of places that don't have
to know about it in the attached approach; in particular, just about
every place that manipulates targetlist contents would have to also
manipulate the secondary tlist.

Another approach is to explicitly identify which of the Params will
be evaluated first and replace it with a node tree that evaluates
the subplan (and sets output Params for the remaining columns).
This is a bit messy because the first-to-be-evaluated is dependent
on the targetlist reordering that the planner does; so we don't want
parse analysis to try to do it.  (If we allowed parse analysis to
know that the planner will sort the tlist items into physical column
order, we could do it like that; but then it would break if we ever
get around to allowing ALTER TABLE to change the physical order.)
We could safely have setrefs.c determine the first-to-be-evaluated
Param, though, since it will traverse the tlist in final order.
So if we went with this approach I'd have setrefs.c replace the first
Param with a SubPlan node.  That seems a bit of a kluge too though.

Preferences, comments, better ideas?

                        regards, tom lane



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: GIN pending list pages not recycled promptly (was Re: GIN improvements part 1: additional information)
Next
From: Michael Paquier
Date:
Subject: Re: Removing dependency to wsock32.lib when compiling code on WIndows