Re: [HACKERS] UPDATE of partition key - Mailing list pgsql-hackers

From Robert Haas
Subject Re: [HACKERS] UPDATE of partition key
Date
Msg-id CA+TgmoYBJS-OPRjEcwjQj+XdqWNQ-C6LJHpMggsrh2jb+-KDeg@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] UPDATE of partition key  (Amit Khandekar <amitdkhan.pg@gmail.com>)
Responses Re: [HACKERS] UPDATE of partition key
Re: [HACKERS] UPDATE of partition key
List pgsql-hackers
On Tue, Jun 20, 2017 at 2:54 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> I guess I don't see why it should work like this.  In the INSERT case,
>> we must build withCheckOption objects for each partition because those
>> partitions don't appear in the plan otherwise -- but in the UPDATE
>> case, they're already there, so why do we need to build anything at
>> all?  Similarly for RETURNING projections.  How are the things we need
>> for those cases not already getting built, associated with the
>> relevant resultRelInfos?  Maybe there's a concern if some children got
>> pruned - they could turn out later to be the children into which
>> tuples need to be routed. But the patch makes no distinction
>> between possibly-pruned children and any others.
>
> Yes, only a subset of the partitions appear in the UPDATE subplans. I
> think typically for updates, a very small subset of the total leaf
> partitions will be there in the plans, others would get pruned. IMHO,
> it would not be worth having an optimization where it opens only those
> leaf partitions which are not already there in the subplans. Without
> the optimization, we are able to re-use the INSERT infrastructure
> without additional changes.

Well, that is possible, but certainly not guaranteed.  I mean,
somebody could do a whole-table UPDATE, or an UPDATE that hits a
smattering of rows in every partition; e.g. the table is partitioned
on order number, and you do UPDATE lineitem SET product_code = 'K372B'
WHERE product_code = 'K372'.

Leaving that aside, the point here is that you're rebuilding
withCheckOptions and returningLists that have already been built in
the planner.  That's bad for two reasons.  First, it's inefficient,
especially if there are many partitions.  Second, it will amount to a
functional bug if you get a different answer than the planner did.
Note this comment in the existing code:
   /*    * Build WITH CHECK OPTION constraints for each leaf partition rel. Note    * that we didn't build the
withCheckOptionListfor each partition within    * the planner, but simple translation of the varattnos for each
partition   * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE    * cases are handled above.    */
 

The comment "UPDATE/DELETE cases are handled above" is referring to
the code that initializes the WCOs generated by the planner.  You've
modified the comment in your patch, but the associated code: your
updated comment says that only "DELETEs and local UPDATES are handled
above", but in reality, *all* updates are still handled above.  And
then they are handled again here.  Similarly for returning lists.
It's certainly not OK for the comment to be inaccurate, but I think
it's also bad to redo the work which the planner has already done,
even if it makes the patch smaller.

Also, I feel like it's probably not correct to use the first result
relation as the nominal relation for building WCOs and returning lists
anyway.  I mean, if the first result relation has a different column
order than the parent relation, isn't this just broken?  If it works
for some reason, the comments don't explain what that reason is.

>> ... I don't understand how you can *not* need a per-leaf-partition
>> mapping.  I mean, maybe you only need the mapping for the *unpruned*
>> leaf partitions
>
> Yes, we need the mapping only for the unpruned leaf partitions, and
> those partitions are available in the per-subplan resultRelInfo's.

OK.

>> but you certainly need a separate mapping for each one of those.
>
> You mean *each* of the leaf partitions ? I didn't get why we would
> need it for each one. The tuple targeted for update belongs to one of
> the per-subplan resultInfos. And this tuple is to be routed to another
> leaf partition. So the reverse mapping is for conversion from the
> source resultRelinfo to the root partition. I am unable to figure out
> a scenario where we would require this reverse mapping for partitions
> on which UPDATE is *not* going to be executed.

I agree - the reverse mapping is only needed for the partitions in
which UPDATE will be executed.

Some other things:

+             * The row was already deleted by a concurrent DELETE. So we don't
+             * have anything to update.

I find this explanation, and the surrounding comments, inadequate.  It
doesn't really explain why we're doing this.  I think it should say
something like this: For a normal UPDATE, the case where the tuple has
been the subject of a concurrent UPDATE or DELETE would be handled by
the EvalPlanQual machinery, but for an UPDATE that we've translated
into a DELETE from this partition and an INSERT into some other
partition, that's not available, because CTID chains can't span
relation boundaries.  We mimic the semantics to a limited extent by
skipping the INSERT if the DELETE fails to find a tuple.  This ensures
that two concurrent attempts to UPDATE the same tuple at the same time
can't turn one tuple into two, and that an UPDATE of a just-deleted
tuple can't resurrect it.

+            bool        partition_check_passed_with_trig_tuple;
+
+            partition_check_passed =
+                (resultRelInfo->ri_PartitionCheck &&
+                 ExecPartitionCheck(resultRelInfo, slot, estate));
+
+            partition_check_passed_with_trig_tuple =
+                (resultRelInfo->ri_PartitionCheck &&
+                 ExecPartitionCheck(resultRelInfo, trig_slot, estate));
+            if (partition_check_passed)
+            {
+                /*
+                 * If it's the trigger that is causing partition constraint
+                 * violation, abort. We don't want a trigger to cause tuple
+                 * routing.
+                 */
+                if (!partition_check_passed_with_trig_tuple)
+                    ExecPartitionCheckEmitError(resultRelInfo,
+                                                trig_slot, estate);
+            }
+            else
+            {
+                /*
+                 * Partition constraint failed with original NEW tuple. But the
+                 * trigger might even have modifed the tuple such that it fits
+                 * back into the partition. So partition constraint check
+                 * should be based on *final* NEW tuple.
+                 */
+                partition_check_passed =
partition_check_passed_with_trig_tuple;
+            }

Maybe I inadvertently gave the contrary impression in some prior
review, but this logic doesn't seem right to me.  I don't think
there's any problem with a BR UPDATE trigger causing tuple routing.
What I want to avoid is repeatedly rerouting the same tuple, but I
don't think that could happen even without this guard. We've now fixed
insert tuple routing so that a BR INSERT trigger can't cause the
partition constraint to be violated (cf. commit
15ce775faa428dc91027e4e2d6b7a167a27118b5) and there's no way for
update tuple routing to trigger additional BR UPDATE triggers.  So I
don't see the point of checking the constraints twice here.  I think
what you want to do is get rid of all the changes here and instead
adjust the logic just before ExecConstraints() to invoke
ExecPartitionCheck() on the post-trigger version of the tuple.

Parenthetically, if we decided to keep this logic as you have it, the
code that sets partition_check_passed and
partition_check_passed_with_trig_tuple doesn't need to check
resultRelInfo->ri_PartitionCheck because the surrounding "if" block
already did.

+    for (i = 0; i < num_rels; i++)
+    {
+        ResultRelInfo *resultRelInfo = &result_rels[i];
+        Relation        rel = resultRelInfo->ri_RelationDesc;
+        Bitmapset     *expr_attrs = NULL;
+
+        pull_varattnos((Node *) rel->rd_partcheck, 1, &expr_attrs);
+
+        /* Both bitmaps are offset by FirstLowInvalidHeapAttributeNumber. */
+        if (bms_overlap(expr_attrs, GetUpdatedColumns(resultRelInfo, estate)))
+            return true;
+    }

This seems like an awfully expensive way of performing this test.
Under what circumstances could this be true for some result relations
and false for others; or in other words, why do we have to loop over
all of the result relations?  It seems to me that the user has typed
something like:

UPDATE whatever SET thingy = ..., whatsit = ... WHERE whatever = ...
AND thunk = ...

If either thingy or whatsit is a partitioning column, UPDATE tuple
routing might be needed - and it should be able to test that by a
*single* comparison between the set of columns being updated and the
partitioning columns, without needing to repeat for every partitions.
Perhaps that test needs to be done at plan time and saved in the plan,
rather than performed here -- or maybe it's easy enough to do it here.

One problem is that, if BR UPDATE triggers are in fact allowed to
cause tuple routing as I proposed above, the presence of a BR UPDATE
trigger for any partition could necessitate UPDATE tuple routing for
queries that wouldn't otherwise need it.  But even if you end up
inserting a test for that case, it can surely be a lot cheaper than
this, since it only involves checking a boolean flag, not a bitmapset.
It could be argue that we ought to prohibit BR UPDATE triggers from
causing tuple routing so that we don't have to do this test at all,
but I'm not sure that's a good trade-off.  It seems to necessitate
checking the partition constraint twice per tuple instead of once per
tuple, which like a very heavy price.

+#define GetUpdatedColumns(relinfo, estate) \
+    (rt_fetch((relinfo)->ri_RangeTableIndex,
(estate)->es_range_table)->updatedCols)

I think this should be moved to a header file (and maybe turned into a
static inline function) rather than copy-pasting the definition into a
new file.

-            List       *mapped_wcoList;
+            List       *mappedWco;            List       *wcoExprs = NIL;            ListCell   *ll;

-            /* varno = node->nominalRelation */
-            mapped_wcoList = map_partition_varattnos(wcoList,
-                                                     node->nominalRelation,
-                                                     partrel, rel);
-            foreach(ll, mapped_wcoList)
+            mappedWco = map_partition_varattnos(firstWco, firstVarno,
+                                                partrel, firstResultRel);
+            foreach(ll, mappedWco)            {                WithCheckOption *wco = castNode(WithCheckOption,
lfirst(ll));               ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
 
-                                                   plan);
+                                                   &mtstate->ps);
                wcoExprs = lappend(wcoExprs, wcoExpr);            }

-            resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
+            resultRelInfo->ri_WithCheckOptions = mappedWco;

Renaming the variable looks fairly pointless, unless I'm missing something?

Regarding the tests, it seems like you've got a test case where you
update a sub-partition and it fails because the tuple would need to be
moved out of a sub-tree, which is good.  But I think it would also be
good to have a case where you update a sub-partition and it succeeds
in moving the tuple within the subtree.  I don't see one like that
presently; it seems all the others update the topmost root or the
leaf.  I also think it would be a good idea to make sub_parted's
column order different from both list_parted and its own children, and
maybe use a diversity of data types (e.g. int4, int8, text instead of
making everything int).

+select tableoid::regclass , * from list_parted where a = 2 order by 1;
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass , * from list_parted where a = 2 order by 1;

The extra space before the comma looks strange.

Also, please make a habit of checking patches for whitespace errors
using git diff --check.

[rhaas pgsql]$ git diff --check
src/backend/executor/nodeModifyTable.c:384: indent with spaces.
+                        tuple, &slot);
src/backend/executor/nodeModifyTable.c:1966: space before tab in indent.
+                IsPartitionKeyUpdate(estate, mtstate->resultRelInfo, nplans));

You will notice these kinds of things if you read the diff you are
submitting before you press send, because git highlights them in
bright red.  That's a good practice for many other reasons, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Julien Rouhaud
Date:
Subject: Re: [HACKERS] Typo in insert.sgml
Next
From: Alvaro Herrera
Date:
Subject: Re: [HACKERS] Optional message to user when terminating/cancellingbackend