Re: [HACKERS] UPDATE of partition key - Mailing list pgsql-hackers

From Amit Khandekar
Subject Re: [HACKERS] UPDATE of partition key
Date
Msg-id CAJ3gD9d_DukdMW4mMq-qTpXy8bJhJTDRVS4O4Ue952-8PO9wkQ@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] UPDATE of partition key  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] UPDATE of partition key
List pgsql-hackers
On 21 June 2017 at 00:23, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jun 20, 2017 at 2:54 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>> I guess I don't see why it should work like this.  In the INSERT case,
>>> we must build withCheckOption objects for each partition because those
>>> partitions don't appear in the plan otherwise -- but in the UPDATE
>>> case, they're already there, so why do we need to build anything at
>>> all?  Similarly for RETURNING projections.  How are the things we need
>>> for those cases not already getting built, associated with the
>>> relevant resultRelInfos?  Maybe there's a concern if some children got
>>> pruned - they could turn out later to be the children into which
>>> tuples need to be routed. But the patch makes no distinction
>>> between possibly-pruned children and any others.
>>
>> Yes, only a subset of the partitions appear in the UPDATE subplans. I
>> think typically for updates, a very small subset of the total leaf
>> partitions will be there in the plans, others would get pruned. IMHO,
>> it would not be worth having an optimization where it opens only those
>> leaf partitions which are not already there in the subplans. Without
>> the optimization, we are able to re-use the INSERT infrastructure
>> without additional changes.
>
> Well, that is possible, but certainly not guaranteed.  I mean,
> somebody could do a whole-table UPDATE, or an UPDATE that hits a
> smattering of rows in every partition;

I am not saying that it's guaranteed to be a small subset. I am saying
that it would be typically a small subset for
update-of-partitioned-key case. Seems weird if a user causes an
update-row-movement for multiple partitions at the same time.
Generally it would be an administrative task where some/all of the
rows of a partition need to have their partition key updated that
cause them to change their partition, and so there would be probably a
where clause that would narrow down the update to that particular
partition, because without the where clause the update is anyway
slower and it's redundant to scan all other partitions.

But, point taken, that there can always be certain cases involving
multiple table partition-key updates.

> e.g. the table is partitioned on order number, and you do UPDATE
> lineitem SET product_code = 'K372B' WHERE product_code = 'K372'.

This query does not update order number, so here there is no
partition-key-update. Are you thinking that the patch is generating
the per-leaf-partition WCO expressions even for a update not involving
a partition key ?

>
> Leaving that aside, the point here is that you're rebuilding
> withCheckOptions and returningLists that have already been built in
> the planner.  That's bad for two reasons.  First, it's inefficient,
> especially if there are many partitions.

Yeah, I agree that this becomes more and more redundant if the update
involves more partitions.

> Second, it will amount to a functional bug if you get a
> different answer than the planner did.

Actually, the per-leaf WCOs are meant to be executed on the
destination partitions where the tuple is moved, while the WCOs
belonging to the per-subplan resultRelInfo are meant for the
resultRelinfo used for the UPDATE plans. So actually it should not
matter whether they look same or different, because they are fired at
different objects. Now these objects can happen to be the same
relations though.

But in any case, it's not clear to me how the mapped WCO and the
planner's WCO would yield a different answer if they are both the same
relation. I am possibly missing something. The planner has already
generated the withCheckOptions for each of the resultRelInfo. And then
we are using one of those to re-generate the WCO for a leaf partition
by only adjusting the attnos. If there is already a WCO generated in
the planner for that leaf partition (because that partition was
present in mtstate->resultRelInfo), then the re-built WCO should be
exactly look same as the earlier one, because they are the same
relations, and so the attnos generated in them would be same since the
Relation TupleDesc is the same.

> Note this comment in the existing code:
>
>     /*
>      * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
>      * that we didn't build the withCheckOptionList for each partition within
>      * the planner, but simple translation of the varattnos for each partition
>      * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
>      * cases are handled above.
>      */
>
> The comment "UPDATE/DELETE cases are handled above" is referring to
> the code that initializes the WCOs generated by the planner.  You've
> modified the comment in your patch, but the associated code: your
> updated comment says that only "DELETEs and local UPDATES are handled
> above", but in reality, *all* updates are still handled above.  And

Actually I meant, "above works for only local updates. For
row-movement-updates, we need per-leaf partition WCOs, because when
the row is inserted into target partition, that partition may be not
be included in the above planner resultRelInfo, so we need WCOs for
all partitions". I think this said comment should be sufficient if I
add this in the code ?

> then they are handled again here.
> Similarly for returning lists.
> It's certainly not OK for the comment to be inaccurate, but I think
> it's also bad to redo the work which the planner has already done,
> even if it makes the patch smaller.
>
> Also, I feel like it's probably not correct to use the first result
> relation as the nominal relation for building WCOs and returning lists
> anyway.  I mean, if the first result relation has a different column
> order than the parent relation, isn't this just broken?  If it works
> for some reason, the comments don't explain what that reason is.

Not sure why parent relation should come into picture. As long as the
first result relation belongs to one of the partitions in the whole
partition tree, we should be able to use that to build WCOs of any
other partitions, because they have a common set of attributes having
the same name. So we are bound to find each of the attributes of first
resultRelInfo in the other leaf partitions during attno mapping.

> Some other things:
>
> +             * The row was already deleted by a concurrent DELETE. So we don't
> +             * have anything to update.
>
> I find this explanation, and the surrounding comments, inadequate.  It
> doesn't really explain why we're doing this.  I think it should say
> something like this: For a normal UPDATE, the case where the tuple has
> been the subject of a concurrent UPDATE or DELETE would be handled by
> the EvalPlanQual machinery, but for an UPDATE that we've translated
> into a DELETE from this partition and an INSERT into some other
> partition, that's not available, because CTID chains can't span
> relation boundaries.  We mimic the semantics to a limited extent by
> skipping the INSERT if the DELETE fails to find a tuple.  This ensures
> that two concurrent attempts to UPDATE the same tuple at the same time
> can't turn one tuple into two, and that an UPDATE of a just-deleted
> tuple can't resurrect it.

Thanks, will put that comment in the next patch.

>
> +            bool        partition_check_passed_with_trig_tuple;
> +
> +            partition_check_passed =
> +                (resultRelInfo->ri_PartitionCheck &&
> +                 ExecPartitionCheck(resultRelInfo, slot, estate));
> +
> +            partition_check_passed_with_trig_tuple =
> +                (resultRelInfo->ri_PartitionCheck &&
> +                 ExecPartitionCheck(resultRelInfo, trig_slot, estate));
> +            if (partition_check_passed)
> +            {
> +                /*
> +                 * If it's the trigger that is causing partition constraint
> +                 * violation, abort. We don't want a trigger to cause tuple
> +                 * routing.
> +                 */
> +                if (!partition_check_passed_with_trig_tuple)
> +                    ExecPartitionCheckEmitError(resultRelInfo,
> +                                                trig_slot, estate);
> +            }
> +            else
> +            {
> +                /*
> +                 * Partition constraint failed with original NEW tuple. But the
> +                 * trigger might even have modifed the tuple such that it fits
> +                 * back into the partition. So partition constraint check
> +                 * should be based on *final* NEW tuple.
> +                 */
> +                partition_check_passed =
> partition_check_passed_with_trig_tuple;
> +            }
>
> Maybe I inadvertently gave the contrary impression in some prior
> review, but this logic doesn't seem right to me.  I don't think
> there's any problem with a BR UPDATE trigger causing tuple routing.
> What I want to avoid is repeatedly rerouting the same tuple, but I
> don't think that could happen even without this guard. We've now fixed
> insert tuple routing so that a BR INSERT trigger can't cause the
> partition constraint to be violated (cf. commit
> 15ce775faa428dc91027e4e2d6b7a167a27118b5) and there's no way for
> update tuple routing to trigger additional BR UPDATE triggers.  So I
> don't see the point of checking the constraints twice here.  I think
> what you want to do is get rid of all the changes here and instead
> adjust the logic just before ExecConstraints() to invoke
> ExecPartitionCheck() on the post-trigger version of the tuple.

When I came up with this code, the intention was to make sure BR
UPDATE trigger does not cause tuple routing. But yeah, I can't recall
what made me think that the above changes would be needed to prevent
BR UPDATE trigger from causing tuple routing. With the latest code, it
indeed looks like we can get rid of these changes, and still prevent
that.

BTW, that code was not to avoid repeated re-routing.

Above, you seem to say that there's no problem with BR UPDATE trigger
causing the tuple routing. But, when none of the partition-key columns
are used in UPDATE, we don't set up for update-tuple-routing, so with
no partition-key update, tuple routing will not occur even if BR
UPDATE trigger would have caused UPDATE tuple routing. This is one
restriction we have to live with because we beforehand decide whether
to do the tuple-routing setup based on the columns modified in the
UPDATE query.

>
> Parenthetically, if we decided to keep this logic as you have it, the
> code that sets partition_check_passed and
> partition_check_passed_with_trig_tuple doesn't need to check
> resultRelInfo->ri_PartitionCheck because the surrounding "if" block
> already did.

Yes.

>
> +    for (i = 0; i < num_rels; i++)
> +    {
> +        ResultRelInfo *resultRelInfo = &result_rels[i];
> +        Relation        rel = resultRelInfo->ri_RelationDesc;
> +        Bitmapset     *expr_attrs = NULL;
> +
> +        pull_varattnos((Node *) rel->rd_partcheck, 1, &expr_attrs);
> +
> +        /* Both bitmaps are offset by FirstLowInvalidHeapAttributeNumber. */
> +        if (bms_overlap(expr_attrs, GetUpdatedColumns(resultRelInfo, estate)))
> +            return true;
> +    }
>
> This seems like an awfully expensive way of performing this test.
> Under what circumstances could this be true for some result relations
> and false for others;

One resultRelinfo can have no partition key column used in its quals,
but the next resultRelinfo can have quite different quals, and these
quals can have partition key referred. This is possible if the two of
them have different parents that have different partition-key columns.

> or in other words, why do we have to loop over all of the result
> relations?  It seems to me that the user has typed something like:
>
> UPDATE whatever SET thingy = ..., whatsit = ... WHERE whatever = ...
> AND thunk = ...
>
> If either thingy or whatsit is a partitioning column, UPDATE tuple
> routing might be needed

So, in the above code, bms_overlap() would return true if either
thingy or whatsit is a partitioning column.

> - and it should be able to test that by a
> *single* comparison between the set of columns being updated and the
> partitioning columns, without needing to repeat for every partitions.

If bms_overlap() returns true for the very first resultRelinfo, it
will return immediately. But yes, if there are no relations using
partition key, we will have to scan all of these relations. But again,
note that these are pruned leaf partitions, they typically will not
contain all the leaf partitions.

> Perhaps that test needs to be done at plan time and saved in the plan,
> rather than performed here -- or maybe it's easy enough to do it here.

Hmm, it looks convenient here because mtstate->resultRelInfo gets set only here.

>
> One problem is that, if BR UPDATE triggers are in fact allowed to
> cause tuple routing as I proposed above, the presence of a BR UPDATE
> trigger for any partition could necessitate UPDATE tuple routing for
> queries that wouldn't otherwise need it.

You mean always setup update tuple routing if there's a BR UPDATE
trigger ? Actually I was going for disallowing BR update trigger to
initiate tuple routing, as I described above.

> But even if you end up
> inserting a test for that case, it can surely be a lot cheaper than
> this,

I didn't exactly get why the bitmap_overlap() test needs to be
compared with the presence-of-trigger test.

> since it only involves checking a boolean flag, not a bitmapset.
> It could be argue that we ought to prohibit BR UPDATE triggers from
> causing tuple routing so that we don't have to do this test at all,
> but I'm not sure that's a good trade-off.
> It seems to necessitate checking the partition constraint twice per
> tuple instead of once per tuple, which like a very heavy price.

I think I didn't quite understand this paragraph as a whole. Can you
state the trade-off here again ?

>
> +#define GetUpdatedColumns(relinfo, estate) \
> +    (rt_fetch((relinfo)->ri_RangeTableIndex,
> (estate)->es_range_table)->updatedCols)
>
> I think this should be moved to a header file (and maybe turned into a
> static inline function) rather than copy-pasting the definition into a
> new file.

Will do that.

>
> -            List       *mapped_wcoList;
> +            List       *mappedWco;
>              List       *wcoExprs = NIL;
>              ListCell   *ll;
>
> -            /* varno = node->nominalRelation */
> -            mapped_wcoList = map_partition_varattnos(wcoList,
> -                                                     node->nominalRelation,
> -                                                     partrel, rel);
> -            foreach(ll, mapped_wcoList)
> +            mappedWco = map_partition_varattnos(firstWco, firstVarno,
> +                                                partrel, firstResultRel);
> +            foreach(ll, mappedWco)
>              {
>                  WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
>                  ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
> -                                                   plan);
> +                                                   &mtstate->ps);
>
>                  wcoExprs = lappend(wcoExprs, wcoExpr);
>              }
>
> -            resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
> +            resultRelInfo->ri_WithCheckOptions = mappedWco;
>
> Renaming the variable looks fairly pointless, unless I'm missing something?

We are converting from firstWco to mappedWco. So firstWco => mappedWco
looks more natural pairing than firstWco => mapped_wcoList.

And I renamed wcoList to firstWco because I wanted to emphasize that
is the first WCO out of the node->withCheckOptionLists. In the
existing code, it was only for INSERT; withCheckOptionLists was a
single element list, so firstWco name didn't sound suitable, but with
multiple elements, it is essential to have it named firstWco so as to
emphasize that we take the first one irrespective of whether it is
UPDATE or INSERT.

>
> Regarding the tests, it seems like you've got a test case where you
> update a sub-partition and it fails because the tuple would need to be
> moved out of a sub-tree, which is good.  But I think it would also be
> good to have a case where you update a sub-partition and it succeeds
> in moving the tuple within the subtree.  I don't see one like that
> presently; it seems all the others update the topmost root or the
> leaf.  I also think it would be a good idea to make sub_parted's
> column order different from both list_parted and its own children, and
> maybe use a diversity of data types (e.g. int4, int8, text instead of
> making everything int).
>
> +select tableoid::regclass , * from list_parted where a = 2 order by 1;
> +update list_parted set b = c + a where a = 2;
> +select tableoid::regclass , * from list_parted where a = 2 order by 1;
>
> The extra space before the comma looks strange.

Will do the above changes, thanks.

>
> Also, please make a habit of checking patches for whitespace errors
> using git diff --check.
>
> [rhaas pgsql]$ git diff --check
> src/backend/executor/nodeModifyTable.c:384: indent with spaces.
> +                        tuple, &slot);
> src/backend/executor/nodeModifyTable.c:1966: space before tab in indent.
> +                IsPartitionKeyUpdate(estate, mtstate->resultRelInfo, nplans));
>
> You will notice these kinds of things if you read the diff you are
> submitting before you press send, because git highlights them in
> bright red.  That's a good practice for many other reasons, too.

Yeah, somehow I think I missed these because I must have checked only
the incremental diffs w.r.t. the earlier one where I must have
introduced them. Your point is very much true that we should make it a
habit to check complete patch with --check option, or apply it myself.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



pgsql-hackers by date:

Previous
From: Yugo Nagata
Date:
Subject: [HACKERS] Logical replication launcher never been restarted when terminated
Next
From: Amit Khandekar
Date:
Subject: Re: [HACKERS] UPDATE of partition key