Re: [HACKERS] UPDATE of partition key - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: [HACKERS] UPDATE of partition key |
Date | |
Msg-id | CA+TgmoYBJS-OPRjEcwjQj+XdqWNQ-C6LJHpMggsrh2jb+-KDeg@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] UPDATE of partition key (Amit Khandekar <amitdkhan.pg@gmail.com>) |
Responses |
Re: [HACKERS] UPDATE of partition key
Re: [HACKERS] UPDATE of partition key |
List | pgsql-hackers |
On Tue, Jun 20, 2017 at 2:54 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> I guess I don't see why it should work like this. In the INSERT case, >> we must build withCheckOption objects for each partition because those >> partitions don't appear in the plan otherwise -- but in the UPDATE >> case, they're already there, so why do we need to build anything at >> all? Similarly for RETURNING projections. How are the things we need >> for those cases not already getting built, associated with the >> relevant resultRelInfos? Maybe there's a concern if some children got >> pruned - they could turn out later to be the children into which >> tuples need to be routed. But the patch makes no distinction >> between possibly-pruned children and any others. > > Yes, only a subset of the partitions appear in the UPDATE subplans. I > think typically for updates, a very small subset of the total leaf > partitions will be there in the plans, others would get pruned. IMHO, > it would not be worth having an optimization where it opens only those > leaf partitions which are not already there in the subplans. Without > the optimization, we are able to re-use the INSERT infrastructure > without additional changes. Well, that is possible, but certainly not guaranteed. I mean, somebody could do a whole-table UPDATE, or an UPDATE that hits a smattering of rows in every partition; e.g. the table is partitioned on order number, and you do UPDATE lineitem SET product_code = 'K372B' WHERE product_code = 'K372'. Leaving that aside, the point here is that you're rebuilding withCheckOptions and returningLists that have already been built in the planner. That's bad for two reasons. First, it's inefficient, especially if there are many partitions. Second, it will amount to a functional bug if you get a different answer than the planner did. Note this comment in the existing code: /* * Build WITH CHECK OPTION constraints for each leaf partition rel. Note * that we didn't build the withCheckOptionListfor each partition within * the planner, but simple translation of the varattnos for each partition * will suffice. This only occurs for the INSERT case; UPDATE/DELETE * cases are handled above. */ The comment "UPDATE/DELETE cases are handled above" is referring to the code that initializes the WCOs generated by the planner. You've modified the comment in your patch, but the associated code: your updated comment says that only "DELETEs and local UPDATES are handled above", but in reality, *all* updates are still handled above. And then they are handled again here. Similarly for returning lists. It's certainly not OK for the comment to be inaccurate, but I think it's also bad to redo the work which the planner has already done, even if it makes the patch smaller. Also, I feel like it's probably not correct to use the first result relation as the nominal relation for building WCOs and returning lists anyway. I mean, if the first result relation has a different column order than the parent relation, isn't this just broken? If it works for some reason, the comments don't explain what that reason is. >> ... I don't understand how you can *not* need a per-leaf-partition >> mapping. I mean, maybe you only need the mapping for the *unpruned* >> leaf partitions > > Yes, we need the mapping only for the unpruned leaf partitions, and > those partitions are available in the per-subplan resultRelInfo's. OK. >> but you certainly need a separate mapping for each one of those. > > You mean *each* of the leaf partitions ? I didn't get why we would > need it for each one. The tuple targeted for update belongs to one of > the per-subplan resultInfos. And this tuple is to be routed to another > leaf partition. So the reverse mapping is for conversion from the > source resultRelinfo to the root partition. I am unable to figure out > a scenario where we would require this reverse mapping for partitions > on which UPDATE is *not* going to be executed. I agree - the reverse mapping is only needed for the partitions in which UPDATE will be executed. Some other things: + * The row was already deleted by a concurrent DELETE. So we don't + * have anything to update. I find this explanation, and the surrounding comments, inadequate. It doesn't really explain why we're doing this. I think it should say something like this: For a normal UPDATE, the case where the tuple has been the subject of a concurrent UPDATE or DELETE would be handled by the EvalPlanQual machinery, but for an UPDATE that we've translated into a DELETE from this partition and an INSERT into some other partition, that's not available, because CTID chains can't span relation boundaries. We mimic the semantics to a limited extent by skipping the INSERT if the DELETE fails to find a tuple. This ensures that two concurrent attempts to UPDATE the same tuple at the same time can't turn one tuple into two, and that an UPDATE of a just-deleted tuple can't resurrect it. + bool partition_check_passed_with_trig_tuple; + + partition_check_passed = + (resultRelInfo->ri_PartitionCheck && + ExecPartitionCheck(resultRelInfo, slot, estate)); + + partition_check_passed_with_trig_tuple = + (resultRelInfo->ri_PartitionCheck && + ExecPartitionCheck(resultRelInfo, trig_slot, estate)); + if (partition_check_passed) + { + /* + * If it's the trigger that is causing partition constraint + * violation, abort. We don't want a trigger to cause tuple + * routing. + */ + if (!partition_check_passed_with_trig_tuple) + ExecPartitionCheckEmitError(resultRelInfo, + trig_slot, estate); + } + else + { + /* + * Partition constraint failed with original NEW tuple. But the + * trigger might even have modifed the tuple such that it fits + * back into the partition. So partition constraint check + * should be based on *final* NEW tuple. + */ + partition_check_passed = partition_check_passed_with_trig_tuple; + } Maybe I inadvertently gave the contrary impression in some prior review, but this logic doesn't seem right to me. I don't think there's any problem with a BR UPDATE trigger causing tuple routing. What I want to avoid is repeatedly rerouting the same tuple, but I don't think that could happen even without this guard. We've now fixed insert tuple routing so that a BR INSERT trigger can't cause the partition constraint to be violated (cf. commit 15ce775faa428dc91027e4e2d6b7a167a27118b5) and there's no way for update tuple routing to trigger additional BR UPDATE triggers. So I don't see the point of checking the constraints twice here. I think what you want to do is get rid of all the changes here and instead adjust the logic just before ExecConstraints() to invoke ExecPartitionCheck() on the post-trigger version of the tuple. Parenthetically, if we decided to keep this logic as you have it, the code that sets partition_check_passed and partition_check_passed_with_trig_tuple doesn't need to check resultRelInfo->ri_PartitionCheck because the surrounding "if" block already did. + for (i = 0; i < num_rels; i++) + { + ResultRelInfo *resultRelInfo = &result_rels[i]; + Relation rel = resultRelInfo->ri_RelationDesc; + Bitmapset *expr_attrs = NULL; + + pull_varattnos((Node *) rel->rd_partcheck, 1, &expr_attrs); + + /* Both bitmaps are offset by FirstLowInvalidHeapAttributeNumber. */ + if (bms_overlap(expr_attrs, GetUpdatedColumns(resultRelInfo, estate))) + return true; + } This seems like an awfully expensive way of performing this test. Under what circumstances could this be true for some result relations and false for others; or in other words, why do we have to loop over all of the result relations? It seems to me that the user has typed something like: UPDATE whatever SET thingy = ..., whatsit = ... WHERE whatever = ... AND thunk = ... If either thingy or whatsit is a partitioning column, UPDATE tuple routing might be needed - and it should be able to test that by a *single* comparison between the set of columns being updated and the partitioning columns, without needing to repeat for every partitions. Perhaps that test needs to be done at plan time and saved in the plan, rather than performed here -- or maybe it's easy enough to do it here. One problem is that, if BR UPDATE triggers are in fact allowed to cause tuple routing as I proposed above, the presence of a BR UPDATE trigger for any partition could necessitate UPDATE tuple routing for queries that wouldn't otherwise need it. But even if you end up inserting a test for that case, it can surely be a lot cheaper than this, since it only involves checking a boolean flag, not a bitmapset. It could be argue that we ought to prohibit BR UPDATE triggers from causing tuple routing so that we don't have to do this test at all, but I'm not sure that's a good trade-off. It seems to necessitate checking the partition constraint twice per tuple instead of once per tuple, which like a very heavy price. +#define GetUpdatedColumns(relinfo, estate) \ + (rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->updatedCols) I think this should be moved to a header file (and maybe turned into a static inline function) rather than copy-pasting the definition into a new file. - List *mapped_wcoList; + List *mappedWco; List *wcoExprs = NIL; ListCell *ll; - /* varno = node->nominalRelation */ - mapped_wcoList = map_partition_varattnos(wcoList, - node->nominalRelation, - partrel, rel); - foreach(ll, mapped_wcoList) + mappedWco = map_partition_varattnos(firstWco, firstVarno, + partrel, firstResultRel); + foreach(ll, mappedWco) { WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll)); ExprState *wcoExpr = ExecInitQual(castNode(List, wco->qual), - plan); + &mtstate->ps); wcoExprs = lappend(wcoExprs, wcoExpr); } - resultRelInfo->ri_WithCheckOptions = mapped_wcoList; + resultRelInfo->ri_WithCheckOptions = mappedWco; Renaming the variable looks fairly pointless, unless I'm missing something? Regarding the tests, it seems like you've got a test case where you update a sub-partition and it fails because the tuple would need to be moved out of a sub-tree, which is good. But I think it would also be good to have a case where you update a sub-partition and it succeeds in moving the tuple within the subtree. I don't see one like that presently; it seems all the others update the topmost root or the leaf. I also think it would be a good idea to make sub_parted's column order different from both list_parted and its own children, and maybe use a diversity of data types (e.g. int4, int8, text instead of making everything int). +select tableoid::regclass , * from list_parted where a = 2 order by 1; +update list_parted set b = c + a where a = 2; +select tableoid::regclass , * from list_parted where a = 2 order by 1; The extra space before the comma looks strange. Also, please make a habit of checking patches for whitespace errors using git diff --check. [rhaas pgsql]$ git diff --check src/backend/executor/nodeModifyTable.c:384: indent with spaces. + tuple, &slot); src/backend/executor/nodeModifyTable.c:1966: space before tab in indent. + IsPartitionKeyUpdate(estate, mtstate->resultRelInfo, nplans)); You will notice these kinds of things if you read the diff you are submitting before you press send, because git highlights them in bright red. That's a good practice for many other reasons, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: