Re: [HACKERS] UPDATE of partition key - Mailing list pgsql-hackers
From | Amit Khandekar |
---|---|
Subject | Re: [HACKERS] UPDATE of partition key |
Date | |
Msg-id | CAJ3gD9d_DukdMW4mMq-qTpXy8bJhJTDRVS4O4Ue952-8PO9wkQ@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] UPDATE of partition key (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: [HACKERS] UPDATE of partition key
|
List | pgsql-hackers |
On 21 June 2017 at 00:23, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jun 20, 2017 at 2:54 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>> I guess I don't see why it should work like this. In the INSERT case, >>> we must build withCheckOption objects for each partition because those >>> partitions don't appear in the plan otherwise -- but in the UPDATE >>> case, they're already there, so why do we need to build anything at >>> all? Similarly for RETURNING projections. How are the things we need >>> for those cases not already getting built, associated with the >>> relevant resultRelInfos? Maybe there's a concern if some children got >>> pruned - they could turn out later to be the children into which >>> tuples need to be routed. But the patch makes no distinction >>> between possibly-pruned children and any others. >> >> Yes, only a subset of the partitions appear in the UPDATE subplans. I >> think typically for updates, a very small subset of the total leaf >> partitions will be there in the plans, others would get pruned. IMHO, >> it would not be worth having an optimization where it opens only those >> leaf partitions which are not already there in the subplans. Without >> the optimization, we are able to re-use the INSERT infrastructure >> without additional changes. > > Well, that is possible, but certainly not guaranteed. I mean, > somebody could do a whole-table UPDATE, or an UPDATE that hits a > smattering of rows in every partition; I am not saying that it's guaranteed to be a small subset. I am saying that it would be typically a small subset for update-of-partitioned-key case. Seems weird if a user causes an update-row-movement for multiple partitions at the same time. Generally it would be an administrative task where some/all of the rows of a partition need to have their partition key updated that cause them to change their partition, and so there would be probably a where clause that would narrow down the update to that particular partition, because without the where clause the update is anyway slower and it's redundant to scan all other partitions. But, point taken, that there can always be certain cases involving multiple table partition-key updates. > e.g. the table is partitioned on order number, and you do UPDATE > lineitem SET product_code = 'K372B' WHERE product_code = 'K372'. This query does not update order number, so here there is no partition-key-update. Are you thinking that the patch is generating the per-leaf-partition WCO expressions even for a update not involving a partition key ? > > Leaving that aside, the point here is that you're rebuilding > withCheckOptions and returningLists that have already been built in > the planner. That's bad for two reasons. First, it's inefficient, > especially if there are many partitions. Yeah, I agree that this becomes more and more redundant if the update involves more partitions. > Second, it will amount to a functional bug if you get a > different answer than the planner did. Actually, the per-leaf WCOs are meant to be executed on the destination partitions where the tuple is moved, while the WCOs belonging to the per-subplan resultRelInfo are meant for the resultRelinfo used for the UPDATE plans. So actually it should not matter whether they look same or different, because they are fired at different objects. Now these objects can happen to be the same relations though. But in any case, it's not clear to me how the mapped WCO and the planner's WCO would yield a different answer if they are both the same relation. I am possibly missing something. The planner has already generated the withCheckOptions for each of the resultRelInfo. And then we are using one of those to re-generate the WCO for a leaf partition by only adjusting the attnos. If there is already a WCO generated in the planner for that leaf partition (because that partition was present in mtstate->resultRelInfo), then the re-built WCO should be exactly look same as the earlier one, because they are the same relations, and so the attnos generated in them would be same since the Relation TupleDesc is the same. > Note this comment in the existing code: > > /* > * Build WITH CHECK OPTION constraints for each leaf partition rel. Note > * that we didn't build the withCheckOptionList for each partition within > * the planner, but simple translation of the varattnos for each partition > * will suffice. This only occurs for the INSERT case; UPDATE/DELETE > * cases are handled above. > */ > > The comment "UPDATE/DELETE cases are handled above" is referring to > the code that initializes the WCOs generated by the planner. You've > modified the comment in your patch, but the associated code: your > updated comment says that only "DELETEs and local UPDATES are handled > above", but in reality, *all* updates are still handled above. And Actually I meant, "above works for only local updates. For row-movement-updates, we need per-leaf partition WCOs, because when the row is inserted into target partition, that partition may be not be included in the above planner resultRelInfo, so we need WCOs for all partitions". I think this said comment should be sufficient if I add this in the code ? > then they are handled again here. > Similarly for returning lists. > It's certainly not OK for the comment to be inaccurate, but I think > it's also bad to redo the work which the planner has already done, > even if it makes the patch smaller. > > Also, I feel like it's probably not correct to use the first result > relation as the nominal relation for building WCOs and returning lists > anyway. I mean, if the first result relation has a different column > order than the parent relation, isn't this just broken? If it works > for some reason, the comments don't explain what that reason is. Not sure why parent relation should come into picture. As long as the first result relation belongs to one of the partitions in the whole partition tree, we should be able to use that to build WCOs of any other partitions, because they have a common set of attributes having the same name. So we are bound to find each of the attributes of first resultRelInfo in the other leaf partitions during attno mapping. > Some other things: > > + * The row was already deleted by a concurrent DELETE. So we don't > + * have anything to update. > > I find this explanation, and the surrounding comments, inadequate. It > doesn't really explain why we're doing this. I think it should say > something like this: For a normal UPDATE, the case where the tuple has > been the subject of a concurrent UPDATE or DELETE would be handled by > the EvalPlanQual machinery, but for an UPDATE that we've translated > into a DELETE from this partition and an INSERT into some other > partition, that's not available, because CTID chains can't span > relation boundaries. We mimic the semantics to a limited extent by > skipping the INSERT if the DELETE fails to find a tuple. This ensures > that two concurrent attempts to UPDATE the same tuple at the same time > can't turn one tuple into two, and that an UPDATE of a just-deleted > tuple can't resurrect it. Thanks, will put that comment in the next patch. > > + bool partition_check_passed_with_trig_tuple; > + > + partition_check_passed = > + (resultRelInfo->ri_PartitionCheck && > + ExecPartitionCheck(resultRelInfo, slot, estate)); > + > + partition_check_passed_with_trig_tuple = > + (resultRelInfo->ri_PartitionCheck && > + ExecPartitionCheck(resultRelInfo, trig_slot, estate)); > + if (partition_check_passed) > + { > + /* > + * If it's the trigger that is causing partition constraint > + * violation, abort. We don't want a trigger to cause tuple > + * routing. > + */ > + if (!partition_check_passed_with_trig_tuple) > + ExecPartitionCheckEmitError(resultRelInfo, > + trig_slot, estate); > + } > + else > + { > + /* > + * Partition constraint failed with original NEW tuple. But the > + * trigger might even have modifed the tuple such that it fits > + * back into the partition. So partition constraint check > + * should be based on *final* NEW tuple. > + */ > + partition_check_passed = > partition_check_passed_with_trig_tuple; > + } > > Maybe I inadvertently gave the contrary impression in some prior > review, but this logic doesn't seem right to me. I don't think > there's any problem with a BR UPDATE trigger causing tuple routing. > What I want to avoid is repeatedly rerouting the same tuple, but I > don't think that could happen even without this guard. We've now fixed > insert tuple routing so that a BR INSERT trigger can't cause the > partition constraint to be violated (cf. commit > 15ce775faa428dc91027e4e2d6b7a167a27118b5) and there's no way for > update tuple routing to trigger additional BR UPDATE triggers. So I > don't see the point of checking the constraints twice here. I think > what you want to do is get rid of all the changes here and instead > adjust the logic just before ExecConstraints() to invoke > ExecPartitionCheck() on the post-trigger version of the tuple. When I came up with this code, the intention was to make sure BR UPDATE trigger does not cause tuple routing. But yeah, I can't recall what made me think that the above changes would be needed to prevent BR UPDATE trigger from causing tuple routing. With the latest code, it indeed looks like we can get rid of these changes, and still prevent that. BTW, that code was not to avoid repeated re-routing. Above, you seem to say that there's no problem with BR UPDATE trigger causing the tuple routing. But, when none of the partition-key columns are used in UPDATE, we don't set up for update-tuple-routing, so with no partition-key update, tuple routing will not occur even if BR UPDATE trigger would have caused UPDATE tuple routing. This is one restriction we have to live with because we beforehand decide whether to do the tuple-routing setup based on the columns modified in the UPDATE query. > > Parenthetically, if we decided to keep this logic as you have it, the > code that sets partition_check_passed and > partition_check_passed_with_trig_tuple doesn't need to check > resultRelInfo->ri_PartitionCheck because the surrounding "if" block > already did. Yes. > > + for (i = 0; i < num_rels; i++) > + { > + ResultRelInfo *resultRelInfo = &result_rels[i]; > + Relation rel = resultRelInfo->ri_RelationDesc; > + Bitmapset *expr_attrs = NULL; > + > + pull_varattnos((Node *) rel->rd_partcheck, 1, &expr_attrs); > + > + /* Both bitmaps are offset by FirstLowInvalidHeapAttributeNumber. */ > + if (bms_overlap(expr_attrs, GetUpdatedColumns(resultRelInfo, estate))) > + return true; > + } > > This seems like an awfully expensive way of performing this test. > Under what circumstances could this be true for some result relations > and false for others; One resultRelinfo can have no partition key column used in its quals, but the next resultRelinfo can have quite different quals, and these quals can have partition key referred. This is possible if the two of them have different parents that have different partition-key columns. > or in other words, why do we have to loop over all of the result > relations? It seems to me that the user has typed something like: > > UPDATE whatever SET thingy = ..., whatsit = ... WHERE whatever = ... > AND thunk = ... > > If either thingy or whatsit is a partitioning column, UPDATE tuple > routing might be needed So, in the above code, bms_overlap() would return true if either thingy or whatsit is a partitioning column. > - and it should be able to test that by a > *single* comparison between the set of columns being updated and the > partitioning columns, without needing to repeat for every partitions. If bms_overlap() returns true for the very first resultRelinfo, it will return immediately. But yes, if there are no relations using partition key, we will have to scan all of these relations. But again, note that these are pruned leaf partitions, they typically will not contain all the leaf partitions. > Perhaps that test needs to be done at plan time and saved in the plan, > rather than performed here -- or maybe it's easy enough to do it here. Hmm, it looks convenient here because mtstate->resultRelInfo gets set only here. > > One problem is that, if BR UPDATE triggers are in fact allowed to > cause tuple routing as I proposed above, the presence of a BR UPDATE > trigger for any partition could necessitate UPDATE tuple routing for > queries that wouldn't otherwise need it. You mean always setup update tuple routing if there's a BR UPDATE trigger ? Actually I was going for disallowing BR update trigger to initiate tuple routing, as I described above. > But even if you end up > inserting a test for that case, it can surely be a lot cheaper than > this, I didn't exactly get why the bitmap_overlap() test needs to be compared with the presence-of-trigger test. > since it only involves checking a boolean flag, not a bitmapset. > It could be argue that we ought to prohibit BR UPDATE triggers from > causing tuple routing so that we don't have to do this test at all, > but I'm not sure that's a good trade-off. > It seems to necessitate checking the partition constraint twice per > tuple instead of once per tuple, which like a very heavy price. I think I didn't quite understand this paragraph as a whole. Can you state the trade-off here again ? > > +#define GetUpdatedColumns(relinfo, estate) \ > + (rt_fetch((relinfo)->ri_RangeTableIndex, > (estate)->es_range_table)->updatedCols) > > I think this should be moved to a header file (and maybe turned into a > static inline function) rather than copy-pasting the definition into a > new file. Will do that. > > - List *mapped_wcoList; > + List *mappedWco; > List *wcoExprs = NIL; > ListCell *ll; > > - /* varno = node->nominalRelation */ > - mapped_wcoList = map_partition_varattnos(wcoList, > - node->nominalRelation, > - partrel, rel); > - foreach(ll, mapped_wcoList) > + mappedWco = map_partition_varattnos(firstWco, firstVarno, > + partrel, firstResultRel); > + foreach(ll, mappedWco) > { > WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll)); > ExprState *wcoExpr = ExecInitQual(castNode(List, wco->qual), > - plan); > + &mtstate->ps); > > wcoExprs = lappend(wcoExprs, wcoExpr); > } > > - resultRelInfo->ri_WithCheckOptions = mapped_wcoList; > + resultRelInfo->ri_WithCheckOptions = mappedWco; > > Renaming the variable looks fairly pointless, unless I'm missing something? We are converting from firstWco to mappedWco. So firstWco => mappedWco looks more natural pairing than firstWco => mapped_wcoList. And I renamed wcoList to firstWco because I wanted to emphasize that is the first WCO out of the node->withCheckOptionLists. In the existing code, it was only for INSERT; withCheckOptionLists was a single element list, so firstWco name didn't sound suitable, but with multiple elements, it is essential to have it named firstWco so as to emphasize that we take the first one irrespective of whether it is UPDATE or INSERT. > > Regarding the tests, it seems like you've got a test case where you > update a sub-partition and it fails because the tuple would need to be > moved out of a sub-tree, which is good. But I think it would also be > good to have a case where you update a sub-partition and it succeeds > in moving the tuple within the subtree. I don't see one like that > presently; it seems all the others update the topmost root or the > leaf. I also think it would be a good idea to make sub_parted's > column order different from both list_parted and its own children, and > maybe use a diversity of data types (e.g. int4, int8, text instead of > making everything int). > > +select tableoid::regclass , * from list_parted where a = 2 order by 1; > +update list_parted set b = c + a where a = 2; > +select tableoid::regclass , * from list_parted where a = 2 order by 1; > > The extra space before the comma looks strange. Will do the above changes, thanks. > > Also, please make a habit of checking patches for whitespace errors > using git diff --check. > > [rhaas pgsql]$ git diff --check > src/backend/executor/nodeModifyTable.c:384: indent with spaces. > + tuple, &slot); > src/backend/executor/nodeModifyTable.c:1966: space before tab in indent. > + IsPartitionKeyUpdate(estate, mtstate->resultRelInfo, nplans)); > > You will notice these kinds of things if you read the diff you are > submitting before you press send, because git highlights them in > bright red. That's a good practice for many other reasons, too. Yeah, somehow I think I missed these because I must have checked only the incremental diffs w.r.t. the earlier one where I must have introduced them. Your point is very much true that we should make it a habit to check complete patch with --check option, or apply it myself. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
pgsql-hackers by date: