Thread: [HACKERS] UPDATE of partition key
Currently, an update of a partition key of a partition is not allowed, since it requires to move the row(s) into the applicable partition. Attached is a WIP patch (update-partition-key.patch) that removes this restriction. When an UPDATE causes the row of a partition to violate its partition constraint, then a partition is searched in that subtree that can accommodate this row, and if found, the row is deleted from the old partition and inserted in the new partition. If not found, an error is reported. There are a few things that can be discussed about : 1. We can run an UPDATE using a child partition at any level in a nested partition tree. In such case, we should move the row only within that child subtree. For e.g. , in a tree such as : tab -> t1 -> t1_1 t1_2 t2 -> t2_1 t2_2 For "UPDATE t2 set col1 = 'AAA' " , if the modified tuple does not fit in t2_1 but can fit in t1_1, it should not be moved to t1_1, because the UPDATE is fired using t2. 2. In the patch, as part of the row movement, ExecDelete() is called followed by ExecInsert(). This is done that way, because we want to have the ROW triggers on that (sub)partition executed. If a user has explicitly created DELETE and INSERT BR triggers for this partition, I think we should run those. While at the same time, another question is, what about UPDATE trigger on the same table ? Here again, one can argue that because this UPDATE has been transformed into a DELETE-INSERT, we should not run UPDATE trigger for row-movement. But there can be a counter-argument. For e.g. if a user needs to make sure about logging updates of particular columns of a row, he will expect the logging to happen even when that row was transparently moved. In the patch, I have retained the firing of UPDATE BR trigger. 3. In case of a concurrent update/delete, suppose session A has locked the row for deleting it. Now a session B has decided to update this row and that is going to cause row movement, which means it will delete it first. But when session A is finished deleting it, session B finds that it is already deleted. In such case, it should not go ahead with inserting a new row as part of the row movement. For that, I have added a new parameter 'already_delete' for ExecDelete(). Of course, this still won't completely solve the concurrency anomaly. In the above case, the UPDATE of Session B gets lost. May be, for a user that does not tolerate this, we can have a table-level option that disallows row movement, or will cause an error to be thrown for one of the concurrent session. 4. The ExecSetupPartitionTupleRouting() is re-used for routing the row that is to be moved. So in ExecInitModifyTable(), we call ExecSetupPartitionTupleRouting() even for UPDATE. We can also do this only during execution time for the very first time we find that we need to do a row movement. I will think over that, but I am thinking it might complicate things, as compared to always doing the setup for UPDATE. WIll check on that. 5. Regarding performance testing, I have compared the results of row-movement with partition versus row-movement with inheritance tree using triggers. Below are the details : Schema : CREATE TABLE ptab (a date, b int, c int); CREATE TABLE ptab (a date, b int, c int) PARTITION BY RANGE (a, b); CREATE TABLE ptab_1_1 PARTITION OF ptab for values from ('1900-01-01', 1) to ('1900-01-01', 101) PARTITION BY range (c); CREATE TABLE ptab_1_1_1 PARTITION OF ptab_1_1 for values from (1) to (51); CREATE TABLE ptab_1_1_2 PARTITION OF ptab_1_1 for values from (51) to (101); ..... ..... CREATE TABLE ptab_1_1_n PARTITION OF ptab_1_1 for values from (n) to (n+m); ...... ...... CREATE TABLE ptab_5_n PARTITION OF ptab for values from ('1905-01-01', 101) to ('1905-01-01', 201) PARTITION BY range (c); CREATE TABLE ptab_1_2_1 PARTITION OF ptab_1_2 for values from (1) to (51); CREATE TABLE ptab_1_2_2 PARTITION OF ptab_1_2 for values from (51) to (101); ..... ..... CREATE TABLE ptab_1_2_n PARTITION OF ptab_1_2 for values from (n) to (n+m); ..... ..... Similarly for inheritance : CREATE TABLE ptab_1_1 (constraint check_ptab_1_1 check (a = '1900-01-01' and b >= 1 and b < 8)) inherits (ptab); create trigger brutrig_ptab_1_1 before update on ptab_1_1 for each row execute procedure ptab_upd_trig(); CREATE TABLE ptab_1_1_1 (constraint check_ptab_1_1_1 check (c >= 1 and c < 51)) inherits (ptab_1_1); create trigger brutrig_ptab_1_1_1 before update on ptab_1_1_1 for each row execute procedure ptab_upd_trig(); CREATE TABLE ptab_1_1_2 (constraint check_ptab_1_1_2 check (c >= 51 and c < 101)) inherits (ptab_1_1); create trigger brutrig_ptab_1_1_2 before update on ptab_1_1_2 for each row execute procedure ptab_upd_trig(); I had to have a BR UPDATE trigger on each of the leaf tables. Attached is the BR trigger function update_trigger.sql. There it generates the table name assuming a fixed pattern of distribution of data over the partitions. It first deletes the row and then inserts a new one. I also skipped the deletion part, and it did not show any significant change in results. parts partitioned inheritance no. of rows subpartitions ===== =========== =========== =========== ============= 500 10 sec 3 min 02 sec 1,000,000 0 1000 10 sec 3 min 05 sec 1,000,000 0 1000 1 min 38sec 30min 50 sec 10,000,000 0 4000 28 sec 5 min 41 sec 1,000,000 10 part : total number of partitions including subparitions if any. partitioned : Partitions created using declarative syntax. inheritence : Partitions created using inheritence , check constraints and insert,update triggers. subpartitions : Number of subpartitions for each partition (in a 2-level tree) Overall the UPDATE in partitions is faster by 10-20 times compared with inheritance with triggers. The UPDATE query moved all of the rows into another partition. It was something like this : update ptab set a = '1949-01-1' where a <= '1924-01-01' For a plain table with 1,000,000 rows, the UPDATE took 8 seconds, and with 10,000,000 rows, it took 1min 32sec. In general, for both partitioned and inheritence tables, the time taken linearly rose with the number of rows. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Mon, Feb 13, 2017 at 7:01 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > parts partitioned inheritance no. of rows subpartitions > ===== =========== =========== =========== ============= > > 500 10 sec 3 min 02 sec 1,000,000 0 > 1000 10 sec 3 min 05 sec 1,000,000 0 > 1000 1 min 38sec 30min 50 sec 10,000,000 0 > 4000 28 sec 5 min 41 sec 1,000,000 10 That's a big speedup. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Feb 13, 2017 at 05:31:56PM +0530, Amit Khandekar wrote: > Currently, an update of a partition key of a partition is not > allowed, since it requires to move the row(s) into the applicable > partition. > > Attached is a WIP patch (update-partition-key.patch) that removes > this restriction. When an UPDATE causes the row of a partition to > violate its partition constraint, then a partition is searched in > that subtree that can accommodate this row, and if found, the row is > deleted from the old partition and inserted in the new partition. If > not found, an error is reported. This is great! Would it be really invasive to HINT something when the subtree is a proper subtree? Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On 14 February 2017 at 22:24, David Fetter <david@fetter.org> wrote: > On Mon, Feb 13, 2017 at 05:31:56PM +0530, Amit Khandekar wrote: >> Currently, an update of a partition key of a partition is not >> allowed, since it requires to move the row(s) into the applicable >> partition. >> >> Attached is a WIP patch (update-partition-key.patch) that removes >> this restriction. When an UPDATE causes the row of a partition to >> violate its partition constraint, then a partition is searched in >> that subtree that can accommodate this row, and if found, the row is >> deleted from the old partition and inserted in the new partition. If >> not found, an error is reported. > > This is great! > > Would it be really invasive to HINT something when the subtree is a > proper subtree? I am not quite sure I understood this question. Can you please explain it a bit more ...
On Wed, Feb 15, 2017 at 01:06:32PM +0530, Amit Khandekar wrote: > On 14 February 2017 at 22:24, David Fetter <david@fetter.org> wrote: > > On Mon, Feb 13, 2017 at 05:31:56PM +0530, Amit Khandekar wrote: > >> Currently, an update of a partition key of a partition is not > >> allowed, since it requires to move the row(s) into the applicable > >> partition. > >> > >> Attached is a WIP patch (update-partition-key.patch) that removes > >> this restriction. When an UPDATE causes the row of a partition to > >> violate its partition constraint, then a partition is searched in > >> that subtree that can accommodate this row, and if found, the row > >> is deleted from the old partition and inserted in the new > >> partition. If not found, an error is reported. > > > > This is great! > > > > Would it be really invasive to HINT something when the subtree is > > a proper subtree? > > I am not quite sure I understood this question. Can you please > explain it a bit more ... Sorry. When an UPDATE can't happen, there are often ways to hint at what went wrong and how to correct it. Violating a uniqueness constraint would be one example. When an UPDATE can't happen and the depth of the subtree is a plausible candidate for what prevents it, there might be a way to say so. Let's imagine a table called log with partitions on "stamp" log_YYYY and subpartitions, also on "stamp", log_YYYYMM. If you do something like UPDATE log_2017 SET "stamp"='2016-11-08 23:03:00' WHERE ... it's possible to know that it might have worked had the UPDATE taken place on log rather than on log_2017. Does that make sense, and if so, is it super invasive to HINT that? Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On 15 February 2017 at 20:26, David Fetter <david@fetter.org> wrote: > When an UPDATE can't happen, there are often ways to hint at > what went wrong and how to correct it. Violating a uniqueness > constraint would be one example. > > When an UPDATE can't happen and the depth of the subtree is a > plausible candidate for what prevents it, there might be a way to say > so. > > Let's imagine a table called log with partitions on "stamp" log_YYYY > and subpartitions, also on "stamp", log_YYYYMM. If you do something > like > > UPDATE log_2017 SET "stamp"='2016-11-08 23:03:00' WHERE ... > > it's possible to know that it might have worked had the UPDATE taken > place on log rather than on log_2017. > > Does that make sense, and if so, is it super invasive to HINT that? Yeah, I think it should be possible to find the root partition with the help of pg_partitioned_table, and then run ExecFindPartition() again using the root. Will check. I am not sure right now how involved that would turn out to be, but I think that logic would not change the existing code, so in that sense it is not invasive. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On 2017/02/16 15:50, Amit Khandekar wrote: > On 15 February 2017 at 20:26, David Fetter <david@fetter.org> wrote: >> When an UPDATE can't happen, there are often ways to hint at >> what went wrong and how to correct it. Violating a uniqueness >> constraint would be one example. >> >> When an UPDATE can't happen and the depth of the subtree is a >> plausible candidate for what prevents it, there might be a way to say >> so. >> >> Let's imagine a table called log with partitions on "stamp" log_YYYY >> and subpartitions, also on "stamp", log_YYYYMM. If you do something >> like >> >> UPDATE log_2017 SET "stamp"='2016-11-08 23:03:00' WHERE ... >> >> it's possible to know that it might have worked had the UPDATE taken >> place on log rather than on log_2017. >> >> Does that make sense, and if so, is it super invasive to HINT that? > > Yeah, I think it should be possible to find the root partition with I assume you mean root *partitioned* table. > the help of pg_partitioned_table, The pg_partitioned_table catalog does not store parent-child relationships, just information about the partition key of a table. To get the root partitioned table, you might want to create a recursive version of get_partition_parent(), maybe called get_partition_root_parent(). By the way, get_partition_parent() scans pg_inherits to find the inheritance parent. > and then run ExecFindPartition() > again using the root. Will check. I am not sure right now how involved > that would turn out to be, but I think that logic would not change the > existing code, so in that sense it is not invasive. I couldn't understand why run ExecFindPartition() again on the root partitioned table, can you clarify? ISTM, we just want to tell the user in the HINT that trying the same update query with root partitioned table might work. I'm not sure if it would work instead to find some intermediate partitioned table (that is, between the root and the one that update query was tried with) to include in the HINT. Thanks, Amit
On 16 February 2017 at 12:57, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/02/16 15:50, Amit Khandekar wrote: >> On 15 February 2017 at 20:26, David Fetter <david@fetter.org> wrote: >>> When an UPDATE can't happen, there are often ways to hint at >>> what went wrong and how to correct it. Violating a uniqueness >>> constraint would be one example. >>> >>> When an UPDATE can't happen and the depth of the subtree is a >>> plausible candidate for what prevents it, there might be a way to say >>> so. >>> >>> Let's imagine a table called log with partitions on "stamp" log_YYYY >>> and subpartitions, also on "stamp", log_YYYYMM. If you do something >>> like >>> >>> UPDATE log_2017 SET "stamp"='2016-11-08 23:03:00' WHERE ... >>> >>> it's possible to know that it might have worked had the UPDATE taken >>> place on log rather than on log_2017. >>> >>> Does that make sense, and if so, is it super invasive to HINT that? >> >> Yeah, I think it should be possible to find the root partition with > > I assume you mean root *partitioned* table. > >> the help of pg_partitioned_table, > > The pg_partitioned_table catalog does not store parent-child > relationships, just information about the partition key of a table. To > get the root partitioned table, you might want to create a recursive > version of get_partition_parent(), maybe called > get_partition_root_parent(). By the way, get_partition_parent() scans > pg_inherits to find the inheritance parent. Yeah. But we also want to make sure that it's a part of declarative partition tree, and not just an inheritance tree ? I am not sure whether it is currently possible to have a mix of these two. May be it is easy to prevent that from happening. > >> and then run ExecFindPartition() >> again using the root. Will check. I am not sure right now how involved >> that would turn out to be, but I think that logic would not change the >> existing code, so in that sense it is not invasive. > > I couldn't understand why run ExecFindPartition() again on the root > partitioned table, can you clarify? ISTM, we just want to tell the user > in the HINT that trying the same update query with root partitioned table > might work. I'm not sure if it would work instead to find some > intermediate partitioned table (that is, between the root and the one that > update query was tried with) to include in the HINT. What I had in mind was : Give that hint only if there *was* a subpartition that could accommodate that row. And if found, we can only include the subpartition name. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On 2017/02/16 17:55, Amit Khandekar wrote: > On 16 February 2017 at 12:57, Amit Langote wrote: >> On 2017/02/16 15:50, Amit Khandekar wrote: >>> On 15 February 2017 at 20:26, David Fetter <david@fetter.org> wrote: >>>> Does that make sense, and if so, is it super invasive to HINT that? >>> >>> Yeah, I think it should be possible to find the root partition with >> >> I assume you mean root *partitioned* table. >> >>> the help of pg_partitioned_table, >> >> The pg_partitioned_table catalog does not store parent-child >> relationships, just information about the partition key of a table. To >> get the root partitioned table, you might want to create a recursive >> version of get_partition_parent(), maybe called >> get_partition_root_parent(). By the way, get_partition_parent() scans >> pg_inherits to find the inheritance parent. > > Yeah. But we also want to make sure that it's a part of declarative > partition tree, and not just an inheritance tree ? I am not sure > whether it is currently possible to have a mix of these two. May be it > is easy to prevent that from happening. It is not possible to mix declarative partitioning and regular inheritance. So, you cannot have a table in a declarative partitioning tree that is not a (sub-) partition of the root table. >>> and then run ExecFindPartition() >>> again using the root. Will check. I am not sure right now how involved >>> that would turn out to be, but I think that logic would not change the >>> existing code, so in that sense it is not invasive. >> >> I couldn't understand why run ExecFindPartition() again on the root >> partitioned table, can you clarify? ISTM, we just want to tell the user >> in the HINT that trying the same update query with root partitioned table >> might work. I'm not sure if it would work instead to find some >> intermediate partitioned table (that is, between the root and the one that >> update query was tried with) to include in the HINT. > > What I had in mind was : Give that hint only if there *was* a > subpartition that could accommodate that row. And if found, we can > only include the subpartition name. Asking to try the update query with the root table sounds like a good enough hint. Trying to find the exact sub-partition (I assume you mean to imply sub-tree here) seems like an overkill, IMHO. Thanks, Amit
On 16 February 2017 at 14:42, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/02/16 17:55, Amit Khandekar wrote: >> On 16 February 2017 at 12:57, Amit Langote wrote: >>> On 2017/02/16 15:50, Amit Khandekar wrote: >>>> On 15 February 2017 at 20:26, David Fetter <david@fetter.org> wrote: >>>>> Does that make sense, and if so, is it super invasive to HINT that? >>>> >>>> Yeah, I think it should be possible to find the root partition with >>> >>> I assume you mean root *partitioned* table. >>> >>>> the help of pg_partitioned_table, >>> >>> The pg_partitioned_table catalog does not store parent-child >>> relationships, just information about the partition key of a table. To >>> get the root partitioned table, you might want to create a recursive >>> version of get_partition_parent(), maybe called >>> get_partition_root_parent(). By the way, get_partition_parent() scans >>> pg_inherits to find the inheritance parent. >> >> Yeah. But we also want to make sure that it's a part of declarative >> partition tree, and not just an inheritance tree ? I am not sure >> whether it is currently possible to have a mix of these two. May be it >> is easy to prevent that from happening. > > It is not possible to mix declarative partitioning and regular > inheritance. So, you cannot have a table in a declarative partitioning > tree that is not a (sub-) partition of the root table. Ok, then that makes things easy. > >>>> and then run ExecFindPartition() >>>> again using the root. Will check. I am not sure right now how involved >>>> that would turn out to be, but I think that logic would not change the >>>> existing code, so in that sense it is not invasive. >>> >>> I couldn't understand why run ExecFindPartition() again on the root >>> partitioned table, can you clarify? ISTM, we just want to tell the user >>> in the HINT that trying the same update query with root partitioned table >>> might work. I'm not sure if it would work instead to find some >>> intermediate partitioned table (that is, between the root and the one that >>> update query was tried with) to include in the HINT. >> >> What I had in mind was : Give that hint only if there *was* a >> subpartition that could accommodate that row. And if found, we can >> only include the subpartition name. > > Asking to try the update query with the root table sounds like a good > enough hint. Trying to find the exact sub-partition (I assume you mean to > imply sub-tree here) seems like an overkill, IMHO. Yeah ... I was thinking , anyways it's an error condition, so why not let the server spend a bit more CPU and get the right sub-partition for the message. If we decide to write code to find the root partition, then it's just a matter of another function ExecFindPartition(). Also, I was thinking : give the hint *only* if we know there is a right sub-partition. Otherwise, it might distract the user. > > Thanks, > Amit > > -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On 13 February 2017 at 12:01, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > There are a few things that can be discussed about : If you do a normal update the new tuple is linked to the old one using the ctid forming a chain of tuple versions. This tuple movement breaks that chain. So the question I had reading this proposal is what behaviour depends on ctid and how is it affected by the ctid chain being broken. I think the concurrent update case is just a symptom of this. If you try to update a row that's locked for a concurrent update you normally wait until the concurrent update finishes, then follow the ctid chain and recheck the where clause on the target of the link and if it still matches you perform the update there. At least you do that if you have isolation_level set to repeatable_read. If you have isolation level set to serializable then you just fail with a serialization failure. I think that's what you should do if you come across a row that's been updated with a broken ctid chain even in repeatable read mode. Just fail with a serialization failure and document that in partitioned tables if you perform updates that move tuples between partitions then you need to be ensure your updates are prepared for serialization failures. I think this would require another bit in the tuple info mask indicating that this is tuple is the last version before a broken ctid chain -- i.e. that it was updated by moving it to another partition. Maybe there's some combination of bits you could use though since this is only needed in a particular situation. Offhand I don't know what other behaviours are dependent on the ctid chain. I think you need to go search the docs -- and probably the code just to be sure -- for any references to ctid to ensure you catch every impact of breaking the ctid chain. -- greg
On Thu, Feb 16, 2017 at 03:39:30PM +0530, Amit Khandekar wrote: > >>>> and then run ExecFindPartition() > >>>> again using the root. Will check. I am not sure right now how involved > >>>> that would turn out to be, but I think that logic would not change the > >>>> existing code, so in that sense it is not invasive. > >>> > >>> I couldn't understand why run ExecFindPartition() again on the root > >>> partitioned table, can you clarify? ISTM, we just want to tell the user > >>> in the HINT that trying the same update query with root partitioned table > >>> might work. I'm not sure if it would work instead to find some > >>> intermediate partitioned table (that is, between the root and the one that > >>> update query was tried with) to include in the HINT. > >> > >> What I had in mind was : Give that hint only if there *was* a > >> subpartition that could accommodate that row. And if found, we can > >> only include the subpartition name. > > > > Asking to try the update query with the root table sounds like a good > > enough hint. Trying to find the exact sub-partition (I assume you mean to > > imply sub-tree here) seems like an overkill, IMHO. > Yeah ... I was thinking , anyways it's an error condition, so why not > let the server spend a bit more CPU and get the right sub-partition > for the message. If we decide to write code to find the root > partition, then it's just a matter of another function > ExecFindPartition(). > > Also, I was thinking : give the hint *only* if we know there is a > right sub-partition. Otherwise, it might distract the user. If this is relatively straight-forward, it'd be great. More actionable knowledge is better. Thanks for taking this on. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Thu, Feb 16, 2017 at 5:47 AM, Greg Stark <stark@mit.edu> wrote: > On 13 February 2017 at 12:01, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> There are a few things that can be discussed about : > > If you do a normal update the new tuple is linked to the old one using > the ctid forming a chain of tuple versions. This tuple movement breaks > that chain. So the question I had reading this proposal is what > behaviour depends on ctid and how is it affected by the ctid chain > being broken. I think this is a good question. > I think the concurrent update case is just a symptom of this. If you > try to update a row that's locked for a concurrent update you normally > wait until the concurrent update finishes, then follow the ctid chain > and recheck the where clause on the target of the link and if it still > matches you perform the update there. Right. EvalPlanQual behavior, in short. > At least you do that if you have isolation_level set to > repeatable_read. If you have isolation level set to serializable then > you just fail with a serialization failure. I think that's what you > should do if you come across a row that's been updated with a broken > ctid chain even in repeatable read mode. Just fail with a > serialization failure and document that in partitioned tables if you > perform updates that move tuples between partitions then you need to > be ensure your updates are prepared for serialization failures. Now, this part I'm not sure about. What's pretty clear is that, barring some redesign of the heap format, we can't keep the CTID chain intact when the tuple moves to a different relfilenode. What's less clear is what to do about that. We can either (1) give up on EvalPlanQual behavior in this case and act just as we would if the row had been deleted; no update happens or (2) throw a serialization error. You're advocating for #2, but I'm not sure that's right, because: 1. It's a lot more work, 2. Your proposed implementation needs an on-disk format change that uses up a scarce infomask bit, and 3. It's not obvious to me that it's clearly preferable from a user experience standpoint. I mean, either way the user doesn't get the behavior that they want. Either they're hoping for EPQ semantics and they instead do a no-op update, or they're hoping for EPQ semantics and they instead get an ERROR. Generally speaking, we don't throw serialization errors today at READ COMMITTED, so if we do so here, that's going to be a noticeable and perhaps unwelcome change. More opinions welcome. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 16 February 2017 at 20:53, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Feb 16, 2017 at 5:47 AM, Greg Stark <stark@mit.edu> wrote: >> On 13 February 2017 at 12:01, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>> There are a few things that can be discussed about : >> >> If you do a normal update the new tuple is linked to the old one using >> the ctid forming a chain of tuple versions. This tuple movement breaks >> that chain. So the question I had reading this proposal is what >> behaviour depends on ctid and how is it affected by the ctid chain >> being broken. > > I think this is a good question. > >> I think the concurrent update case is just a symptom of this. If you >> try to update a row that's locked for a concurrent update you normally >> wait until the concurrent update finishes, then follow the ctid chain >> and recheck the where clause on the target of the link and if it still >> matches you perform the update there. > > Right. EvalPlanQual behavior, in short. > >> At least you do that if you have isolation_level set to >> repeatable_read. If you have isolation level set to serializable then >> you just fail with a serialization failure. I think that's what you >> should do if you come across a row that's been updated with a broken >> ctid chain even in repeatable read mode. Just fail with a >> serialization failure and document that in partitioned tables if you >> perform updates that move tuples between partitions then you need to >> be ensure your updates are prepared for serialization failures. > > Now, this part I'm not sure about. What's pretty clear is that, > barring some redesign of the heap format, we can't keep the CTID chain > intact when the tuple moves to a different relfilenode. What's less > clear is what to do about that. We can either (1) give up on > EvalPlanQual behavior in this case and act just as we would if the row > had been deleted; no update happens. This is what the current patch has done. > or (2) throw a serialization > error. You're advocating for #2, but I'm not sure that's right, > because: > > 1. It's a lot more work, > > 2. Your proposed implementation needs an on-disk format change that > uses up a scarce infomask bit, and > > 3. It's not obvious to me that it's clearly preferable from a user > experience standpoint. I mean, either way the user doesn't get the > behavior that they want. Either they're hoping for EPQ semantics and > they instead do a no-op update, or they're hoping for EPQ semantics > and they instead get an ERROR. Generally speaking, we don't throw > serialization errors today at READ COMMITTED, so if we do so here, > that's going to be a noticeable and perhaps unwelcome change. > > More opinions welcome. I am inclined to at least have some option for the user to decide the behaviour. In the future we can even consider support for walking through the ctid chain across multiple relfilenodes. But till then, we need to decide what default behaviour to keep. My inclination is more towards erroring out in an unfortunate even where there is an UPDATE while the row-movement is happening. One option is to not get into finding whether the DELETE was part of partition row-movement or it was indeed a DELETE, and always error out the UPDATE when heap_update() returns HeapTupleUpdated, but only if the table is a leaf partition. But this obviously will cause annoyance because of chances of getting such errors when there are concurrent updates and deletes in the same partition. But we can keep a table-level option for determining whether to error out or silently lose the UPDATE. Another option I was thinking : When the UPDATE is on a partition key, acquire ExclusiveLock (not AccessExclusiveLock) only on that partition, so that the selects will continue to execute, but UPDATE/DELETE will wait before opening the table for scan. The UPDATE on partition key is not going to be a very routine operation, it sounds more like a DBA maintenance operation; so it does not look like it would come in between usual transactions.
On Thu, Feb 16, 2017 at 8:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Generally speaking, we don't throw > serialization errors today at READ COMMITTED, so if we do so here, > that's going to be a noticeable and perhaps unwelcome change. Yes we do: https://www.postgresql.org/docs/9.6/static/transaction-iso.html#XACT-REPEATABLE-READ -- Thomas Munro http://www.enterprisedb.com
On Mon, Feb 20, 2017 at 3:36 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Thu, Feb 16, 2017 at 8:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Generally speaking, we don't throw >> serialization errors today at READ COMMITTED, so if we do so here, >> that's going to be a noticeable and perhaps unwelcome change. > > Yes we do: > > https://www.postgresql.org/docs/9.6/static/transaction-iso.html#XACT-REPEATABLE-READ Oops -- please ignore, I misread that as repeatable read. -- Thomas Munro http://www.enterprisedb.com
Hi Amit, Thanks for working on this. On 2017/02/13 21:01, Amit Khandekar wrote: > Currently, an update of a partition key of a partition is not allowed, > since it requires to move the row(s) into the applicable partition. > > Attached is a WIP patch (update-partition-key.patch) that removes this > restriction. When an UPDATE causes the row of a partition to violate > its partition constraint, then a partition is searched in that subtree > that can accommodate this row, and if found, the row is deleted from > the old partition and inserted in the new partition. If not found, an > error is reported. That's clearly an improvement over what we have now. > There are a few things that can be discussed about : > > 1. We can run an UPDATE using a child partition at any level in a > nested partition tree. In such case, we should move the row only > within that child subtree. > > For e.g. , in a tree such as : > tab -> > t1 -> > t1_1 > t1_2 > t2 -> > t2_1 > t2_2 > > For "UPDATE t2 set col1 = 'AAA' " , if the modified tuple does not fit > in t2_1 but can fit in t1_1, it should not be moved to t1_1, because > the UPDATE is fired using t2. Makes sense. One should perform the update by specifying tab such that the row moves from t2 to t1, before we could determine t1_1 as the target for the new row. Specifying t2 directly in that case is clearly the "violates partition constraint" situation. I wonder if that's enough a hint for the user to try updating the parent (or better still, root parent). Or as we were discussing, should there be an actual HINT message spelling that out for the user. > 2. In the patch, as part of the row movement, ExecDelete() is called > followed by ExecInsert(). This is done that way, because we want to > have the ROW triggers on that (sub)partition executed. If a user has > explicitly created DELETE and INSERT BR triggers for this partition, I > think we should run those. While at the same time, another question > is, what about UPDATE trigger on the same table ? Here again, one can > argue that because this UPDATE has been transformed into a > DELETE-INSERT, we should not run UPDATE trigger for row-movement. But > there can be a counter-argument. For e.g. if a user needs to make sure > about logging updates of particular columns of a row, he will expect > the logging to happen even when that row was transparently moved. In > the patch, I have retained the firing of UPDATE BR trigger. What of UPDATE AR triggers? As a comment on how row-movement is being handled in code, I wonder if it could be be made to look similar structurally to the code in ExecInsert() that handles ON CONFLICT DO UPDATE. That is, if (partition constraint fails) { /* row movement */ } else { /* ExecConstraints() */ /* heap_update(), EvalPlanQual(), and ExecInsertIndexTuples() */ } I see that ExecConstraint() won't get called on the source partition's constraints if row movement occurs. Maybe, that's unnecessary because the new row won't be inserted into that partition anyway. ExecWithCheckOptions() for RLS update check does happen *before* row movement though. > 3. In case of a concurrent update/delete, suppose session A has locked > the row for deleting it. Now a session B has decided to update this > row and that is going to cause row movement, which means it will > delete it first. But when session A is finished deleting it, session B > finds that it is already deleted. In such case, it should not go ahead > with inserting a new row as part of the row movement. For that, I have > added a new parameter 'already_delete' for ExecDelete(). Makes sense. Maybe: already_deleted -> concurrently_deleted. > Of course, this still won't completely solve the concurrency anomaly. > In the above case, the UPDATE of Session B gets lost. May be, for a > user that does not tolerate this, we can have a table-level option > that disallows row movement, or will cause an error to be thrown for > one of the concurrent session. Will this table-level option be specified for a partitioned table once or for individual partitions? > 4. The ExecSetupPartitionTupleRouting() is re-used for routing the row > that is to be moved. So in ExecInitModifyTable(), we call > ExecSetupPartitionTupleRouting() even for UPDATE. We can also do this > only during execution time for the very first time we find that we > need to do a row movement. I will think over that, but I am thinking > it might complicate things, as compared to always doing the setup for > UPDATE. WIll check on that. Hmm. ExecSetupPartitionTupleRouting(), which does significant amount of setup work, is fine being called in ExecInitModifyTable() in the insert case because there are often cases where that's a bulk-insert and hence cost of the setup work is amortized. Updates, OTOH, are seldom done in a bulk manner. So that might be an argument for doing it late only when needed. But that starts to sound less attractive when one realizes that that will occur for every row that wants to move. I wonder if updates that will require row movement when done will be done in a bulk manner (as a maintenance op), so one-time tuple routing setup seems fine. Again, enable_row_movement option specified for the parent sounds like it would be a nice to have. Only do the setup if it's turned on, which goes without saying. > 5. Regarding performance testing, I have compared the results of > row-movement with partition versus row-movement with inheritance tree > using triggers. Below are the details : > > Schema : [ ... ] > parts partitioned inheritance no. of rows subpartitions > ===== =========== =========== =========== ============= > > 500 10 sec 3 min 02 sec 1,000,000 0 > 1000 10 sec 3 min 05 sec 1,000,000 0 > 1000 1 min 38sec 30min 50 sec 10,000,000 0 > 4000 28 sec 5 min 41 sec 1,000,000 10 > > part : total number of partitions including subparitions if any. > partitioned : Partitions created using declarative syntax. > inheritence : Partitions created using inheritence , check constraints > and insert,update triggers. > subpartitions : Number of subpartitions for each partition (in a 2-level tree) > > Overall the UPDATE in partitions is faster by 10-20 times compared > with inheritance with triggers. > > The UPDATE query moved all of the rows into another partition. It was > something like this : > update ptab set a = '1949-01-1' where a <= '1924-01-01' > > For a plain table with 1,000,000 rows, the UPDATE took 8 seconds, and > with 10,000,000 rows, it took 1min 32sec. Nice! > In general, for both partitioned and inheritence tables, the time > taken linearly rose with the number of rows. Hopefully not also with the number of partitions though. I will look more closely at the code soon. Thanks, Amit
On Mon, Feb 20, 2017 at 2:58 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > I am inclined to at least have some option for the user to decide the > behaviour. In the future we can even consider support for walking > through the ctid chain across multiple relfilenodes. But till then, we > need to decide what default behaviour to keep. My inclination is more > towards erroring out in an unfortunate even where there is an UPDATE > while the row-movement is happening. One option is to not get into > finding whether the DELETE was part of partition row-movement or it > was indeed a DELETE, and always error out the UPDATE when > heap_update() returns HeapTupleUpdated, but only if the table is a > leaf partition. But this obviously will cause annoyance because of > chances of getting such errors when there are concurrent updates and > deletes in the same partition. But we can keep a table-level option > for determining whether to error out or silently lose the UPDATE. I'm still a fan of the "do nothing and just document that this is a weirdness of partitioned tables" approach, because implementing something will be complicated, will ensure that this misses this release if not the next one, and may not be any better for users. But probably we need to get some more opinions from other people, since I can imagine people being pretty unhappy if the consensus happens to be at odds with my own preferences. > Another option I was thinking : When the UPDATE is on a partition key, > acquire ExclusiveLock (not AccessExclusiveLock) only on that > partition, so that the selects will continue to execute, but > UPDATE/DELETE will wait before opening the table for scan. The UPDATE > on partition key is not going to be a very routine operation, it > sounds more like a DBA maintenance operation; so it does not look like > it would come in between usual transactions. I think that's going to make users far more unhappy than breaking the EPQ behavior ever would. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Friday, February 24, 2017, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Feb 20, 2017 at 2:58 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> I am inclined to at least have some option for the user to decide the
> behaviour. In the future we can even consider support for walking
> through the ctid chain across multiple relfilenodes. But till then, we
> need to decide what default behaviour to keep. My inclination is more
> towards erroring out in an unfortunate even where there is an UPDATE
> while the row-movement is happening. One option is to not get into
> finding whether the DELETE was part of partition row-movement or it
> was indeed a DELETE, and always error out the UPDATE when
> heap_update() returns HeapTupleUpdated, but only if the table is a
> leaf partition. But this obviously will cause annoyance because of
> chances of getting such errors when there are concurrent updates and
> deletes in the same partition. But we can keep a table-level option
> for determining whether to error out or silently lose the UPDATE.
I'm still a fan of the "do nothing and just document that this is a
weirdness of partitioned tables" approach, because implementing
something will be complicated, will ensure that this misses this
release if not the next one, and may not be any better for users. But
probably we need to get some more opinions from other people, since I
can imagine people being pretty unhappy if the consensus happens to be
at odds with my own preferences.
For my own sanity - the move update would complete successfully and break every ctid chain that it touches. Any update lined up behind it in the lock queue would discover their target record has been deleted and would experience whatever behavior their isolation level dictates for such a situation. So multi-partition update queries will fail to update some records if they happen to move between partitions even if they would otherwise match the query's predicate.
Is there any difference in behavior between this and a SQL writeable CTE performing the same thing via delete-returning-insert?
David J.
On Fri, Feb 24, 2017 at 1:18 PM, David G. Johnston <david.g.johnston@gmail.com> wrote: > For my own sanity - the move update would complete successfully and break > every ctid chain that it touches. Any update lined up behind it in the lock > queue would discover their target record has been deleted and would > experience whatever behavior their isolation level dictates for such a > situation. So multi-partition update queries will fail to update some > records if they happen to move between partitions even if they would > otherwise match the query's predicate. Right. That's the behavior for which I am advocating, on the grounds that it's the simplest to implement and if we all agree on something else more complicated later, we can do it then. > Is there any difference in behavior between this and a SQL writeable CTE > performing the same thing via delete-returning-insert? Not to my knowledge. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 24 February 2017 at 07:02, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Feb 20, 2017 at 2:58 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> I am inclined to at least have some option for the user to decide the >> behaviour. In the future we can even consider support for walking >> through the ctid chain across multiple relfilenodes. But till then, we >> need to decide what default behaviour to keep. My inclination is more >> towards erroring out in an unfortunate even where there is an UPDATE >> while the row-movement is happening. One option is to not get into >> finding whether the DELETE was part of partition row-movement or it >> was indeed a DELETE, and always error out the UPDATE when >> heap_update() returns HeapTupleUpdated, but only if the table is a >> leaf partition. But this obviously will cause annoyance because of >> chances of getting such errors when there are concurrent updates and >> deletes in the same partition. But we can keep a table-level option >> for determining whether to error out or silently lose the UPDATE. > > I'm still a fan of the "do nothing and just document that this is a > weirdness of partitioned tables" approach, because implementing > something will be complicated, will ensure that this misses this > release if not the next one, and may not be any better for users. But > probably we need to get some more opinions from other people, since I > can imagine people being pretty unhappy if the consensus happens to be > at odds with my own preferences. I'd give the view that we cannot silently ignore this issue, bearing in mind the point that we're expecting partitioned tables to behave exactly like normal tables. In my understanding the issue is that UPDATEs will fail to update a row when a valid row exists in the case where a row moved between partitions; that behaviour will be different to a standard table. It is of course very good that we have something ready for this release and can make a choice of what to do. Thoughts 1. Reuse the tuple state HEAP_MOVED_OFF which IIRC represent exactly almost exactly the same thing. An UPDATE which gets to a HEAP_MOVED_OFF tuple will know to re-find the tuple via the partition metadata, or I might be persuaded that in-this-release it is acceptable to fail when this occurs with an ERROR and a retryable SQLCODE, since the UPDATE will succeed on next execution. 2. I know that DB2 handles this by having the user specify WITH ROW MOVEMENT to explicitly indicate they accept the issue and want update to work even with that. We could have an explicit option to allow that. This appears to be the only way we could avoid silent errors for foreign table partitions. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Feb 24, 2017 at 3:24 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > I'd give the view that we cannot silently ignore this issue, bearing > in mind the point that we're expecting partitioned tables to behave > exactly like normal tables. At the risk of repeating myself, I don't expect that, and I don't think it's a reasonable expectation. It's reasonable to expect partitioning to be notably better than inheritance (which I think it already is) and to provide a good base for future work (which I think it does), but I think getting them to behave exactly like normal tables (except for the things we want to be different) will take another ten years of development work. > In my understanding the issue is that UPDATEs will fail to update a > row when a valid row exists in the case where a row moved between > partitions; that behaviour will be different to a standard table. Right, when at READ COMMITTED and EvalPlanQual would have happened otherwise. > It is of course very good that we have something ready for this > release and can make a choice of what to do. > > Thoughts > > 1. Reuse the tuple state HEAP_MOVED_OFF which IIRC represent exactly > almost exactly the same thing. An UPDATE which gets to a > HEAP_MOVED_OFF tuple will know to re-find the tuple via the partition > metadata, or I might be persuaded that in-this-release it is > acceptable to fail when this occurs with an ERROR and a retryable > SQLCODE, since the UPDATE will succeed on next execution. I've got my doubts about whether we can make that bit work that way, considering that we still support pg_upgrade (possibly in multiple steps) from old releases that had VACUUM FULL. We really ought to put some work into reclaiming those old bits, but there's probably no time for that in v10. > 2. I know that DB2 handles this by having the user specify WITH ROW > MOVEMENT to explicitly indicate they accept the issue and want update > to work even with that. We could have an explicit option to allow > that. This appears to be the only way we could avoid silent errors for > foreign table partitions. Yeah, that's a thought. We could give people a choice between (a) updates that cause rows to move between partitions just fail and (b) such updates work but with EPQ-related deficiencies. I had previously thought that, given those two choices, everybody would like (b) better than (a), but maybe not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Friday, February 24, 2017, Simon Riggs <simon@2ndquadrant.com> wrote:
2. I know that DB2 handles this by having the user specify WITH ROW
MOVEMENT to explicitly indicate they accept the issue and want update
to work even with that. We could have an explicit option to allow
that. This appears to be the only way we could avoid silent errors for
foreign table partitions.
This does, however, make the partitioning very non-transparent to every update query just because it is remotely possible a partition-moving update might occur concurrently.
I dislike an error. I'd say that making partition "just work" here is material for another patch. In this one an update of the partition key can be documented as shorthand for delete-returning-insert with all the limitations that go with that. If someone acceptably solves the ctid following logic later it can be committed - I'm assuming there would be no complaints to making things just work in a case where they only sorta worked.
David J.
On 24 February 2017 at 14:57, David G. Johnston <david.g.johnston@gmail.com> wrote: > I dislike an error. I'd say that making partition "just work" here is > material for another patch. In this one an update of the partition key can > be documented as shorthand for delete-returning-insert with all the > limitations that go with that. If someone acceptably solves the ctid > following logic later it can be committed - I'm assuming there would be no > complaints to making things just work in a case where they only sorta > worked. Personally I don't think there's any hope that there will ever be cross-table ctids links. Maybe one day there will be a major new table storage format with very different capabilities than today but in the current architecture it seems like an impossible leap. I would expect everyone to come to terms with the basic idea that partition key updates are always going to be a corner case. The user defined the partition key and the docs should carefully explain to them the impact of that definition. As long as that explanation gives them something they can work with and manage the consequences of that's going to be fine. What I'm concerned about is that silently giving "wrong" answers in regular queries -- not even ones doing the partition key updates -- is something the user can't really manage. They have no way to rewrite the query to avoid the problem if some other user or part of their system is updating partition keys. They have no way to know the problem is even occurring. Just to spell it out -- it's not just "no-op updates" where the user sees 0 records updated. If I update all records where username='stark', perhaps to set the "user banned" flag and get back "9 records updated" and later find out that I missed a record because someone changed the department_id while my query was running -- how would I even know? How could I possibly rewrite my query to avoid that? The reason I suggested throwing a serialization failure was because I thought that would be the easiest short-cut to the problem. I had imagined having a bit pattern that indicated such a move would actually be a pretty minor change actually. I would actually consider using a normal update bitmask with InvalidBlockId in the ctid to indicate the tuple was updated and the target of the chain isn't available. That may be something we'll need in the future for other cases too. Throwing an error means the user has to retry their query but that's at least something they can do. Even if they don't do it automatically the ultimate user will probably just retry whatever operation errored out anyways. But at least their database isn't logically corrupted. -- greg
On 24 February 2017 at 14:57, David G. Johnston
<david.g.johnston@gmail.com> wrote:
> I dislike an error. I'd say that making partition "just work" here is
> material for another patch. In this one an update of the partition key can
> be documented as shorthand for delete-returning-insert with all the
> limitations that go with that. If someone acceptably solves the ctid
> following logic later it can be committed - I'm assuming there would be no
> complaints to making things just work in a case where they only sorta
> worked.
Personally I don't think there's any hope that there will ever be
cross-table ctids links. Maybe one day there will be a major new table
storage format with very different capabilities than today but in the
current architecture it seems like an impossible leap.
How about making it work without a physical token dynamic? For instance, let the server recognize the serialization error but instead of returning it to the client the server itself tries again.
I would expect everyone to come to terms with the basic idea that
partition key updates are always going to be a corner case. The user
defined the partition key and the docs should carefully explain to
them the impact of that definition. As long as that explanation gives
them something they can work with and manage the consequences of
that's going to be fine.
What I'm concerned about is that silently giving "wrong" answers in
regular queries -- not even ones doing the partition key updates -- is
something the user can't really manage. They have no way to rewrite
the query to avoid the problem if some other user or part of their
system is updating partition keys. They have no way to know the
problem is even occurring.
Just to spell it out -- it's not just "no-op updates" where the user
sees 0 records updated. If I update all records where
username='stark', perhaps to set the "user banned" flag and get back
"9 records updated" and later find out that I missed a record because
someone changed the department_id while my query was running -- how
would I even know? How could I possibly rewrite my query to avoid
that?
But my point is that this isn't a regression from current behavior. If I deleted one of those starks and re-inserted them with a different department_id that brand new record wouldn't be banned. In short, my take on this patch is that it is a performance optimization. Making the UPDATE command actually work as part of its implementation detail is a happy byproduct.
From the POV of an external observer it doesn't have to matter whether the update or delete-insert SQL was used. It would be nice if the UPDATE version could keep logical identity maintained but that is a feature enhancement.
Failing if the other session used the UPDATE SQL isn't wrong; and I'm not against it, I just don't believe that it is better than maintaining the status quo semantics.
That said my concurrency-fu is not that strong and I don't really have a practical reason to prefer one over the other - thus I fall back on maintaining internal consistency.
IIUC it is already possible, for those who care to do so, to get a serialization failure in this scenario by upgrading isolation to repeatable read.
David J.
On Sat, Feb 25, 2017 at 11:41 PM, Greg Stark <stark@mit.edu> wrote: > What I'm concerned about is that silently giving "wrong" answers in > regular queries -- not even ones doing the partition key updates -- is > something the user can't really manage. They have no way to rewrite > the query to avoid the problem if some other user or part of their > system is updating partition keys. They have no way to know the > problem is even occurring. That's a reasonable concern, but it's not like EvalPlanQual works perfectly today and never causes any application-visible inconsistencies that end up breaking things. As the documentation says: ---- Because of the above rules, it is possible for an updating command to see an inconsistent snapshot: it can see the effects of concurrent updating commands on the same rows it is trying to update, but it does not see effects of those commands on other rows in the database. This behavior makes Read Committed mode unsuitable for commands that involve complex search conditions; however, it is just right for simpler cases. ---- Maybe I've just spent too long hanging out with Kevin Grittner, but I've come to view our EvalPlanQual behavior as pretty rickety and unreliable in general. For example, consider the fact that when I spent over a year and approximately 1 gazillion email messages trying to hammer out how join pushdown was going to EPQ rechecks, we discovered that the FDW API wasn't actually handling those correctly for even for scans of single tables, hence commit 5fc4c26db5120bd90348b6ee3101fcddfdf54800. I'm not saying that time and effort wasn't well-spent, but I wonder whether it's necessary to hold partitioned tables to a higher standard than that to which the FDW interface was held for the first 4.5 years of its life. Perhaps it is good for us to do that, but I'm not 100% convinced. It seems like we decide to worry about EvalPlanQual in some cases and not in others more or less arbitrarily. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 23 February 2017 at 16:02, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > >> 2. In the patch, as part of the row movement, ExecDelete() is called >> followed by ExecInsert(). This is done that way, because we want to >> have the ROW triggers on that (sub)partition executed. If a user has >> explicitly created DELETE and INSERT BR triggers for this partition, I >> think we should run those. While at the same time, another question >> is, what about UPDATE trigger on the same table ? Here again, one can >> argue that because this UPDATE has been transformed into a >> DELETE-INSERT, we should not run UPDATE trigger for row-movement. But >> there can be a counter-argument. For e.g. if a user needs to make sure >> about logging updates of particular columns of a row, he will expect >> the logging to happen even when that row was transparently moved. In >> the patch, I have retained the firing of UPDATE BR trigger. > > What of UPDATE AR triggers? I think it does not make sense running after row triggers in case of row-movement. There is no update happened on that leaf partition. This reasoning can also apply to BR update triggers. But the reasons for having a BR trigger and AR triggers are quite different. Generally, a user needs to do some modifications to the row before getting the final NEW row into the database, and hence [s]he defines a BR trigger for that. And we can't just silently skip this step only because the final row went into some other partition; in fact the row-movement itself might depend on what the BR trigger did with the row. Whereas, AR triggers are typically written for doing some other operation once it is made sure the row is actually updated. In case of row-movement, it is not actually updated. > > As a comment on how row-movement is being handled in code, I wonder if it > could be be made to look similar structurally to the code in ExecInsert() > that handles ON CONFLICT DO UPDATE. That is, > > if (partition constraint fails) > { > /* row movement */ > } > else > { > /* ExecConstraints() */ > /* heap_update(), EvalPlanQual(), and ExecInsertIndexTuples() */ > } I guess this is what has been effectively done for row movement, no ? Looking at that, I found that in the current patch, if there is no row-movement happening, ExecPartitionCheck() effectively gets called twice : First time when ExecPartitionCheck() is explicitly called for row-movement-required check, and second time in ExecConstraints() call. May be there should be 2 separate functions ExecCheckConstraints() and ExecPartitionConstraints(), and also ExecCheckConstraints() that just calls both. This way we can call the appropriate functions() accordingly in row-movement case, and the other callers would continue to call ExecConstraints(). > > I see that ExecConstraint() won't get called on the source partition's > constraints if row movement occurs. Maybe, that's unnecessary because the > new row won't be inserted into that partition anyway. Yes I agree. > > ExecWithCheckOptions() for RLS update check does happen *before* row > movement though. Yes. I think we should do it anyways. > >> 3. In case of a concurrent update/delete, suppose session A has locked >> the row for deleting it. Now a session B has decided to update this >> row and that is going to cause row movement, which means it will >> delete it first. But when session A is finished deleting it, session B >> finds that it is already deleted. In such case, it should not go ahead >> with inserting a new row as part of the row movement. For that, I have >> added a new parameter 'already_delete' for ExecDelete(). > > Makes sense. Maybe: already_deleted -> concurrently_deleted. Right, concurrently_deleted sounds more accurate. In the next patch, I will change that. > >> Of course, this still won't completely solve the concurrency anomaly. >> In the above case, the UPDATE of Session B gets lost. May be, for a >> user that does not tolerate this, we can have a table-level option >> that disallows row movement, or will cause an error to be thrown for >> one of the concurrent session. > > Will this table-level option be specified for a partitioned table once or > for individual partitions? My opinion is, if decide to have table-level option, it should be on the root partition, to keep it simple. > >> 4. The ExecSetupPartitionTupleRouting() is re-used for routing the row >> that is to be moved. So in ExecInitModifyTable(), we call >> ExecSetupPartitionTupleRouting() even for UPDATE. We can also do this >> only during execution time for the very first time we find that we >> need to do a row movement. I will think over that, but I am thinking >> it might complicate things, as compared to always doing the setup for >> UPDATE. WIll check on that. > > Hmm. ExecSetupPartitionTupleRouting(), which does significant amount of > setup work, is fine being called in ExecInitModifyTable() in the insert > case because there are often cases where that's a bulk-insert and hence > cost of the setup work is amortized. Updates, OTOH, are seldom done in a > bulk manner. So that might be an argument for doing it late only when > needed. Yes, agreed. > But that starts to sound less attractive when one realizes that > that will occur for every row that wants to move. If we manage to call ExecSetupPartitionTupleRouting() during execution phase only once for the very first time we find the update requires row movement, then we can re-use the info. One more thing I noticed is that, in case of update-returning, the ExecDelete() will also generate result of RETURNING, which we are discarding. So this is a waste. We should not even process RETURNING in ExecDelete() called for row-movement. The RETURNING should be processed only for ExecInsert(). -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On 2017/02/26 4:01, David G. Johnston wrote: > IIUC it is already possible, for those who care to do so, to get a > serialization failure in this scenario by upgrading isolation to repeatable > read. Maybe, this can be added as a note in the documentation. Thanks, Amit
On Thu, Mar 2, 2017 at 11:53 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > I think it does not make sense running after row triggers in case of > row-movement. There is no update happened on that leaf partition. This > reasoning can also apply to BR update triggers. But the reasons for > having a BR trigger and AR triggers are quite different. Generally, a > user needs to do some modifications to the row before getting the > final NEW row into the database, and hence [s]he defines a BR trigger > for that. And we can't just silently skip this step only because the > final row went into some other partition; in fact the row-movement > itself might depend on what the BR trigger did with the row. Whereas, > AR triggers are typically written for doing some other operation once > it is made sure the row is actually updated. In case of row-movement, > it is not actually updated. How about running the BR update triggers for the old partition and the AR update triggers for the new partition? It seems weird to run BR update triggers but not AR update triggers. Another option would be to run BR and AR delete triggers and then BR and AR insert triggers, emphasizing the choice to treat this update as a delete + insert, but (as Amit Kh. pointed out to me when we were in a room together this week) that precludes using the BEFORE trigger to modify the row. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2017/03/02 15:23, Amit Khandekar wrote: > On 23 February 2017 at 16:02, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> >>> 2. In the patch, as part of the row movement, ExecDelete() is called >>> followed by ExecInsert(). This is done that way, because we want to >>> have the ROW triggers on that (sub)partition executed. If a user has >>> explicitly created DELETE and INSERT BR triggers for this partition, I >>> think we should run those. While at the same time, another question >>> is, what about UPDATE trigger on the same table ? Here again, one can >>> argue that because this UPDATE has been transformed into a >>> DELETE-INSERT, we should not run UPDATE trigger for row-movement. But >>> there can be a counter-argument. For e.g. if a user needs to make sure >>> about logging updates of particular columns of a row, he will expect >>> the logging to happen even when that row was transparently moved. In >>> the patch, I have retained the firing of UPDATE BR trigger. >> >> What of UPDATE AR triggers? > > I think it does not make sense running after row triggers in case of > row-movement. There is no update happened on that leaf partition. This > reasoning can also apply to BR update triggers. But the reasons for > having a BR trigger and AR triggers are quite different. Generally, a > user needs to do some modifications to the row before getting the > final NEW row into the database, and hence [s]he defines a BR trigger > for that. And we can't just silently skip this step only because the > final row went into some other partition; in fact the row-movement > itself might depend on what the BR trigger did with the row. Whereas, > AR triggers are typically written for doing some other operation once > it is made sure the row is actually updated. In case of row-movement, > it is not actually updated. OK, so it'd be better to clarify in the documentation that that's the case. >> As a comment on how row-movement is being handled in code, I wonder if it >> could be be made to look similar structurally to the code in ExecInsert() >> that handles ON CONFLICT DO UPDATE. That is, >> >> if (partition constraint fails) >> { >> /* row movement */ >> } >> else >> { >> /* ExecConstraints() */ >> /* heap_update(), EvalPlanQual(), and ExecInsertIndexTuples() */ >> } > > I guess this is what has been effectively done for row movement, no ? Yes, although it seems nice how the formatting of the code in ExecInsert() makes it apparent that they are distinct code paths. OTOH, the additional diffs caused by the suggested formatting might confuse other reviewers. > Looking at that, I found that in the current patch, if there is no > row-movement happening, ExecPartitionCheck() effectively gets called > twice : First time when ExecPartitionCheck() is explicitly called for > row-movement-required check, and second time in ExecConstraints() > call. May be there should be 2 separate functions > ExecCheckConstraints() and ExecPartitionConstraints(), and also > ExecCheckConstraints() that just calls both. This way we can call the > appropriate functions() accordingly in row-movement case, and the > other callers would continue to call ExecConstraints(). One random idea: we could add a bool ri_PartitionCheckOK which is set to true after it is checked in ExecConstraints(). And modify the condition in ExecConstraints() as follows: if (resultRelInfo->ri_PartitionCheck && + !resultRelInfo->ri_PartitionCheckOK && !ExecPartitionCheck(resultRelInfo, slot, estate)) >>> 3. In case of a concurrent update/delete, suppose session A has locked >>> the row for deleting it. Now a session B has decided to update this >>> row and that is going to cause row movement, which means it will >>> delete it first. But when session A is finished deleting it, session B >>> finds that it is already deleted. In such case, it should not go ahead >>> with inserting a new row as part of the row movement. For that, I have >>> added a new parameter 'already_delete' for ExecDelete(). >> >> Makes sense. Maybe: already_deleted -> concurrently_deleted. > > Right, concurrently_deleted sounds more accurate. In the next patch, I > will change that. Okay, thanks. >>> Of course, this still won't completely solve the concurrency anomaly. >>> In the above case, the UPDATE of Session B gets lost. May be, for a >>> user that does not tolerate this, we can have a table-level option >>> that disallows row movement, or will cause an error to be thrown for >>> one of the concurrent session. >> >> Will this table-level option be specified for a partitioned table once or >> for individual partitions? > > My opinion is, if decide to have table-level option, it should be on > the root partition, to keep it simple. I see. >> But that starts to sound less attractive when one realizes that >> that will occur for every row that wants to move. > > If we manage to call ExecSetupPartitionTupleRouting() during execution > phase only once for the very first time we find the update requires > row movement, then we can re-use the info. That might work, too. But I guess we're going with initialization in ExecInitModifyTable(). > One more thing I noticed is that, in case of update-returning, the > ExecDelete() will also generate result of RETURNING, which we are > discarding. So this is a waste. We should not even process RETURNING > in ExecDelete() called for row-movement. The RETURNING should be > processed only for ExecInsert(). I wonder if it makes sense to have ExecDeleteInternal() and ExecInsertInternal(), which perform the core function of DELETE and INSERT, respectively. Such as running triggers, checking constraints, etc. The RETURNING part is controllable by the statement, so it will be handled by the ExecDelete() and ExecInsert(), like it is now. When called from ExecUpdate() as part of row-movement, they perform just the core part and leave the rest to be done by ExecUpdate() itself. Thanks, Amit
I haven't yet handled all points, but meanwhile, some of the important points are discussed below ... On 6 March 2017 at 15:11, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > >>> But that starts to sound less attractive when one realizes that >>> that will occur for every row that wants to move. >> >> If we manage to call ExecSetupPartitionTupleRouting() during execution >> phase only once for the very first time we find the update requires >> row movement, then we can re-use the info. > > That might work, too. But I guess we're going with initialization in > ExecInitModifyTable(). I am more worried about this: even the UPDATEs that do not involve row movement would do the expensive setup. So do it only once when we find that we need to move the row. Something like this : ExecUpdate() { .... if (resultRelInfo->ri_PartitionCheck && !ExecPartitionCheck(resultRelInfo, slot, estate)) { bool already_deleted; ExecDelete(tupleid, oldtuple, planSlot, epqstate, estate, &already_deleted, canSetTag); if (already_deleted) return NULL; else { /* If we haven't already built the state for INSERT * tuple routing, build it now */ if (!mtstate->mt_partition_dispatch_info) { ExecSetupPartitionTupleRouting( mtstate->resultRelInfo->ri_RelationDesc, &mtstate->mt_partition_dispatch_info, &mtstate->mt_partitions, &mtstate->mt_partition_tupconv_maps, &mtstate->mt_partition_tuple_slot, &mtstate->mt_num_dispatch, &mtstate->mt_num_partitions); } return ExecInsert(mtstate, slot, planSlot, NULL, ONCONFLICT_NONE, estate, false); } } ... } > >> One more thing I noticed is that, in case of update-returning, the >> ExecDelete() will also generate result of RETURNING, which we are >> discarding. So this is a waste. We should not even process RETURNING >> in ExecDelete() called for row-movement. The RETURNING should be >> processed only for ExecInsert(). > > I wonder if it makes sense to have ExecDeleteInternal() and > ExecInsertInternal(), which perform the core function of DELETE and > INSERT, respectively. Such as running triggers, checking constraints, > etc. The RETURNING part is controllable by the statement, so it will be > handled by the ExecDelete() and ExecInsert(), like it is now. > > When called from ExecUpdate() as part of row-movement, they perform just > the core part and leave the rest to be done by ExecUpdate() itself. Yes, if we decide to execute only the core insert/delete operations and skip the triggers, then there is a compelling reason to have something like ExecDeleteInternal() and ExecInsertInternal(). In fact, I was about to start doing the same, except for the below discussion ... On 4 March 2017 at 12:49, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Mar 2, 2017 at 11:53 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> I think it does not make sense running after row triggers in case of >> row-movement. There is no update happened on that leaf partition. This >> reasoning can also apply to BR update triggers. But the reasons for >> having a BR trigger and AR triggers are quite different. Generally, a >> user needs to do some modifications to the row before getting the >> final NEW row into the database, and hence [s]he defines a BR trigger >> for that. And we can't just silently skip this step only because the >> final row went into some other partition; in fact the row-movement >> itself might depend on what the BR trigger did with the row. Whereas, >> AR triggers are typically written for doing some other operation once >> it is made sure the row is actually updated. In case of row-movement, >> it is not actually updated. > > How about running the BR update triggers for the old partition and the > AR update triggers for the new partition? It seems weird to run BR > update triggers but not AR update triggers. Another option would be > to run BR and AR delete triggers and then BR and AR insert triggers, > emphasizing the choice to treat this update as a delete + insert, but > (as Amit Kh. pointed out to me when we were in a room together this > week) that precludes using the BEFORE trigger to modify the row. I checked the trigger behaviour in case of UPSERT. Here, when there is conflict found, ExecOnConflictUpdate() is called, and then the function returns immediately, which means AR INSERT trigger will not fire. And ExecOnConflictUpdate() calls ExecUpdate(), which means BR and AR UPDATE triggers will be fired. So in short, when an INSERT becomes an UPDATE, BR INSERT triggers do fire, but then the BR UPDATE and AR UPDATE also get fired. On the same lines, it makes sense in case of UPDATE=>DELETE+INSERT operation to fire a BR UPDATE trigger on the original table, and then the BR and AR DELETE/INSERT triggers on the respective tables. So the common policy can be : Fire the BR trigger. It can be INESRT/UPDATE/DELETE trigger depending upon what the statement is. If there is a change in the operation, according to what the operation is converted to (UPDATE->DELETE+INSERT or INSERT->UPDATE), all the respective triggers would be fired.
On 17 March 2017 at 16:07, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 6 March 2017 at 15:11, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> >>>> But that starts to sound less attractive when one realizes that >>>> that will occur for every row that wants to move. >>> >>> If we manage to call ExecSetupPartitionTupleRouting() during execution >>> phase only once for the very first time we find the update requires >>> row movement, then we can re-use the info. >> >> That might work, too. But I guess we're going with initialization in >> ExecInitModifyTable(). > > I am more worried about this: even the UPDATEs that do not involve row > movement would do the expensive setup. So do it only once when we find > that we need to move the row. Something like this : > ExecUpdate() > { > .... > if (resultRelInfo->ri_PartitionCheck && > !ExecPartitionCheck(resultRelInfo, slot, estate)) > { > bool already_deleted; > > ExecDelete(tupleid, oldtuple, planSlot, epqstate, estate, > &already_deleted, canSetTag); > > if (already_deleted) > return NULL; > else > { > /* If we haven't already built the state for INSERT > * tuple routing, build it now */ > if (!mtstate->mt_partition_dispatch_info) > { > ExecSetupPartitionTupleRouting( > mtstate->resultRelInfo->ri_RelationDesc, > &mtstate->mt_partition_dispatch_info, > &mtstate->mt_partitions, > &mtstate->mt_partition_tupconv_maps, > &mtstate->mt_partition_tuple_slot, > &mtstate->mt_num_dispatch, > &mtstate->mt_num_partitions); > } > > return ExecInsert(mtstate, slot, planSlot, NULL, > ONCONFLICT_NONE, estate, false); > } > } > ... > } Attached is v2 patch which implements the above optimization. Now, for UPDATE, ExecSetupPartitionTupleRouting() will be called only if row movement is needed. We have to open an extra relation for the root partition, and keep it opened and its handle stored in mt_partition_dispatch_info[0]->reldesc. So ExecEndModifyTable() closes this if it is different from node->resultRelInfo->ri_RelationDesc. If it is same as node->resultRelInfo, it should not be closed because it gets closed as part of ExecEndPlan(). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hi Amit, Thanks for the updated patch. On 2017/03/23 3:09, Amit Khandekar wrote: > Attached is v2 patch which implements the above optimization. Would it be better to have at least some new tests? Also, there are a few places in the documentation mentioning that such updates cause error, which will need to be updated. Perhaps also add some explanatory notes about the mechanism (delete+insert), trigger behavior, caveats, etc. There were some points discussed upthread that could be mentioned in the documentation. @@ -633,6 +634,9 @@ ExecDelete(ItemPointer tupleid, HeapUpdateFailureData hufd; TupleTableSlot *slot = NULL; + if (already_deleted) + *already_deleted = false; + concurrently_deleted? @@ -962,7 +969,7 @@ ExecUpdate(ItemPointer tupleid, } else { - LockTupleMode lockmode; + LockTupleMode lockmode; Useless hunk. + if (!mtstate->mt_partition_dispatch_info) + { The if (pointer == NULL) style is better perhaps. + /* root table RT index is at the head of partitioned_rels */ + if (node->partitioned_rels) + { + Index root_rti; + Oid root_oid; + + root_rti = linitial_int(node->partitioned_rels); + root_oid = getrelid(root_rti, estate->es_range_table); + root_rel = heap_open(root_oid, NoLock); /* locked by InitPlan */ + } + else + root_rel = mtstate->resultRelInfo->ri_RelationDesc; Some explanatory comments here might be good, for example, explain in what situations node->partitioned_rels would not have been set and/or vice versa. > Now, for > UPDATE, ExecSetupPartitionTupleRouting() will be called only if row > movement is needed. > > We have to open an extra relation for the root partition, and keep it > opened and its handle stored in > mt_partition_dispatch_info[0]->reldesc. So ExecEndModifyTable() closes > this if it is different from node->resultRelInfo->ri_RelationDesc. If > it is same as node->resultRelInfo, it should not be closed because it > gets closed as part of ExecEndPlan(). I guess you're referring to the following hunk. Some comments: @@ -2154,10 +2221,19 @@ ExecEndModifyTable(ModifyTableState *node) * Close all the partitioned tables, leaf partitions,and their indices * * Remember node->mt_partition_dispatch_info[0] corresponds to the root - * partitioned table, which we must not try to close, because it is the - * main target table of the query that will be closed by ExecEndPlan(). - * Also, tupslot is NULL for the root partitioned table. + * partitioned table, which should not be closed if it is the main target + * table of the query, which will be closed by ExecEndPlan(). The last part could be written as: because it will be closed by ExecEndPlan(). Also, tupslot + * is NULL for the root partitioned table. */ + if (node->mt_num_dispatch > 0) + { + Relation root_partition; root_relation? + + root_partition = node->mt_partition_dispatch_info[0]->reldesc; + if (root_partition != node->resultRelInfo->ri_RelationDesc) + heap_close(root_partition, NoLock); + } It might be a good idea to Assert inside the if block above that node->operation != CMD_INSERT. Perhaps, also reflect that in the comment above so that it's clearer. I will set the patch to Waiting on Author. Thanks, Amit
Thanks Amit for your review comments. I am yet to handle all of your comments, but meanwhile , attached is an updated patch, that handles RETURNING. Earlier it was not working because ExecInsert() did not return any RETURNING clause. This is because the setup needed to create RETURNIG projection info for leaf partitions is done in ExecInitModifyTable() only in case of INSERT. But because it is an UPDATE operation, we have to do this explicitly as a one-time operation when it is determined that row-movement is required. This is similar to how we do one-time setup of mt_partition_dispatch_info. So in the patch, I have moved this code into a new function ExecInitPartitionReturningProjection(), and now this is called in ExecInitModifyTable() as well as during row movement for ExecInsert() processing the returning clause. Basically we need to do all that is done in ExecInitModifyTable() for INSERT. There are a couple of other things that I suspect that might need to be done as part of the missing initialization for Execinsert() during row-movement : 1. Junk filter handling 2. WITH CHECK OPTION Yet, ExecDelete() during row-movement is still returning the RETURNING result redundantly, which I am yet to handle this. On 23 March 2017 at 07:04, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Hi Amit, > > Thanks for the updated patch. > > On 2017/03/23 3:09, Amit Khandekar wrote: >> Attached is v2 patch which implements the above optimization. > > Would it be better to have at least some new tests? Also, there are a few > places in the documentation mentioning that such updates cause error, > which will need to be updated. Perhaps also add some explanatory notes > about the mechanism (delete+insert), trigger behavior, caveats, etc. > There were some points discussed upthread that could be mentioned in the > documentation. Yeah, agreed. Will do this in the subsequent patch. > > @@ -633,6 +634,9 @@ ExecDelete(ItemPointer tupleid, > HeapUpdateFailureData hufd; > TupleTableSlot *slot = NULL; > > + if (already_deleted) > + *already_deleted = false; > + > > concurrently_deleted? Done. > > @@ -962,7 +969,7 @@ ExecUpdate(ItemPointer tupleid, > } > else > { > - LockTupleMode lockmode; > + LockTupleMode lockmode; > > Useless hunk. Removed. I am yet to handle your other comments , still working on them, but till then , attached is the updated patch.
Attachment
On 25 March 2017 at 01:34, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: I am yet to handle all of your comments, but meanwhile , attached is > an updated patch, that handles RETURNING. > > Earlier it was not working because ExecInsert() did not return any > RETURNING clause. This is because the setup needed to create RETURNIG > projection info for leaf partitions is done in ExecInitModifyTable() > only in case of INSERT. But because it is an UPDATE operation, we have > to do this explicitly as a one-time operation when it is determined > that row-movement is required. This is similar to how we do one-time > setup of mt_partition_dispatch_info. So in the patch, I have moved > this code into a new function ExecInitPartitionReturningProjection(), > and now this is called in ExecInitModifyTable() as well as during row > movement for ExecInsert() processing the returning clause. > Basically we need to do all that is done in ExecInitModifyTable() for > INSERT. There are a couple of other things that I suspect that might > need to be done as part of the missing initialization for Execinsert() > during row-movement : > 1. Junk filter handling > 2. WITH CHECK OPTION Attached is an another updated patch v4 which does WITH-CHECK-OPTION related initialization. So we now have below two function calls during row movement : /* Build WITH CHECK OPTION constraints for leaf partitions */ ExecInitPartitionWithCheckOptions(mtstate, root_rel); /* Build a projection for each leaf partition rel. */ ExecInitPartitionReturningProjection(mtstate, root_rel); And these functions are now re-used at two places : In ExecInitModifyTable() and in row-movement code. Basically whatever was not being initialized in ExecInitModifyTable() is now done in row-movement code. I have added relevant scenarios in sql/update.sql. I checked the junk filter handling. I think there isn't anything that needs to be done, because for INSERT, all that is needed is ExecCheckPlanOutput(). And this function is anyway called even in ExecInitModifyTable() even for UPDATE, so we don't have to initialize anything additional. > Yet, ExecDelete() during row-movement is still returning the RETURNING > result redundantly, which I am yet to handle this. Done above. Now we have a new parameter in ExecDelete() which tells whether to skip RETURNING. On 23 March 2017 at 07:04, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Would it be better to have at least some new tests? Added some more scenarios in update.sql. Also have included scenarios for WITH-CHECK-OPTION for updatable views. > Also, there are a few places in the documentation mentioning that such updates cause error, > which will need to be updated. Perhaps also add some explanatory notes > about the mechanism (delete+insert), trigger behavior, caveats, etc. > There were some points discussed upthread that could be mentioned in the > documentation. Yeah, I agree. Documentation needs some important changes. I am still working on them. > + if (!mtstate->mt_partition_dispatch_info) > + { > > The if (pointer == NULL) style is better perhaps. > > + /* root table RT index is at the head of partitioned_rels */ > + if (node->partitioned_rels) > + { > + Index root_rti; > + Oid root_oid; > + > + root_rti = linitial_int(node->partitioned_rels); > + root_oid = getrelid(root_rti, estate->es_range_table); > + root_rel = heap_open(root_oid, NoLock); /* locked by > InitPlan */ > + } > + else > + root_rel = mtstate->resultRelInfo->ri_RelationDesc; > > Some explanatory comments here might be good, for example, explain in what > situations node->partitioned_rels would not have been set and/or vice versa. Added some more comments in the relevant if conditions. > >> Now, for >> UPDATE, ExecSetupPartitionTupleRouting() will be called only if row >> movement is needed. >> >> We have to open an extra relation for the root partition, and keep it >> opened and its handle stored in >> mt_partition_dispatch_info[0]->reldesc. So ExecEndModifyTable() closes >> this if it is different from node->resultRelInfo->ri_RelationDesc. If >> it is same as node->resultRelInfo, it should not be closed because it >> gets closed as part of ExecEndPlan(). > > I guess you're referring to the following hunk. Some comments: > > @@ -2154,10 +2221,19 @@ ExecEndModifyTable(ModifyTableState *node) > * Close all the partitioned tables, leaf partitions, and their indices > * > * Remember node->mt_partition_dispatch_info[0] corresponds to the root > - * partitioned table, which we must not try to close, because it is the > - * main target table of the query that will be closed by ExecEndPlan(). > - * Also, tupslot is NULL for the root partitioned table. > + * partitioned table, which should not be closed if it is the main target > + * table of the query, which will be closed by ExecEndPlan(). > > The last part could be written as: because it will be closed by ExecEndPlan(). Actually I later realized that the relation is not required to be kept open until ExecEndmodifyTable(). So I reverted the above changes. Now it is immediately closed once all the row-movement-related setup is done. > > Also, tupslot > + * is NULL for the root partitioned table. > */ > + if (node->mt_num_dispatch > 0) > + { > + Relation root_partition; > > root_relation? > > + > + root_partition = node->mt_partition_dispatch_info[0]->reldesc; > + if (root_partition != node->resultRelInfo->ri_RelationDesc) > + heap_close(root_partition, NoLock); > + } > > It might be a good idea to Assert inside the if block above that > node->operation != CMD_INSERT. Perhaps, also reflect that in the comment > above so that it's clearer. This does not apply now since I reverted as mentioned above. > >> Looking at that, I found that in the current patch, if there is no >> row-movement happening, ExecPartitionCheck() effectively gets called >> twice : First time when ExecPartitionCheck() is explicitly called for >> row-movement-required check, and second time in ExecConstraints() >> call. May be there should be 2 separate functions >> ExecCheckConstraints() and ExecPartitionConstraints(), and also >> ExecCheckConstraints() that just calls both. This way we can call the >> appropriate functions() accordingly in row-movement case, and the >> other callers would continue to call ExecConstraints(). > > One random idea: we could add a bool ri_PartitionCheckOK which is set to > true after it is checked in ExecConstraints(). And modify the condition > in ExecConstraints() as follows: > > if (resultRelInfo->ri_PartitionCheck && >+ !resultRelInfo->ri_PartitionCheckOK && > !ExecPartitionCheck(resultRelInfo, slot, estate)) I have taken out the part in ExecConstraints where it forms and emits partition constraint error message, and put in new function ExecPartitionCheckEmitError(), and this is called in ExecConstraints() as well as in ExecUpdate() when it finds that it is not a partitioned table. This happens when the UPDATE has been run on a leaf partition, and when ExecPartitionCheck() fails for the leaf partition. Here, we just need to emit the same error message that ExecConstraint() emits.
Attachment
On 27 March 2017 at 13:05, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> Also, there are a few places in the documentation mentioning that such updates cause error, >> which will need to be updated. Perhaps also add some explanatory notes >> about the mechanism (delete+insert), trigger behavior, caveats, etc. >> There were some points discussed upthread that could be mentioned in the >> documentation. >> Yeah, I agree. Documentation needs some important changes. I am still >> working on them. Attached patch v5 has above required doc changes added. In the section 5.11 "Partitioning" => subsection 5 "Caveats", I have removed the caveat of not being able to update partition key. And it is now replaced by the caveat where an update/delete operations can silently miss a row when there is a concurrent UPDATE of partition-key happening. UPDATE row movement behaviour is described in : Part VI "Reference => SQL Commands => UPDATE > On 4 March 2017 at 12:49, Robert Haas <robertmhaas@gmail.com> wrote: >> How about running the BR update triggers for the old partition and the >> AR update triggers for the new partition? It seems weird to run BR >> update triggers but not AR update triggers. Another option would be >> to run BR and AR delete triggers and then BR and AR insert triggers, >> emphasizing the choice to treat this update as a delete + insert, but >> (as Amit Kh. pointed out to me when we were in a room together this >> week) that precludes using the BEFORE trigger to modify the row. > > I checked the trigger behaviour in case of UPSERT. Here, when there is > conflict found, ExecOnConflictUpdate() is called, and then the > function returns immediately, which means AR INSERT trigger will not > fire. And ExecOnConflictUpdate() calls ExecUpdate(), which means BR > and AR UPDATE triggers will be fired. So in short, when an INSERT > becomes an UPDATE, BR INSERT triggers do fire, but then the BR UPDATE > and AR UPDATE also get fired. On the same lines, it makes sense in > case of UPDATE=>DELETE+INSERT operation to fire a BR UPDATE trigger on > the original table, and then the BR and AR DELETE/INSERT triggers on > the respective tables. > > So the common policy can be : > Fire the BR trigger. It can be INESRT/UPDATE/DELETE trigger depending > upon what the statement is. > If there is a change in the operation, according to what the operation > is converted to (UPDATE->DELETE+INSERT or INSERT->UPDATE), all the > respective triggers would be fired. The current patch already has the behaviour as per above policy. So I have included the description of this trigger related behaviour in the "Overview of Trigger Behavior" section of the docs. This has been derived from the way it is written for trigger behaviour for UPSERT in the preceding section.
Attachment
Hi Amit, Thanks for the updated patches. On 2017/03/28 19:12, Amit Khandekar wrote: > On 27 March 2017 at 13:05, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>> Also, there are a few places in the documentation mentioning that such updates cause error, >>> which will need to be updated. Perhaps also add some explanatory notes >>> about the mechanism (delete+insert), trigger behavior, caveats, etc. >>> There were some points discussed upthread that could be mentioned in the >>> documentation. >>> Yeah, I agree. Documentation needs some important changes. I am still >>> working on them. > > Attached patch v5 has above required doc changes added. > > In the section 5.11 "Partitioning" => subsection 5 "Caveats", I have > removed the caveat of not being able to update partition key. And it > is now replaced by the caveat where an update/delete operations can > silently miss a row when there is a concurrent UPDATE of partition-key > happening. Hmm, how about just removing the "partition-changing updates are disallowed" caveat from the list on the 5.11 Partitioning page and explain the concurrency-related caveats on the UPDATE reference page? > UPDATE row movement behaviour is described in : > Part VI "Reference => SQL Commands => UPDATE > >> On 4 March 2017 at 12:49, Robert Haas <robertmhaas@gmail.com> wrote: >>> How about running the BR update triggers for the old partition and the >>> AR update triggers for the new partition? It seems weird to run BR >>> update triggers but not AR update triggers. Another option would be >>> to run BR and AR delete triggers and then BR and AR insert triggers, >>> emphasizing the choice to treat this update as a delete + insert, but >>> (as Amit Kh. pointed out to me when we were in a room together this >>> week) that precludes using the BEFORE trigger to modify the row. >> >> I checked the trigger behaviour in case of UPSERT. Here, when there is >> conflict found, ExecOnConflictUpdate() is called, and then the >> function returns immediately, which means AR INSERT trigger will not >> fire. And ExecOnConflictUpdate() calls ExecUpdate(), which means BR >> and AR UPDATE triggers will be fired. So in short, when an INSERT >> becomes an UPDATE, BR INSERT triggers do fire, but then the BR UPDATE >> and AR UPDATE also get fired. On the same lines, it makes sense in >> case of UPDATE=>DELETE+INSERT operation to fire a BR UPDATE trigger on >> the original table, and then the BR and AR DELETE/INSERT triggers on >> the respective tables. >> >> So the common policy can be : >> Fire the BR trigger. It can be INESRT/UPDATE/DELETE trigger depending >> upon what the statement is. >> If there is a change in the operation, according to what the operation >> is converted to (UPDATE->DELETE+INSERT or INSERT->UPDATE), all the >> respective triggers would be fired. > > The current patch already has the behaviour as per above policy. So I > have included the description of this trigger related behaviour in the > "Overview of Trigger Behavior" section of the docs. This has been > derived from the way it is written for trigger behaviour for UPSERT in > the preceding section. I tested how various row-level triggers behave and it all seems to work as you have described in your various messages, which the latest patch also documents. Some comments on the patch itself: - An <command>UPDATE</> that causes a row to move from one partition to - another fails, because the new value of the row fails to satisfy the - implicit partition constraint of the original partition. This might - change in future releases. + An <command>UPDATE</> causes a row to move from one partition to another + if the new value of the row fails to satisfy the implicit partition <snip> As mentioned above, we can simply remove this item from the list of caveats on ddl.sgml. The new text can be moved to the Notes portion of the UPDATE reference page. + If an <command>UPDATE</command> on a partitioned table causes a row to + move to another partition, it is possible that all row-level + <literal>BEFORE</> <command>UPDATE</command> triggers and all row-level + <literal>BEFORE</> <command>DELETE</command>/<command>INSERT</command> + triggers are applied on the respective partitions in a way that is apparent + from the final state of the updated row. How about dropping "it is possible that" from this sentence? + <command>UPDATE</command> is done by doing a <command>DELETE</command> on How about: s/is done/is performed/g + triggers are not applied because the <command>UPDATE</command> is converted + to a <command>DELETE</command> and <command>UPDATE</command>. I think you meant DELETE and INSERT. + if (resultRelInfo->ri_PartitionCheck && + !ExecPartitionCheck(resultRelInfo, slot, estate)) + { How about a one-line comment what this block of code does? - * Check the constraints of the tuple. Note that we pass the same + * Check the constraints of the tuple. Note that we pass the same I think that this hunk is not necessary. (I've heard that two spaces after a sentence-ending period is not a problem [1].) + * We have already run partition constraints above, so skip them below. How about: s/run/checked the/g? @@ -2159,6 +2289,7 @@ ExecEndModifyTable(ModifyTableState *node) heap_close(pd->reldesc, NoLock); ExecDropSingleTupleTableSlot(pd->tupslot); } + for (i = 0; i < node->mt_num_partitions; i++) { ResultRelInfo *resultRelInfo = node->mt_partitions + i; Needless hunk. Overall, I think the patch looks good now. Thanks again for working on it. Thanks, Amit [1] https://www.python.org/dev/peps/pep-0008/#comments
For some reason, my reply got sent to only Amit Langote instead of reply-to-all. Below is the mail reply. Thanks Amit Langote for bringing this to my notice. On 31 March 2017 at 16:54, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 31 March 2017 at 14:04, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On 2017/03/28 19:12, Amit Khandekar wrote: >>> In the section 5.11 "Partitioning" => subsection 5 "Caveats", I have >>> removed the caveat of not being able to update partition key. And it >>> is now replaced by the caveat where an update/delete operations can >>> silently miss a row when there is a concurrent UPDATE of partition-key >>> happening. >> >> Hmm, how about just removing the "partition-changing updates are >> disallowed" caveat from the list on the 5.11 Partitioning page and explain >> the concurrency-related caveats on the UPDATE reference page? > > IMHO this caveat is better placed in Partitioning chapter to emphasize > that it is a drawback specifically in presence of partitioning. > >> + If an <command>UPDATE</command> on a partitioned table causes a row to >> + move to another partition, it is possible that all row-level >> + <literal>BEFORE</> <command>UPDATE</command> triggers and all row-level >> + <literal>BEFORE</> <command>DELETE</command>/<command>INSERT</command> >> + triggers are applied on the respective partitions in a way that is >> apparent >> + from the final state of the updated row. >> >> How about dropping "it is possible that" from this sentence? > > What the statement means is : "It is true that all triggers are > applied on the respective partitions; but it is possible that they are > applied in a way that is apparent from final state of the updated > row". So "possible" applies to "in a way that is apparent..". It > means, the user should be aware that all the triggers can change the > row and so the final row will be affected by all those triggers. > Actually, we have a similar statement for UPSERT involved with > triggers in the preceding section. I have taken the statement from > there. > >> >> + <command>UPDATE</command> is done by doing a <command>DELETE</command> on >> >> How about: s/is done/is performed/g > > Done. > >> >> + triggers are not applied because the <command>UPDATE</command> is >> converted >> + to a <command>DELETE</command> and <command>UPDATE</command>. >> >> I think you meant DELETE and INSERT. > > Oops. Corrected. > >> >> + if (resultRelInfo->ri_PartitionCheck && >> + !ExecPartitionCheck(resultRelInfo, slot, estate)) >> + { >> >> How about a one-line comment what this block of code does? > > Yes, this was needed. Added a comment. > >> >> - * Check the constraints of the tuple. Note that we pass the same >> + * Check the constraints of the tuple. Note that we pass the same >> >> I think that this hunk is not necessary. (I've heard that two spaces >> after a sentence-ending period is not a problem [1].) > > Actually I accidentally removed one space, thinking that it was one of > my own comments. Reverted back this change, since it is a needless > hunk. > >> >> + * We have already run partition constraints above, so skip them below. >> >> How about: s/run/checked the/g? > > Done. > >> @@ -2159,6 +2289,7 @@ ExecEndModifyTable(ModifyTableState *node) >> heap_close(pd->reldesc, NoLock); >> ExecDropSingleTupleTableSlot(pd->tupslot); >> } >> + >> for (i = 0; i < node->mt_num_partitions; i++) >> { >> ResultRelInfo *resultRelInfo = node->mt_partitions + i; >> >> Needless hunk. > > Right. Removed. > >> >> Overall, I think the patch looks good now. Thanks again for working on it. > > Thanks Amit for your efforts in reviewing the patch. Attached is v6 > patch that contains above points handled. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Attachment
Hi Amit, Thanks for updating the patch. Since ddl.sgml got updated on Saturday, patch needs a rebase. > On 31 March 2017 at 16:54, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> On 31 March 2017 at 14:04, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> On 2017/03/28 19:12, Amit Khandekar wrote: >>>> In the section 5.11 "Partitioning" => subsection 5 "Caveats", I have >>>> removed the caveat of not being able to update partition key. And it >>>> is now replaced by the caveat where an update/delete operations can >>>> silently miss a row when there is a concurrent UPDATE of partition-key >>>> happening. >>> >>> Hmm, how about just removing the "partition-changing updates are >>> disallowed" caveat from the list on the 5.11 Partitioning page and explain >>> the concurrency-related caveats on the UPDATE reference page? >> >> IMHO this caveat is better placed in Partitioning chapter to emphasize >> that it is a drawback specifically in presence of partitioning. I mean we fixed things for declarative partitioning so it's no longer a caveat like it is for partitioning implemented using inheritance (in that the former doesn't require user-defined triggers to implement row-movement). Seeing the first sentence, that is: An <command>UPDATE</> causes a row to move from one partition to another if the new value of the row fails to satisfy the implicit partition constraint of the original partition but there is another partition which can fit this row. which clearly seems to suggest that row-movement, if required, is handled by the system. So it's not clear why it's in this list. If we want to describe the limitations of the current implementation, we'll need to rephrase it a bit. How about something like: For an <command>UPDATE</> that causes a row to move from one partition to another due the partition key being updated, the following caveats exist: <a brief description of the possibility of surprising results in the presence of concurrent manipulation of the row in question> >>> + If an <command>UPDATE</command> on a partitioned table causes a row to >>> + move to another partition, it is possible that all row-level >>> + <literal>BEFORE</> <command>UPDATE</command> triggers and all row-level >>> + <literal>BEFORE</> <command>DELETE</command>/<command>INSERT</command> >>> + triggers are applied on the respective partitions in a way that is >>> apparent >>> + from the final state of the updated row. >>> >>> How about dropping "it is possible that" from this sentence? >> >> What the statement means is : "It is true that all triggers are >> applied on the respective partitions; but it is possible that they are >> applied in a way that is apparent from final state of the updated >> row". So "possible" applies to "in a way that is apparent..". It >> means, the user should be aware that all the triggers can change the >> row and so the final row will be affected by all those triggers. >> Actually, we have a similar statement for UPSERT involved with >> triggers in the preceding section. I have taken the statement from >> there. I think where it appears in that sentence made me think it could be confusing to some. How about reordering sentences in that paragraph so that the whole paragraphs reads as follows: If an UPDATE on a partitioned table causes a row to move to another partition, it will be performed as a DELETE from the original partition followed by INSERT into the new partition. In this case, all row-level BEFORE UPDATE triggers and all row-level BEFORE DELETE triggers are fired on the original partition. Then all row-level BEFORE INSERT triggers are fired on the destination partition. The possibility of surprising outcomes should be considered when all these triggers affect the row being moved. As far as AFTER ROW triggers are concerned, AFTER DELETE and AFTER INSERT triggers are applied; but AFTER UPDATE triggers are not applied because the UPDATE has been converted to a DELETE and INSERT. None of the DELETE and INSERT statement-level triggers are fired, even if row movement occurs; only the UPDATE triggers of the target table used in the UPDATE statement will be fired. Finally, I forgot to mention during the last review that the new parameter 'returning' to ExecDelete() could be called 'process_returning'. Thanks, Amit
On 3 April 2017 at 17:13, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Hi Amit, > > Thanks for updating the patch. Since ddl.sgml got updated on Saturday, > patch needs a rebase. Rebased now. > >> On 31 March 2017 at 16:54, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>> On 31 March 2017 at 14:04, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>>> On 2017/03/28 19:12, Amit Khandekar wrote: >>>>> In the section 5.11 "Partitioning" => subsection 5 "Caveats", I have >>>>> removed the caveat of not being able to update partition key. And it >>>>> is now replaced by the caveat where an update/delete operations can >>>>> silently miss a row when there is a concurrent UPDATE of partition-key >>>>> happening. >>>> >>>> Hmm, how about just removing the "partition-changing updates are >>>> disallowed" caveat from the list on the 5.11 Partitioning page and explain >>>> the concurrency-related caveats on the UPDATE reference page? >>> >>> IMHO this caveat is better placed in Partitioning chapter to emphasize >>> that it is a drawback specifically in presence of partitioning. > > I mean we fixed things for declarative partitioning so it's no longer a > caveat like it is for partitioning implemented using inheritance (in that > the former doesn't require user-defined triggers to implement > row-movement). Seeing the first sentence, that is: > > An <command>UPDATE</> causes a row to move from one partition to another > if the new value of the row fails to satisfy the implicit partition > constraint of the original partition but there is another partition which > can fit this row. > > which clearly seems to suggest that row-movement, if required, is handled > by the system. So it's not clear why it's in this list. If we want to > describe the limitations of the current implementation, we'll need to > rephrase it a bit. Yes I agree. > How about something like: > For an <command>UPDATE</> that causes a row to move from one partition to > another due the partition key being updated, the following caveats exist: > <a brief description of the possibility of surprising results in the > presence of concurrent manipulation of the row in question> Now with the slightly changed doc structuring for partitioning in latest master, I have described in the end of section "5.10.2. Declarative Partitioning" this note : --- "Updating the partition key of a row might cause it to be moved into a different partition where this row satisfies its partition constraint." --- And then in the Limitations section, I have replaced the earlier can't-update-partition-key limitation with this new limitation as below : "When an UPDATE causes a row to move from one partition to another, there is a chance that another concurrent UPDATE or DELETE misses this row. Suppose, during the row movement, the row is still visible for the concurrent session, and it is about to do an UPDATE or DELETE operation on the same row. This DML operation can silently miss this row if the row now gets deleted from the partition by the first session as part of its UPDATE row movement. In such case, the concurrent UPDATE/DELETE, being unaware of the row movement, interprets that the row has just been deleted so there is nothing to be done for this row. Whereas, in the usual case where the table is not partitioned, or where there is no row movement, the second session would have identified the newly updated row and carried UPDATE/DELETE on this new row version." --- Further, in the Notes section of update.sgml, I have kept a link to the above limitations section like this : "In the case of a partitioned table, updating a row might cause it to no longer satisfy the partition constraint of the containing partition. In that case, if there is some other partition in the partition tree for which this row satisfies its partition constraint, then the row is moved to that partition. If there isn't such a partition, an error will occur. The error will also occur when updating a partition directly. Behind the scenes, the row movement is actually a DELETE and INSERT operation. However, there is a possibility that a concurrent UPDATE or DELETE on the same row may miss this row. For details see the section Section 5.10.2.3." > >>>> + If an <command>UPDATE</command> on a partitioned table causes a row to >>>> + move to another partition, it is possible that all row-level >>>> + <literal>BEFORE</> <command>UPDATE</command> triggers and all row-level >>>> + <literal>BEFORE</> <command>DELETE</command>/<command>INSERT</command> >>>> + triggers are applied on the respective partitions in a way that is >>>> apparent >>>> + from the final state of the updated row. >>>> >>>> How about dropping "it is possible that" from this sentence? >>> >>> What the statement means is : "It is true that all triggers are >>> applied on the respective partitions; but it is possible that they are >>> applied in a way that is apparent from final state of the updated >>> row". So "possible" applies to "in a way that is apparent..". It >>> means, the user should be aware that all the triggers can change the >>> row and so the final row will be affected by all those triggers. >>> Actually, we have a similar statement for UPSERT involved with >>> triggers in the preceding section. I have taken the statement from >>> there. > > I think where it appears in that sentence made me think it could be > confusing to some. How about reordering sentences in that paragraph so > that the whole paragraphs reads as follows: > > If an UPDATE on a partitioned table causes a row to move to another > partition, it will be performed as a DELETE from the original partition > followed by INSERT into the new partition. In this case, all row-level > BEFORE UPDATE triggers and all row-level BEFORE DELETE triggers are fired > on the original partition. Then all row-level BEFORE INSERT triggers are > fired on the destination partition. The possibility of surprising outcomes > should be considered when all these triggers affect the row being moved. > As far as AFTER ROW triggers are concerned, AFTER DELETE and AFTER INSERT > triggers are applied; but AFTER UPDATE triggers are not applied because > the UPDATE has been converted to a DELETE and INSERT. None of the DELETE > and INSERT statement-level triggers are fired, even if row movement > occurs; only the UPDATE triggers of the target table used in the UPDATE > statement will be fired. Yeah, most of the above makes sense to me. I have kept the phrase "as far as statement-level triggers are concerned". > > Finally, I forgot to mention during the last review that the new parameter > 'returning' to ExecDelete() could be called 'process_returning'. Done, thanks. Attached updated patch v7 has the above changes. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Attachment
Hi Amit, On 2017/04/04 20:11, Amit Khandekar wrote: > On 3 April 2017 at 17:13, Amit Langote wrote: >>>> On 31 March 2017 at 14:04, Amit Langote wrote: >> How about something like: >> For an <command>UPDATE</> that causes a row to move from one partition to >> another due the partition key being updated, the following caveats exist: >> <a brief description of the possibility of surprising results in the >> presence of concurrent manipulation of the row in question> > > Now with the slightly changed doc structuring for partitioning in > latest master, I have described in the end of section "5.10.2. > Declarative Partitioning" this note : > > --- > > "Updating the partition key of a row might cause it to be moved into a > different partition where this row satisfies its partition > constraint." > > --- > > And then in the Limitations section, I have replaced the earlier > can't-update-partition-key limitation with this new limitation as > below : > > "When an UPDATE causes a row to move from one partition to another, > there is a chance that another concurrent UPDATE or DELETE misses this > row. Suppose, during the row movement, the row is still visible for > the concurrent session, and it is about to do an UPDATE or DELETE > operation on the same row. This DML operation can silently miss this > row if the row now gets deleted from the partition by the first > session as part of its UPDATE row movement. In such case, the > concurrent UPDATE/DELETE, being unaware of the row movement, > interprets that the row has just been deleted so there is nothing to > be done for this row. Whereas, in the usual case where the table is > not partitioned, or where there is no row movement, the second session > would have identified the newly updated row and carried UPDATE/DELETE > on this new row version." > > --- OK. > Further, in the Notes section of update.sgml, I have kept a link to > the above limitations section like this : > > "In the case of a partitioned table, updating a row might cause it to > no longer satisfy the partition constraint of the containing > partition. In that case, if there is some other partition in the > partition tree for which this row satisfies its partition constraint, > then the row is moved to that partition. If there isn't such a > partition, an error will occur. The error will also occur when > updating a partition directly. Behind the scenes, the row movement is > actually a DELETE and INSERT operation. However, there is a > possibility that a concurrent UPDATE or DELETE on the same row may > miss this row. For details see the section Section 5.10.2.3." OK, too. It seems to me that the details in 5.10.2.3 provide more or less the same information as "concurrent UPDATE or DELETE looking at the moved row will miss this row", but maybe that's fine. >> If an UPDATE on a partitioned table causes a row to move to another >> partition, it will be performed as a DELETE from the original partition >> followed by INSERT into the new partition. In this case, all row-level >> BEFORE UPDATE triggers and all row-level BEFORE DELETE triggers are fired >> on the original partition. Then all row-level BEFORE INSERT triggers are >> fired on the destination partition. The possibility of surprising outcomes >> should be considered when all these triggers affect the row being moved. >> As far as AFTER ROW triggers are concerned, AFTER DELETE and AFTER INSERT >> triggers are applied; but AFTER UPDATE triggers are not applied because >> the UPDATE has been converted to a DELETE and INSERT. None of the DELETE >> and INSERT statement-level triggers are fired, even if row movement >> occurs; only the UPDATE triggers of the target table used in the UPDATE >> statement will be fired. > > Yeah, most of the above makes sense to me. I have kept the phrase "as > far as statement-level triggers are concerned". OK, sure. >> Finally, I forgot to mention during the last review that the new parameter >> 'returning' to ExecDelete() could be called 'process_returning'. > > Done, thanks. > > Attached updated patch v7 has the above changes. Marked as ready for committer. Thanks, Amit
On Wed, Apr 5, 2017 at 5:54 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Marked as ready for committer. Andres seems to have changed the status of this patch to "Needs review" and then, 30 seconds later, to "Waiting on author", but there's no actual email on the thread explaining what his concerns were. I'm going to set this back to "Ready for Committer" and push it out to the next CommitFest. I think this would be a great feature, but I think it's not entirely clear that we have consensus on the design, so let's revisit it for next release. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017-04-07 13:55:51 -0400, Robert Haas wrote: > On Wed, Apr 5, 2017 at 5:54 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > Marked as ready for committer. > > Andres seems to have changed the status of this patch to "Needs > review" and then, 30 seconds later, to "Waiting on author" > there's no actual email on the thread explaining what his concerns > were. I'm going to set this back to "Ready for Committer" and push it > out to the next CommitFest. I think this would be a great feature, > but I think it's not entirely clear that we have consensus on the > design, so let's revisit it for next release. I was kind of looking for the appropriate status of "not entirely clear that we have consensus on the design" - which isn't really ready-for-committer, but no waiting-on-author either... Greetings, Andres Freund
On Tue, Apr 4, 2017 at 7:11 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > Attached updated patch v7 has the above changes. This no longer applies. Please rebase. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2 May 2017 at 18:17, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Apr 4, 2017 at 7:11 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> Attached updated patch v7 has the above changes. > > This no longer applies. Please rebase. Thanks Robert for informing about this. My patch has a separate function for emitting error message when a partition constraint fails. And, the recent commit c0a8ae7be3 has changes to correct the way the tuple is formed for displaying in the error message. Hence there were some code-level conflicts. Attached is the rebased patch, which resolves the above conflicts. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Fri, Mar 17, 2017 at 4:07 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 4 March 2017 at 12:49, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Mar 2, 2017 at 11:53 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>> I think it does not make sense running after row triggers in case of >>> row-movement. There is no update happened on that leaf partition. This >>> reasoning can also apply to BR update triggers. But the reasons for >>> having a BR trigger and AR triggers are quite different. Generally, a >>> user needs to do some modifications to the row before getting the >>> final NEW row into the database, and hence [s]he defines a BR trigger >>> for that. And we can't just silently skip this step only because the >>> final row went into some other partition; in fact the row-movement >>> itself might depend on what the BR trigger did with the row. Whereas, >>> AR triggers are typically written for doing some other operation once >>> it is made sure the row is actually updated. In case of row-movement, >>> it is not actually updated. >> >> How about running the BR update triggers for the old partition and the >> AR update triggers for the new partition? It seems weird to run BR >> update triggers but not AR update triggers. Another option would be >> to run BR and AR delete triggers and then BR and AR insert triggers, >> emphasizing the choice to treat this update as a delete + insert, but >> (as Amit Kh. pointed out to me when we were in a room together this >> week) that precludes using the BEFORE trigger to modify the row. >> I also find the current behavior with respect to triggers quite odd. The two points that appears odd are (a) Executing both before row update and delete triggers on original partition sounds quite odd. (b) It seems inconsistent to consider behavior for row and statement triggers differently > > I checked the trigger behaviour in case of UPSERT. Here, when there is > conflict found, ExecOnConflictUpdate() is called, and then the > function returns immediately, which means AR INSERT trigger will not > fire. And ExecOnConflictUpdate() calls ExecUpdate(), which means BR > and AR UPDATE triggers will be fired. So in short, when an INSERT > becomes an UPDATE, BR INSERT triggers do fire, but then the BR UPDATE > and AR UPDATE also get fired. On the same lines, it makes sense in > case of UPDATE=>DELETE+INSERT operation to fire a BR UPDATE trigger on > the original table, and then the BR and AR DELETE/INSERT triggers on > the respective tables. > I am not sure if it is good idea to compare it with "Insert On Conflict Do Update", but even if we want that way, I think Insert On Conflict is consistent in statement level triggers which means it will fire both Insert and Update statement level triggres (as per below note in docs) whereas the documentation in the patch indicates that this patch will only fire Update statement level triggers which is odd. Note in docs about Insert On Conflict "Note that with an INSERT with an ON CONFLICT DO UPDATE clause, both INSERT and UPDATE statement level trigger will be fired. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, May 3, 2017 at 11:22 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 2 May 2017 at 18:17, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Apr 4, 2017 at 7:11 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>> Attached updated patch v7 has the above changes. >> > > Attached is the rebased patch, which resolves the above conflicts. > Few comments: 1. Operating directly on partition doesn't allow update to move row. Refer below example: create table t1(c1 int) partition by range(c1); create table t1_part_1 partition of t1 for values from (1) to (100); create table t1_part_2 partition of t1 for values from (100) to (200); insert into t1 values(generate_series(1,11)); insert into t1 values(generate_series(110,120)); postgres=# update t1_part_1 set c1=122 where c1=11; ERROR: new row for relation "t1_part_1" violates partition constraint DETAIL: Failing row contains (122). 2. - +static void ExecInitPartitionWithCheckOptions(ModifyTableState *mtstate, + Relation root_rel); Spurious line delete. 3. + longer satisfy the partition constraint of the containing partition. In that + case, if there is some other partition in the partition tree for which this + row satisfies its partition constraint, then the row is moved to that + partition. If there isn't such a partition, an error will occur. Doesn't this error case indicate that this needs to be integrated with Default partition patch of Rahila or that patch needs to take care this error case? Basically, if there is no matching partition, then move it to default partition. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, May 11, 2017 at 7:54 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Few comments: > 1. > Operating directly on partition doesn't allow update to move row. > Refer below example: > create table t1(c1 int) partition by range(c1); > create table t1_part_1 partition of t1 for values from (1) to (100); > create table t1_part_2 partition of t1 for values from (100) to (200); > insert into t1 values(generate_series(1,11)); > insert into t1 values(generate_series(110,120)); > > postgres=# update t1_part_1 set c1=122 where c1=11; > ERROR: new row for relation "t1_part_1" violates partition constraint > DETAIL: Failing row contains (122). I think that's correct behavior. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 11 May 2017 at 17:23, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Mar 17, 2017 at 4:07 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> On 4 March 2017 at 12:49, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Thu, Mar 2, 2017 at 11:53 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>>> I think it does not make sense running after row triggers in case of >>>> row-movement. There is no update happened on that leaf partition. This >>>> reasoning can also apply to BR update triggers. But the reasons for >>>> having a BR trigger and AR triggers are quite different. Generally, a >>>> user needs to do some modifications to the row before getting the >>>> final NEW row into the database, and hence [s]he defines a BR trigger >>>> for that. And we can't just silently skip this step only because the >>>> final row went into some other partition; in fact the row-movement >>>> itself might depend on what the BR trigger did with the row. Whereas, >>>> AR triggers are typically written for doing some other operation once >>>> it is made sure the row is actually updated. In case of row-movement, >>>> it is not actually updated. >>> >>> How about running the BR update triggers for the old partition and the >>> AR update triggers for the new partition? It seems weird to run BR >>> update triggers but not AR update triggers. Another option would be >>> to run BR and AR delete triggers and then BR and AR insert triggers, >>> emphasizing the choice to treat this update as a delete + insert, but >>> (as Amit Kh. pointed out to me when we were in a room together this >>> week) that precludes using the BEFORE trigger to modify the row. >>> > > I also find the current behavior with respect to triggers quite odd. > The two points that appears odd are (a) Executing both before row > update and delete triggers on original partition sounds quite odd. Note that *before* trigger gets fired *before* the update happens. The actual update may not even happen, depending upon what the trigger does. And then in our case, the update does not happen; not just that, it is transformed into delete-insert. So then we should fire before-delete trigger. > (b) It seems inconsistent to consider behavior for row and statement > triggers differently I am not sure whether we should compare row and statement triggers. Statement triggers are anyway fired only per-statement, depending upon whether it is update or insert or delete. It has nothing to do with how the rows are modified. > >> >> I checked the trigger behaviour in case of UPSERT. Here, when there is >> conflict found, ExecOnConflictUpdate() is called, and then the >> function returns immediately, which means AR INSERT trigger will not >> fire. And ExecOnConflictUpdate() calls ExecUpdate(), which means BR >> and AR UPDATE triggers will be fired. So in short, when an INSERT >> becomes an UPDATE, BR INSERT triggers do fire, but then the BR UPDATE >> and AR UPDATE also get fired. On the same lines, it makes sense in >> case of UPDATE=>DELETE+INSERT operation to fire a BR UPDATE trigger on >> the original table, and then the BR and AR DELETE/INSERT triggers on >> the respective tables. >> > > I am not sure if it is good idea to compare it with "Insert On > Conflict Do Update", but even if we want that way, I think Insert On > Conflict is consistent in statement level triggers which means it will > fire both Insert and Update statement level triggres (as per below > note in docs) whereas the documentation in the patch indicates that > this patch will only fire Update statement level triggers which is > odd > > Note in docs about Insert On Conflict > "Note that with an INSERT with an ON CONFLICT DO UPDATE clause, both > INSERT and UPDATE statement level trigger will be fired. I guess the reason this behaviour is kept for UPSERT, is because the statement itself suggests : insert/update. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On 11 May 2017 at 17:24, Amit Kapila <amit.kapila16@gmail.com> wrote: > Few comments: > 1. > Operating directly on partition doesn't allow update to move row. > Refer below example: > create table t1(c1 int) partition by range(c1); > create table t1_part_1 partition of t1 for values from (1) to (100); > create table t1_part_2 partition of t1 for values from (100) to (200); > insert into t1 values(generate_series(1,11)); > insert into t1 values(generate_series(110,120)); > > postgres=# update t1_part_1 set c1=122 where c1=11; > ERROR: new row for relation "t1_part_1" violates partition constraint > DETAIL: Failing row contains (122). Yes, as Robert said, this is expected behaviour. We move the row only within the partition subtree that has the update table as its root. In this case, it's the leaf partition. > > 3. > + longer satisfy the partition constraint of the containing partition. In that > + case, if there is some other partition in the partition tree for which this > + row satisfies its partition constraint, then the row is moved to that > + partition. If there isn't such a partition, an error will occur. > > Doesn't this error case indicate that this needs to be integrated with > Default partition patch of Rahila or that patch needs to take care > this error case? > Basically, if there is no matching partition, then move it to default partition. Will have a look on this. Thanks for pointing this out. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Thu, May 11, 2017 at 5:41 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 11 May 2017 at 17:23, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Fri, Mar 17, 2017 at 4:07 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>> On 4 March 2017 at 12:49, Robert Haas <robertmhaas@gmail.com> wrote: >>>> On Thu, Mar 2, 2017 at 11:53 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>>>> I think it does not make sense running after row triggers in case of >>>>> row-movement. There is no update happened on that leaf partition. This >>>>> reasoning can also apply to BR update triggers. But the reasons for >>>>> having a BR trigger and AR triggers are quite different. Generally, a >>>>> user needs to do some modifications to the row before getting the >>>>> final NEW row into the database, and hence [s]he defines a BR trigger >>>>> for that. And we can't just silently skip this step only because the >>>>> final row went into some other partition; in fact the row-movement >>>>> itself might depend on what the BR trigger did with the row. Whereas, >>>>> AR triggers are typically written for doing some other operation once >>>>> it is made sure the row is actually updated. In case of row-movement, >>>>> it is not actually updated. >>>> >>>> How about running the BR update triggers for the old partition and the >>>> AR update triggers for the new partition? It seems weird to run BR >>>> update triggers but not AR update triggers. Another option would be >>>> to run BR and AR delete triggers and then BR and AR insert triggers, >>>> emphasizing the choice to treat this update as a delete + insert, but >>>> (as Amit Kh. pointed out to me when we were in a room together this >>>> week) that precludes using the BEFORE trigger to modify the row. >>>> >> >> I also find the current behavior with respect to triggers quite odd. >> The two points that appears odd are (a) Executing both before row >> update and delete triggers on original partition sounds quite odd. > Note that *before* trigger gets fired *before* the update happens. The > actual update may not even happen, depending upon what the trigger > does. And then in our case, the update does not happen; not just that, > it is transformed into delete-insert. So then we should fire > before-delete trigger. > Sure, I am aware of that point, but it doesn't seem obvious that both update and delete BR triggers get fired for original partition. As a developer, it might be obvious to you that as you have used delete and insert interface, it is okay that corresponding BR/AR triggers get fired, however, it is not so obvious for others, rather it appears quite odd. If we try to compare it with the non-partitioned update, there also it is internally a delete and insert operation, but we don't fire triggers for those. >> (b) It seems inconsistent to consider behavior for row and statement >> triggers differently > > I am not sure whether we should compare row and statement triggers. > Statement triggers are anyway fired only per-statement, depending upon > whether it is update or insert or delete. It has nothing to do with > how the rows are modified. > Okay. The broader point I was trying to convey was that the way this patch defines the behavior of triggers doesn't sound good to me. It appears to me that in this thread multiple people have raised points around trigger behavior and you should try to consider those. Apart from the options, Robert has suggested, another option could be that we allow firing BR-AR update triggers for original partition and BR-AR insert triggers for the new partition. In this case, one can argue that we have not actually updated the row in the original partition, so there is no need to fire AR update triggers, but I feel that is what we do for non-partitioned table update and it should be okay here as well. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, May 11, 2017 at 5:45 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 11 May 2017 at 17:24, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Few comments: >> 1. >> Operating directly on partition doesn't allow update to move row. >> Refer below example: >> create table t1(c1 int) partition by range(c1); >> create table t1_part_1 partition of t1 for values from (1) to (100); >> create table t1_part_2 partition of t1 for values from (100) to (200); >> insert into t1 values(generate_series(1,11)); >> insert into t1 values(generate_series(110,120)); >> >> postgres=# update t1_part_1 set c1=122 where c1=11; >> ERROR: new row for relation "t1_part_1" violates partition constraint >> DETAIL: Failing row contains (122). > > Yes, as Robert said, this is expected behaviour. We move the row only > within the partition subtree that has the update table as its root. In > this case, it's the leaf partition. > Okay, but what is the technical reason behind it? Is it because the current design doesn't support it or is it because of something very fundamental to partitions? Is it because we can't find root partition from leaf partition? + is_partitioned_table = + root_rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE; + + if (is_partitioned_table) + ExecSetupPartitionTupleRouting( + root_rel, + /* Build WITH CHECK OPTION constraints for leaf partitions */ + ExecInitPartitionWithCheckOptions(mtstate, root_rel); + /* Build a projection for each leaf partition rel. */ + ExecInitPartitionReturningProjection(mtstate, root_rel); .. + /* It's not a partitioned table after all; error out. */ + ExecPartitionCheckEmitError(resultRelInfo, slot, estate); When we are anyway going to give error if table is not a partitioned table, then isn't it better to give it early when we first identify that. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, May 12, 2017 at 9:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, May 11, 2017 at 5:45 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> On 11 May 2017 at 17:24, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> Few comments: >>> 1. >>> Operating directly on partition doesn't allow update to move row. >>> Refer below example: >>> create table t1(c1 int) partition by range(c1); >>> create table t1_part_1 partition of t1 for values from (1) to (100); >>> create table t1_part_2 partition of t1 for values from (100) to (200); >>> insert into t1 values(generate_series(1,11)); >>> insert into t1 values(generate_series(110,120)); >>> >>> postgres=# update t1_part_1 set c1=122 where c1=11; >>> ERROR: new row for relation "t1_part_1" violates partition constraint >>> DETAIL: Failing row contains (122). >> >> Yes, as Robert said, this is expected behaviour. We move the row only >> within the partition subtree that has the update table as its root. In >> this case, it's the leaf partition. >> > > Okay, but what is the technical reason behind it? Is it because the > current design doesn't support it or is it because of something very > fundamental to partitions? > One plausible theory is that as Select's on partitions just returns the rows of that partition, the update should also behave in same way. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 12 May 2017 at 08:30, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, May 11, 2017 at 5:41 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> On 11 May 2017 at 17:23, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> On Fri, Mar 17, 2017 at 4:07 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>>> On 4 March 2017 at 12:49, Robert Haas <robertmhaas@gmail.com> wrote: >>>>> On Thu, Mar 2, 2017 at 11:53 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>>>>> I think it does not make sense running after row triggers in case of >>>>>> row-movement. There is no update happened on that leaf partition. This >>>>>> reasoning can also apply to BR update triggers. But the reasons for >>>>>> having a BR trigger and AR triggers are quite different. Generally, a >>>>>> user needs to do some modifications to the row before getting the >>>>>> final NEW row into the database, and hence [s]he defines a BR trigger >>>>>> for that. And we can't just silently skip this step only because the >>>>>> final row went into some other partition; in fact the row-movement >>>>>> itself might depend on what the BR trigger did with the row. Whereas, >>>>>> AR triggers are typically written for doing some other operation once >>>>>> it is made sure the row is actually updated. In case of row-movement, >>>>>> it is not actually updated. >>>>> >>>>> How about running the BR update triggers for the old partition and the >>>>> AR update triggers for the new partition? It seems weird to run BR >>>>> update triggers but not AR update triggers. Another option would be >>>>> to run BR and AR delete triggers and then BR and AR insert triggers, >>>>> emphasizing the choice to treat this update as a delete + insert, but >>>>> (as Amit Kh. pointed out to me when we were in a room together this >>>>> week) that precludes using the BEFORE trigger to modify the row. >>>>> >>> >>> I also find the current behavior with respect to triggers quite odd. >>> The two points that appears odd are (a) Executing both before row >>> update and delete triggers on original partition sounds quite odd. >> Note that *before* trigger gets fired *before* the update happens. The >> actual update may not even happen, depending upon what the trigger >> does. And then in our case, the update does not happen; not just that, >> it is transformed into delete-insert. So then we should fire >> before-delete trigger. >> > > Sure, I am aware of that point, but it doesn't seem obvious that both > update and delete BR triggers get fired for original partition. As a > developer, it might be obvious to you that as you have used delete and > insert interface, it is okay that corresponding BR/AR triggers get > fired, however, it is not so obvious for others, rather it appears > quite odd. I agree that it seems a bit odd that we are firing both update and delete triggers on the same partition. But if you look at the perspective that the update=>delete+insert is a user-aware operation, it does not seem that odd. > If we try to compare it with the non-partitioned update, > there also it is internally a delete and insert operation, but we > don't fire triggers for those. For a non-partitioned table, the delete+insert is internal, whereas for partitioned table, it is completely visible to the user. > >>> (b) It seems inconsistent to consider behavior for row and statement >>> triggers differently >> >> I am not sure whether we should compare row and statement triggers. >> Statement triggers are anyway fired only per-statement, depending upon >> whether it is update or insert or delete. It has nothing to do with >> how the rows are modified. >> > > Okay. The broader point I was trying to convey was that the way this > patch defines the behavior of triggers doesn't sound good to me. It > appears to me that in this thread multiple people have raised points > around trigger behavior and you should try to consider those. I understand that there is no single solution which will provide completely intuitive trigger behaviour. Skipping BR delete trigger should be fine. But then for consistency, we should skip BR insert trigger as well, the theory being that the delete+insert are not fired by the user so we should not fire them. But I feel both should be fired to avoid any consequences unexpected to the user who has installed those triggers. The only specific concern of yours is that of firing *both* update as well as insert triggers on the same table, right ? My explanation for this was : we have done this before for UPSERT, and we had documented the same. We can do it here also. > Apart from the options, Robert has suggested, another option could be that > we allow firing BR-AR update triggers for original partition and BR-AR > insert triggers for the new partition. In this case, one can argue > that we have not actually updated the row in the original partition, > so there is no need to fire AR update triggers, Yes that's what I think. If there is no update happened, then AR update trigger should not be executed. AR triggers are only for scenarios where it is guaranteed that the DML operation has happened when the trigger is being executed. > but I feel that is what we do for non-partitioned table update and it should be okay here > as well. I don't think so. For e.g. if a BR trigger returns NULL, the update does not happen, and then the AR trigger does not fire as well. Do you see any other scenarios for non-partitioned tables, where AR triggers do fire when the update does not happen ? Overall, I am also open to skipping both insert+delete BR trigger, but I am trying to convince above that this might not be as odd as it sounds, especially if we document this clearly why we have done. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On 12 May 2017 at 10:01, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, May 12, 2017 at 9:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Thu, May 11, 2017 at 5:45 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>> On 11 May 2017 at 17:24, Amit Kapila <amit.kapila16@gmail.com> wrote: >>>> Few comments: >>>> 1. >>>> Operating directly on partition doesn't allow update to move row. >>>> Refer below example: >>>> create table t1(c1 int) partition by range(c1); >>>> create table t1_part_1 partition of t1 for values from (1) to (100); >>>> create table t1_part_2 partition of t1 for values from (100) to (200); >>>> insert into t1 values(generate_series(1,11)); >>>> insert into t1 values(generate_series(110,120)); >>>> >>>> postgres=# update t1_part_1 set c1=122 where c1=11; >>>> ERROR: new row for relation "t1_part_1" violates partition constraint >>>> DETAIL: Failing row contains (122). >>> >>> Yes, as Robert said, this is expected behaviour. We move the row only >>> within the partition subtree that has the update table as its root. In >>> this case, it's the leaf partition. >>> >> >> Okay, but what is the technical reason behind it? Is it because the >> current design doesn't support it or is it because of something very >> fundamental to partitions? No, we can do that if decide to update some table outside the partition subtree. The reason is more of semantics. I think the user who is running UPDATE for a partitioned table, should not be necessarily aware of the structure of the complete partition tree outside of the current subtree. It is always safe to return error instead of moving the data outside of the subtree silently. >> > > One plausible theory is that as Select's on partitions just returns > the rows of that partition, the update should also behave in same way. Yes , right. Or even inserts fail if we try to insert data that does not fit into the current subtree. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
>> 3. >> + longer satisfy the partition constraint of the containing partition. In that >> + case, if there is some other partition in the partition tree for which this >> + row satisfies its partition constraint, then the row is moved to that >> + partition. If there isn't such a partition, an error will occur. >> >> Doesn't this error case indicate that this needs to be integrated with >> Default partition patch of Rahila or that patch needs to take care >> this error case? >> Basically, if there is no matching partition, then move it to default partition. > > Will have a look on this. Thanks for pointing this out. I tried update row movement with both my patch and default-partition patch applied. And it looks like it works as expected : 1. When an update changes the partitioned key such that the row does not fit into any of the non-default partitions, the row is moved to the default partition. 2. If the row does fit into a non-default partition, the row moves into that partition. 3. If a row from a default partition is updated such that it fits into any of the non-default partition, it moves into that partition. I think we can debate on whether the row should stay in the default partition or move. I think it should be moved, since now the row has a suitable partition. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Fri, Feb 24, 2017 at 3:50 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Feb 24, 2017 at 3:24 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > >> It is of course very good that we have something ready for this >> release and can make a choice of what to do. >> >> Thoughts >> >> 1. Reuse the tuple state HEAP_MOVED_OFF which IIRC represent exactly >> almost exactly the same thing. An UPDATE which gets to a >> HEAP_MOVED_OFF tuple will know to re-find the tuple via the partition >> metadata, or I might be persuaded that in-this-release it is >> acceptable to fail when this occurs with an ERROR and a retryable >> SQLCODE, since the UPDATE will succeed on next execution. > > I've got my doubts about whether we can make that bit work that way, > considering that we still support pg_upgrade (possibly in multiple > steps) from old releases that had VACUUM FULL. We really ought to put > some work into reclaiming those old bits, but there's probably no time > for that in v10. > I agree with you that it might not be straightforward to make it work, but now that earliest it can go is v11, do we want to try doing something other than just documenting it. What I could read from this e-mail thread is that you are intending towards just documenting it for the first cut of this feature. However, both Greg and Simon are of opinion that we should do something about this and even patch Author (Amit Khandekar) has shown some inclination to do something about this point (return error to the user in some way), so I think we can't ignore this point. I think now that we have some more time, it is better to try something based on a couple of ideas floating in this thread to address this point and see if we can come up with something doable without a big architecture change. What is your take on this point now? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, May 12, 2017 at 10:49 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 12 May 2017 at 08:30, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Thu, May 11, 2017 at 5:41 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > >> If we try to compare it with the non-partitioned update, >> there also it is internally a delete and insert operation, but we >> don't fire triggers for those. > > For a non-partitioned table, the delete+insert is internal, whereas > for partitioned table, it is completely visible to the user. > If the user has executed an update on root table, then it is transparent. I think we can consider it user visible only in case if there is some user visible syntax like "Update ... Move Row If Constraint Not Satisfied" >> >>>> (b) It seems inconsistent to consider behavior for row and statement >>>> triggers differently >>> >>> I am not sure whether we should compare row and statement triggers. >>> Statement triggers are anyway fired only per-statement, depending upon >>> whether it is update or insert or delete. It has nothing to do with >>> how the rows are modified. >>> >> >> Okay. The broader point I was trying to convey was that the way this >> patch defines the behavior of triggers doesn't sound good to me. It >> appears to me that in this thread multiple people have raised points >> around trigger behavior and you should try to consider those. > > I understand that there is no single solution which will provide > completely intuitive trigger behaviour. Skipping BR delete trigger > should be fine. But then for consistency, we should skip BR insert > trigger as well, the theory being that the delete+insert are not fired > by the user so we should not fire them. But I feel both should be > fired to avoid any consequences unexpected to the user who has > installed those triggers. > > The only specific concern of yours is that of firing *both* update as > well as insert triggers on the same table, right ? My explanation for > this was : we have done this before for UPSERT, and we had documented > the same. We can do it here also. > >> Apart from the options, Robert has suggested, another option could be that >> we allow firing BR-AR update triggers for original partition and BR-AR >> insert triggers for the new partition. In this case, one can argue >> that we have not actually updated the row in the original partition, >> so there is no need to fire AR update triggers, > > Yes that's what I think. If there is no update happened, then AR > update trigger should not be executed. AR triggers are only for > scenarios where it is guaranteed that the DML operation has happened > when the trigger is being executed. > >> but I feel that is what we do for non-partitioned table update and it should be okay here >> as well. > > I don't think so. For e.g. if a BR trigger returns NULL, the update > does not happen, and then the AR trigger does not fire as well. Do you > see any other scenarios for non-partitioned tables, where AR triggers > do fire when the update does not happen ? > No, but here also it can be considered as an update for original partition. > > Overall, I am also open to skipping both insert+delete BR trigger, > I think it might be better to summarize all the options discussed including what the patch has and see what most people consider as sensible. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 12 May 2017 at 14:56, Amit Kapila <amit.kapila16@gmail.com> wrote: > I think it might be better to summarize all the options discussed > including what the patch has and see what most people consider as > sensible. Yes, makes sense. Here are the options that were discussed so far for ROW triggers : Option 1 : (the patch follows this option) ---------- BR Update trigger for source partition. BR,AR Delete trigger for source partition. BR,AR Insert trigger for destination partition. No AR Update trigger. Rationale : BR Update trigger should be fired because that trigger can even modify the rows, and that can even result in partition key update even though the UPDATE statement is not updating the partition key. Also, fire the delete/insert triggers on respective partitions since the rows are about to be deleted/inserted. AR update trigger should not be fired because that required an actual update to have happened. Option 2 ---------- BR Update trigger for source partition. AR Update trigger on destination partition. No insert/delete triggers. Rationale : Since it's an UPDATE statement, only update triggers should be fired. The update ends up moving the row into another partition, so AR Update trigger should be fired on this partition rather than the original partition. Option 3 -------- BR, AR delete triggers on source partition BR, AR insert triggers on destination partition. Rationale : Since the update is converted to delete+insert, just skip the update triggers completely. Option 4 -------- BR-AR update triggers for source partition BR-AR insert triggers for destination partition Rationale : Since it is an update statement, both BR and AR UPDATE trigger should be fired on original partition. Since update is converted to delete+insert, the corresponding triggers should be fired, but since we already are firing UPDATE trigger on original partition, skip delete triggers, otherwise both UPDATE and DELETE triggers would get fired on the same partition. ---------------- For statement triggers, I think it should be based on the documentation recently checked in for partitions in general. + A statement that targets a parent table in a inheritance or partitioning + hierarchy does not cause the statement-level triggers of affected child + tables to be fired; only the parent table's statement-level triggers are + fired. However, row-level triggers of any affected child tables will be + fired. Based on that, for row movement as well, the trigger should be fired only for the table referred in the UPDATE statement, and not for any child tables, or for any partitions to which the rows were moved. The doc in this row-movement patch also matches with this behaviour.
On Fri, May 12, 2017 at 3:07 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > I agree with you that it might not be straightforward to make it work, > but now that earliest it can go is v11, do we want to try doing > something other than just documenting it. What I could read from this > e-mail thread is that you are intending towards just documenting it > for the first cut of this feature. However, both Greg and Simon are of > opinion that we should do something about this and even patch Author > (Amit Khandekar) has shown some inclination to do something about this > point (return error to the user in some way), so I think we can't > ignore this point. > > I think now that we have some more time, it is better to try something > based on a couple of ideas floating in this thread to address this > point and see if we can come up with something doable without a big > architecture change. > > What is your take on this point now? I still don't think it's worth spending a bit on this, especially not with WARM probably gobbling up multiple bits. Reclaiming the bits seems like a good idea, but spending one on this still seems to me like it's probably not the best use of our increasingly-limited supply of infomask bits. Now, Simon and Greg may still feel otherwise, of course. I could get behind providing an option to turn this behavior on and off at the level of the partitioned table. That would use a reloption rather than an infomask bit, so no scarce resource is being consumed. I suspect that most people don't update the partition keys at all (so they don't care either way) and the ones who do are probably either depending on EPQ (in which case they most likely want to just disallow all UPDATE-row-movement) or not (in which case they again don't care). If I understand correctly, the only people who will benefit from consuming an infomask bit are the people who update their partition keys AND depend on EPQ BUT only for non-key updates AND need the system to make sure that they don't accidentally rely on it for the case of an EPQ update. That seems (to me, anyway) like it's got to be a really small percentage of actual users, but I just work here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, May 12, 2017 at 4:17 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > Option 3 > -------- > > BR, AR delete triggers on source partition > BR, AR insert triggers on destination partition. > > Rationale : > Since the update is converted to delete+insert, just skip the update > triggers completely. +1 to option3 Generally, BR triggers are used for updating the ROW value and AR triggers to VALIDATE the row or to modify some other tables. So it seems that we can fire the triggers what is actual operation is happening at the partition level. For source partition, it's only the delete operation (no update happened) so we fire delete triggers and for the destination only insert operations so fire only inserts triggers. That will keep the things simple. And, it will also be in sync with the actual partition level delete/insert operations. We may argue that user might have declared only update triggers and as he has executed the update operation he may expect those triggers to get fired. But, I think this behaviour can be documented with the proper logic that if the user is updating the partition key then he must be ready with the Delete/Insert triggers also, he can not rely only upon update level triggers. Earlier I thought that option1 is better but later I think that this can complicate the situation as we are firing first BR update then BR delete and can change the row multiple time and defining such behaviour can be complicated. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, May 17, 2017 at 12:06 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > On Fri, May 12, 2017 at 4:17 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> Option 3 >> -------- >> >> BR, AR delete triggers on source partition >> BR, AR insert triggers on destination partition. >> >> Rationale : >> Since the update is converted to delete+insert, just skip the update >> triggers completely. > > +1 to option3 > .. > Earlier I thought that option1 is better but later I think that this > can complicate the situation as we are firing first BR update then BR > delete and can change the row multiple time and defining such > behaviour can be complicated. > If we have to go by this theory, then the option you have preferred will still execute BR triggers for both delete and insert, so input row can still be changed twice. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, May 15, 2017 at 5:28 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, May 12, 2017 at 3:07 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> I agree with you that it might not be straightforward to make it work, >> but now that earliest it can go is v11, do we want to try doing >> something other than just documenting it. What I could read from this >> e-mail thread is that you are intending towards just documenting it >> for the first cut of this feature. However, both Greg and Simon are of >> opinion that we should do something about this and even patch Author >> (Amit Khandekar) has shown some inclination to do something about this >> point (return error to the user in some way), so I think we can't >> ignore this point. >> >> I think now that we have some more time, it is better to try something >> based on a couple of ideas floating in this thread to address this >> point and see if we can come up with something doable without a big >> architecture change. >> >> What is your take on this point now? > > I still don't think it's worth spending a bit on this, especially not > with WARM probably gobbling up multiple bits. Reclaiming the bits > seems like a good idea, but spending one on this still seems to me > like it's probably not the best use of our increasingly-limited supply > of infomask bits. > I think we can do this even without using an additional infomask bit. As suggested by Greg up thread, we can set InvalidBlockId in ctid to indicate such an update. > Now, Simon and Greg may still feel otherwise, of > course. > > I could get behind providing an option to turn this behavior on and > off at the level of the partitioned table. > Sure that sounds like a viable option and we can set the default value as false. However, it might be better if we can detect the same internally without big changes. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, May 17, 2017 at 3:15 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Earlier I thought that option1 is better but later I think that this >> can complicate the situation as we are firing first BR update then BR >> delete and can change the row multiple time and defining such >> behaviour can be complicated. >> > > If we have to go by this theory, then the option you have preferred > will still execute BR triggers for both delete and insert, so input > row can still be changed twice. Yeah, right as per my theory above Option3 have the same problem. But after putting some more thought I realised that only for "Before Update" or the "Before Insert" trigger row can be changed. Correct me if I am assuming something wrong? So now again option3 will make more sense. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, May 17, 2017 at 6:29 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > I think we can do this even without using an additional infomask bit. > As suggested by Greg up thread, we can set InvalidBlockId in ctid to > indicate such an update. Hmm. How would that work? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, May 17, 2017 at 12:06 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, May 12, 2017 at 4:17 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Option 3
> --------
>
> BR, AR delete triggers on source partition
> BR, AR insert triggers on destination partition.
>
> Rationale :
> Since the update is converted to delete+insert, just skip the update
> triggers completely.
+1 to option3
Generally, BR triggers are used for updating the ROW value and AR
triggers to VALIDATE the row or to modify some other tables. So it
seems that we can fire the triggers what is actual operation is
happening at the partition level.
For source partition, it's only the delete operation (no update
happened) so we fire delete triggers and for the destination only
insert operations so fire only inserts triggers. That will keep the
things simple. And, it will also be in sync with the actual partition
level delete/insert operations.
We may argue that user might have declared only update triggers and as
he has executed the update operation he may expect those triggers to
get fired. But, I think this behaviour can be documented with the
proper logic that if the user is updating the partition key then he
must be ready with the Delete/Insert triggers also, he can not rely
only upon update level triggers.
Right, that is even my concern. That user might had declared only update
triggers and when user executing UPDATE its expect it to get call - but
with option 3 its not happening.
In term of consistency option 1 looks better. Its doing the same what
its been implemented for the UPSERT - so that user might be already
aware of trigger behaviour. Plus if we document the behaviour then it
sounds correct -
- Original command was UPDATE so BR update
- Later found that its ROW movement - so BR delete followed by AR delete
- Then Insert in new partition - so BR INSERT followed by AR Insert.
But again I am not quite sure how good it will be to compare the partition
behaviour with the UPSERT.
Earlier I thought that option1 is better but later I think that this
can complicate the situation as we are firing first BR update then BR
delete and can change the row multiple time and defining such
behaviour can be complicated.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Rushabh Lathia
On Wed, May 17, 2017 at 5:17 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, May 17, 2017 at 6:29 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> I think we can do this even without using an additional infomask bit. >> As suggested by Greg up thread, we can set InvalidBlockId in ctid to >> indicate such an update. > > Hmm. How would that work? > We can pass a flag say row_moved (or require_row_movement) to heap_delete which will in turn set InvalidBlockId in ctid instead of setting it to self. Then the ExecUpdate needs to check for the same and return an error when heap_update is not successful (result != HeapTupleMayBeUpdated). Can you explain what difficulty are you envisioning? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 17 May 2017 at 17:29, Rushabh Lathia <rushabh.lathia@gmail.com> wrote: > > > On Wed, May 17, 2017 at 12:06 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote: >> >> On Fri, May 12, 2017 at 4:17 PM, Amit Khandekar <amitdkhan.pg@gmail.com> >> wrote: >> > Option 3 >> > -------- >> > >> > BR, AR delete triggers on source partition >> > BR, AR insert triggers on destination partition. >> > >> > Rationale : >> > Since the update is converted to delete+insert, just skip the update >> > triggers completely. >> >> +1 to option3 >> Generally, BR triggers are used for updating the ROW value and AR >> triggers to VALIDATE the row or to modify some other tables. So it >> seems that we can fire the triggers what is actual operation is >> happening at the partition level. >> >> For source partition, it's only the delete operation (no update >> happened) so we fire delete triggers and for the destination only >> insert operations so fire only inserts triggers. That will keep the >> things simple. And, it will also be in sync with the actual partition >> level delete/insert operations. >> >> We may argue that user might have declared only update triggers and as >> he has executed the update operation he may expect those triggers to >> get fired. But, I think this behaviour can be documented with the >> proper logic that if the user is updating the partition key then he >> must be ready with the Delete/Insert triggers also, he can not rely >> only upon update level triggers. >> > > Right, that is even my concern. That user might had declared only update > triggers and when user executing UPDATE its expect it to get call - but > with option 3 its not happening. Yes that's the issue with option 3. A user wants to make sure update triggers run, and here we are skipping the BEFORE update triggers. And user might even modify rows. Now regarding the AR update triggers .... The user might be more concerned with the non-partition-key columns, and the UPDATE of partition key typically would update only the partition key and not the other column. So for typical case, it makes sense to skip the UPDATE AR trigger. But if the UPDATE contains both partition key as well as other column updates, it makes sense to fire AR UPDATE trigger. One thing we can do is restrict an UPDATE to have both partition key and non-partition key column updates. So this way we can always skip the AR update triggers for row-movement updates, unless may be fire AR UPDATE triggers *only* if they are created using "BEFORE UPDATE OF <column_name>" and the column is the partition key. Between skipping delete-insert triggers versus skipping update triggers, I would go for skipping delete-insert triggers. I think we cannot skip BR update triggers because that would be a correctness issue. From user-perspective, I think the user would like to install a trigger that would fire if any of the child tables get modified. But because there is no provision to install a common trigger, the user has to install the same trigger on every child table. In that sense, it might not matter whether we fire AR UPDATE trigger on old partition or new partition. > > In term of consistency option 1 looks better. Its doing the same what > its been implemented for the UPSERT - so that user might be already > aware of trigger behaviour. Plus if we document the behaviour then it > sounds correct - > > - Original command was UPDATE so BR update > - Later found that its ROW movement - so BR delete followed by AR delete > - Then Insert in new partition - so BR INSERT followed by AR Insert. > > But again I am not quite sure how good it will be to compare the partition > behaviour with the UPSERT. > > > >> >> Earlier I thought that option1 is better but later I think that this >> can complicate the situation as we are firing first BR update then BR >> delete and can change the row multiple time and defining such >> behaviour can be complicated. >> >> -- >> Regards, >> Dilip Kumar >> EnterpriseDB: http://www.enterprisedb.com >> >> >> -- >> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-hackers > > > > > -- > Rushabh Lathia -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Wed, May 17, 2017 at 4:05 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > On Wed, May 17, 2017 at 3:15 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> Earlier I thought that option1 is better but later I think that this >>> can complicate the situation as we are firing first BR update then BR >>> delete and can change the row multiple time and defining such >>> behaviour can be complicated. >>> >> >> If we have to go by this theory, then the option you have preferred >> will still execute BR triggers for both delete and insert, so input >> row can still be changed twice. > > Yeah, right as per my theory above Option3 have the same problem. > > But after putting some more thought I realised that only for "Before > Update" or the "Before Insert" trigger row can be changed. > Before Row Delete triggers can suppress the delete operation itself which is kind of unintended in this case. I think without the user being aware it doesn't seem advisable to execute multiple BR triggers. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 18 May 2017 at 16:52, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, May 17, 2017 at 4:05 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote: >> On Wed, May 17, 2017 at 3:15 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>>> Earlier I thought that option1 is better but later I think that this >>>> can complicate the situation as we are firing first BR update then BR >>>> delete and can change the row multiple time and defining such >>>> behaviour can be complicated. >>>> >>> >>> If we have to go by this theory, then the option you have preferred >>> will still execute BR triggers for both delete and insert, so input >>> row can still be changed twice. >> >> Yeah, right as per my theory above Option3 have the same problem. >> >> But after putting some more thought I realised that only for "Before >> Update" or the "Before Insert" trigger row can be changed. >> > > Before Row Delete triggers can suppress the delete operation itself > which is kind of unintended in this case. I think without the user > being aware it doesn't seem advisable to execute multiple BR triggers. By now, majority of the opinions have shown that they do not favour two triggers getting fired on a single update. Amit, do you consider option 2 as a valid option ? That is, fire only UPDATE triggers. BR on source partition, and AR on destination partition. Do you agree that firing BR update trigger is essential since it can modify the row and even prevent the update from happening ? Also, since a user does not have a provision to install a common UPDATE row trigger, (s)he installs it on each of the leaf partitions. And then when an update causes row movement, using option 3 would end up not firing update triggers on any of the partitions. So, I prefer option 2 over option 3 , i.e. make sure to fire BR and AR update triggers. Actually option 2 is what Robert had proposed in the beginning. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On 12 May 2017 at 09:27, Amit Kapila <amit.kapila16@gmail.com> wrote: > > + is_partitioned_table = > + root_rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE; > + > + if (is_partitioned_table) > + ExecSetupPartitionTupleRouting( > + root_rel, > + /* Build WITH CHECK OPTION constraints for leaf partitions */ > + ExecInitPartitionWithCheckOptions(mtstate, root_rel); > + /* Build a projection for each leaf partition rel. */ > + ExecInitPartitionReturningProjection(mtstate, root_rel); > .. > + /* It's not a partitioned table after all; error out. */ > + ExecPartitionCheckEmitError(resultRelInfo, slot, estate); > > When we are anyway going to give error if table is not a partitioned > table, then isn't it better to give it early when we first identify > that. Yeah that's right, fixed. Moved the partitioned table check early. This also showed that there is no need for is_partitioned_table variable. Accordingly adjusted the code. > - > +static void ExecInitPartitionWithCheckOptions(ModifyTableState *mtstate, > + Relation root_rel); > Spurious line delete. Done. Also rebased the patch over latest code. Attached v8 patch. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Wed, May 24, 2017 at 2:47 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 18 May 2017 at 16:52, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Wed, May 17, 2017 at 4:05 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote: >>> On Wed, May 17, 2017 at 3:15 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>>>> Earlier I thought that option1 is better but later I think that this >>>>> can complicate the situation as we are firing first BR update then BR >>>>> delete and can change the row multiple time and defining such >>>>> behaviour can be complicated. >>>>> >>>> >>>> If we have to go by this theory, then the option you have preferred >>>> will still execute BR triggers for both delete and insert, so input >>>> row can still be changed twice. >>> >>> Yeah, right as per my theory above Option3 have the same problem. >>> >>> But after putting some more thought I realised that only for "Before >>> Update" or the "Before Insert" trigger row can be changed. >>> >> >> Before Row Delete triggers can suppress the delete operation itself >> which is kind of unintended in this case. I think without the user >> being aware it doesn't seem advisable to execute multiple BR triggers. > > By now, majority of the opinions have shown that they do not favour > two triggers getting fired on a single update. Amit, do you consider > option 2 as a valid option ? > Sounds sensible to me. > That is, fire only UPDATE triggers. BR on > source partition, and AR on destination partition. Do you agree that > firing BR update trigger is essential since it can modify the row and > even prevent the update from happening ? > Agreed. Apart from above, there is one open issue [1] related to generating an error for concurrent delete of row for which I have mentioned some way of getting it done, do you want to try that option and see if you face any issue in making the progress on that lines? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, May 24, 2017 at 8:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, May 24, 2017 at 2:47 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> >> By now, majority of the opinions have shown that they do not favour >> two triggers getting fired on a single update. Amit, do you consider >> option 2 as a valid option ? >> > > Sounds sensible to me. > >> That is, fire only UPDATE triggers. BR on >> source partition, and AR on destination partition. Do you agree that >> firing BR update trigger is essential since it can modify the row and >> even prevent the update from happening ? >> > > Agreed. > > Apart from above, there is one open issue [1] > Forget to mention the link, doing it now. [1] - https://www.postgresql.org/message-id/CAA4eK1KEZQ%2BCyXbBzfn1jFHoEfa_OemDLhLyy7xfD1QUZLo1DQ%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 24 May 2017 at 20:16, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, May 24, 2017 at 8:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Apart from above, there is one open issue [1] >> > > Forget to mention the link, doing it now. > > [1] - https://www.postgresql.org/message-id/CAA4eK1KEZQ%2BCyXbBzfn1jFHoEfa_OemDLhLyy7xfD1QUZLo1DQ%40mail.gmail.com I am not sure right now whether making the t_ctid of such tuples to Invalid would be a right option, especially because I think there can be already some other meaning if t_ctid is not valid. But may be we can check this more. If we decide to error out using some way, I would be inclined towards considering re-using some combinations of infomask bits (like HEAP_MOVED_OFF as suggested upthread) rather than using invalid t_ctid value. But I think, we can also take step-by-step approach even for v11. If we agree that it is ok to silently do the updates as long as we document the behaviour, we can go ahead and do this, and then as a second step, implement error handling as a separate patch. If that patch does not materialize, we at least have the current behaviour documented. Ideally, I think we would have liked if we were somehow able to make the row-movement UPDATE itself abort if it finds any normal updates waiting for it to finish, rather than making the normal updates fail because a row-movement occurred . But I think we will have to live with it.
On Mon, May 29, 2017 at 11:20 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 24 May 2017 at 20:16, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Wed, May 24, 2017 at 8:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> Apart from above, there is one open issue [1] >>> >> >> Forget to mention the link, doing it now. >> >> [1] - https://www.postgresql.org/message-id/CAA4eK1KEZQ%2BCyXbBzfn1jFHoEfa_OemDLhLyy7xfD1QUZLo1DQ%40mail.gmail.com > > I am not sure right now whether making the t_ctid of such tuples to > Invalid would be a right option, especially because I think there can > be already some other meaning if t_ctid is not valid. > AFAIK, this is used to point to current tuple itself or newer version of a tuple or is used in speculative inserts (refer comments above HeapTupleHeaderData in htup_details.h). Can you mention what other meaning are you referring here for InvalidBlockId in t_ctid? > But may be we > can check this more. > > If we decide to error out using some way, I would be inclined towards > considering re-using some combinations of infomask bits (like > HEAP_MOVED_OFF as suggested upthread) rather than using invalid t_ctid > value. > > But I think, we can also take step-by-step approach even for v11. If > we agree that it is ok to silently do the updates as long as we > document the behaviour, we can go ahead and do this, and then as a > second step, implement error handling as a separate patch. If that > patch does not materialize, we at least have the current behaviour > documented. > I think that is sensible approach if we find the second step involves big or complicated changes. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, May 29, 2017 at 5:26 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> But I think, we can also take step-by-step approach even for v11. If >> we agree that it is ok to silently do the updates as long as we >> document the behaviour, we can go ahead and do this, and then as a >> second step, implement error handling as a separate patch. If that >> patch does not materialize, we at least have the current behaviour >> documented. > > I think that is sensible approach if we find the second step involves > big or complicated changes. I think it is definitely a good idea to separate the two patches. UPDATE tuple routing without any special handling for the EPQ issue is just a partitioning feature. The proposed handling for the EPQ issue is an *on-disk format change*. That turns a patch which is subject only to routine bugs into one which can eat your data permanently -- so having the "can eat your data permanently" separated out for both review and commit seems only prudent. For me, it's not a matter of which patch is big or complicated, but rather a matter of one of them being a whole lot riskier than the other. Even UPDATE tuple routing could mess things up pretty seriously if we end up with tuples in the wrong partition, of course, but the other thing is still worse. In terms of a development plan, I think we would need to have both patches before either could be committed. I believe that everyone other than me who has expressed an opinion on this issue has said that it's unacceptable to just ignore the issue, so it doesn't sound like there will be much appetite for having #1 go into the tree without #2. I'm still really concerned about that approach because we do not have very much bit space left and WARM wants to use quite a bit of it. I think it's quite possible that we'll be sad in the future if we find that we can't implement feature XYZ because of the bit-space consumed by this feature. However, I don't have the only vote here and I'm not going to try to shove this into the tree over multiple objections (unless there are a lot more votes the other way, but so far there's no sign of that). Greg/Amit's idea of using the CTID field rather than an infomask bit seems like a possibly promising approach. Not everything that needs bit-space can use the CTID field, so using it is a little less likely to conflict with something else we want to do in the future than using a precious infomask bit. However, I'm worried about this: /* Make sure there is no forward chain link in t_ctid */ tp.t_data->t_ctid = tp.t_self; The comment does not say *why* we need to make sure that there is no forward chain link, but it implies that some code somewhere in the system does or at one time did depend on no forward link existing. Any such code that still exists will need to be updated. Anybody know what code that might be, exactly? The other potential issue I see here is that I know the WARM code also tries to use the bit-space in the CTID field; in particular, it uses the CTID field of the last tuple in a HOT chain to point back to the root of the chain. That seems like it could conflict with the usage proposed here, but I'm not totally sure. Has anyone investigated this issue? Regarding the trigger issue, I can't claim to have a terribly strong opinion on this. I think that practically anything we do here might upset somebody, but probably any halfway-reasonable thing we choose to do will be OK for most people. However, there seems to be a discrepancy between the approach that got the most votes and the one that is implemented by the v8 patch, so that seems like something to fix. For what it's worth, in the future, I imagine that we might allow adding a trigger to a partitioned table and having that cascade down to all descendant tables. In that world, firing the BR UPDATE trigger for the old partition and the AR UPDATE trigger for the new partition will look a lot like the behavior the user would expect on an unpartitioned table, which could be viewed as a good thing. On the other hand, it's still going to be a DELETE+INSERT under the hood for the foreseeable future, so firing the delete triggers and then the insert triggers is also defensible. Is there any big difference between these appraoches in terms of how much code is required to make this work? In terms of the approach taken by the patch itself, it seems surprising to me that the patch only calls ExecSetupPartitionTupleRouting when an update fails the partition constraint. Note that in the insert case, we call that function at the start of execution; calling it in the middle seems to involve additional hazards; for example, is it really safe to add additional ResultRelInfos midway through the operation? Is it safe to take more locks midway through the operation? It seems like it might be a lot safer to decide at the beginning of the operation whether this is needed -- we can skip it if none of the columns involved in the partition key (or partition key expressions) are mentioned in the update. (There's also the issue of triggers, but I'm not sure that it's sensible to allow a trigger on an individual partition to reroute an update to another partition; what if we get an infinite loop?) + if (concurrently_deleted) + return NULL; I don't understand the motivation for this change, and there are no comments explaining it that I can see. Perhaps the concurrency-related (i.e. EPQ) behavior here could be tested via the isolation tester. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 1 June 2017 at 03:25, Robert Haas <robertmhaas@gmail.com> wrote: > Greg/Amit's idea of using the CTID field rather than an infomask bit > seems like a possibly promising approach. Not everything that needs > bit-space can use the CTID field, so using it is a little less likely > to conflict with something else we want to do in the future than using > a precious infomask bit. However, I'm worried about this: > > /* Make sure there is no forward chain link in t_ctid */ > tp.t_data->t_ctid = tp.t_self; > > The comment does not say *why* we need to make sure that there is no > forward chain link, but it implies that some code somewhere in the > system does or at one time did depend on no forward link existing. > Any such code that still exists will need to be updated. Anybody know > what code that might be, exactly? I am going to have a look overall at this approach, and about code somewhere else which might be assuming that t_ctid cannot be Invalid. > Regarding the trigger issue, I can't claim to have a terribly strong > opinion on this. I think that practically anything we do here might > upset somebody, but probably any halfway-reasonable thing we choose to > do will be OK for most people. However, there seems to be a > discrepancy between the approach that got the most votes and the one > that is implemented by the v8 patch, so that seems like something to > fix. Yes, I have started working on updating the patch to use that approach (BR and AR update triggers on source and destination partition respectively, instead of delete+insert) The approach taken by the patch (BR update + delete+insert triggers) didn't require any changes in the way ExecDelete() and ExecInsert() were called. Now we would require to skip the delete/insert triggers, so some flags need to be passed to these functions, or else have stripped down versions of ExecDelete() and ExecInsert() which don't do other things like RETURNING handling and firing triggers. > > For what it's worth, in the future, I imagine that we might allow > adding a trigger to a partitioned table and having that cascade down > to all descendant tables. In that world, firing the BR UPDATE trigger > for the old partition and the AR UPDATE trigger for the new partition > will look a lot like the behavior the user would expect on an > unpartitioned table, which could be viewed as a good thing. On the > other hand, it's still going to be a DELETE+INSERT under the hood for > the foreseeable future, so firing the delete triggers and then the > insert triggers is also defensible. Ok, I was assuming that there won't be any plans to support triggers on a partitioned table, but yes, I had imagined how the behaviour would be in this world. Currently, users who want to have triggers on a table that happens to be a partitioned table, have to install the same trigger on each of the leaf partitions, since there is no other choice. But we would never know whether a trigger on a leaf partition was actually meant to be specifically on that individual partition or it was actually meant to be a trigger on a root partitioned table. Hence there is the difficulty of deciding the right behaviour in case of triggers with row movement. If we have an AR UPDATE trigger on root table, then during row movement, it does not matter whether we fire the trigger on source or destination, because it is the same single trigger cascaded on both the partitions. If there is a trigger installed specifically on a leaf partition, then we know that it should not be fired on other partitions since it is specifically made for this one. And same applies for delete and insert triggers: If installed on parent, don't involve them in row-movement; only fire them if installed on leaf partitions regardless of whether it was an internally generated delete+insert due to row-movement). Similarly we can think about BR triggers. Of courses, DBAs should be aware of triggers that are already installed in the table ancestors before installing a new one on a child table. Overall, it becomes much clearer what to do if we decide to allow triggers on partitioned tables. > Is there any big difference between these appraoches in terms > of how much code is required to make this work? You mean if we allow triggers on partitioned tables ? I think we would have to keep some flag in the trigger data (or somewhere else) that the trigger actually belongs to upper partitioned table, and so for delete+insert, don't fire such trigger. Other than that, we don't have to decide in any unique way which trigger to fire on which table. > > In terms of the approach taken by the patch itself, it seems > surprising to me that the patch only calls > ExecSetupPartitionTupleRouting when an update fails the partition > constraint. Note that in the insert case, we call that function at > the start of execution; > calling it in the middle seems to involve additional hazards; > for example, is it really safe to add additional > ResultRelInfos midway through the operation? I thought since the additional ResultRelInfos go into mtstate->mt_partitions which is independent of estate->es_result_relations, that should be safe. > Is it safe to take more locks midway through the operation? I can imagine some rows already updated, when other tasks like ALTER TABLE or CREATE INDEX happen on other partitions which are still unlocked, and then for row movement we try to lock these other partitions and wait for the DDL tasks to complete. But I didn't see any particular issues with that. But correct me if you suspect a possible issue. One issue can be if we were able to modify the table attributes, but I believe we cannot do that for inherited columns. > It seems like it might be a lot > safer to decide at the beginning of the operation whether this is > needed -- we can skip it if none of the columns involved in the > partition key (or partition key expressions) are mentioned in the > update. > (There's also the issue of triggers, The reason I thought it cannot be done at the start of the execution, is because even if we know that update is not modifying the partition key column, we are not certain that the final NEW row has its partition key column unchanged, because of triggers. I understand it might be weird for a user requiring to modify a partition key value, but if a user does that, it will result in crash because we won't have the partition routing setup, thinking that there is no partition key column in the UPDATE. And we also cannot unconditionally setup the partition routing on all updates, for performance reasons. > I'm not sure that it's sensible to allow a trigger on an > individual partition to reroute an update to another partition > what if we get an infinite loop?) You mean, if the other table has another trigger that will again route to the original partition ? But this infinite loop problem could occur even for 2 normal tables ? > > + if (concurrently_deleted) > + return NULL; > > I don't understand the motivation for this change, and there are no > comments explaining it that I can see. Yeah comments, I think, are missing. I thought in the ExecDelete() they are there, but they are not. If a concurrent delete already deleted the row, we should not bother about moving the row, hence the above code. > Perhaps the concurrency-related (i.e. EPQ) behavior here could be > tested via the isolation tester. WIll check. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Thu, Jun 1, 2017 at 7:41 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> Regarding the trigger issue, I can't claim to have a terribly strong >> opinion on this. I think that practically anything we do here might >> upset somebody, but probably any halfway-reasonable thing we choose to >> do will be OK for most people. However, there seems to be a >> discrepancy between the approach that got the most votes and the one >> that is implemented by the v8 patch, so that seems like something to >> fix. > > Yes, I have started working on updating the patch to use that approach > (BR and AR update triggers on source and destination partition > respectively, instead of delete+insert) The approach taken by the > patch (BR update + delete+insert triggers) didn't require any changes > in the way ExecDelete() and ExecInsert() were called. Now we would > require to skip the delete/insert triggers, so some flags need to be > passed to these functions, or else have stripped down versions of > ExecDelete() and ExecInsert() which don't do other things like > RETURNING handling and firing triggers. See, that strikes me as a pretty good argument for firing the DELETE+INSERT triggers... I'm not wedded to that approach, but "what makes the code simplest?" is not a bad tiebreak, other things being equal. >> In terms of the approach taken by the patch itself, it seems >> surprising to me that the patch only calls >> ExecSetupPartitionTupleRouting when an update fails the partition >> constraint. Note that in the insert case, we call that function at >> the start of execution; > >> calling it in the middle seems to involve additional hazards; >> for example, is it really safe to add additional >> ResultRelInfos midway through the operation? > > I thought since the additional ResultRelInfos go into > mtstate->mt_partitions which is independent of > estate->es_result_relations, that should be safe. I don't know. That sounds scary to me, but it might be OK. Probably needs more study. >> Is it safe to take more locks midway through the operation? > > I can imagine some rows already updated, when other tasks like ALTER > TABLE or CREATE INDEX happen on other partitions which are still > unlocked, and then for row movement we try to lock these other > partitions and wait for the DDL tasks to complete. But I didn't see > any particular issues with that. But correct me if you suspect a > possible issue. One issue can be if we were able to modify the table > attributes, but I believe we cannot do that for inherited columns. It's just that it's very unlike what we do anywhere else. I don't have a real specific idea in mind about what might totally break, but at a minimum it could certainly cause behavior that can't happen today. Today, if you run a query on some tables, it will block waiting for any locks at the beginning of the query, and the query won't begin executing until it has all of the locks. With this approach, you might block midway through; you might even deadlock midway through. Maybe that's not overtly broken, but it's at least got the possibility of being surprising. Now, I'd actually kind of like to have behavior like this for other cases, too. If we're inserting one row, can't we just lock the one partition into which it needs to get inserted, rather than all of them? But I'm wary of introducing such behavior incidentally in a patch whose main goal is to allow UPDATE row movement. Figuring out what could go wrong and fixing it seems like a substantial project all of its own. > The reason I thought it cannot be done at the start of the execution, > is because even if we know that update is not modifying the partition > key column, we are not certain that the final NEW row has its > partition key column unchanged, because of triggers. I understand it > might be weird for a user requiring to modify a partition key value, > but if a user does that, it will result in crash because we won't have > the partition routing setup, thinking that there is no partition key > column in the UPDATE. I think we could avoid that issue. Suppose we select the target partition based only on the original NEW tuple. If a trigger on that partition subsequently modifies the tuple so that it no longer satisfies the partition constraint for that partition, just let it ERROR out normally. Actually, it seems like that's probably the *easiest* behavior to implement. Otherwise, you might fire triggers, discover that you need to re-route the tuple, and then ... fire triggers again on the new partition, which might reroute it again? >> I'm not sure that it's sensible to allow a trigger on an >> individual partition to reroute an update to another partition >> what if we get an infinite loop?) > > You mean, if the other table has another trigger that will again route > to the original partition ? But this infinite loop problem could occur > even for 2 normal tables ? How? For a normal trigger, nothing it does can change which table is targeted. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2 June 2017 at 01:17, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jun 1, 2017 at 7:41 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>> Regarding the trigger issue, I can't claim to have a terribly strong >>> opinion on this. I think that practically anything we do here might >>> upset somebody, but probably any halfway-reasonable thing we choose to >>> do will be OK for most people. However, there seems to be a >>> discrepancy between the approach that got the most votes and the one >>> that is implemented by the v8 patch, so that seems like something to >>> fix. >> >> Yes, I have started working on updating the patch to use that approach >> (BR and AR update triggers on source and destination partition >> respectively, instead of delete+insert) The approach taken by the >> patch (BR update + delete+insert triggers) didn't require any changes >> in the way ExecDelete() and ExecInsert() were called. Now we would >> require to skip the delete/insert triggers, so some flags need to be >> passed to these functions, or else have stripped down versions of >> ExecDelete() and ExecInsert() which don't do other things like >> RETURNING handling and firing triggers. > > See, that strikes me as a pretty good argument for firing the > DELETE+INSERT triggers... > > I'm not wedded to that approach, but "what makes the code simplest?" > is not a bad tiebreak, other things being equal. Yes, that sounds good to me. But I think we want to wait for other's opinion because it is quite understandable that two triggers firing on the same partition sounds odd. > >>> In terms of the approach taken by the patch itself, it seems >>> surprising to me that the patch only calls >>> ExecSetupPartitionTupleRouting when an update fails the partition >>> constraint. Note that in the insert case, we call that function at >>> the start of execution; >> >>> calling it in the middle seems to involve additional hazards; >>> for example, is it really safe to add additional >>> ResultRelInfos midway through the operation? >> >> I thought since the additional ResultRelInfos go into >> mtstate->mt_partitions which is independent of >> estate->es_result_relations, that should be safe. > > I don't know. That sounds scary to me, but it might be OK. Probably > needs more study. > >>> Is it safe to take more locks midway through the operation? >> >> I can imagine some rows already updated, when other tasks like ALTER >> TABLE or CREATE INDEX happen on other partitions which are still >> unlocked, and then for row movement we try to lock these other >> partitions and wait for the DDL tasks to complete. But I didn't see >> any particular issues with that. But correct me if you suspect a >> possible issue. One issue can be if we were able to modify the table >> attributes, but I believe we cannot do that for inherited columns. > > It's just that it's very unlike what we do anywhere else. I don't > have a real specific idea in mind about what might totally break, but > at a minimum it could certainly cause behavior that can't happen > today. Today, if you run a query on some tables, it will block > waiting for any locks at the beginning of the query, and the query > won't begin executing until it has all of the locks. With this > approach, you might block midway through; you might even deadlock > midway through. Maybe that's not overtly broken, but it's at least > got the possibility of being surprising. > > Now, I'd actually kind of like to have behavior like this for other > cases, too. If we're inserting one row, can't we just lock the one > partition into which it needs to get inserted, rather than all of > them? But I'm wary of introducing such behavior incidentally in a > patch whose main goal is to allow UPDATE row movement. Figuring out > what could go wrong and fixing it seems like a substantial project all > of its own. Yes, I agree it makes sense trying to avoid introducing something we haven't tried before, in this patch, as far as possible. > >> The reason I thought it cannot be done at the start of the execution, >> is because even if we know that update is not modifying the partition >> key column, we are not certain that the final NEW row has its >> partition key column unchanged, because of triggers. I understand it >> might be weird for a user requiring to modify a partition key value, >> but if a user does that, it will result in crash because we won't have >> the partition routing setup, thinking that there is no partition key >> column in the UPDATE. > > I think we could avoid that issue. Suppose we select the target > partition based only on the original NEW tuple. If a trigger on that > partition subsequently modifies the tuple so that it no longer > satisfies the partition constraint for that partition, just let it > ERROR out normally. Ok, so you are saying, don't allow a partition trigger to initiate the row movement. I think we should keep this as a documented restriction. Actually it would be unfortunate that we would have to keep this restriction only because of implementation issue. So, according to that, below would be the logic : Run partition constraint check on the original NEW row. If it succeeds : { Fire BR UPDATE trigger on the original partition. Run partition constraint check again with the modified NEW row (may be do this only if the trigger modified the partition key) If it fails, abort. Else proceed with theusual local update. } else { Fire BR UPDATE trigger on original partition. Find the right partition for the modified NEW row. If it is the samepartition, proceed with the usual local update. else do the row movement. } > Actually, it seems like that's probably the > *easiest* behavior to implement. Otherwise, you might fire triggers, > discover that you need to re-route the tuple, and then ... fire > triggers again on the new partition, which might reroute it again? Why would update BR trigger fire on the new partition ? On the new partition, only BR INSERT trigger would fire if at all we decide to fire delete+insert triggers. And insert trigger would not again cause the tuple to be re-routed because it's an insert. > >>> I'm not sure that it's sensible to allow a trigger on an >>> individual partition to reroute an update to another partition >>> what if we get an infinite loop?) >> >> You mean, if the other table has another trigger that will again route >> to the original partition ? But this infinite loop problem could occur >> even for 2 normal tables ? > > How? For a normal trigger, nothing it does can change which table is targeted. I thought you were considering the possibility that on the new partition, the trigger function itself is running another update stmt, which is also possible for normal tables . But now I think you are saying, the row that is being inserted into the new partition might get again modified by the INSERT trigger on the new partition, which might in turn cause it to fail the new partition constraint. But in that case, it will not cause another row movement, because in the new partition, it's an INSERT, not an UPDATE, so the operation would end there, aborted. But correct me if I you were thinking of a different scenario that can cause infinite loop. -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Fri, Jun 2, 2017 at 4:37 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 2 June 2017 at 01:17, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Jun 1, 2017 at 7:41 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>>> Regarding the trigger issue, I can't claim to have a terribly strong >>>> opinion on this. I think that practically anything we do here might >>>> upset somebody, but probably any halfway-reasonable thing we choose to >>>> do will be OK for most people. However, there seems to be a >>>> discrepancy between the approach that got the most votes and the one >>>> that is implemented by the v8 patch, so that seems like something to >>>> fix. >>> >>> Yes, I have started working on updating the patch to use that approach >>> (BR and AR update triggers on source and destination partition >>> respectively, instead of delete+insert) The approach taken by the >>> patch (BR update + delete+insert triggers) didn't require any changes >>> in the way ExecDelete() and ExecInsert() were called. Now we would >>> require to skip the delete/insert triggers, so some flags need to be >>> passed to these functions, >>> I thought you already need to pass an additional flag for special handling of ctid in Delete case. For Insert, a new flag needs to be passed and need to have a check for that in few places. > or else have stripped down versions of >>> ExecDelete() and ExecInsert() which don't do other things like >>> RETURNING handling and firing triggers. >> >> See, that strikes me as a pretty good argument for firing the >> DELETE+INSERT triggers... >> >> I'm not wedded to that approach, but "what makes the code simplest?" >> is not a bad tiebreak, other things being equal. > > Yes, that sounds good to me. > I am okay if we want to go ahead with firing BR UPDATE + DELETE + INSERT triggers for an Update statement (when row movement happens) on the argument of code simplicity, but it sounds slightly odd behavior. > But I think we want to wait for other's > opinion because it is quite understandable that two triggers firing on > the same partition sounds odd. > Yeah, but I think we have to rely on docs in this case as behavior is not intuitive. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Jun 1, 2017 at 3:25 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, May 29, 2017 at 5:26 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> But I think, we can also take step-by-step approach even for v11. If >>> we agree that it is ok to silently do the updates as long as we >>> document the behaviour, we can go ahead and do this, and then as a >>> second step, implement error handling as a separate patch. If that >>> patch does not materialize, we at least have the current behaviour >>> documented. >> >> I think that is sensible approach if we find the second step involves >> big or complicated changes. > > I think it is definitely a good idea to separate the two patches. > UPDATE tuple routing without any special handling for the EPQ issue is > just a partitioning feature. The proposed handling for the EPQ issue > is an *on-disk format change*. That turns a patch which is subject > only to routine bugs into one which can eat your data permanently -- > so having the "can eat your data permanently" separated out for both > review and commit seems only prudent. For me, it's not a matter of > which patch is big or complicated, but rather a matter of one of them > being a whole lot riskier than the other. Even UPDATE tuple routing > could mess things up pretty seriously if we end up with tuples in the > wrong partition, of course, but the other thing is still worse. > > In terms of a development plan, I think we would need to have both > patches before either could be committed. I believe that everyone > other than me who has expressed an opinion on this issue has said that > it's unacceptable to just ignore the issue, so it doesn't sound like > there will be much appetite for having #1 go into the tree without #2. > I'm still really concerned about that approach because we do not have > very much bit space left and WARM wants to use quite a bit of it. I > think it's quite possible that we'll be sad in the future if we find > that we can't implement feature XYZ because of the bit-space consumed > by this feature. However, I don't have the only vote here and I'm not > going to try to shove this into the tree over multiple objections > (unless there are a lot more votes the other way, but so far there's > no sign of that). > > Greg/Amit's idea of using the CTID field rather than an infomask bit > seems like a possibly promising approach. Not everything that needs > bit-space can use the CTID field, so using it is a little less likely > to conflict with something else we want to do in the future than using > a precious infomask bit. However, I'm worried about this: > > /* Make sure there is no forward chain link in t_ctid */ > tp.t_data->t_ctid = tp.t_self; > > The comment does not say *why* we need to make sure that there is no > forward chain link, but it implies that some code somewhere in the > system does or at one time did depend on no forward link existing. > I think it is to ensure that EvalPlanQual mechanism gets invoked in the right case. The visibility routine will return HeapTupleUpdated both when the tuple is deleted or updated (updated - has a newer version of the tuple), so we use ctid to decide if we need to follow the tuple chain for a newer version of the tuple. > Any such code that still exists will need to be updated. > Yeah. > The other potential issue I see here is that I know the WARM code also > tries to use the bit-space in the CTID field; in particular, it uses > the CTID field of the last tuple in a HOT chain to point back to the > root of the chain. That seems like it could conflict with the usage > proposed here, but I'm not totally sure. > The proposed change in WARM tuple patch uses ip_posid field of CTID and we are planning to use ip_blkid field. Here is the relevant text and code from WARM tuple patch: "Store the root line pointer of the WARM chain in the t_ctid.ip_posid field of the last tuple in the chain and mark the tuple header with HEAP_TUPLE_LATEST flag to record that fact." +#define HeapTupleHeaderSetHeapLatest(tup, offnum) \ +do { \ + AssertMacro(OffsetNumberIsValid(offnum)); \ + (tup)->t_infomask2 |= HEAP_LATEST_TUPLE; \ + ItemPointerSetOffsetNumber(&(tup)->t_ctid, (offnum)); \ +} while (0) For further details, refer patch 0001-Track-root-line-pointer-v23_v26 in the below e-mail: https://www.postgresql.org/message-id/CABOikdOTstHK2y0rDk%2BY3Wx9HRe%2BbZtj3zuYGU%3DVngneiHo5KQ%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 5 June 2017 at 11:27, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Jun 2, 2017 at 4:37 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> On 2 June 2017 at 01:17, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Thu, Jun 1, 2017 at 7:41 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>>>> Regarding the trigger issue, I can't claim to have a terribly strong >>>>> opinion on this. I think that practically anything we do here might >>>>> upset somebody, but probably any halfway-reasonable thing we choose to >>>>> do will be OK for most people. However, there seems to be a >>>>> discrepancy between the approach that got the most votes and the one >>>>> that is implemented by the v8 patch, so that seems like something to >>>>> fix. >>>> >>>> Yes, I have started working on updating the patch to use that approach >>>> (BR and AR update triggers on source and destination partition >>>> respectively, instead of delete+insert) The approach taken by the >>>> patch (BR update + delete+insert triggers) didn't require any changes >>>> in the way ExecDelete() and ExecInsert() were called. Now we would >>>> require to skip the delete/insert triggers, so some flags need to be >>>> passed to these functions, >>>> > > I thought you already need to pass an additional flag for special > handling of ctid in Delete case. Yeah that was unavoidable. > For Insert, a new flag needs to be > passed and need to have a check for that in few places. For skipping delete and insert trigger, we need to include still another flag, and checks in both ExecDelete() and ExecInsert() for skipping both BR and AR trigger, and then in ExecUpdate(), again a call to ExecARUpdateTriggers() before quitting. > >> or else have stripped down versions of >>>> ExecDelete() and ExecInsert() which don't do other things like >>>> RETURNING handling and firing triggers. >>> >>> See, that strikes me as a pretty good argument for firing the >>> DELETE+INSERT triggers... >>> >>> I'm not wedded to that approach, but "what makes the code simplest?" >>> is not a bad tiebreak, other things being equal. >> >> Yes, that sounds good to me. >> > > I am okay if we want to go ahead with firing BR UPDATE + DELETE + > INSERT triggers for an Update statement (when row movement happens) on > the argument of code simplicity, but it sounds slightly odd behavior. Ok. Will keep this behaviour that is already present in the patch. I myself also feel that code simplicity can be used as a tie-breaker if a single behaviour cannot be agreed upon that completely satisfies all aspects. > >> But I think we want to wait for other's >> opinion because it is quite understandable that two triggers firing on >> the same partition sounds odd. >> > > Yeah, but I think we have to rely on docs in this case as behavior is > not intuitive. Agreed. The doc changes in the patch already has explained in detail this behaviour. > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Fri, Jun 2, 2017 at 7:07 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > So, according to that, below would be the logic : > > Run partition constraint check on the original NEW row. > If it succeeds : > { > Fire BR UPDATE trigger on the original partition. > Run partition constraint check again with the modified NEW row > (may be do this only if the trigger modified the partition key) > If it fails, > abort. > Else > proceed with the usual local update. > } > else > { > Fire BR UPDATE trigger on original partition. > Find the right partition for the modified NEW row. > If it is the same partition, > proceed with the usual local update. > else > do the row movement. > } Sure, that sounds about right, although the "Fire BR UPDATE trigger on the original partition." is the same in both branches, so I'm not quite sure why you have that in the "if" block. >> Actually, it seems like that's probably the >> *easiest* behavior to implement. Otherwise, you might fire triggers, >> discover that you need to re-route the tuple, and then ... fire >> triggers again on the new partition, which might reroute it again? > > Why would update BR trigger fire on the new partition ? On the new > partition, only BR INSERT trigger would fire if at all we decide to > fire delete+insert triggers. And insert trigger would not again cause > the tuple to be re-routed because it's an insert. OK, sure, that makes sense. I guess it's really the insert case that I was worried about -- if we have a BEFORE ROW INSERT trigger and it changes the tuple and we reroute it, I think we'd have to fire the BEFORE ROW INSERT on the new partition, which might change the tuple again and cause yet another reroute, and in this worst case this is an infinite loop. But it sounds like we're going to fix that problem -- I think correctly -- by only ever allowing the tuple to be routed once. If some trigger tries to make a change the tuple after that such that re-routing is required, they get an error. And what you are describing here seems like it will be fine. > But now I think you are saying, the row that is being inserted into > the new partition might get again modified by the INSERT trigger on > the new partition, which might in turn cause it to fail the new > partition constraint. But in that case, it will not cause another row > movement, because in the new partition, it's an INSERT, not an UPDATE, > so the operation would end there, aborted. Yeah, that's what I was worried about. I didn't want a row movement to be able to trigger another row movement and so on ad infinitum. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jun 5, 2017 at 2:51 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Greg/Amit's idea of using the CTID field rather than an infomask bit >> seems like a possibly promising approach. Not everything that needs >> bit-space can use the CTID field, so using it is a little less likely >> to conflict with something else we want to do in the future than using >> a precious infomask bit. However, I'm worried about this: >> >> /* Make sure there is no forward chain link in t_ctid */ >> tp.t_data->t_ctid = tp.t_self; >> >> The comment does not say *why* we need to make sure that there is no >> forward chain link, but it implies that some code somewhere in the >> system does or at one time did depend on no forward link existing. > > I think it is to ensure that EvalPlanQual mechanism gets invoked in > the right case. The visibility routine will return HeapTupleUpdated > both when the tuple is deleted or updated (updated - has a newer > version of the tuple), so we use ctid to decide if we need to follow > the tuple chain for a newer version of the tuple. That would explain why need to make sure that there *is* a forward chain link in t_ctid for an update, but it doesn't explain why we need to make sure that there *isn't* a forward link for delete. > The proposed change in WARM tuple patch uses ip_posid field of CTID > and we are planning to use ip_blkid field. Here is the relevant text > and code from WARM tuple patch: > > "Store the root line pointer of the WARM chain in the t_ctid.ip_posid > field of the last tuple in the chain and mark the tuple header with > HEAP_TUPLE_LATEST flag to record that fact." > > +#define HeapTupleHeaderSetHeapLatest(tup, offnum) \ > +do { \ > + AssertMacro(OffsetNumberIsValid(offnum)); \ > + (tup)->t_infomask2 |= HEAP_LATEST_TUPLE; \ > + ItemPointerSetOffsetNumber(&(tup)->t_ctid, (offnum)); \ > +} while (0) > > For further details, refer patch 0001-Track-root-line-pointer-v23_v26 > in the below e-mail: > https://www.postgresql.org/message-id/CABOikdOTstHK2y0rDk%2BY3Wx9HRe%2BbZtj3zuYGU%3DVngneiHo5KQ%40mail.gmail.com OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jun 6, 2017 at 11:54 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Jun 5, 2017 at 2:51 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> Greg/Amit's idea of using the CTID field rather than an infomask bit >>> seems like a possibly promising approach. Not everything that needs >>> bit-space can use the CTID field, so using it is a little less likely >>> to conflict with something else we want to do in the future than using >>> a precious infomask bit. However, I'm worried about this: >>> >>> /* Make sure there is no forward chain link in t_ctid */ >>> tp.t_data->t_ctid = tp.t_self; >>> >>> The comment does not say *why* we need to make sure that there is no >>> forward chain link, but it implies that some code somewhere in the >>> system does or at one time did depend on no forward link existing. >> >> I think it is to ensure that EvalPlanQual mechanism gets invoked in >> the right case. The visibility routine will return HeapTupleUpdated >> both when the tuple is deleted or updated (updated - has a newer >> version of the tuple), so we use ctid to decide if we need to follow >> the tuple chain for a newer version of the tuple. > > That would explain why need to make sure that there *is* a forward > chain link in t_ctid for an update, but it doesn't explain why we need > to make sure that there *isn't* a forward link for delete. > As far as I understand, it is to ensure that for deleted rows, nothing more needs to be done. For example, see the below check in ExecUpdate/ExecDelete. if (!ItemPointerEquals(tupleid, &hufd.ctid)) { .. } .. Also a similar check in ExecLockRows. Now for deleted rows, if the t_ctid wouldn't point to itself, then in the mentioned functions, we were not in a position to conclude that the row is deleted. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 6 June 2017 at 23:52, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jun 2, 2017 at 7:07 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> So, according to that, below would be the logic : >> >> Run partition constraint check on the original NEW row. >> If it succeeds : >> { >> Fire BR UPDATE trigger on the original partition. >> Run partition constraint check again with the modified NEW row >> (may be do this only if the trigger modified the partition key) >> If it fails, >> abort. >> Else >> proceed with the usual local update. >> } >> else >> { >> Fire BR UPDATE trigger on original partition. >> Find the right partition for the modified NEW row. >> If it is the same partition, >> proceed with the usual local update. >> else >> do the row movement. >> } > > Sure, that sounds about right, although the "Fire BR UPDATE trigger on > the original partition." is the same in both branches, so I'm not > quite sure why you have that in the "if" block. Actually after coding this logic, it looks a bit different. See ExecUpdate() in the attached file trigger_related_changes.patch ---- Now that we are making sure trigger won't change the partition of the tuple, next thing we need to do is, make sure the tuple routing setup is done *only* if the UPDATE is modifying partition keys. Otherwise, this will degrade normal update performance. Below is the logic I am implementing for determining whether the UPDATE is modifying partition keys. In ExecInitModifyTable() ... Call GetUpdatedColumns(mtstate->rootResultRelInfo, estate) to get updated_columns. For each of the updated_columns : { Check if the column is part of partition key quals of any of the relations in mtstate->resultRelInfo[] array. /* * mtstate->resultRelInfo[] contains exactly those leaf partitions * which qualify the update quals. */ If (it is part of partition key quals of at least one of the relations) { Do ExecSetupPartitionTupleRouting() for the root partition. break; } } Few things need to be considered : Use Relation->rd_partcheck to get partition check quals of each of the relations in mtstate->resultRelInfo[]. The Relation->rd_partcheck of the leaf partitions would include the ancestors' partition quals as well. So we are good: we don't have to explicitly get the upper partition constraints. Note that an UPDATE can modify a column which is not used in a partition constraint expressions of any of the partitions or partitioned tables in the subtree, but that column may have been used in partition constraint of a partitioned table belonging to upper subtree. All of the relations in mtstate->resultRelInfo are already open. So we don't need to re-open any more relations to get the partition quals. The column bitmap set returned by GetUpdatedColumns() refer to attribute numbers w.r.t. to the root partition. And the mstate->resultRelInfo[] have attnos w.r.t. to the leaf partitions. So we need to do something similar to map_partition_varattnos() to change the updated columns attnos to the leaf partitions and walk down the partition constraint expressions to find if the attnos are present there. Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 7 June 2017 at 16:42, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > The column bitmap set returned by GetUpdatedColumns() refer to > attribute numbers w.r.t. to the root partition. And the > mstate->resultRelInfo[] have attnos w.r.t. to the leaf partitions. So > we need to do something similar to map_partition_varattnos() to change > the updated columns attnos to the leaf partitions I was wrong about this. Each of the mtstate->resultRelInfo[] has its own corresponding RangeTblEntry with its own updatedCols having attnos accordingly adjusted to refer its own table attributes. So we don't have to do the mapping; we need to get modifedCols separately for each of the ResultRelInfo, rather than the root relinfo. > and walk down the > partition constraint expressions to find if the attnos are present > there. But this we will need to do. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Wed, Jun 7, 2017 at 5:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > As far as I understand, it is to ensure that for deleted rows, nothing > more needs to be done. For example, see the below check in > ExecUpdate/ExecDelete. > if (!ItemPointerEquals(tupleid, &hufd.ctid)) > { > .. > } > .. > > Also a similar check in ExecLockRows. Now for deleted rows, if the > t_ctid wouldn't point to itself, then in the mentioned functions, we > were not in a position to conclude that the row is deleted. Right, so we would have to find all such checks and change them to use some other method to conclude that the row is deleted. What method would we use? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jun 8, 2017 at 1:33 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jun 7, 2017 at 5:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> As far as I understand, it is to ensure that for deleted rows, nothing >> more needs to be done. For example, see the below check in >> ExecUpdate/ExecDelete. >> if (!ItemPointerEquals(tupleid, &hufd.ctid)) >> { >> .. >> } >> .. >> >> Also a similar check in ExecLockRows. Now for deleted rows, if the >> t_ctid wouldn't point to itself, then in the mentioned functions, we >> were not in a position to conclude that the row is deleted. > > Right, so we would have to find all such checks and change them to use > some other method to conclude that the row is deleted. What method > would we use? > I think before doing above check we can simply check if ctid.ip_blkid contains InvalidBlockNumber, then return an error. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Jun 8, 2017 at 7:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Jun 8, 2017 at 1:33 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Jun 7, 2017 at 5:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> As far as I understand, it is to ensure that for deleted rows, nothing >>> more needs to be done. For example, see the below check in >>> ExecUpdate/ExecDelete. >>> if (!ItemPointerEquals(tupleid, &hufd.ctid)) >>> { >>> .. >>> } >>> .. >>> >>> Also a similar check in ExecLockRows. Now for deleted rows, if the >>> t_ctid wouldn't point to itself, then in the mentioned functions, we >>> were not in a position to conclude that the row is deleted. >> >> Right, so we would have to find all such checks and change them to use >> some other method to conclude that the row is deleted. What method >> would we use? > > I think before doing above check we can simply check if ctid.ip_blkid > contains InvalidBlockNumber, then return an error. Hmm, OK. That case never happens today? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 7 June 2017 at 20:19, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 7 June 2017 at 16:42, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> The column bitmap set returned by GetUpdatedColumns() refer to >> attribute numbers w.r.t. to the root partition. And the >> mstate->resultRelInfo[] have attnos w.r.t. to the leaf partitions. So >> we need to do something similar to map_partition_varattnos() to change >> the updated columns attnos to the leaf partitions > > I was wrong about this. Each of the mtstate->resultRelInfo[] has its > own corresponding RangeTblEntry with its own updatedCols having attnos > accordingly adjusted to refer its own table attributes. So we don't > have to do the mapping; we need to get modifedCols separately for each > of the ResultRelInfo, rather than the root relinfo. > >> and walk down the >> partition constraint expressions to find if the attnos are present >> there. > > But this we will need to do. Attached is v9 patch. This covers the two parts discussed upthread : 1. Prevent triggers from causing the row movement. 2. Setup the tuple routing in ExecInitModifyTable(), but only if a partition key is modified. Check new function IsPartitionKeyUpdate(). Have rebased the patch to consider changes done in commit 15ce775faa428dc9 to prevent triggers from violating partition constraints. There, for the call to ExecFindPartition() in ExecInsert, we need to fetch the mtstate->rootResultRelInfo in case the operation is part of update row movement. This is because the root partition is not available in the resultRelInfo for UPDATE. Added many more test scenarios in update.sql that cover the above. I am yet to test the concurrency part using isolation tester. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Thu, Jun 8, 2017 at 10:40 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jun 8, 2017 at 7:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Thu, Jun 8, 2017 at 1:33 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Wed, Jun 7, 2017 at 5:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>>> As far as I understand, it is to ensure that for deleted rows, nothing >>>> more needs to be done. For example, see the below check in >>>> ExecUpdate/ExecDelete. >>>> if (!ItemPointerEquals(tupleid, &hufd.ctid)) >>>> { >>>> .. >>>> } >>>> .. >>>> >>>> Also a similar check in ExecLockRows. Now for deleted rows, if the >>>> t_ctid wouldn't point to itself, then in the mentioned functions, we >>>> were not in a position to conclude that the row is deleted. >>> >>> Right, so we would have to find all such checks and change them to use >>> some other method to conclude that the row is deleted. What method >>> would we use? >> >> I think before doing above check we can simply check if ctid.ip_blkid >> contains InvalidBlockNumber, then return an error. > > Hmm, OK. That case never happens today? > As per my understanding that case doesn't exist. I will verify again once the patch is available. I can take a crack at it if Amit Khandekar is busy with something else or is not comfortable in this area. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 9 June 2017 at 19:10, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Jun 8, 2017 at 10:40 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Jun 8, 2017 at 7:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> On Thu, Jun 8, 2017 at 1:33 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> On Wed, Jun 7, 2017 at 5:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>>>> As far as I understand, it is to ensure that for deleted rows, nothing >>>>> more needs to be done. For example, see the below check in >>>>> ExecUpdate/ExecDelete. >>>>> if (!ItemPointerEquals(tupleid, &hufd.ctid)) >>>>> { >>>>> .. >>>>> } >>>>> .. >>>>> >>>>> Also a similar check in ExecLockRows. Now for deleted rows, if the >>>>> t_ctid wouldn't point to itself, then in the mentioned functions, we >>>>> were not in a position to conclude that the row is deleted. >>>> >>>> Right, so we would have to find all such checks and change them to use >>>> some other method to conclude that the row is deleted. What method >>>> would we use? >>> >>> I think before doing above check we can simply check if ctid.ip_blkid >>> contains InvalidBlockNumber, then return an error. >> >> Hmm, OK. That case never happens today? >> > > As per my understanding that case doesn't exist. I will verify again > once the patch is available. I can take a crack at it if Amit > Khandekar is busy with something else or is not comfortable in this > area. Amit, I was going to have a look at this, once I finish with the other part. I was busy on getting that done first. But your comments/help are always welcome. > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Fri, Jun 9, 2017 at 7:48 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 9 June 2017 at 19:10, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Thu, Jun 8, 2017 at 10:40 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Thu, Jun 8, 2017 at 7:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>>> >>>> I think before doing above check we can simply check if ctid.ip_blkid >>>> contains InvalidBlockNumber, then return an error. >>> >>> Hmm, OK. That case never happens today? >>> >> >> As per my understanding that case doesn't exist. I will verify again >> once the patch is available. I can take a crack at it if Amit >> Khandekar is busy with something else or is not comfortable in this >> area. > > Amit, I was going to have a look at this, once I finish with the other > part. > Sure, will wait for your patch to be available. I can help by reviewing the same. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
While rebasing my patch for the below recent commit, I realized that a similar issue exists for the uptate-tuple-routing patch as well : commit 78a030a441966d91bc7e932ef84da39c3ea7d970 Author: Tom Lane <tgl@sss.pgh.pa.us> Date: Mon Jun 12 23:29:44 2017 -0400 Fix confusion about number of subplans in partitioned INSERT setup. The above issue was about incorrectly using 'i' in mtstate->mt_plans[i] during handling WITH CHECK OPTIONS in ExecInitModifyTable(), where 'i' was actually meant to refer to the positions in mtstate->mt_num_partitions. Actually for INSERT, there is only a single plan element in mtstate->mt_plans[] array. Similarly, for update-tuple routing, we cannot use mtstate->mt_plans[i], because 'i' refers to position in mtstate->mt_partitions[] , whereas mtstate->mt_plans is not at all in order of mtstate->mt_partitions; in fact mt_plans has only the plans that are to be scanned on pruned partitions; so it can well be a small subset of total partitions. I am working on an updated patch to fix the above.
On 13 June 2017 at 15:40, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > While rebasing my patch for the below recent commit, I realized that a > similar issue exists for the uptate-tuple-routing patch as well : > > commit 78a030a441966d91bc7e932ef84da39c3ea7d970 > Author: Tom Lane <tgl@sss.pgh.pa.us> > Date: Mon Jun 12 23:29:44 2017 -0400 > > Fix confusion about number of subplans in partitioned INSERT setup. > > The above issue was about incorrectly using 'i' in > mtstate->mt_plans[i] during handling WITH CHECK OPTIONS in > ExecInitModifyTable(), where 'i' was actually meant to refer to the > positions in mtstate->mt_num_partitions. Actually for INSERT, there is > only a single plan element in mtstate->mt_plans[] array. > > Similarly, for update-tuple routing, we cannot use > mtstate->mt_plans[i], because 'i' refers to position in > mtstate->mt_partitions[] , whereas mtstate->mt_plans is not at all in > order of mtstate->mt_partitions; in fact mt_plans has only the plans > that are to be scanned on pruned partitions; so it can well be a small > subset of total partitions. > > I am working on an updated patch to fix the above. Attached patch v10 fixes the above. In the existing code, where it builds WCO constraints for each leaf partition; with the patch, that code now is applicable to row-movement-updates as well. So the assertions in the code are now updated to allow the same. Secondly, the mapping for each of the leaf partitions was constructed using the root partition attributes. Now in the patch, the mtstate->resultRelInfo[0] (i.e. the first resultRelInfo) is used as reference. So effectively, map_partition_varattnos() now represents not just parent-to-partition mapping, but rather, mapping between any two partitions/partitioned_tables. It's done this way, so that we can have a common WCO building code for inserts as well as updates. For e.g. for inserts, the first (and only) WCO belongs to node->nominalRelation so nominalRelation is used for map_partition_varattnos(), whereas for updates, first WCO belongs to the first resultRelInfo which is not same as nominalRelation. So in the patch, in both cases, we use the first resultRelInfo and the WCO of the first resultRelInfo for map_partition_varattnos(). Similar thing is done for Returning expressions. --------- Another change in the patch is : for ExecInitQual() for WCO quals, mtstate->ps is used as parent, rather than first plan. For updates, first plan does not belong to the parent partition. In fact, I think in all cases, we should use mtstate->ps as the parent. mtstate->mt_plans[0] don't look like they should be considered parent of these expressions. May be it does not matter to which parent we link these quals, because there is no ReScan for ExecModifyTable(). Note that for RETURNING projection expressions, we do use mtstate->ps. -------- There is another issue I discovered. The row-movement works fine if the destination leaf partition has different attribute ordering than the root : the existing insert-tuple-routing mapping handles that. But if the source partition has different ordering w.r.t. the root, it has a problem : there is no mapping in the opposite direction, i.e. from the leaf to root. And we require that because the tuple of source leaf partition needs to be converted to root partition tuple descriptor, since ExecFindPartition() starts with root. To fix this, I have introduced another mapping array mtstate->mt_resultrel_maps[]. This corresponds to the mtstate->resultRelInfo[]. We don't require per-leaf-partition mapping, because the update result relations are pruned subset of the total leaf partitions. So in ExecInsert, before calling ExecFindPartition(), we need to convert the leaf partition tuple to root using this reverse mapping. Since we need to convert the tuple here, and again after ExecFindPartition() for the found leaf partition, I have replaced the common code by new function ConvertPartitionTupleSlot(). ------- Used a new flag is_partitionkey_update in ExecInitModifyTable(), which can be re-used in subsequent sections , rather than again calling IsPartitionKeyUpdate() function again. ------- Some more test scenarios added that cover above changes. Basically partitions that have different tuple descriptors than parents. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
When I tested partition-key-update on a partitioned table having no child partitions, it crashed. This is because there is an Assert(mtstate->mt_num_partitions > 0) for creating the partition-to-root map, which fails if there are no partitions under the partitioned table. Actually we should skp creating this map if there are no partitions under the partitioned table on which UPDATE is run. So the attached patch has this new change to fix it (and appropriate additional test case added) : --- a/src/backend/executor/nodeModifyTable.c +++ b/src/backend/executor/nodeModifyTable.c @@ -2006,15 +2006,14 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags) * descriptor of a source partition does not match the root partition * descriptor. In such case we need to convert tuples to the root partition * tuple descriptor, because the search for destination partition starts - * from the root. + * from the root. Skip this setup if it's not a partition key update or if + * there are no partitions below this partitioned table. */ - if (is_partitionkey_update) + if (is_partitionkey_update && mtstate->mt_num_partitions > 0) { TupleConversionMap **tup_conv_maps; TupleDesc outdesc; - Assert(mtstate->mt_num_partitions > 0); - mtstate->mt_resultrel_maps = (TupleConversionMap **) palloc0(sizeof(TupleConversionMap*) * nplans); On 15 June 2017 at 23:06, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 13 June 2017 at 15:40, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> While rebasing my patch for the below recent commit, I realized that a >> similar issue exists for the uptate-tuple-routing patch as well : >> >> commit 78a030a441966d91bc7e932ef84da39c3ea7d970 >> Author: Tom Lane <tgl@sss.pgh.pa.us> >> Date: Mon Jun 12 23:29:44 2017 -0400 >> >> Fix confusion about number of subplans in partitioned INSERT setup. >> >> The above issue was about incorrectly using 'i' in >> mtstate->mt_plans[i] during handling WITH CHECK OPTIONS in >> ExecInitModifyTable(), where 'i' was actually meant to refer to the >> positions in mtstate->mt_num_partitions. Actually for INSERT, there is >> only a single plan element in mtstate->mt_plans[] array. >> >> Similarly, for update-tuple routing, we cannot use >> mtstate->mt_plans[i], because 'i' refers to position in >> mtstate->mt_partitions[] , whereas mtstate->mt_plans is not at all in >> order of mtstate->mt_partitions; in fact mt_plans has only the plans >> that are to be scanned on pruned partitions; so it can well be a small >> subset of total partitions. >> >> I am working on an updated patch to fix the above. > > Attached patch v10 fixes the above. In the existing code, where it > builds WCO constraints for each leaf partition; with the patch, that > code now is applicable to row-movement-updates as well. So the > assertions in the code are now updated to allow the same. Secondly, > the mapping for each of the leaf partitions was constructed using the > root partition attributes. Now in the patch, the > mtstate->resultRelInfo[0] (i.e. the first resultRelInfo) is used as > reference. So effectively, map_partition_varattnos() now represents > not just parent-to-partition mapping, but rather, mapping between any > two partitions/partitioned_tables. It's done this way, so that we can > have a common WCO building code for inserts as well as updates. For > e.g. for inserts, the first (and only) WCO belongs to > node->nominalRelation so nominalRelation is used for > map_partition_varattnos(), whereas for updates, first WCO belongs to > the first resultRelInfo which is not same as nominalRelation. So in > the patch, in both cases, we use the first resultRelInfo and the WCO > of the first resultRelInfo for map_partition_varattnos(). > > Similar thing is done for Returning expressions. > > --------- > > Another change in the patch is : for ExecInitQual() for WCO quals, > mtstate->ps is used as parent, rather than first plan. For updates, > first plan does not belong to the parent partition. In fact, I think > in all cases, we should use mtstate->ps as the parent. > mtstate->mt_plans[0] don't look like they should be considered parent > of these expressions. May be it does not matter to which parent we > link these quals, because there is no ReScan for ExecModifyTable(). > > Note that for RETURNING projection expressions, we do use mtstate->ps. > > -------- > > There is another issue I discovered. The row-movement works fine if > the destination leaf partition has different attribute ordering than > the root : the existing insert-tuple-routing mapping handles that. But > if the source partition has different ordering w.r.t. the root, it has > a problem : there is no mapping in the opposite direction, i.e. from > the leaf to root. And we require that because the tuple of source leaf > partition needs to be converted to root partition tuple descriptor, > since ExecFindPartition() starts with root. > > To fix this, I have introduced another mapping array > mtstate->mt_resultrel_maps[]. This corresponds to the > mtstate->resultRelInfo[]. We don't require per-leaf-partition mapping, > because the update result relations are pruned subset of the total > leaf partitions. > > So in ExecInsert, before calling ExecFindPartition(), we need to > convert the leaf partition tuple to root using this reverse mapping. > Since we need to convert the tuple here, and again after > ExecFindPartition() for the found leaf partition, I have replaced the > common code by new function ConvertPartitionTupleSlot(). > > ------- > > Used a new flag is_partitionkey_update in ExecInitModifyTable(), which > can be re-used in subsequent sections , rather than again calling > IsPartitionKeyUpdate() function again. > > ------- > > Some more test scenarios added that cover above changes. Basically > partitions that have different tuple descriptors than parents. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Fri, Jun 16, 2017 at 5:36 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > There is another issue I discovered. The row-movement works fine if > the destination leaf partition has different attribute ordering than > the root : the existing insert-tuple-routing mapping handles that. But > if the source partition has different ordering w.r.t. the root, it has > a problem : there is no mapping in the opposite direction, i.e. from > the leaf to root. And we require that because the tuple of source leaf > partition needs to be converted to root partition tuple descriptor, > since ExecFindPartition() starts with root. > > To fix this, I have introduced another mapping array > mtstate->mt_resultrel_maps[]. This corresponds to the > mtstate->resultRelInfo[]. We don't require per-leaf-partition mapping, > because the update result relations are pruned subset of the total > leaf partitions. Hi Amit & Amit, Just a thought: If I understand correctly this new array of tuple conversion maps is the same as mtstate->mt_transition_tupconv_maps in my patch transition-tuples-from-child-tables-v11.patch (hopefully soon to be committed to close a PG10 open item). In my patch I bounce transition tuples from child relations up to the named relation's triggers, and in this patch you bounce child tuples up to the named relation for rerouting, so the conversion requirement is the same. Perhaps we could consider refactoring to build a common struct member on demand for the row movement patch at some point in the future if it makes the code cleaner. -- Thomas Munro http://www.enterprisedb.com
On Thu, Jun 15, 2017 at 1:36 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > Attached patch v10 fixes the above. In the existing code, where it > builds WCO constraints for each leaf partition; with the patch, that > code now is applicable to row-movement-updates as well. I guess I don't see why it should work like this. In the INSERT case, we must build withCheckOption objects for each partition because those partitions don't appear in the plan otherwise -- but in the UPDATE case, they're already there, so why do we need to build anything at all? Similarly for RETURNING projections. How are the things we need for those cases not already getting built, associated with the relevant resultRelInfos? Maybe there's a concern if some children got pruned - they could turn out later to be the children into which tuples need to be routed. But the patch makes no distinction between possibly-pruned children and any others. > There is another issue I discovered. The row-movement works fine if > the destination leaf partition has different attribute ordering than > the root : the existing insert-tuple-routing mapping handles that. But > if the source partition has different ordering w.r.t. the root, it has > a problem : there is no mapping in the opposite direction, i.e. from > the leaf to root. And we require that because the tuple of source leaf > partition needs to be converted to root partition tuple descriptor, > since ExecFindPartition() starts with root. Seems reasonable, but... > To fix this, I have introduced another mapping array > mtstate->mt_resultrel_maps[]. This corresponds to the > mtstate->resultRelInfo[]. We don't require per-leaf-partition mapping, > because the update result relations are pruned subset of the total > leaf partitions. ... I don't understand how you can *not* need a per-leaf-partition mapping. I mean, maybe you only need the mapping for the *unpruned* leaf partitions but you certainly need a separate mapping for each one of those. It's possible to imagine driving the tuple routing off of just the partition key attributes, extracted from wherever they are inside the tuple at the current level, rather than converting to the root's tuple format. However, that's not totally straightforward because there could be multiple levels of partitioning throughout the tree and different attributes might be needed at different levels. Moreover, in most cases, the mappings are going to end up being no-ops because the column order will be the same, so it's probably not worth complicating the code to try to avoid a double conversion that usually won't happen. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 20 June 2017 at 03:42, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Just a thought: If I understand correctly this new array of tuple > conversion maps is the same as mtstate->mt_transition_tupconv_maps in > my patch transition-tuples-from-child-tables-v11.patch (hopefully soon > to be committed to close a PG10 open item). In my patch I bounce > transition tuples from child relations up to the named relation's > triggers, and in this patch you bounce child tuples up to the named > relation for rerouting, so the conversion requirement is the same. > Perhaps we could consider refactoring to build a common struct member > on demand for the row movement patch at some point in the future if it > makes the code cleaner. I agree; thanks for bringing this to my attention. The conversion maps in my patch and yours do sound like they are exactly same. And even in case where both update-row-movement and transition tables are playing together, the same map should serve the purpose of both. I will keep a watch on your patch, and check how I can adjust my patch so that I don't have to refactor the mapping. One difference I see is : in your patch, in ExecModifyTable() we jump the current map position for each successive subplan, whereas in my patch, in ExecInsert() we deduce the position of the right map to be fetched using the position of the current resultRelInfo in the mtstate->resultRelInfo[] array. I think your way is more consistent with the existing code. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On 20 June 2017 at 03:46, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jun 15, 2017 at 1:36 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> Attached patch v10 fixes the above. In the existing code, where it >> builds WCO constraints for each leaf partition; with the patch, that >> code now is applicable to row-movement-updates as well. > > I guess I don't see why it should work like this. In the INSERT case, > we must build withCheckOption objects for each partition because those > partitions don't appear in the plan otherwise -- but in the UPDATE > case, they're already there, so why do we need to build anything at > all? Similarly for RETURNING projections. How are the things we need > for those cases not already getting built, associated with the > relevant resultRelInfos? Maybe there's a concern if some children got > pruned - they could turn out later to be the children into which > tuples need to be routed. But the patch makes no distinction > between possibly-pruned children and any others. Yes, only a subset of the partitions appear in the UPDATE subplans. I think typically for updates, a very small subset of the total leaf partitions will be there in the plans, others would get pruned. IMHO, it would not be worth having an optimization where it opens only those leaf partitions which are not already there in the subplans. Without the optimization, we are able to re-use the INSERT infrastructure without additional changes. > >> There is another issue I discovered. The row-movement works fine if >> the destination leaf partition has different attribute ordering than >> the root : the existing insert-tuple-routing mapping handles that. But >> if the source partition has different ordering w.r.t. the root, it has >> a problem : there is no mapping in the opposite direction, i.e. from >> the leaf to root. And we require that because the tuple of source leaf >> partition needs to be converted to root partition tuple descriptor, >> since ExecFindPartition() starts with root. > > Seems reasonable, but... > >> To fix this, I have introduced another mapping array >> mtstate->mt_resultrel_maps[]. This corresponds to the >> mtstate->resultRelInfo[]. We don't require per-leaf-partition mapping, >> because the update result relations are pruned subset of the total >> leaf partitions. > > ... I don't understand how you can *not* need a per-leaf-partition > mapping. I mean, maybe you only need the mapping for the *unpruned* > leaf partitions Yes, we need the mapping only for the unpruned leaf partitions, and those partitions are available in the per-subplan resultRelInfo's. > but you certainly need a separate mapping for each one of those. You mean *each* of the leaf partitions ? I didn't get why we would need it for each one. The tuple targeted for update belongs to one of the per-subplan resultInfos. And this tuple is to be routed to another leaf partition. So the reverse mapping is for conversion from the source resultRelinfo to the root partition. I am unable to figure out a scenario where we would require this reverse mapping for partitions on which UPDATE is *not* going to be executed. > > It's possible to imagine driving the tuple routing off of just the > partition key attributes, extracted from wherever they are inside the > tuple at the current level, rather than converting to the root's tuple > format. However, that's not totally straightforward because there > could be multiple levels of partitioning throughout the tree and > different attributes might be needed at different levels. Yes, the conversion anyway occurs at each of these levels even for insert, specifically because there can be different partition attributes each time. For update, its only one additional conversion. But yes, this new mapping would be required for this one single conversion. > Moreover, > in most cases, the mappings are going to end up being no-ops because > the column order will be the same, so it's probably not worth > complicating the code to try to avoid a double conversion that usually > won't happen. I agree. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Tue, Jun 20, 2017 at 2:54 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> I guess I don't see why it should work like this. In the INSERT case, >> we must build withCheckOption objects for each partition because those >> partitions don't appear in the plan otherwise -- but in the UPDATE >> case, they're already there, so why do we need to build anything at >> all? Similarly for RETURNING projections. How are the things we need >> for those cases not already getting built, associated with the >> relevant resultRelInfos? Maybe there's a concern if some children got >> pruned - they could turn out later to be the children into which >> tuples need to be routed. But the patch makes no distinction >> between possibly-pruned children and any others. > > Yes, only a subset of the partitions appear in the UPDATE subplans. I > think typically for updates, a very small subset of the total leaf > partitions will be there in the plans, others would get pruned. IMHO, > it would not be worth having an optimization where it opens only those > leaf partitions which are not already there in the subplans. Without > the optimization, we are able to re-use the INSERT infrastructure > without additional changes. Well, that is possible, but certainly not guaranteed. I mean, somebody could do a whole-table UPDATE, or an UPDATE that hits a smattering of rows in every partition; e.g. the table is partitioned on order number, and you do UPDATE lineitem SET product_code = 'K372B' WHERE product_code = 'K372'. Leaving that aside, the point here is that you're rebuilding withCheckOptions and returningLists that have already been built in the planner. That's bad for two reasons. First, it's inefficient, especially if there are many partitions. Second, it will amount to a functional bug if you get a different answer than the planner did. Note this comment in the existing code: /* * Build WITH CHECK OPTION constraints for each leaf partition rel. Note * that we didn't build the withCheckOptionListfor each partition within * the planner, but simple translation of the varattnos for each partition * will suffice. This only occurs for the INSERT case; UPDATE/DELETE * cases are handled above. */ The comment "UPDATE/DELETE cases are handled above" is referring to the code that initializes the WCOs generated by the planner. You've modified the comment in your patch, but the associated code: your updated comment says that only "DELETEs and local UPDATES are handled above", but in reality, *all* updates are still handled above. And then they are handled again here. Similarly for returning lists. It's certainly not OK for the comment to be inaccurate, but I think it's also bad to redo the work which the planner has already done, even if it makes the patch smaller. Also, I feel like it's probably not correct to use the first result relation as the nominal relation for building WCOs and returning lists anyway. I mean, if the first result relation has a different column order than the parent relation, isn't this just broken? If it works for some reason, the comments don't explain what that reason is. >> ... I don't understand how you can *not* need a per-leaf-partition >> mapping. I mean, maybe you only need the mapping for the *unpruned* >> leaf partitions > > Yes, we need the mapping only for the unpruned leaf partitions, and > those partitions are available in the per-subplan resultRelInfo's. OK. >> but you certainly need a separate mapping for each one of those. > > You mean *each* of the leaf partitions ? I didn't get why we would > need it for each one. The tuple targeted for update belongs to one of > the per-subplan resultInfos. And this tuple is to be routed to another > leaf partition. So the reverse mapping is for conversion from the > source resultRelinfo to the root partition. I am unable to figure out > a scenario where we would require this reverse mapping for partitions > on which UPDATE is *not* going to be executed. I agree - the reverse mapping is only needed for the partitions in which UPDATE will be executed. Some other things: + * The row was already deleted by a concurrent DELETE. So we don't + * have anything to update. I find this explanation, and the surrounding comments, inadequate. It doesn't really explain why we're doing this. I think it should say something like this: For a normal UPDATE, the case where the tuple has been the subject of a concurrent UPDATE or DELETE would be handled by the EvalPlanQual machinery, but for an UPDATE that we've translated into a DELETE from this partition and an INSERT into some other partition, that's not available, because CTID chains can't span relation boundaries. We mimic the semantics to a limited extent by skipping the INSERT if the DELETE fails to find a tuple. This ensures that two concurrent attempts to UPDATE the same tuple at the same time can't turn one tuple into two, and that an UPDATE of a just-deleted tuple can't resurrect it. + bool partition_check_passed_with_trig_tuple; + + partition_check_passed = + (resultRelInfo->ri_PartitionCheck && + ExecPartitionCheck(resultRelInfo, slot, estate)); + + partition_check_passed_with_trig_tuple = + (resultRelInfo->ri_PartitionCheck && + ExecPartitionCheck(resultRelInfo, trig_slot, estate)); + if (partition_check_passed) + { + /* + * If it's the trigger that is causing partition constraint + * violation, abort. We don't want a trigger to cause tuple + * routing. + */ + if (!partition_check_passed_with_trig_tuple) + ExecPartitionCheckEmitError(resultRelInfo, + trig_slot, estate); + } + else + { + /* + * Partition constraint failed with original NEW tuple. But the + * trigger might even have modifed the tuple such that it fits + * back into the partition. So partition constraint check + * should be based on *final* NEW tuple. + */ + partition_check_passed = partition_check_passed_with_trig_tuple; + } Maybe I inadvertently gave the contrary impression in some prior review, but this logic doesn't seem right to me. I don't think there's any problem with a BR UPDATE trigger causing tuple routing. What I want to avoid is repeatedly rerouting the same tuple, but I don't think that could happen even without this guard. We've now fixed insert tuple routing so that a BR INSERT trigger can't cause the partition constraint to be violated (cf. commit 15ce775faa428dc91027e4e2d6b7a167a27118b5) and there's no way for update tuple routing to trigger additional BR UPDATE triggers. So I don't see the point of checking the constraints twice here. I think what you want to do is get rid of all the changes here and instead adjust the logic just before ExecConstraints() to invoke ExecPartitionCheck() on the post-trigger version of the tuple. Parenthetically, if we decided to keep this logic as you have it, the code that sets partition_check_passed and partition_check_passed_with_trig_tuple doesn't need to check resultRelInfo->ri_PartitionCheck because the surrounding "if" block already did. + for (i = 0; i < num_rels; i++) + { + ResultRelInfo *resultRelInfo = &result_rels[i]; + Relation rel = resultRelInfo->ri_RelationDesc; + Bitmapset *expr_attrs = NULL; + + pull_varattnos((Node *) rel->rd_partcheck, 1, &expr_attrs); + + /* Both bitmaps are offset by FirstLowInvalidHeapAttributeNumber. */ + if (bms_overlap(expr_attrs, GetUpdatedColumns(resultRelInfo, estate))) + return true; + } This seems like an awfully expensive way of performing this test. Under what circumstances could this be true for some result relations and false for others; or in other words, why do we have to loop over all of the result relations? It seems to me that the user has typed something like: UPDATE whatever SET thingy = ..., whatsit = ... WHERE whatever = ... AND thunk = ... If either thingy or whatsit is a partitioning column, UPDATE tuple routing might be needed - and it should be able to test that by a *single* comparison between the set of columns being updated and the partitioning columns, without needing to repeat for every partitions. Perhaps that test needs to be done at plan time and saved in the plan, rather than performed here -- or maybe it's easy enough to do it here. One problem is that, if BR UPDATE triggers are in fact allowed to cause tuple routing as I proposed above, the presence of a BR UPDATE trigger for any partition could necessitate UPDATE tuple routing for queries that wouldn't otherwise need it. But even if you end up inserting a test for that case, it can surely be a lot cheaper than this, since it only involves checking a boolean flag, not a bitmapset. It could be argue that we ought to prohibit BR UPDATE triggers from causing tuple routing so that we don't have to do this test at all, but I'm not sure that's a good trade-off. It seems to necessitate checking the partition constraint twice per tuple instead of once per tuple, which like a very heavy price. +#define GetUpdatedColumns(relinfo, estate) \ + (rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->updatedCols) I think this should be moved to a header file (and maybe turned into a static inline function) rather than copy-pasting the definition into a new file. - List *mapped_wcoList; + List *mappedWco; List *wcoExprs = NIL; ListCell *ll; - /* varno = node->nominalRelation */ - mapped_wcoList = map_partition_varattnos(wcoList, - node->nominalRelation, - partrel, rel); - foreach(ll, mapped_wcoList) + mappedWco = map_partition_varattnos(firstWco, firstVarno, + partrel, firstResultRel); + foreach(ll, mappedWco) { WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll)); ExprState *wcoExpr = ExecInitQual(castNode(List, wco->qual), - plan); + &mtstate->ps); wcoExprs = lappend(wcoExprs, wcoExpr); } - resultRelInfo->ri_WithCheckOptions = mapped_wcoList; + resultRelInfo->ri_WithCheckOptions = mappedWco; Renaming the variable looks fairly pointless, unless I'm missing something? Regarding the tests, it seems like you've got a test case where you update a sub-partition and it fails because the tuple would need to be moved out of a sub-tree, which is good. But I think it would also be good to have a case where you update a sub-partition and it succeeds in moving the tuple within the subtree. I don't see one like that presently; it seems all the others update the topmost root or the leaf. I also think it would be a good idea to make sub_parted's column order different from both list_parted and its own children, and maybe use a diversity of data types (e.g. int4, int8, text instead of making everything int). +select tableoid::regclass , * from list_parted where a = 2 order by 1; +update list_parted set b = c + a where a = 2; +select tableoid::regclass , * from list_parted where a = 2 order by 1; The extra space before the comma looks strange. Also, please make a habit of checking patches for whitespace errors using git diff --check. [rhaas pgsql]$ git diff --check src/backend/executor/nodeModifyTable.c:384: indent with spaces. + tuple, &slot); src/backend/executor/nodeModifyTable.c:1966: space before tab in indent. + IsPartitionKeyUpdate(estate, mtstate->resultRelInfo, nplans)); You will notice these kinds of things if you read the diff you are submitting before you press send, because git highlights them in bright red. That's a good practice for many other reasons, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017/06/21 3:53, Robert Haas wrote: > On Tue, Jun 20, 2017 at 2:54 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>> I guess I don't see why it should work like this. In the INSERT case, >>> we must build withCheckOption objects for each partition because those >>> partitions don't appear in the plan otherwise -- but in the UPDATE >>> case, they're already there, so why do we need to build anything at >>> all? Similarly for RETURNING projections. How are the things we need >>> for those cases not already getting built, associated with the >>> relevant resultRelInfos? Maybe there's a concern if some children got >>> pruned - they could turn out later to be the children into which >>> tuples need to be routed. But the patch makes no distinction >>> between possibly-pruned children and any others. >> >> Yes, only a subset of the partitions appear in the UPDATE subplans. I >> think typically for updates, a very small subset of the total leaf >> partitions will be there in the plans, others would get pruned. IMHO, >> it would not be worth having an optimization where it opens only those >> leaf partitions which are not already there in the subplans. Without >> the optimization, we are able to re-use the INSERT infrastructure >> without additional changes. > > Well, that is possible, but certainly not guaranteed. I mean, > somebody could do a whole-table UPDATE, or an UPDATE that hits a > smattering of rows in every partition; e.g. the table is partitioned > on order number, and you do UPDATE lineitem SET product_code = 'K372B' > WHERE product_code = 'K372'. > > Leaving that aside, the point here is that you're rebuilding > withCheckOptions and returningLists that have already been built in > the planner. That's bad for two reasons. First, it's inefficient, > especially if there are many partitions. Second, it will amount to a > functional bug if you get a different answer than the planner did. > Note this comment in the existing code: > > /* > * Build WITH CHECK OPTION constraints for each leaf partition rel. Note > * that we didn't build the withCheckOptionList for each partition within > * the planner, but simple translation of the varattnos for each partition > * will suffice. This only occurs for the INSERT case; UPDATE/DELETE > * cases are handled above. > */ > > The comment "UPDATE/DELETE cases are handled above" is referring to > the code that initializes the WCOs generated by the planner. You've > modified the comment in your patch, but the associated code: your > updated comment says that only "DELETEs and local UPDATES are handled > above", but in reality, *all* updates are still handled above. And > then they are handled again here. Similarly for returning lists. > It's certainly not OK for the comment to be inaccurate, but I think > it's also bad to redo the work which the planner has already done, > even if it makes the patch smaller. I guess this has to do with the UPDATE turning into DELETE+INSERT. So, it seems like WCOs are being initialized for the leaf partitions (ResultRelInfos in the mt_partitions array) that are in turn are initialized for the aforementioned INSERT. That's why the term "...local UPDATEs" in the new comment text. If that's true, I wonder if it makes sense to apply what would be WCO_RLS_UPDATE_CHECK to a leaf partition that the tuple will be moved into by calling ExecInsert()? > Also, I feel like it's probably not correct to use the first result > relation as the nominal relation for building WCOs and returning lists > anyway. I mean, if the first result relation has a different column > order than the parent relation, isn't this just broken? If it works > for some reason, the comments don't explain what that reason is. Yep, it's more appropriate to use ModifyTableState->rootResultRelationInfo->ri_RelationDesc somehow. That is, if answer to the question I raised above is positive. Thanks, Amit
On Wed, Jun 21, 2017 at 5:28 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:>> The comment "UPDATE/DELETE cases are handled above" is referring to >> the code that initializes the WCOs generated by the planner. You've >> modified the comment in your patch, but the associated code: your >> updated comment says that only "DELETEs and local UPDATES are handled >> above", but in reality, *all* updates are still handled above. And >> then they are handled again here. Similarly for returning lists. >> It's certainly not OK for the comment to be inaccurate, but I think >> it's also bad to redo the work which the planner has already done, >> even if it makes the patch smaller. > > I guess this has to do with the UPDATE turning into DELETE+INSERT. So, it > seems like WCOs are being initialized for the leaf partitions > (ResultRelInfos in the mt_partitions array) that are in turn are > initialized for the aforementioned INSERT. That's why the term "...local > UPDATEs" in the new comment text. > > If that's true, I wonder if it makes sense to apply what would be > WCO_RLS_UPDATE_CHECK to a leaf partition that the tuple will be moved into > by calling ExecInsert()? I think we probably should apply the insert policy, just as we're executing the insert trigger. >> Also, I feel like it's probably not correct to use the first result >> relation as the nominal relation for building WCOs and returning lists >> anyway. I mean, if the first result relation has a different column >> order than the parent relation, isn't this just broken? If it works >> for some reason, the comments don't explain what that reason is. > > Yep, it's more appropriate to use > ModifyTableState->rootResultRelationInfo->ri_RelationDesc somehow. That > is, if answer to the question I raised above is positive. The questions appear to me to be independent. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 21 June 2017 at 00:23, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jun 20, 2017 at 2:54 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>> I guess I don't see why it should work like this. In the INSERT case, >>> we must build withCheckOption objects for each partition because those >>> partitions don't appear in the plan otherwise -- but in the UPDATE >>> case, they're already there, so why do we need to build anything at >>> all? Similarly for RETURNING projections. How are the things we need >>> for those cases not already getting built, associated with the >>> relevant resultRelInfos? Maybe there's a concern if some children got >>> pruned - they could turn out later to be the children into which >>> tuples need to be routed. But the patch makes no distinction >>> between possibly-pruned children and any others. >> >> Yes, only a subset of the partitions appear in the UPDATE subplans. I >> think typically for updates, a very small subset of the total leaf >> partitions will be there in the plans, others would get pruned. IMHO, >> it would not be worth having an optimization where it opens only those >> leaf partitions which are not already there in the subplans. Without >> the optimization, we are able to re-use the INSERT infrastructure >> without additional changes. > > Well, that is possible, but certainly not guaranteed. I mean, > somebody could do a whole-table UPDATE, or an UPDATE that hits a > smattering of rows in every partition; I am not saying that it's guaranteed to be a small subset. I am saying that it would be typically a small subset for update-of-partitioned-key case. Seems weird if a user causes an update-row-movement for multiple partitions at the same time. Generally it would be an administrative task where some/all of the rows of a partition need to have their partition key updated that cause them to change their partition, and so there would be probably a where clause that would narrow down the update to that particular partition, because without the where clause the update is anyway slower and it's redundant to scan all other partitions. But, point taken, that there can always be certain cases involving multiple table partition-key updates. > e.g. the table is partitioned on order number, and you do UPDATE > lineitem SET product_code = 'K372B' WHERE product_code = 'K372'. This query does not update order number, so here there is no partition-key-update. Are you thinking that the patch is generating the per-leaf-partition WCO expressions even for a update not involving a partition key ? > > Leaving that aside, the point here is that you're rebuilding > withCheckOptions and returningLists that have already been built in > the planner. That's bad for two reasons. First, it's inefficient, > especially if there are many partitions. Yeah, I agree that this becomes more and more redundant if the update involves more partitions. > Second, it will amount to a functional bug if you get a > different answer than the planner did. Actually, the per-leaf WCOs are meant to be executed on the destination partitions where the tuple is moved, while the WCOs belonging to the per-subplan resultRelInfo are meant for the resultRelinfo used for the UPDATE plans. So actually it should not matter whether they look same or different, because they are fired at different objects. Now these objects can happen to be the same relations though. But in any case, it's not clear to me how the mapped WCO and the planner's WCO would yield a different answer if they are both the same relation. I am possibly missing something. The planner has already generated the withCheckOptions for each of the resultRelInfo. And then we are using one of those to re-generate the WCO for a leaf partition by only adjusting the attnos. If there is already a WCO generated in the planner for that leaf partition (because that partition was present in mtstate->resultRelInfo), then the re-built WCO should be exactly look same as the earlier one, because they are the same relations, and so the attnos generated in them would be same since the Relation TupleDesc is the same. > Note this comment in the existing code: > > /* > * Build WITH CHECK OPTION constraints for each leaf partition rel. Note > * that we didn't build the withCheckOptionList for each partition within > * the planner, but simple translation of the varattnos for each partition > * will suffice. This only occurs for the INSERT case; UPDATE/DELETE > * cases are handled above. > */ > > The comment "UPDATE/DELETE cases are handled above" is referring to > the code that initializes the WCOs generated by the planner. You've > modified the comment in your patch, but the associated code: your > updated comment says that only "DELETEs and local UPDATES are handled > above", but in reality, *all* updates are still handled above. And Actually I meant, "above works for only local updates. For row-movement-updates, we need per-leaf partition WCOs, because when the row is inserted into target partition, that partition may be not be included in the above planner resultRelInfo, so we need WCOs for all partitions". I think this said comment should be sufficient if I add this in the code ? > then they are handled again here. > Similarly for returning lists. > It's certainly not OK for the comment to be inaccurate, but I think > it's also bad to redo the work which the planner has already done, > even if it makes the patch smaller. > > Also, I feel like it's probably not correct to use the first result > relation as the nominal relation for building WCOs and returning lists > anyway. I mean, if the first result relation has a different column > order than the parent relation, isn't this just broken? If it works > for some reason, the comments don't explain what that reason is. Not sure why parent relation should come into picture. As long as the first result relation belongs to one of the partitions in the whole partition tree, we should be able to use that to build WCOs of any other partitions, because they have a common set of attributes having the same name. So we are bound to find each of the attributes of first resultRelInfo in the other leaf partitions during attno mapping. > Some other things: > > + * The row was already deleted by a concurrent DELETE. So we don't > + * have anything to update. > > I find this explanation, and the surrounding comments, inadequate. It > doesn't really explain why we're doing this. I think it should say > something like this: For a normal UPDATE, the case where the tuple has > been the subject of a concurrent UPDATE or DELETE would be handled by > the EvalPlanQual machinery, but for an UPDATE that we've translated > into a DELETE from this partition and an INSERT into some other > partition, that's not available, because CTID chains can't span > relation boundaries. We mimic the semantics to a limited extent by > skipping the INSERT if the DELETE fails to find a tuple. This ensures > that two concurrent attempts to UPDATE the same tuple at the same time > can't turn one tuple into two, and that an UPDATE of a just-deleted > tuple can't resurrect it. Thanks, will put that comment in the next patch. > > + bool partition_check_passed_with_trig_tuple; > + > + partition_check_passed = > + (resultRelInfo->ri_PartitionCheck && > + ExecPartitionCheck(resultRelInfo, slot, estate)); > + > + partition_check_passed_with_trig_tuple = > + (resultRelInfo->ri_PartitionCheck && > + ExecPartitionCheck(resultRelInfo, trig_slot, estate)); > + if (partition_check_passed) > + { > + /* > + * If it's the trigger that is causing partition constraint > + * violation, abort. We don't want a trigger to cause tuple > + * routing. > + */ > + if (!partition_check_passed_with_trig_tuple) > + ExecPartitionCheckEmitError(resultRelInfo, > + trig_slot, estate); > + } > + else > + { > + /* > + * Partition constraint failed with original NEW tuple. But the > + * trigger might even have modifed the tuple such that it fits > + * back into the partition. So partition constraint check > + * should be based on *final* NEW tuple. > + */ > + partition_check_passed = > partition_check_passed_with_trig_tuple; > + } > > Maybe I inadvertently gave the contrary impression in some prior > review, but this logic doesn't seem right to me. I don't think > there's any problem with a BR UPDATE trigger causing tuple routing. > What I want to avoid is repeatedly rerouting the same tuple, but I > don't think that could happen even without this guard. We've now fixed > insert tuple routing so that a BR INSERT trigger can't cause the > partition constraint to be violated (cf. commit > 15ce775faa428dc91027e4e2d6b7a167a27118b5) and there's no way for > update tuple routing to trigger additional BR UPDATE triggers. So I > don't see the point of checking the constraints twice here. I think > what you want to do is get rid of all the changes here and instead > adjust the logic just before ExecConstraints() to invoke > ExecPartitionCheck() on the post-trigger version of the tuple. When I came up with this code, the intention was to make sure BR UPDATE trigger does not cause tuple routing. But yeah, I can't recall what made me think that the above changes would be needed to prevent BR UPDATE trigger from causing tuple routing. With the latest code, it indeed looks like we can get rid of these changes, and still prevent that. BTW, that code was not to avoid repeated re-routing. Above, you seem to say that there's no problem with BR UPDATE trigger causing the tuple routing. But, when none of the partition-key columns are used in UPDATE, we don't set up for update-tuple-routing, so with no partition-key update, tuple routing will not occur even if BR UPDATE trigger would have caused UPDATE tuple routing. This is one restriction we have to live with because we beforehand decide whether to do the tuple-routing setup based on the columns modified in the UPDATE query. > > Parenthetically, if we decided to keep this logic as you have it, the > code that sets partition_check_passed and > partition_check_passed_with_trig_tuple doesn't need to check > resultRelInfo->ri_PartitionCheck because the surrounding "if" block > already did. Yes. > > + for (i = 0; i < num_rels; i++) > + { > + ResultRelInfo *resultRelInfo = &result_rels[i]; > + Relation rel = resultRelInfo->ri_RelationDesc; > + Bitmapset *expr_attrs = NULL; > + > + pull_varattnos((Node *) rel->rd_partcheck, 1, &expr_attrs); > + > + /* Both bitmaps are offset by FirstLowInvalidHeapAttributeNumber. */ > + if (bms_overlap(expr_attrs, GetUpdatedColumns(resultRelInfo, estate))) > + return true; > + } > > This seems like an awfully expensive way of performing this test. > Under what circumstances could this be true for some result relations > and false for others; One resultRelinfo can have no partition key column used in its quals, but the next resultRelinfo can have quite different quals, and these quals can have partition key referred. This is possible if the two of them have different parents that have different partition-key columns. > or in other words, why do we have to loop over all of the result > relations? It seems to me that the user has typed something like: > > UPDATE whatever SET thingy = ..., whatsit = ... WHERE whatever = ... > AND thunk = ... > > If either thingy or whatsit is a partitioning column, UPDATE tuple > routing might be needed So, in the above code, bms_overlap() would return true if either thingy or whatsit is a partitioning column. > - and it should be able to test that by a > *single* comparison between the set of columns being updated and the > partitioning columns, without needing to repeat for every partitions. If bms_overlap() returns true for the very first resultRelinfo, it will return immediately. But yes, if there are no relations using partition key, we will have to scan all of these relations. But again, note that these are pruned leaf partitions, they typically will not contain all the leaf partitions. > Perhaps that test needs to be done at plan time and saved in the plan, > rather than performed here -- or maybe it's easy enough to do it here. Hmm, it looks convenient here because mtstate->resultRelInfo gets set only here. > > One problem is that, if BR UPDATE triggers are in fact allowed to > cause tuple routing as I proposed above, the presence of a BR UPDATE > trigger for any partition could necessitate UPDATE tuple routing for > queries that wouldn't otherwise need it. You mean always setup update tuple routing if there's a BR UPDATE trigger ? Actually I was going for disallowing BR update trigger to initiate tuple routing, as I described above. > But even if you end up > inserting a test for that case, it can surely be a lot cheaper than > this, I didn't exactly get why the bitmap_overlap() test needs to be compared with the presence-of-trigger test. > since it only involves checking a boolean flag, not a bitmapset. > It could be argue that we ought to prohibit BR UPDATE triggers from > causing tuple routing so that we don't have to do this test at all, > but I'm not sure that's a good trade-off. > It seems to necessitate checking the partition constraint twice per > tuple instead of once per tuple, which like a very heavy price. I think I didn't quite understand this paragraph as a whole. Can you state the trade-off here again ? > > +#define GetUpdatedColumns(relinfo, estate) \ > + (rt_fetch((relinfo)->ri_RangeTableIndex, > (estate)->es_range_table)->updatedCols) > > I think this should be moved to a header file (and maybe turned into a > static inline function) rather than copy-pasting the definition into a > new file. Will do that. > > - List *mapped_wcoList; > + List *mappedWco; > List *wcoExprs = NIL; > ListCell *ll; > > - /* varno = node->nominalRelation */ > - mapped_wcoList = map_partition_varattnos(wcoList, > - node->nominalRelation, > - partrel, rel); > - foreach(ll, mapped_wcoList) > + mappedWco = map_partition_varattnos(firstWco, firstVarno, > + partrel, firstResultRel); > + foreach(ll, mappedWco) > { > WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll)); > ExprState *wcoExpr = ExecInitQual(castNode(List, wco->qual), > - plan); > + &mtstate->ps); > > wcoExprs = lappend(wcoExprs, wcoExpr); > } > > - resultRelInfo->ri_WithCheckOptions = mapped_wcoList; > + resultRelInfo->ri_WithCheckOptions = mappedWco; > > Renaming the variable looks fairly pointless, unless I'm missing something? We are converting from firstWco to mappedWco. So firstWco => mappedWco looks more natural pairing than firstWco => mapped_wcoList. And I renamed wcoList to firstWco because I wanted to emphasize that is the first WCO out of the node->withCheckOptionLists. In the existing code, it was only for INSERT; withCheckOptionLists was a single element list, so firstWco name didn't sound suitable, but with multiple elements, it is essential to have it named firstWco so as to emphasize that we take the first one irrespective of whether it is UPDATE or INSERT. > > Regarding the tests, it seems like you've got a test case where you > update a sub-partition and it fails because the tuple would need to be > moved out of a sub-tree, which is good. But I think it would also be > good to have a case where you update a sub-partition and it succeeds > in moving the tuple within the subtree. I don't see one like that > presently; it seems all the others update the topmost root or the > leaf. I also think it would be a good idea to make sub_parted's > column order different from both list_parted and its own children, and > maybe use a diversity of data types (e.g. int4, int8, text instead of > making everything int). > > +select tableoid::regclass , * from list_parted where a = 2 order by 1; > +update list_parted set b = c + a where a = 2; > +select tableoid::regclass , * from list_parted where a = 2 order by 1; > > The extra space before the comma looks strange. Will do the above changes, thanks. > > Also, please make a habit of checking patches for whitespace errors > using git diff --check. > > [rhaas pgsql]$ git diff --check > src/backend/executor/nodeModifyTable.c:384: indent with spaces. > + tuple, &slot); > src/backend/executor/nodeModifyTable.c:1966: space before tab in indent. > + IsPartitionKeyUpdate(estate, mtstate->resultRelInfo, nplans)); > > You will notice these kinds of things if you read the diff you are > submitting before you press send, because git highlights them in > bright red. That's a good practice for many other reasons, too. Yeah, somehow I think I missed these because I must have checked only the incremental diffs w.r.t. the earlier one where I must have introduced them. Your point is very much true that we should make it a habit to check complete patch with --check option, or apply it myself. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On 21 June 2017 at 20:14, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jun 21, 2017 at 5:28 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote:>> The comment "UPDATE/DELETE > cases are handled above" is referring to >>> the code that initializes the WCOs generated by the planner. You've >>> modified the comment in your patch, but the associated code: your >>> updated comment says that only "DELETEs and local UPDATES are handled >>> above", but in reality, *all* updates are still handled above. And >>> then they are handled again here. Similarly for returning lists. >>> It's certainly not OK for the comment to be inaccurate, but I think >>> it's also bad to redo the work which the planner has already done, >>> even if it makes the patch smaller. >> >> I guess this has to do with the UPDATE turning into DELETE+INSERT. So, it >> seems like WCOs are being initialized for the leaf partitions >> (ResultRelInfos in the mt_partitions array) that are in turn are >> initialized for the aforementioned INSERT. That's why the term "...local >> UPDATEs" in the new comment text. >> >> If that's true, I wonder if it makes sense to apply what would be >> WCO_RLS_UPDATE_CHECK to a leaf partition that the tuple will be moved into >> by calling ExecInsert()? > > I think we probably should apply the insert policy, just as we're > executing the insert trigger. Yes, the RLS quals should execute during tuple routing according to whether it is a update or whether it has been converted to insert. I think the tests don't quite test the insert part. Will check. > >>> Also, I feel like it's probably not correct to use the first result >>> relation as the nominal relation for building WCOs and returning lists >>> anyway. I mean, if the first result relation has a different column >>> order than the parent relation, isn't this just broken? If it works >>> for some reason, the comments don't explain what that reason is. >> >> Yep, it's more appropriate to use >> ModifyTableState->rootResultRelationInfo->ri_RelationDesc somehow. That >> is, if answer to the question I raised above is positive. From what I had checked earlier when coding that part, rootResultRelInfo is NULL in case of inserts, unless something has changed in later commits. That's the reason I decided to use the first resultRelInfo. Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Wed, Jun 21, 2017 at 1:37 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> e.g. the table is partitioned on order number, and you do UPDATE >> lineitem SET product_code = 'K372B' WHERE product_code = 'K372'. > > This query does not update order number, so here there is no > partition-key-update. Are you thinking that the patch is generating > the per-leaf-partition WCO expressions even for a update not involving > a partition key ? No, it just wasn't a great example. Sorry. >> Second, it will amount to a functional bug if you get a >> different answer than the planner did. > > Actually, the per-leaf WCOs are meant to be executed on the > destination partitions where the tuple is moved, while the WCOs > belonging to the per-subplan resultRelInfo are meant for the > resultRelinfo used for the UPDATE plans. So actually it should not > matter whether they look same or different, because they are fired at > different objects. Now these objects can happen to be the same > relations though. > > But in any case, it's not clear to me how the mapped WCO and the > planner's WCO would yield a different answer if they are both the same > relation. I am possibly missing something. The planner has already > generated the withCheckOptions for each of the resultRelInfo. And then > we are using one of those to re-generate the WCO for a leaf partition > by only adjusting the attnos. If there is already a WCO generated in > the planner for that leaf partition (because that partition was > present in mtstate->resultRelInfo), then the re-built WCO should be > exactly look same as the earlier one, because they are the same > relations, and so the attnos generated in them would be same since the > Relation TupleDesc is the same. If the planner's WCOs and mapped WCOs are always the same, then I think we should try to avoid generating both. If they can be different, but that's intentional and correct, then there's no substantive problem with the patch but the comments need to make it clear why we are generating both. > Actually I meant, "above works for only local updates. For > row-movement-updates, we need per-leaf partition WCOs, because when > the row is inserted into target partition, that partition may be not > be included in the above planner resultRelInfo, so we need WCOs for > all partitions". I think this said comment should be sufficient if I > add this in the code ? Let's not get too focused on updating the comment until we are in agreement about what the code ought to be doing. I'm not clear whether you accept the point that the patch needs to be changed to avoid generating the same WCOs and returning lists in both the planner and the executor. >> Also, I feel like it's probably not correct to use the first result >> relation as the nominal relation for building WCOs and returning lists >> anyway. I mean, if the first result relation has a different column >> order than the parent relation, isn't this just broken? If it works >> for some reason, the comments don't explain what that reason is. > > Not sure why parent relation should come into picture. As long as the > first result relation belongs to one of the partitions in the whole > partition tree, we should be able to use that to build WCOs of any > other partitions, because they have a common set of attributes having > the same name. So we are bound to find each of the attributes of first > resultRelInfo in the other leaf partitions during attno mapping. Well, at least for returning lists, we've got to generate the returning lists so that they all match the column order of the parent, not the parent's first child. Otherwise, for example, UPDATE parent_table ... RETURNING * will not work correctly. The tuples returned by the returning clause have to have the attribute order of parent_table, not the attribute order of parent_table's first child. I'm not sure whether WCOs have the same issue, but it's not clear to me why they wouldn't: they contain a qual which is an expression tree, and presumably there are Var nodes in there someplace, and if so, then they have varattnos that have to be right for the purpose for which they're going to be used. >> + for (i = 0; i < num_rels; i++) >> + { >> + ResultRelInfo *resultRelInfo = &result_rels[i]; >> + Relation rel = resultRelInfo->ri_RelationDesc; >> + Bitmapset *expr_attrs = NULL; >> + >> + pull_varattnos((Node *) rel->rd_partcheck, 1, &expr_attrs); >> + >> + /* Both bitmaps are offset by FirstLowInvalidHeapAttributeNumber. */ >> + if (bms_overlap(expr_attrs, GetUpdatedColumns(resultRelInfo, estate))) >> + return true; >> + } >> >> This seems like an awfully expensive way of performing this test. >> Under what circumstances could this be true for some result relations >> and false for others; > > One resultRelinfo can have no partition key column used in its quals, > but the next resultRelinfo can have quite different quals, and these > quals can have partition key referred. This is possible if the two of > them have different parents that have different partition-key columns. Hmm, true. So if we have a table foo that is partitioned by list (a), and one of its children is a table bar that is partitioned by list (b), then we need to consider doing tuple-routing if either column a is modified, or if column b is modified for a partition which is a descendant of bar. But visiting that only requires looking at the partitioned table and those children that are also partitioned, not all of the leaf partitions as the patch does. >> - and it should be able to test that by a >> *single* comparison between the set of columns being updated and the >> partitioning columns, without needing to repeat for every partitions. > > If bms_overlap() returns true for the very first resultRelinfo, it > will return immediately. But yes, if there are no relations using > partition key, we will have to scan all of these relations. But again, > note that these are pruned leaf partitions, they typically will not > contain all the leaf partitions. But they might, and then this will be inefficient. Just because the patch doesn't waste many cycles in the case where most partitions are pruned doesn't mean that it's OK for it to waste cycles when few partitions are pruned. >> One problem is that, if BR UPDATE triggers are in fact allowed to >> cause tuple routing as I proposed above, the presence of a BR UPDATE >> trigger for any partition could necessitate UPDATE tuple routing for >> queries that wouldn't otherwise need it. > > You mean always setup update tuple routing if there's a BR UPDATE > trigger ? Yes. > Actually I was going for disallowing BR update trigger to > initiate tuple routing, as I described above. I know that! But as I said before, they requires evaluating every partition key constraint twice per tuple, which seems very expensive. I'm very doubtful that's a good approach. >> But even if you end up >> inserting a test for that case, it can surely be a lot cheaper than >> this, > > I didn't exactly get why the bitmap_overlap() test needs to be > compared with the presence-of-trigger test. My point was: If you always set up tuple routing when a BR UPDATE trigger is present, then you don't need to check the partition constraint twice per tuple. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jun 21, 2017 at 1:38 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>> Yep, it's more appropriate to use >>> ModifyTableState->rootResultRelationInfo->ri_RelationDesc somehow. That >>> is, if answer to the question I raised above is positive. > > From what I had checked earlier when coding that part, > rootResultRelInfo is NULL in case of inserts, unless something has > changed in later commits. That's the reason I decided to use the first > resultRelInfo. We're just going around in circles here. Saying that you decided to use the first child's resultRelInfo because you didn't have a resultRelInfo for the parent is an explanation of why you wrote the code the way you did, but that doesn't make it correct. I want to know why you think it's correct. I think it's probably wrong, because it seems to me that if the INSERT code needs to use the parent's ResultRelInfo rather than the first child's ResultRelInfo, the UPDATE code probably needs to do the same. Commit d3cc37f1d801a6b5cad9bf179274a8d767f1ee50 got rid of resultRelInfos for non-leaf partitions, and commit e180c8aa8caf5c55a273d4a8e6092e77ff3cff10 added the resultRelInfo back for the topmost parent, because otherwise it didn't work correctly. If every partition in the hierarchy has a different attribute ordering, then it seems to me that it must surely matter which of those attribute orderings we pick. It's hard to imagine that we can pick *either* the parent's attribute ordering *or* that of the first child and nothing will be different - the attribute numbers inside the returning lists and WCOs we create have got to get used somehow, so surely it matters which attribute numbers we use, doesn't it? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 22 June 2017 at 01:41, Robert Haas <robertmhaas@gmail.com> wrote: >>> Second, it will amount to a functional bug if you get a >>> different answer than the planner did. >> >> Actually, the per-leaf WCOs are meant to be executed on the >> destination partitions where the tuple is moved, while the WCOs >> belonging to the per-subplan resultRelInfo are meant for the >> resultRelinfo used for the UPDATE plans. So actually it should not >> matter whether they look same or different, because they are fired at >> different objects. Now these objects can happen to be the same >> relations though. >> >> But in any case, it's not clear to me how the mapped WCO and the >> planner's WCO would yield a different answer if they are both the same >> relation. I am possibly missing something. The planner has already >> generated the withCheckOptions for each of the resultRelInfo. And then >> we are using one of those to re-generate the WCO for a leaf partition >> by only adjusting the attnos. If there is already a WCO generated in >> the planner for that leaf partition (because that partition was >> present in mtstate->resultRelInfo), then the re-built WCO should be >> exactly look same as the earlier one, because they are the same >> relations, and so the attnos generated in them would be same since the >> Relation TupleDesc is the same. > > If the planner's WCOs and mapped WCOs are always the same, then I > think we should try to avoid generating both. If they can be > different, but that's intentional and correct, then there's no > substantive problem with the patch but the comments need to make it > clear why we are generating both. > >> Actually I meant, "above works for only local updates. For >> row-movement-updates, we need per-leaf partition WCOs, because when >> the row is inserted into target partition, that partition may be not >> be included in the above planner resultRelInfo, so we need WCOs for >> all partitions". I think this said comment should be sufficient if I >> add this in the code ? > > Let's not get too focused on updating the comment until we are in > agreement about what the code ought to be doing. I'm not clear > whether you accept the point that the patch needs to be changed to > avoid generating the same WCOs and returning lists in both the planner > and the executor. Yes, we can re-use the WCOs generated in the planner, as an optimization, since those we re-generate for the same relations will look exactly the same. The WCOs generated by planner (in inheritance_planner) are generated when (in adjust_appendrel_attrs()) we change attnos used in the query to refer to the child RTEs and this adjusts the attnos of the WCOs of the child RTEs. So the WCOs of subplan resultRelInfo are actually the parent table WCOs, but only the attnos changed. And in ExecInitModifyTable() we do the same thing for leaf partitions, although using different function map_variable_attnos(). > >>> Also, I feel like it's probably not correct to use the first result >>> relation as the nominal relation for building WCOs and returning lists >>> anyway. I mean, if the first result relation has a different column >>> order than the parent relation, isn't this just broken? If it works >>> for some reason, the comments don't explain what that reason is. One thing I didn't mention earlier about the WCOs, is that for child rels, we don't use the WCOs defined for the child rels. We only inherit the WCO expressions defined for the root rel. That's the reason they are the same expressions, only the attnos changed to match the respective relation tupledesc. If the WCOs of each of the subplan resultRelInfo() were different, then definitely it was not possible to use the first resultRelinfo to generate other leaf partition WCOs, because the WCO defined for relation A is independent of that defined for relation B. So, since the WCOs of all the relations are actually those of the parent, we only need to adjust the attnos of any of these resultRelInfos. For e.g., if the root rel WCO is defined as "col > 5" where col is the 4th column, the expression will look like "var_1.attno_4 > 5". And the WCO that is generated for a subplan resultRelInfo will look something like "var_n.attno_2 > 5" if col is the 2nd column in this table. All of the above logic assumes that we never use the WCO defined for the child relation. At least that's how it looks by looking at the way we generate WCOs in ExecInitModifyTable() for INSERTs as well looking at the code in inheritance_planner() for UPDATEs. At both these places, we never use the WCOs defined for child tables. So suppose we define the tables and their WCOs like this : CREATE TABLE range_parted ( a text, b int, c int) partition by range (a, b); ALTER TABLE range_parted ENABLE ROW LEVEL SECURITY; GRANT ALL ON range_parted TO PUBLIC ; create policy seeall ON range_parted as PERMISSIVE for SELECT using ( true); create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c); create table part_c_1_100 (b int, c int, a text); alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100); create table part_c_100_200 (c int, a text, b int); alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200); GRANT ALL ON part_c_100_200 TO PUBLIC ; ALTER TABLE part_c_100_200 ENABLE ROW LEVEL SECURITY; create policy seeall ON part_c_100_200 as PERMISSIVE for SELECT using ( true); insert into part_c_1_100 (a, b, c) values ('b', 12, 96); insert into part_c_1_100 (a, b, c) values ('b', 13, 97); insert into part_c_100_200 (a, b, c) values ('b', 15, 105); insert into part_c_100_200 (a, b, c) values ('b', 17, 105); -- For root table, allow updates only if NEW.c is even number. create policy pu on range_parted for UPDATE USING (true) WITH CHECK (c % 2 = 0); -- For this table, allow updates only if NEW.c is divisible by 4. create policy pu on part_c_100_200 for UPDATE USING (true) WITH CHECK (c % 4 = 0); Now, if we try to update the child table using UPDATE on root table, it will allow setting c to a number which would otherwise violate WCO constraint of the child table if the query would have run on the child table directly : postgres=# set role user1; SET postgres=> select tableoid::regclass, * from range_parted where b = 17; tableoid | a | b | c ----------------+---+----+-----part_c_100_200 | b | 17 | 105 -- root table does not allow updating c to odd numbers postgres=> update range_parted set c = 107 where a = 'b' and b = 17 ; ERROR: new row violates row-level security policy for table "range_parted" -- child table does not allow to modify it to 106 because it is not divisble by 4. postgres=> update part_c_100_200 set c = 106 where a = 'b' and b = 17 ; ERROR: new row violates row-level security policy for table "part_c_100_200" -- But we can update it to 106 by running update on the root table, because here child table WCOs won't get used. postgres=> update range_parted set c = 106 where a = 'b' and b = 17 ; UPDATE 1 postgres=> select tableoid::regclass, * from range_parted where b = 17; tableoid | a | b | c ----------------+---+----+-----part_c_100_200 | b | 17 | 106 Same applies for INSERTs. I hope this is expected behaviour. Initially I had found this weird, but then saw that is consistent for both inserts as well as updates. >> >> Not sure why parent relation should come into picture. As long as the >> first result relation belongs to one of the partitions in the whole >> partition tree, we should be able to use that to build WCOs of any >> other partitions, because they have a common set of attributes having >> the same name. So we are bound to find each of the attributes of first >> resultRelInfo in the other leaf partitions during attno mapping. > > Well, at least for returning lists, we've got to generate the > returning lists so that they all match the column order of the parent, > not the parent's first child. > Otherwise, for example, UPDATE > parent_table ... RETURNING * will not work correctly. The tuples > returned by the returning clause have to have the attribute order of > parent_table, not the attribute order of parent_table's first child. > I'm not sure whether WCOs have the same issue, but it's not clear to > me why they wouldn't: they contain a qual which is an expression tree, > and presumably there are Var nodes in there someplace, and if so, then > they have varattnos that have to be right for the purpose for which > they're going to be used. So once we put the attnos right according to the child relation tupdesc, the rest part of generating the final RETURNING expressions as per the root able column order is taken care of by the returning projection, no ? This scenario is included in the update.sql regression test here : -- ok (row movement, with subset of rows moved into different partition) update range_parted set b = b - 6 where c > 116 returning a, b + c; > >>> + for (i = 0; i < num_rels; i++) >>> + { >>> + ResultRelInfo *resultRelInfo = &result_rels[i]; >>> + Relation rel = resultRelInfo->ri_RelationDesc; >>> + Bitmapset *expr_attrs = NULL; >>> + >>> + pull_varattnos((Node *) rel->rd_partcheck, 1, &expr_attrs); >>> + >>> + /* Both bitmaps are offset by FirstLowInvalidHeapAttributeNumber. */ >>> + if (bms_overlap(expr_attrs, GetUpdatedColumns(resultRelInfo, estate))) >>> + return true; >>> + } >>> >>> This seems like an awfully expensive way of performing this test. >>> Under what circumstances could this be true for some result relations >>> and false for others; >> >> One resultRelinfo can have no partition key column used in its quals, >> but the next resultRelinfo can have quite different quals, and these >> quals can have partition key referred. This is possible if the two of >> them have different parents that have different partition-key columns. > > Hmm, true. So if we have a table foo that is partitioned by list (a), > and one of its children is a table bar that is partitioned by list > (b), then we need to consider doing tuple-routing if either column a > is modified, or if column b is modified for a partition which is a > descendant of bar. But visiting that only requires looking at the > partitioned table and those children that are also partitioned, not > all of the leaf partitions as the patch does. > Will give a thought on this and get back on this, and remaining points.
On 26 June 2017 at 08:37, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 22 June 2017 at 01:41, Robert Haas <robertmhaas@gmail.com> wrote: >>>> Second, it will amount to a functional bug if you get a >>>> different answer than the planner did. >>> >>> Actually, the per-leaf WCOs are meant to be executed on the >>> destination partitions where the tuple is moved, while the WCOs >>> belonging to the per-subplan resultRelInfo are meant for the >>> resultRelinfo used for the UPDATE plans. So actually it should not >>> matter whether they look same or different, because they are fired at >>> different objects. Now these objects can happen to be the same >>> relations though. >>> >>> But in any case, it's not clear to me how the mapped WCO and the >>> planner's WCO would yield a different answer if they are both the same >>> relation. I am possibly missing something. The planner has already >>> generated the withCheckOptions for each of the resultRelInfo. And then >>> we are using one of those to re-generate the WCO for a leaf partition >>> by only adjusting the attnos. If there is already a WCO generated in >>> the planner for that leaf partition (because that partition was >>> present in mtstate->resultRelInfo), then the re-built WCO should be >>> exactly look same as the earlier one, because they are the same >>> relations, and so the attnos generated in them would be same since the >>> Relation TupleDesc is the same. >> >> If the planner's WCOs and mapped WCOs are always the same, then I >> think we should try to avoid generating both. If they can be >> different, but that's intentional and correct, then there's no >> substantive problem with the patch but the comments need to make it >> clear why we are generating both. >> >>> Actually I meant, "above works for only local updates. For >>> row-movement-updates, we need per-leaf partition WCOs, because when >>> the row is inserted into target partition, that partition may be not >>> be included in the above planner resultRelInfo, so we need WCOs for >>> all partitions". I think this said comment should be sufficient if I >>> add this in the code ? >> >> Let's not get too focused on updating the comment until we are in >> agreement about what the code ought to be doing. I'm not clear >> whether you accept the point that the patch needs to be changed to >> avoid generating the same WCOs and returning lists in both the planner >> and the executor. > > Yes, we can re-use the WCOs generated in the planner, as an > optimization, since those we re-generate for the same relations will > look exactly the same. The WCOs generated by planner (in > inheritance_planner) are generated when (in adjust_appendrel_attrs()) > we change attnos used in the query to refer to the child RTEs and this > adjusts the attnos of the WCOs of the child RTEs. So the WCOs of > subplan resultRelInfo are actually the parent table WCOs, but only the > attnos changed. And in ExecInitModifyTable() we do the same thing for > leaf partitions, although using different function > map_variable_attnos(). In attached patch v12, during UPDATE tuple routing setup, for each leaf partition, we now check if it is present already in one of the UPDATE per-subplan resultrels. If present, we re-use them rather than creating a new one and opening the table again. So the mtstate->mt_partitions is now an array of ResultRelInfo pointers. That pointer points to either the UPDATE per-subplan result rel, or a newly allocated ResultRelInfo. For each of the leaf partitions, we have to search through the per-subplan resultRelInfo oids to check if there is a match. To do this, I have created a temporary hash table which stores oids and the ResultRelInfo pointers of mtstate->resultRelInfo array, and which can be used to search the oid for each of the leaf partitions. This patch version has handled only the above discussion point. I will follow up with the other points separately. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 22 June 2017 at 01:57, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jun 21, 2017 at 1:38 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>>> Yep, it's more appropriate to use >>>> ModifyTableState->rootResultRelationInfo->ri_RelationDesc somehow. That >>>> is, if answer to the question I raised above is positive. >> >> From what I had checked earlier when coding that part, >> rootResultRelInfo is NULL in case of inserts, unless something has >> changed in later commits. That's the reason I decided to use the first >> resultRelInfo. > > We're just going around in circles here. Saying that you decided to > use the first child's resultRelInfo because you didn't have a > resultRelInfo for the parent is an explanation of why you wrote the > code the way you did, but that doesn't make it correct. I want to > know why you think it's correct. Yeah, that was just an FYI on how I decided to use the first resultRelInfo; it was not for explaining why using first resultRelInfo is correct. So upthread, I have tried to explain. > > I think it's probably wrong, because it seems to me that if the INSERT > code needs to use the parent's ResultRelInfo rather than the first > child's ResultRelInfo, the UPDATE code probably needs to do the same. > Commit d3cc37f1d801a6b5cad9bf179274a8d767f1ee50 got rid of > resultRelInfos for non-leaf partitions, and commit > e180c8aa8caf5c55a273d4a8e6092e77ff3cff10 added the resultRelInfo back > for the topmost parent, because otherwise it didn't work correctly. Regarding rootResultRelInfo , it would have been good if rootResultRelInfo was set for both insert and update, but it isn't set for inserts..... For inserts : In ExecInitModifyTable(), ModifyTableState->rootResultRelInfo remains NULL because ModifyTable->rootResultRelIndex is = -1 : /* If modifying a partitioned table, initialize the root table info */ if (node->rootResultRelIndex >= 0) mtstate->rootResultRelInfo = estate->es_root_result_relations + node->rootResultRelIndex; ModifyTable->rootResultRelIndex = -1 because it does not get set since ModifyTable->partitioned_rels is NULL : /* * If the main target relation is a partitioned table, the * following list contains the RT indexes of partitioned child * relations including the root, which are not included in the * above list. We also keep RT indexes of the roots * separately to be identitied as such during the executor * initialization. */ if (splan->partitioned_rels != NIL) { root->glob->nonleafResultRelations = list_concat(root->glob->nonleafResultRelations, list_copy(splan->partitioned_rels)); /* Remember where this root will be in the global list. */ splan->rootResultRelIndex = list_length(root->glob->rootResultRelations); root->glob->rootResultRelations = lappend_int(root->glob->rootResultRelations, linitial_int(splan->partitioned_rels)); } ModifyTable->partitioned_rels is NULL because inheritance_planner() does not get called for INSERTs; instead, grouping_planner() gets called : subquery_planner() { /* * Do the main planning. If we have an inherited target relation, that * needs special processing, else go straight to grouping_planner. */ if (parse->resultRelation && rt_fetch(parse->resultRelation, parse->rtable)->inh) inheritance_planner(root); else grouping_planner(root, false, tuple_fraction); } Above, inh is false in case of inserts.
Hi Amit, On 2017/06/28 20:43, Amit Khandekar wrote: > In attached patch v12 The patch no longer applies and fails to compile after the following commit was made yesterday: commit 501ed02cf6f4f60c3357775eb07578aebc912d3a Author: Andrew Gierth <rhodiumtoad@postgresql.org> Date: Wed Jun 28 18:55:03 2017 +0100 Fix transition tables for partition/inheritance. Thanks, Amit
On 29 June 2017 at 07:42, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Hi Amit, > > On 2017/06/28 20:43, Amit Khandekar wrote: >> In attached patch v12 > > The patch no longer applies and fails to compile after the following > commit was made yesterday: > > commit 501ed02cf6f4f60c3357775eb07578aebc912d3a > Author: Andrew Gierth <rhodiumtoad@postgresql.org> > Date: Wed Jun 28 18:55:03 2017 +0100 > > Fix transition tables for partition/inheritance. Thanks for informing Amit. As Thomas mentioned upthread, the above commit already uses a tuple conversion mapping from leaf partition to root partitioned table (mt_transition_tupconv_maps), which serves the same purpose as that of the mapping used in the update-partition-key patch during update tuple routing (mt_resultrel_maps). We need to try to merge these two into a general-purpose mapping array such as mt_leaf_root_maps. I haven't done that in the rebased patch (attached), so currently it has both these mapping fields. For transition tables, this map is per-leaf-partition in case of inserts, whereas it is per-subplan result rel for updates. For update-tuple routing, the mapping is required to be per-subplan. Now, for update-row-movement in presence of transition tables, we would require both per-subplan mapping as well as per-leaf-partition mapping, which can't be done if we have a single mapping field, unless we have some way to identify which of the per-leaf partition mapping elements belong to per-subplan rels. So, it's not immediately possible to merge them. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 22 June 2017 at 01:41, Robert Haas <robertmhaas@gmail.com> wrote: >>> + for (i = 0; i < num_rels; i++) >>> + { >>> + ResultRelInfo *resultRelInfo = &result_rels[i]; >>> + Relation rel = resultRelInfo->ri_RelationDesc; >>> + Bitmapset *expr_attrs = NULL; >>> + >>> + pull_varattnos((Node *) rel->rd_partcheck, 1, &expr_attrs); >>> + >>> + /* Both bitmaps are offset by FirstLowInvalidHeapAttributeNumber. */ >>> + if (bms_overlap(expr_attrs, GetUpdatedColumns(resultRelInfo, estate))) >>> + return true; >>> + } >>> >>> This seems like an awfully expensive way of performing this test. >>> Under what circumstances could this be true for some result relations >>> and false for others; >> >> One resultRelinfo can have no partition key column used in its quals, >> but the next resultRelinfo can have quite different quals, and these >> quals can have partition key referred. This is possible if the two of >> them have different parents that have different partition-key columns. > > Hmm, true. So if we have a table foo that is partitioned by list (a), > and one of its children is a table bar that is partitioned by list > (b), then we need to consider doing tuple-routing if either column a > is modified, or if column b is modified for a partition which is a > descendant of bar. But visiting that only requires looking at the > partitioned table and those children that are also partitioned, not > all of the leaf partitions as the patch does. The main concern is that the non-leaf partitions are not open (except root), so we would need to open them in order to get the partition key of the parents of update resultrels (or get only the partition key atts and exprs from pg_partitioned_table). There can be multiple approaches to finding partition key columns. Approach 1 : When there are a few update result rels and a large partition tree, we traverse from each of the result rels to their ancestors , and open their ancestors (get_partition_parent()) to get the partition key columns. For result rels having common parents, do this only once. Approach 2 : If there are only a few partitioned tables, and large number of update result rels, it would be easier to just open all the partitioned tables and form the partition key column bitmap out of all their partition keys. If the bitmap does not have updated columns, that's not a partition-key-update. So for typical non-partition-key updates, just opening the partitioned tables will suffice, and so that would not affect performance of normal updates. But if the bitmap has updated columns, we can't conclude that it's a partition-key-update, otherwise it would be false positive. We again need to further check whether the update result rels belong to ancestors that have updated partition keys. Approach 3 : In RelationData, in a new bitmap field (rd_partcheckattrs ?), store partition key attrs that are used in rd_partcheck . Populate this field during generate_partition_qual(). So to conclude, I think, we can do this : Scenario 1 : Only one partitioned table : the root; rest all are leaf partitions. In this case, it is definitely efficient to just check the root partition key, which will be sufficient. Scenario 2 : There are few non-leaf partitioned tables (3-4) : Open those tables, and follow 2nd approach above: If we don't find any updated partition-keys in any of them, well and good. If we do find, failover to approach 3 : For each of the update resultrels, use the new rd_partcheckattrs bitmap to know if it uses any of the updated columns. This would be faster than pulling up attrs from the quals like how it was done in the patch. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Thu, Jun 29, 2017 at 3:52 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > So to conclude, I think, we can do this : > > Scenario 1 : > Only one partitioned table : the root; rest all are leaf partitions. > In this case, it is definitely efficient to just check the root > partition key, which will be sufficient. > > Scenario 2 : > There are few non-leaf partitioned tables (3-4) : > Open those tables, and follow 2nd approach above: If we don't find any > updated partition-keys in any of them, well and good. If we do find, > failover to approach 3 : For each of the update resultrels, use the > new rd_partcheckattrs bitmap to know if it uses any of the updated > columns. This would be faster than pulling up attrs from the quals > like how it was done in the patch. I think we should just have the planner figure out a list of which columns are partitioning columns either for the named relation or some descendent, and set a flag if that set of columns overlaps the set of columns updated. At execution time, update tuple routing is needed if either that flag is set or if some partition included in the plan has a BR UPDATE trigger. Attached is a draft patch implementing that approach. This could be made more more accurate. Suppose table foo is partitioned by a and some but not all of the partitions partitioned by b. If it so happens that, in a query which only updates b, constraint exclusion eliminates all of the partitions that are subpartitioned by b, it would be unnecessary to enable update tuple routing (unless BR UPDATE triggers are present) but this patch will not figure that out. I don't think that optimization is critical for the first version of this feature; there will be a limited number of users with asymmetrical subpartitioning setups, and if one of them has an idea how to improve this without hurting anything else, they are free to contribute a patch. Other optimizations are possible too, but I don't really see any of them as critical either. I don't think the approach of building a hash table to figure out which result rels have already been created is a good one. That too feels like something that the planner should be figuring out and the executor should just be implementing what the planner decided. I haven't figured out exactly how that should work yet, but it seems like it ought to be doable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Fri, Jun 30, 2017 at 12:01 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 29 June 2017 at 07:42, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Hi Amit, >> >> On 2017/06/28 20:43, Amit Khandekar wrote: >>> In attached patch v12 >> >> The patch no longer applies and fails to compile after the following >> commit was made yesterday: >> >> commit 501ed02cf6f4f60c3357775eb07578aebc912d3a >> Author: Andrew Gierth <rhodiumtoad@postgresql.org> >> Date: Wed Jun 28 18:55:03 2017 +0100 >> >> Fix transition tables for partition/inheritance. > > Thanks for informing Amit. > > As Thomas mentioned upthread, the above commit already uses a tuple > conversion mapping from leaf partition to root partitioned table > (mt_transition_tupconv_maps), which serves the same purpose as that of > the mapping used in the update-partition-key patch during update tuple > routing (mt_resultrel_maps). > > We need to try to merge these two into a general-purpose mapping array > such as mt_leaf_root_maps. I haven't done that in the rebased patch > (attached), so currently it has both these mapping fields. > > For transition tables, this map is per-leaf-partition in case of > inserts, whereas it is per-subplan result rel for updates. For > update-tuple routing, the mapping is required to be per-subplan. Now, > for update-row-movement in presence of transition tables, we would > require both per-subplan mapping as well as per-leaf-partition > mapping, which can't be done if we have a single mapping field, unless > we have some way to identify which of the per-leaf partition mapping > elements belong to per-subplan rels. > > So, it's not immediately possible to merge them. Would make sense to have a set of functions with names like GetConvertor{From,To}{Subplan,Leaf}(mtstate, index) which build arrays m_convertors_{from,to}_by_{subplan,leaf} the first time they need them? -- Thomas Munro http://www.enterprisedb.com
On Fri, Jun 30, 2017 at 4:20 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I don't think the approach of building a hash table to figure out > which result rels have already been created is a good one. That too > feels like something that the planner should be figuring out and the > executor should just be implementing what the planner decided. I > haven't figured out exactly how that should work yet, but it seems > like it ought to be doable. I was imagining when I wrote the above that the planner should somehow compute a list of relations that it has excluded so that the executor can skip building ResultRelInfos for exactly those relations, but on further study, that's not particularly easy to achieve and wouldn't really save anything anyway, because the list of OIDs is coming straight out of the partition descriptor, so it's pretty much free. However, I still think it would be a nifty idea if we could avoid needing the hash table to deduplicate. The reason we need that is, I think, that expand_inherited_rtentry() is going to expand the inheritance hierarchy in whatever order the scan(s) of pg_inherits return the descendant tables, whereas the partition descriptor is going to put them in a canonical order. But that seems like it wouldn't be too hard to fix: let's have expand_inherited_rtentry() expand the partitioned table in the same order that will be used by ExecSetupPartitionTupleRouting(). That seems pretty easy to do - just have expand_inherited_rtentry() notice that it's got a partitioned table and call RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to produce the list of OIDs. Then - I think - ExecSetupPartitionTupleRouting() doesn't need the hash table; it can just scan through the return value of ExecSetupPartitionTupleRouting() and the list of already-created ResultRelInfo structures in parallel - the order must be the same, but the latter can be missing some elements, so it can just create the missing ones. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017/07/02 20:10, Robert Haas wrote: > On Fri, Jun 30, 2017 at 4:20 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> I don't think the approach of building a hash table to figure out >> which result rels have already been created is a good one. That too >> feels like something that the planner should be figuring out and the >> executor should just be implementing what the planner decided. I >> haven't figured out exactly how that should work yet, but it seems >> like it ought to be doable. > > I was imagining when I wrote the above that the planner should somehow > compute a list of relations that it has excluded so that the executor > can skip building ResultRelInfos for exactly those relations, but on > further study, that's not particularly easy to achieve and wouldn't > really save anything anyway, because the list of OIDs is coming > straight out of the partition descriptor, so it's pretty much free. > However, I still think it would be a nifty idea if we could avoid > needing the hash table to deduplicate. The reason we need that is, I > think, that expand_inherited_rtentry() is going to expand the > inheritance hierarchy in whatever order the scan(s) of pg_inherits > return the descendant tables, whereas the partition descriptor is > going to put them in a canonical order. > > But that seems like it wouldn't be too hard to fix: let's have > expand_inherited_rtentry() expand the partitioned table in the same > order that will be used by ExecSetupPartitionTupleRouting(). That > seems pretty easy to do - just have expand_inherited_rtentry() notice > that it's got a partitioned table and call > RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to > produce the list of OIDs. Then - I think - > ExecSetupPartitionTupleRouting() doesn't need the hash table; it can > just scan through the return value of ExecSetupPartitionTupleRouting() > and the list of already-created ResultRelInfo structures in parallel - > the order must be the same, but the latter can be missing some > elements, so it can just create the missing ones. Interesting idea. If we are going to do this, I think we may need to modify RelationGetPartitionDispatchInfo() a bit or invent an alternative that does not do as much work. Currently, it assumes that it's only ever called by ExecSetupPartitionTupleRouting() and hence also generates PartitionDispatchInfo objects for partitioned child tables. We don't need that if called from within the planner. Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled with its usage within the executor, because there is this comment: /* * We keep the partitioned ones open until we're done using the * information being collected here(for example, see * ExecEndModifyTable). */ Thanks, Amit
On 2017/07/03 18:54, Amit Langote wrote: > On 2017/07/02 20:10, Robert Haas wrote: >> But that seems like it wouldn't be too hard to fix: let's have >> expand_inherited_rtentry() expand the partitioned table in the same >> order that will be used by ExecSetupPartitionTupleRouting(). That's really what I wanted when updating the patch for tuple-routing to foreign partitions. (I don't understand the issue discussed here, though.) >> That >> seems pretty easy to do - just have expand_inherited_rtentry() notice >> that it's got a partitioned table and call >> RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to >> produce the list of OIDs. Seems like a good idea. > Interesting idea. > > If we are going to do this, I think we may need to modify > RelationGetPartitionDispatchInfo() a bit or invent an alternative that > does not do as much work. Currently, it assumes that it's only ever > called by ExecSetupPartitionTupleRouting() and hence also generates > PartitionDispatchInfo objects for partitioned child tables. We don't need > that if called from within the planner. > > Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled > with its usage within the executor, because there is this comment: > > /* > * We keep the partitioned ones open until we're done using the > * information being collected here (for example, see > * ExecEndModifyTable). > */ Yeah, we need some refactoring work. Is anyone working on that? Best regards, Etsuro Fujita
On 2017/07/04 17:25, Etsuro Fujita wrote: > On 2017/07/03 18:54, Amit Langote wrote: >> On 2017/07/02 20:10, Robert Haas wrote: >>> That >>> seems pretty easy to do - just have expand_inherited_rtentry() notice >>> that it's got a partitioned table and call >>> RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to >>> produce the list of OIDs. > Seems like a good idea. > >> Interesting idea. >> >> If we are going to do this, I think we may need to modify >> RelationGetPartitionDispatchInfo() a bit or invent an alternative that >> does not do as much work. Currently, it assumes that it's only ever >> called by ExecSetupPartitionTupleRouting() and hence also generates >> PartitionDispatchInfo objects for partitioned child tables. We don't need >> that if called from within the planner. >> >> Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled >> with its usage within the executor, because there is this comment: >> >> /* >> * We keep the partitioned ones open until we're done using the >> * information being collected here (for example, see >> * ExecEndModifyTable). >> */ > > Yeah, we need some refactoring work. Is anyone working on that? I would like to take a shot at that if someone else hasn't already cooked up a patch. Working on making RelationGetPartitionDispatchInfo() a routine callable from both within the planner and the executor should be a worthwhile effort. Thanks, Amit
On 4 July 2017 at 14:38, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/07/04 17:25, Etsuro Fujita wrote: >> On 2017/07/03 18:54, Amit Langote wrote: >>> On 2017/07/02 20:10, Robert Haas wrote: >>>> That >>>> seems pretty easy to do - just have expand_inherited_rtentry() notice >>>> that it's got a partitioned table and call >>>> RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to >>>> produce the list of OIDs. >> Seems like a good idea. >> >>> Interesting idea. >>> >>> If we are going to do this, I think we may need to modify >>> RelationGetPartitionDispatchInfo() a bit or invent an alternative that >>> does not do as much work. Currently, it assumes that it's only ever >>> called by ExecSetupPartitionTupleRouting() and hence also generates >>> PartitionDispatchInfo objects for partitioned child tables. We don't need >>> that if called from within the planner. >>> >>> Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled >>> with its usage within the executor, because there is this comment: >>> >>> /* >>> * We keep the partitioned ones open until we're done using the >>> * information being collected here (for example, see >>> * ExecEndModifyTable). >>> */ >> >> Yeah, we need some refactoring work. Is anyone working on that? > > I would like to take a shot at that if someone else hasn't already cooked > up a patch. Working on making RelationGetPartitionDispatchInfo() a > routine callable from both within the planner and the executor should be a > worthwhile effort. What I am currently working on is to see if we can call find_all_inheritors() or find_inheritance_children() instead of generating the leaf_part_oids using APPEND_REL_PARTITION_OIDS(). Possibly we don't have to refactor it completely. find_inheritance_children() needs to return the oids in canonical order. So in find_inheritance_children () need to re-use part of RelationBuildPartitionDesc() where it generates those oids in that order. I am checking this part, and am going to come up with an approach based on findings. Also, need to investigate whether *always* sorting the oids in canonical order is going to be much expensive than the current sorting using oids. But I guess it won't be that expensive. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On 4 July 2017 at 14:48, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 4 July 2017 at 14:38, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On 2017/07/04 17:25, Etsuro Fujita wrote: >>> On 2017/07/03 18:54, Amit Langote wrote: >>>> On 2017/07/02 20:10, Robert Haas wrote: >>>>> That >>>>> seems pretty easy to do - just have expand_inherited_rtentry() notice >>>>> that it's got a partitioned table and call >>>>> RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to >>>>> produce the list of OIDs. >>> Seems like a good idea. >>> >>>> Interesting idea. >>>> >>>> If we are going to do this, I think we may need to modify >>>> RelationGetPartitionDispatchInfo() a bit or invent an alternative that >>>> does not do as much work. Currently, it assumes that it's only ever >>>> called by ExecSetupPartitionTupleRouting() and hence also generates >>>> PartitionDispatchInfo objects for partitioned child tables. We don't need >>>> that if called from within the planner. >>>> >>>> Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled >>>> with its usage within the executor, because there is this comment: >>>> >>>> /* >>>> * We keep the partitioned ones open until we're done using the >>>> * information being collected here (for example, see >>>> * ExecEndModifyTable). >>>> */ >>> >>> Yeah, we need some refactoring work. Is anyone working on that? >> >> I would like to take a shot at that if someone else hasn't already cooked >> up a patch. Working on making RelationGetPartitionDispatchInfo() a >> routine callable from both within the planner and the executor should be a >> worthwhile effort. > > What I am currently working on is to see if we can call > find_all_inheritors() or find_inheritance_children() instead of > generating the leaf_part_oids using APPEND_REL_PARTITION_OIDS(). > Possibly we don't have to refactor it completely. > find_inheritance_children() needs to return the oids in canonical > order. So in find_inheritance_children () need to re-use part of > RelationBuildPartitionDesc() where it generates those oids in that > order. I am checking this part, and am going to come up with an > approach based on findings. The other approach is to make canonical ordering only in find_all_inheritors() by replacing call to find_inheritance_children() with the refactored/modified RelationGetPartitionDispatchInfo(). But that would mean that the callers of find_inheritance_children() would have one ordering, while the callers of find_all_inheritors() would have a different ordering; that brings up chances of deadlocks. That's why I think, we need to think about modifying the common function find_inheritance_children(), so that we would be consistent with the ordering. And then use find_inheritance_children() or find_all_inheritors() in RelationGetPartitionDispatchInfo(). So yes, there would be some modifications to RelationGetPartitionDispatchInfo(). > > Also, need to investigate whether *always* sorting the oids in > canonical order is going to be much expensive than the current sorting > using oids. But I guess it won't be that expensive. > > > -- > Thanks, > -Amit Khandekar > EnterpriseDB Corporation > The Postgres Database Company -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On 4 July 2017 at 15:23, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 4 July 2017 at 14:48, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> On 4 July 2017 at 14:38, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> On 2017/07/04 17:25, Etsuro Fujita wrote: >>>> On 2017/07/03 18:54, Amit Langote wrote: >>>>> On 2017/07/02 20:10, Robert Haas wrote: >>>>>> That >>>>>> seems pretty easy to do - just have expand_inherited_rtentry() notice >>>>>> that it's got a partitioned table and call >>>>>> RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to >>>>>> produce the list of OIDs. >>>> Seems like a good idea. >>>> >>>>> Interesting idea. >>>>> >>>>> If we are going to do this, I think we may need to modify >>>>> RelationGetPartitionDispatchInfo() a bit or invent an alternative that >>>>> does not do as much work. Currently, it assumes that it's only ever >>>>> called by ExecSetupPartitionTupleRouting() and hence also generates >>>>> PartitionDispatchInfo objects for partitioned child tables. We don't need >>>>> that if called from within the planner. >>>>> >>>>> Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled >>>>> with its usage within the executor, because there is this comment: >>>>> >>>>> /* >>>>> * We keep the partitioned ones open until we're done using the >>>>> * information being collected here (for example, see >>>>> * ExecEndModifyTable). >>>>> */ >>>> >>>> Yeah, we need some refactoring work. Is anyone working on that? >>> >>> I would like to take a shot at that if someone else hasn't already cooked >>> up a patch. Working on making RelationGetPartitionDispatchInfo() a >>> routine callable from both within the planner and the executor should be a >>> worthwhile effort. >> >> What I am currently working on is to see if we can call >> find_all_inheritors() or find_inheritance_children() instead of >> generating the leaf_part_oids using APPEND_REL_PARTITION_OIDS(). >> Possibly we don't have to refactor it completely. >> find_inheritance_children() needs to return the oids in canonical >> order. So in find_inheritance_children () need to re-use part of >> RelationBuildPartitionDesc() where it generates those oids in that >> order. I am checking this part, and am going to come up with an >> approach based on findings. > > The other approach is to make canonical ordering only in > find_all_inheritors() by replacing call to find_inheritance_children() > with the refactored/modified RelationGetPartitionDispatchInfo(). But > that would mean that the callers of find_inheritance_children() would > have one ordering, while the callers of find_all_inheritors() would > have a different ordering; that brings up chances of deadlocks. That's > why I think, we need to think about modifying the common function > find_inheritance_children(), so that we would be consistent with the > ordering. And then use find_inheritance_children() or > find_all_inheritors() in RelationGetPartitionDispatchInfo(). So yes, > there would be some modifications to > RelationGetPartitionDispatchInfo(). > >> >> Also, need to investigate whether *always* sorting the oids in >> canonical order is going to be much expensive than the current sorting >> using oids. But I guess it won't be that expensive. Like I mentioned upthread... in expand_inherited_rtentry(), if we replace find_all_inheritors() with something else that returns oids in canonical order, that will change the order in which children tables get locked, which increases the chance of deadlock. Because, then the callers of find_all_inheritors() will lock them in one order, while callers of expand_inherited_rtentry() will lock them in a different order. Even in the current code, I think there is a chance of deadlocks because RelationGetPartitionDispatchInfo() and find_all_inheritors() have different lock ordering. Now, to get the oids of a partitioned table children sorted by canonical ordering, (i.e. using the partition bound values) we need to either use the partition bounds to sort the oids like the way it is done in RelationBuildPartitionDesc() or, open the parent table and get it's Relation->rd_partdesc->oids[] which are already sorted in canonical order. So if we generate oids using this way in find_all_inheritors() and find_inheritance_children(), that will generate consistent ordering everywhere. But this method is quite expensive as compared to the way oids are generated and sorted using oid values in find_inheritance_children(). In both expand_inherited_rtentry() and RelationGetPartitionDispatchInfo(), each of the child tables are opened. So, in both of these functions, what we can do is : call a new function partition_tree_walker() which does following : 1. Lock the children using the existing order (i.e. sorted by oid values) using the same function find_all_inheritors(). Rename find_all_inheritors() to lock_all_inheritors(... , bool return_oids) which returns the oid list only if requested. 2. And then scan through each of the partitions in canonical order, by opening the parent table, then opening the partition descriptor oids, and then doing whatever needs to be done with that partition rel. partition_tree_walker() will look something like this : void partition_tree_walker(Oid parentOid, LOCKMODE lockmode, void (*walker_func) (), void *context) { Relation parentrel; List *rels_list; ListCell *cell; (void) lock_all_inheritors(parentOid, lockmode, false /* don't generate oids */); parentrel = heap_open(parentOid, NoLock); rels_list = append_rel_partition_oids(NIL, parentrel); /* Scan through all partitioned rels, and at the * same time append their children. */ foreach(cell, rels_list) { /* Open partrel without locking; lock_all_inheritors() has locked it */ Relation partrel = heap_open(lfirst_oid(cell),NoLock); /* Append the children of a partitioned rel to the same list * that we are iterating on */ if (RelationGetPartitionDesc(partrel)) rels_list = append_rel_partition_oids(rels_list, partrel); /* * Do whatever processing needs to be done on this partel. * The walker function is free to eitherclose the partel * or keep it opened, but it needs to make sure the opened * ones are closed later */ walker_func(partrel, context); } } List *append_rel_partition_oids(List *rel_list, Relation rel) { int i; for (i = 0; i < rel->rd_partdesc->nparts; i++) rel_list = lappend_oid(rel_list, rel->rd_partdesc->oids[i]); return rel_list; } So, in expand_inherited_rtentry() the foreach(l, inhOIDs) loop will be replaced by partition_tree_walker(parentOid, expand_rte_walker_func) where expand_rte_walker_func() will do all the work done in the for loop for each of the partition rels. Similarly, in RelationGetPartitionDispatchInfo() the initial part where it uses APPEND_REL_PARTITION_OIDS() can be replaced by partition_tree_walker(rel, dispatch_info_walkerfunc) where dispatch_info_walkerfunc() will generate the oids, or may be populate the complete PartitionDispatchData structure. 'pd' variable can be passed as context to the partition_tree_walker(..., context) Generating the resultrels in canonical order by opening the tables using the above way wouldn't be more expensive than the existing code, because even currently we anyways have to open all the tables in both of these functions. -Amit Khandekar
On 5 July 2017 at 15:12, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > Like I mentioned upthread... in expand_inherited_rtentry(), if we > replace find_all_inheritors() with something else that returns oids in > canonical order, that will change the order in which children tables > get locked, which increases the chance of deadlock. Because, then the > callers of find_all_inheritors() will lock them in one order, while > callers of expand_inherited_rtentry() will lock them in a different > order. Even in the current code, I think there is a chance of > deadlocks because RelationGetPartitionDispatchInfo() and > find_all_inheritors() have different lock ordering. > > Now, to get the oids of a partitioned table children sorted by > canonical ordering, (i.e. using the partition bound values) we need to > either use the partition bounds to sort the oids like the way it is > done in RelationBuildPartitionDesc() or, open the parent table and get > it's Relation->rd_partdesc->oids[] which are already sorted in > canonical order. So if we generate oids using this way in > find_all_inheritors() and find_inheritance_children(), that will > generate consistent ordering everywhere. But this method is quite > expensive as compared to the way oids are generated and sorted using > oid values in find_inheritance_children(). > > In both expand_inherited_rtentry() and > RelationGetPartitionDispatchInfo(), each of the child tables are > opened. > > So, in both of these functions, what we can do is : call a new > function partition_tree_walker() which does following : > 1. Lock the children using the existing order (i.e. sorted by oid > values) using the same function find_all_inheritors(). Rename > find_all_inheritors() to lock_all_inheritors(... , bool return_oids) > which returns the oid list only if requested. > 2. And then scan through each of the partitions in canonical order, by > opening the parent table, then opening the partition descriptor oids, > and then doing whatever needs to be done with that partition rel. > > partition_tree_walker() will look something like this : > > void partition_tree_walker(Oid parentOid, LOCKMODE lockmode, > void (*walker_func) (), void *context) > { > Relation parentrel; > List *rels_list; > ListCell *cell; > > (void) lock_all_inheritors(parentOid, lockmode, > false /* don't generate oids */); > > parentrel = heap_open(parentOid, NoLock); > rels_list = append_rel_partition_oids(NIL, parentrel); > > /* Scan through all partitioned rels, and at the > * same time append their children. */ > foreach(cell, rels_list) > { > /* Open partrel without locking; lock_all_inheritors() has locked it */ > Relation partrel = heap_open(lfirst_oid(cell), NoLock); > > /* Append the children of a partitioned rel to the same list > * that we are iterating on */ > if (RelationGetPartitionDesc(partrel)) > rels_list = append_rel_partition_oids(rels_list, partrel); > > /* > * Do whatever processing needs to be done on this partel. > * The walker function is free to either close the partel > * or keep it opened, but it needs to make sure the opened > * ones are closed later > */ > walker_func(partrel, context); > } > } > > List *append_rel_partition_oids(List *rel_list, Relation rel) > { > int i; > for (i = 0; i < rel->rd_partdesc->nparts; i++) > rel_list = lappend_oid(rel_list, rel->rd_partdesc->oids[i]); > > return rel_list; > } > > > So, in expand_inherited_rtentry() the foreach(l, inhOIDs) loop will be > replaced by partition_tree_walker(parentOid, expand_rte_walker_func) > where expand_rte_walker_func() will do all the work done in the for > loop for each of the partition rels. > > Similarly, in RelationGetPartitionDispatchInfo() the initial part > where it uses APPEND_REL_PARTITION_OIDS() can be replaced by > partition_tree_walker(rel, dispatch_info_walkerfunc) where > dispatch_info_walkerfunc() will generate the oids, or may be populate > the complete PartitionDispatchData structure. 'pd' variable can be > passed as context to the partition_tree_walker(..., context) > > Generating the resultrels in canonical order by opening the tables > using the above way wouldn't be more expensive than the existing code, > because even currently we anyways have to open all the tables in both > of these functions. > Attached is a WIP patch (make_resultrels_ordered.patch) that generates the result rels in canonical order. This patch is kept separate from the update-partition-key patch, and can be applied on master branch. In this patch, rather than partition_tree_walker() called with a context, I have provided a function partition_walker_next() using which we iterate over all the partitions in canonical order. partition_walker_next() will take care of appending oids from partition descriptors. Now, to generate consistent oid ordering in RelationGetPartitionDispatchInfo() and expand_inherited_rtentry(), we could have very well skipped using the partition_walker API in expand_inherited_rtentry() and just had it iterate over the partition descriptors the way it is done in RelationGetPartitionDispatchInfo(). But I think it's better to have some common function to traverse the partition tree in consistent order, hence the usage of partition_walker_next() in both expand_inherited_rtentry() and RelationGetPartitionDispatchInfo(). In RelationGetPartitionDispatchInfo(), still, it only uses this function to generate partitioned table list. But even to generate partitioned tables in correct order, it is better to use partition_walker_next(), so that we make sure to finally generate consistent order of leaf oids. I considered the option where RelationGetPartitionDispatchInfo() would directly build the pd[] array over each iteration of partition_walker_next(). But that was turning out to be clumsy, because then we need to keep track of which pd[] element each of the oids would go into by having a current position of pd[]. Rather than this, it is best to keep building of pd array separate, as done in the existing code. Didn't do any renaming for find_all_inheritors(). Just called it in both the functions, and ignored the list returned. Like mentioned upthread, it is important to lock in this order so as to be consistent with the lock ordering in other places where find_inheritance_children() is called. Hence, called find_all_inheritors() in RelationGetPartitionDispatchInfo() as well. Note that this patch does not attempt to make RelationGetPartitionDispatchInfo() work in planner. That I think should be done once we finalise how to generate common oid ordering, and is not in the scope of this project. Once I merge this in the update-partition-key patch, in ExecSetupPartitionTupleRouting(), I will be able to search for the leaf partitions in this ordered resultrel list, without having to build a hash table of result rels the way it is currently done in the update-partition-key patch. Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 13 July 2017 at 22:39, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > Attached is a WIP patch (make_resultrels_ordered.patch) that generates > the result rels in canonical order. This patch is kept separate from > the update-partition-key patch, and can be applied on master branch. Attached update-partition-key_v13.patch now contains this make_resultrels_ordered.patch changes. So now that the per-subplan result rels and the leaf partition oids that are generated for tuple routing are both known to have the same order (cannonical), in ExecSetupPartitionTupleRouting(), we look for the per-subplan results without the need for a hash table. Instead of the hash table, we iterate over the leaf partition oids and at the same time keep shifting a position over the per-subplan resultrels whenever the resultrel at the position is found to be present in the leaf partitions list. Since the two lists are in the same order, we never have to again scan the portition of the lists that is already scanned. I considered whether the issue behind this recent commit might be relevant for update tuple-routing as well : commit f81a91db4d1c2032632aa5df9fc14be24f5fe5ec Author: Robert Haas <rhaas@postgresql.org> Date: Mon Jul 17 21:29:45 2017 -0400 Use a real RT index when setting up partition tuple routing. Since we know that using a dummy 1 value for tuple routing result rels is not correct, I am checking about another possibility : Now in the latest patch, the tuple routing partitions would have a mix of a) existing update result-rels, and b) new partition resultrels. 'b' resultrels would have the RT index of nominalRelation, but the existing 'a' resultrels would have their own different RT indexes. I suspect, this might surface a similar issue that was fixed by the above commit, for e.g. with the WITH query having UPDATE subqueries doing tuple routing. Will check that. This patch also has Robert's changes in the planner to decide whether to do update tuple routing. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Mon, Jul 24, 2017 at 11:23 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
Attached update-partition-key_v13.patch now contains this
make_resultrels_ordered.patch changes.
I have applied attach patch and got below observation.
Observation : if join producing multiple output rows for a given row to be modified. I am seeing here it is updating a row and also inserting rows in target table. hence after update total count of table got incremented.
below are steps:
postgres=# create table part_upd (a int, b int) partition by range(a);
CREATE TABLE
postgres=# create table part_upd1 partition of part_upd for values from (minvalue) to (-10);
CREATE TABLE
postgres=# create table part_upd2 partition of part_upd for values from (-10) to (0);
CREATE TABLE
postgres=# create table part_upd3 partition of part_upd for values from (0) to (10);
CREATE TABLE
postgres=# create table part_upd4 partition of part_upd for values from (10) to (maxvalue);
CREATE TABLE
postgres=# insert into part_upd select i,i from generate_series(-30,30,3)i;
INSERT 0 21
postgres=# select count(*) from part_upd;
count
-------
21
(1 row)
postgres=#
postgres=# create table non_part_upd (a int);
CREATE TABLE
postgres=# insert into non_part_upd select i%2 from generate_series(-30,30,5)i;
INSERT 0 13
postgres=# update part_upd t1 set a = (t2.a+10) from non_part_upd t2 where t2.a = t1.b;
UPDATE 7
postgres=# select count(*) from part_upd;
count
-------
27
(1 row)
postgres=# select tableoid::regclass,* from part_upd;
tableoid | a | b
-----------+-----+-----
part_upd1 | -30 | -30
part_upd1 | -27 | -27
part_upd1 | -24 | -24
part_upd1 | -21 | -21
part_upd1 | -18 | -18
part_upd1 | -15 | -15
part_upd1 | -12 | -12
part_upd2 | -9 | -9
part_upd2 | -6 | -6
part_upd2 | -3 | -3
part_upd3 | 3 | 3
part_upd3 | 6 | 6
part_upd3 | 9 | 9
part_upd4 | 12 | 12
part_upd4 | 15 | 15
part_upd4 | 18 | 18
part_upd4 | 21 | 21
part_upd4 | 24 | 24
part_upd4 | 27 | 27
part_upd4 | 30 | 30
part_upd4 | 10 | 0
part_upd4 | 10 | 0
part_upd4 | 10 | 0
part_upd4 | 10 | 0
part_upd4 | 10 | 0
part_upd4 | 10 | 0
part_upd4 | 10 | 0
(27 rows)
below are steps:
postgres=# create table part_upd (a int, b int) partition by range(a);
CREATE TABLE
postgres=# create table part_upd1 partition of part_upd for values from (minvalue) to (-10);
CREATE TABLE
postgres=# create table part_upd2 partition of part_upd for values from (-10) to (0);
CREATE TABLE
postgres=# create table part_upd3 partition of part_upd for values from (0) to (10);
CREATE TABLE
postgres=# create table part_upd4 partition of part_upd for values from (10) to (maxvalue);
CREATE TABLE
postgres=# insert into part_upd select i,i from generate_series(-30,30,3)i;
INSERT 0 21
postgres=# select count(*) from part_upd;
count
-------
21
(1 row)
postgres=#
postgres=# create table non_part_upd (a int);
CREATE TABLE
postgres=# insert into non_part_upd select i%2 from generate_series(-30,30,5)i;
INSERT 0 13
postgres=# update part_upd t1 set a = (t2.a+10) from non_part_upd t2 where t2.a = t1.b;
UPDATE 7
postgres=# select count(*) from part_upd;
count
-------
27
(1 row)
postgres=# select tableoid::regclass,* from part_upd;
tableoid | a | b
-----------+-----+-----
part_upd1 | -30 | -30
part_upd1 | -27 | -27
part_upd1 | -24 | -24
part_upd1 | -21 | -21
part_upd1 | -18 | -18
part_upd1 | -15 | -15
part_upd1 | -12 | -12
part_upd2 | -9 | -9
part_upd2 | -6 | -6
part_upd2 | -3 | -3
part_upd3 | 3 | 3
part_upd3 | 6 | 6
part_upd3 | 9 | 9
part_upd4 | 12 | 12
part_upd4 | 15 | 15
part_upd4 | 18 | 18
part_upd4 | 21 | 21
part_upd4 | 24 | 24
part_upd4 | 27 | 27
part_upd4 | 30 | 30
part_upd4 | 10 | 0
part_upd4 | 10 | 0
part_upd4 | 10 | 0
part_upd4 | 10 | 0
part_upd4 | 10 | 0
part_upd4 | 10 | 0
part_upd4 | 10 | 0
(27 rows)
Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB CorporationOn 25 July 2017 at 15:02, Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote: > On Mon, Jul 24, 2017 at 11:23 AM, Amit Khandekar <amitdkhan.pg@gmail.com> > wrote: >> >> >> Attached update-partition-key_v13.patch now contains this >> make_resultrels_ordered.patch changes. >> > > I have applied attach patch and got below observation. > > Observation : if join producing multiple output rows for a given row to be > modified. I am seeing here it is updating a row and also inserting rows in > target table. hence after update total count of table got incremented. Thanks for catching this Rajkumar. So after the row to be updated is already moved to another partition, when the next join output row corresponds to the same row which is moved, that row is now deleted, so ExecDelete()=>heap_delete() gets HeapTupleSelfUpdated, and this is not handled. So even when ExecDelete() finds that the row is already deleted, we still call ExecInsert(), so a new row is inserted. In ExecDelete(), we should indicate that the row is already deleted. In the existing patch, there is a parameter concurrenty_deleted for ExecDelete() which indicates that the row is concurrently deleted. I think we can make this parameter for both of these purposes so as to avoid ExecInsert() for both these scenarios. Will work on a patch.
On Tue, Jul 25, 2017 at 3:54 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
On 25 July 2017 at 15:02, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:
> On Mon, Jul 24, 2017 at 11:23 AM, Amit Khandekar <amitdkhan.pg@gmail.com>
> wrote:
>>
>>
>> Attached update-partition-key_v13.patch now contains this
>> make_resultrels_ordered.patch changes.
>>
>
> I have applied attach patch and got below observation.
>
> Observation : if join producing multiple output rows for a given row to be
> modified. I am seeing here it is updating a row and also inserting rows in
> target table. hence after update total count of table got incremented.
Thanks for catching this Rajkumar.
So after the row to be updated is already moved to another partition,
when the next join output row corresponds to the same row which is
moved, that row is now deleted, so ExecDelete()=>heap_delete() gets
HeapTupleSelfUpdated, and this is not handled. So even when
ExecDelete() finds that the row is already deleted, we still call
ExecInsert(), so a new row is inserted. In ExecDelete(), we should
indicate that the row is already deleted. In the existing patch, there
is a parameter concurrenty_deleted for ExecDelete() which indicates
that the row is concurrently deleted. I think we can make this
parameter for both of these purposes so as to avoid ExecInsert() for
both these scenarios. Will work on a patch.
Thanks Amit.
Got one more observation : update... returning is not working with whole row reference. please take a look.
postgres=# create table part (a int, b int) partition by range(a);
CREATE TABLE
postgres=# create table part_p1 partition of part for values from (minvalue) to (0);
CREATE TABLE
postgres=# create table part_p2 partition of part for values from (0) to (maxvalue);
CREATE TABLE
postgres=# insert into part values (10,1);
INSERT 0 1
postgres=# insert into part values (20,2);
INSERT 0 1
postgres=# update part t1 set a = b returning t1;
ERROR: unexpected whole-row reference found in partition key
Got one more observation : update... returning is not working with whole row reference. please take a look.
postgres=# create table part (a int, b int) partition by range(a);
CREATE TABLE
postgres=# create table part_p1 partition of part for values from (minvalue) to (0);
CREATE TABLE
postgres=# create table part_p2 partition of part for values from (0) to (maxvalue);
CREATE TABLE
postgres=# insert into part values (10,1);
INSERT 0 1
postgres=# insert into part values (20,2);
INSERT 0 1
postgres=# update part t1 set a = b returning t1;
ERROR: unexpected whole-row reference found in partition key
On Thu, Jul 13, 2017 at 1:09 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > Attached is a WIP patch (make_resultrels_ordered.patch) that generates > the result rels in canonical order. This patch is kept separate from > the update-partition-key patch, and can be applied on master branch. Hmm, I like the approach you've taken here in general, but I think it needs cleanup. +typedef struct ParentChild This is a pretty generic name. Pick something more specific and informative. +static List *append_rel_partition_oids(List *rel_list, Relation rel); One could be forgiven for thinking that this function was just going to append OIDs, but it actually appends ParentChild structures, so I think the name needs work. +List *append_rel_partition_oids(List *rel_list, Relation rel) Style. Please pgindent your patches. +#ifdef DEBUG_PRINT_OIDS + print_oids(*leaf_part_oids); +#endif I'd just rip out this debug stuff once you've got this working, but if we keep it, it certainly can't have a name as generic as print_oids() when it's actually doing something with a list of ParentChild structures. Also, it prints names, not OIDs. And DEBUG_PRINT_OIDS is no good for the same reasons. + if (RelationGetPartitionDesc(rel)) + walker->rels_list = append_rel_partition_oids(walker->rels_list, rel); Every place that calls append_rel_partition_oids guards that call with if (RelationGetPartitionDesc(...)). It seems to me that it would be simpler to remove those tests and instead just replace the Assert(partdesc) inside that function with if (!partdesc) return; Is there any real benefit in this "walker" interface? It looks to me like it might be simpler to just change things around so that it returns a list of OIDs, like find_all_inheritors, but generated differently. Then if you want bound-ordering rather than OID-ordering, you just do this: list_free(inhOids); inhOids = get_partition_oids_in_bound_order(rel); That'd remove the need for some if/then logic as you've currently got in get_next_child(). + is_partitioned_resultrel = + (oldrelation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE + && rti == parse->resultRelation); I suspect this isn't correct for a table that contains wCTEs, because there would in that case be multiple result relations. I think we should always expand in bound order rather than only when it's a result relation. I think for partition-wise join, we're going to want to do it this way for all relations in the query, or at least for all relations in the query that might possibly be able to participate in a partition-wise join. If there are multiple cases that are going to need this ordering, it's hard for me to accept the idea that it's worth the complexity of trying to keep track of when we expanded things in one order vs. another. There are other applications of having things in bound order too, like MergeAppend -> Append strength-reduction (which might not be legal anyway if there are list partitions with multiple, non-contiguous list bounds or if any NULL partition doesn't end up in the right place in the order, but there will be lots of cases where it can work). On another note, did you do anything about the suggestion Thomas made in http://postgr.es/m/CAEepm=3sc_j1zwqDYrbU4DTfX5rHcaMNNuaXRKWZFgt9m23OcA@mail.gmail.com ? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017/07/26 6:07, Robert Haas wrote: > On Thu, Jul 13, 2017 at 1:09 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> Attached is a WIP patch (make_resultrels_ordered.patch) that generates >> the result rels in canonical order. This patch is kept separate from >> the update-partition-key patch, and can be applied on master branch. > > I suspect this isn't correct for a table that contains wCTEs, because > there would in that case be multiple result relations. > > I think we should always expand in bound order rather than only when > it's a result relation. I think for partition-wise join, we're going > to want to do it this way for all relations in the query, or at least > for all relations in the query that might possibly be able to > participate in a partition-wise join. If there are multiple cases > that are going to need this ordering, it's hard for me to accept the > idea that it's worth the complexity of trying to keep track of when we > expanded things in one order vs. another. There are other > applications of having things in bound order too, like MergeAppend -> > Append strength-reduction (which might not be legal anyway if there > are list partitions with multiple, non-contiguous list bounds or if > any NULL partition doesn't end up in the right place in the order, but > there will be lots of cases where it can work). Sorry to be responding this late to the Amit's make_resultrel_ordered patch itself, but I agree that we should teach the planner to *always* expand partitioned tables in the partition bound order. When working on something else, I ended up writing a prerequisite patch that refactors RelationGetPartitionDispatchInfo() to not be too tied to its current usage for tuple-routing, so that it can now be used in the planner (for example, in expand_inherited_rtentry(), instead of find_all_inheritors()). If we could adopt that patch, we can focus on the update partition row movement issues more closely on this thread, rather than the concerns about the order that planner puts partitions into. I checked that we get the same result relation order with both the patches, but I would like to highlight a notable difference here between the approaches taken by our patches. In my patch, I have now taught RelationGetPartitionDispatchInfo() to lock *only* the partitioned tables in the tree, because we need to look at its partition descriptor to collect partition OIDs and bounds. We can defer locking (and opening the relation descriptor of) leaf partitions to a point where planner has determined that the partition will be accessed after all (not pruned), which will be done in a separate patch of course. Sorry again that I didn't share this patch sooner. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 2017/07/26 6:07, Robert Haas wrote: > On Thu, Jul 13, 2017 at 1:09 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> Attached is a WIP patch (make_resultrels_ordered.patch) that generates >> the result rels in canonical order. This patch is kept separate from >> the update-partition-key patch, and can be applied on master branch. Thank you for working on this, Amit! > Hmm, I like the approach you've taken here in general, +1 for the approach. > Is there any real benefit in this "walker" interface? It looks to me > like it might be simpler to just change things around so that it > returns a list of OIDs, like find_all_inheritors, but generated > differently. Then if you want bound-ordering rather than > OID-ordering, you just do this: > > list_free(inhOids); > inhOids = get_partition_oids_in_bound_order(rel); > > That'd remove the need for some if/then logic as you've currently got > in get_next_child(). Yeah, that would make the code much simple, so +1 for Robert's idea. > I think we should always expand in bound order rather than only when > it's a result relation. I think for partition-wise join, we're going > to want to do it this way for all relations in the query, or at least > for all relations in the query that might possibly be able to > participate in a partition-wise join. If there are multiple cases > that are going to need this ordering, it's hard for me to accept the > idea that it's worth the complexity of trying to keep track of when we > expanded things in one order vs. another. There are other > applications of having things in bound order too, like MergeAppend -> > Append strength-reduction (which might not be legal anyway if there > are list partitions with multiple, non-contiguous list bounds or if > any NULL partition doesn't end up in the right place in the order, but > there will be lots of cases where it can work). +1 for that as well. Another benefit from that would be EXPLAIN; we could display partitions for a partitioned table in the same order for Append and ModifyTable (ie, SELECT/UPDATE/DELETE), which I think would make the EXPLAIN result much readable. Best regards, Etsuro Fujita
On 2017/07/25 21:55, Rajkumar Raghuwanshi wrote: > Got one more observation : update... returning is not working with whole > row reference. please take a look. > > postgres=# create table part (a int, b int) partition by range(a); > CREATE TABLE > postgres=# create table part_p1 partition of part for values from > (minvalue) to (0); > CREATE TABLE > postgres=# create table part_p2 partition of part for values from (0) to > (maxvalue); > CREATE TABLE > postgres=# insert into part values (10,1); > INSERT 0 1 > postgres=# insert into part values (20,2); > INSERT 0 1 > postgres=# update part t1 set a = b returning t1; > ERROR: unexpected whole-row reference found in partition key That looks like a bug which exists in HEAD too. I posted a patch in a dedicated thread to address the same [1]. Thanks, Amit [1] https://www.postgresql.org/message-id/9a39df80-871e-6212-0684-f93c83be4097%40lab.ntt.co.jp
On 26 July 2017 at 02:37, Robert Haas <robertmhaas@gmail.com> wrote: > Is there any real benefit in this "walker" interface? It looks to me > like it might be simpler to just change things around so that it > returns a list of OIDs, like find_all_inheritors, but generated > differently. Then if you want bound-ordering rather than > OID-ordering, you just do this: > > list_free(inhOids); > inhOids = get_partition_oids_in_bound_order(rel); > > That'd remove the need for some if/then logic as you've currently got > in get_next_child(). Yes, I had considered that ; i.e., first generating just a list of bound-ordered oids. But that consequently needs all the child tables to be opened and closed twice; once during the list generation, and then while expanding the partitioned table. Agreed, that the second time, heap_open() would not be that expensive because tables would be cached, but still it would require to get the cached relation handle from hash table. Since we anyway want to open the tables, better have a *next() function to go-get the next partition in a fixed order. Actually, there isn't much that the walker next() function does. Any code that wants to traverse bound-wise can do that by its own. The walker function is just a convenient way to make sure everyone traverses in the same order by using this function. Yet to go over other things including your review comments, and Amit Langote's patch on refactoring RelationGetPartitionDispatchInfo(). > On another note, did you do anything about the suggestion Thomas made > in http://postgr.es/m/CAEepm=3sc_j1zwqDYrbU4DTfX5rHcaMNNuaXRKWZFgt9m23OcA@mail.gmail.com > ? This is still pending on me; plus I think there are some more points. I need to go over those and consolidate a list of todos. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Wed, Jul 26, 2017 at 2:13 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Sorry to be responding this late to the Amit's make_resultrel_ordered > patch itself, but I agree that we should teach the planner to *always* > expand partitioned tables in the partition bound order. Sounds like we have unanimous agreement on that point. Yesterday, I was discussing with Beena Emerson, who is working on run-time partition pruning, that it would also be useful for that purpose, if you're trying to prune based on a range query. > I checked that we get the same result relation order with both the > patches, but I would like to highlight a notable difference here between > the approaches taken by our patches. In my patch, I have now taught > RelationGetPartitionDispatchInfo() to lock *only* the partitioned tables > in the tree, because we need to look at its partition descriptor to > collect partition OIDs and bounds. We can defer locking (and opening the > relation descriptor of) leaf partitions to a point where planner has > determined that the partition will be accessed after all (not pruned), > which will be done in a separate patch of course. That's very desirable, but I believe it introduces a deadlock risk which Amit's patch avoids. A transaction using the code you've written here is eventually going to lock all partitions, BUT it's going to move the partitioned ones to the front of the locking order vs. what find_all_inheritors would do. So, when multi-level partitioning is in use, I think it could happen that some other transaction is accessing the table using a different code path that uses the find_all_inheritors order without modification. If those locks conflict (e.g. query vs. DROP) then there's a deadlock risk. Unfortunately I don't see any easy way around that problem, but maybe somebody else has an idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 28 July 2017 at 20:10, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jul 26, 2017 at 2:13 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Sorry to be responding this late to the Amit's make_resultrel_ordered >> patch itself, but I agree that we should teach the planner to *always* >> expand partitioned tables in the partition bound order. > > Sounds like we have unanimous agreement on that point. I too agree. > >> I checked that we get the same result relation order with both the >> patches, but I would like to highlight a notable difference here between >> the approaches taken by our patches. In my patch, I have now taught >> RelationGetPartitionDispatchInfo() to lock *only* the partitioned tables >> in the tree, because we need to look at its partition descriptor to >> collect partition OIDs and bounds. We can defer locking (and opening the >> relation descriptor of) leaf partitions to a point where planner has >> determined that the partition will be accessed after all (not pruned), >> which will be done in a separate patch of course. With Amit Langote's patch, we can very well do the locking beforehand by find_all_inheritors(), and then run RelationGetPartitionDispatchInfo() with noLock, so as to remove the deadlock problem. But I think we should keep these two tasks separate, i.e. expanding the partition tree in bound order, and making RelationGetPartitionDispatchInfo() work for the planner. Regarding building the PartitionDispatchInfo in the planner, we should do that only after it is known that partition columns are updated, so it can't be done in expand_inherited_rtentry() because it would be too soon. For planner setup, RelationGetPartitionDispatchInfo() should just build the tupmap for each partitioned table, and then initialize the rest of the fields like tuplslot, reldesc , etc later during execution. So for now, I feel we should just do the changes for making sure the order is same, and then over that, separately modify RelationGetPartitionDispatchInfo() for planner. > > That's very desirable, but I believe it introduces a deadlock risk > which Amit's patch avoids. A transaction using the code you've > written here is eventually going to lock all partitions, BUT it's > going to move the partitioned ones to the front of the locking order > vs. what find_all_inheritors would do. So, when multi-level > partitioning is in use, I think it could happen that some other > transaction is accessing the table using a different code path that > uses the find_all_inheritors order without modification. If those > locks conflict (e.g. query vs. DROP) then there's a deadlock risk. Yes, I agree. Even with single-level partitioning, the leaf partitions ordered by find_all_inheritors() is by oid values, so that's also going to be differently ordered. > > Unfortunately I don't see any easy way around that problem, but maybe > somebody else has an idea. One approach I had considered was to have find_inheritance_children() itself lock the children in bound order, so that everyone will have bound-ordered oids, but that would be too expensive since it requires opening all partitioned tables to initialize partition descriptors. In find_inheritance_children(), we get all oids without opening any tables. But now that I think more of it, it's only the partitioned tables that we have to open, not the leaf partitions; and furthermore, I didn't see calls to find_inheritance_children() and find_all_inheritors() in performance-critical code, except in expand_inherited_rtentry(). All of them are in DDL commands; but yes, that can change in the future. Regarding dynamically locking specific partitions as and when needed, I think this method inherently has the issue of deadlock because the order would be random. So it feels like there is no way around other than to lock all partitions beforehand. ---------------- Regarding using first resultrel for mapping RETURNING and WCO, I think we can use (a renamed) getASTriggerResultRelInfo() to get the root result relation, and use WCO and RETURNING expressions of this relation to do the mapping for child rels. This way, there won't be insert/update specific code, and we don't need to use first result relation. While checking the whole-row bug on the other thread [1] , I noticed that the RETURNING/WCO expressions for the per-subplan result rels are formed by considering not just simple vars, but also whole row vars and other nodes. So for update-tuple-routing, there would be some result-rels WCOs formed using adjust_appendrel_attrs(), while for others, they would be built using map_partition_varattnos() which only considers simple vars. So the bug in [1] would be there for update-partition-key as well, when the tuple is routed into a newly built resultrel. May be, while fixing the bug in [1] , this might be automatically solved. ---------------- Below are the TODOS at this point : Fix for bug reported by Rajkumar about update with join. Do something about two separate mapping tables for Transition tables and update tuple-routing. GetUpdatedColumns() to be moved to header file. More test scenarios in regression tests. Need to check/test whether we are correctly applying insert policies (ant not update) while inserting a routed tuple. Use getASTriggerResultRelInfo() for attrno mapping, rather than first resultrel, for generating child WCO/RETURNING expression. Address Robert's review comments on make_resultrel_ordered.patch. pgindent. [1] https://www.postgresql.org/message-id/d86d27ea-cc9d-5dbe-b131-e7dec4017983%40lab.ntt.co.jp Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On 2017/07/29 2:45, Amit Khandekar wrote: > On 28 July 2017 at 20:10, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Jul 26, 2017 at 2:13 AM, Amit Langote wrote: >>> I checked that we get the same result relation order with both the >>> patches, but I would like to highlight a notable difference here between >>> the approaches taken by our patches. In my patch, I have now taught >>> RelationGetPartitionDispatchInfo() to lock *only* the partitioned tables >>> in the tree, because we need to look at its partition descriptor to >>> collect partition OIDs and bounds. We can defer locking (and opening the >>> relation descriptor of) leaf partitions to a point where planner has >>> determined that the partition will be accessed after all (not pruned), >>> which will be done in a separate patch of course. >> >> That's very desirable, but I believe it introduces a deadlock risk >> which Amit's patch avoids. A transaction using the code you've >> written here is eventually going to lock all partitions, BUT it's >> going to move the partitioned ones to the front of the locking order >> vs. what find_all_inheritors would do. So, when multi-level >> partitioning is in use, I think it could happen that some other >> transaction is accessing the table using a different code path that >> uses the find_all_inheritors order without modification. If those >> locks conflict (e.g. query vs. DROP) then there's a deadlock risk. > > Yes, I agree. Even with single-level partitioning, the leaf partitions > ordered by find_all_inheritors() is by oid values, so that's also > going to be differently ordered. We do require to lock the parent first in any case. Doesn't that prevent deadlocks by imparting an implicit order on locking by operations whose locks conflict. Having said that, I think it would be desirable for all code paths to manipulate partitions in the same order. For partitioned tables, I think we can make it the partition bound order by replacing all calls to find_all_inheritors and find_inheritance_children on partitioned table parents with something else that reads partition OIDs from the relcache (PartitionDesc) and traverses the partition tree breadth-first manner. >> Unfortunately I don't see any easy way around that problem, but maybe >> somebody else has an idea. > > One approach I had considered was to have find_inheritance_children() > itself lock the children in bound order, so that everyone will have > bound-ordered oids, but that would be too expensive since it requires > opening all partitioned tables to initialize partition descriptors. In > find_inheritance_children(), we get all oids without opening any > tables. But now that I think more of it, it's only the partitioned > tables that we have to open, not the leaf partitions; and furthermore, > I didn't see calls to find_inheritance_children() and > find_all_inheritors() in performance-critical code, except in > expand_inherited_rtentry(). All of them are in DDL commands; but yes, > that can change in the future. This approach more or less amounts to calling the new RelationGetPartitionDispatchInfo() (per my proposed patch, a version of which I posted upthread.) Maybe we can add a wrapper on top, say, get_all_partition_oids() which throws away other things that RelationGetPartitionDispatchInfo() returned. In addition it locks all the partitions that are returned, unlike only the partitioned ones, which is what RelationGetPartitionDispatchInfo() has been taught to do. > Regarding dynamically locking specific partitions as and when needed, > I think this method inherently has the issue of deadlock because the > order would be random. So it feels like there is no way around other > than to lock all partitions beforehand. I'm not sure why the order has to be random. If and when we decide to open and lock a subset of partitions for a given query, it will be done in some canonical order as far as I can imagine. Do you have some specific example in mind? Thanks, Amit
On 2 August 2017 at 14:38, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/07/29 2:45, Amit Khandekar wrote: >> On 28 July 2017 at 20:10, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Wed, Jul 26, 2017 at 2:13 AM, Amit Langote wrote: >>>> I checked that we get the same result relation order with both the >>>> patches, but I would like to highlight a notable difference here between >>>> the approaches taken by our patches. In my patch, I have now taught >>>> RelationGetPartitionDispatchInfo() to lock *only* the partitioned tables >>>> in the tree, because we need to look at its partition descriptor to >>>> collect partition OIDs and bounds. We can defer locking (and opening the >>>> relation descriptor of) leaf partitions to a point where planner has >>>> determined that the partition will be accessed after all (not pruned), >>>> which will be done in a separate patch of course. >>> >>> That's very desirable, but I believe it introduces a deadlock risk >>> which Amit's patch avoids. A transaction using the code you've >>> written here is eventually going to lock all partitions, BUT it's >>> going to move the partitioned ones to the front of the locking order >>> vs. what find_all_inheritors would do. So, when multi-level >>> partitioning is in use, I think it could happen that some other >>> transaction is accessing the table using a different code path that >>> uses the find_all_inheritors order without modification. If those >>> locks conflict (e.g. query vs. DROP) then there's a deadlock risk. >> >> Yes, I agree. Even with single-level partitioning, the leaf partitions >> ordered by find_all_inheritors() is by oid values, so that's also >> going to be differently ordered. > > We do require to lock the parent first in any case. Doesn't that prevent > deadlocks by imparting an implicit order on locking by operations whose > locks conflict. Yes may be, but I am not too sure at this point. find_all_inheritors() locks only the children, and the parent lock is already locked separately. find_all_inheritors() does not necessitate to lock the children with the same lockmode as the parent. > Having said that, I think it would be desirable for all code paths to > manipulate partitions in the same order. For partitioned tables, I think > we can make it the partition bound order by replacing all calls to > find_all_inheritors and find_inheritance_children on partitioned table > parents with something else that reads partition OIDs from the relcache > (PartitionDesc) and traverses the partition tree breadth-first manner. > >>> Unfortunately I don't see any easy way around that problem, but maybe >>> somebody else has an idea. >> >> One approach I had considered was to have find_inheritance_children() >> itself lock the children in bound order, so that everyone will have >> bound-ordered oids, but that would be too expensive since it requires >> opening all partitioned tables to initialize partition descriptors. In >> find_inheritance_children(), we get all oids without opening any >> tables. But now that I think more of it, it's only the partitioned >> tables that we have to open, not the leaf partitions; and furthermore, >> I didn't see calls to find_inheritance_children() and >> find_all_inheritors() in performance-critical code, except in >> expand_inherited_rtentry(). All of them are in DDL commands; but yes, >> that can change in the future. > > This approach more or less amounts to calling the new > RelationGetPartitionDispatchInfo() (per my proposed patch, a version of > which I posted upthread.) Maybe we can add a wrapper on top, say, > get_all_partition_oids() which throws away other things that > RelationGetPartitionDispatchInfo() returned. In addition it locks all the > partitions that are returned, unlike only the partitioned ones, which is > what RelationGetPartitionDispatchInfo() has been taught to do. So there are three different task items here : 1. Arrange the oids in consistent order everywhere. 2. Prepare the Partition Dispatch Info data structure in the planner as against during execution. 3. For update tuple routing, assume that the result rels are ordered consistently to make the searching efficient. #3 depends on #1. So for that, I have come up with a minimum set of changes to have expand_inherited_rtentry() generate the rels in bound order. When we do #2 , it may be possible that we may need to re-do my changes in expand_inherited_rtentry(), but those are minimum. We may even end up having the walker function being used at multiple places, but right now it is not certain. So, I think we can continue the discussion about #1 and #2 in a separate thread. > >> Regarding dynamically locking specific partitions as and when needed, >> I think this method inherently has the issue of deadlock because the >> order would be random. So it feels like there is no way around other >> than to lock all partitions beforehand. > > I'm not sure why the order has to be random. If and when we decide to > open and lock a subset of partitions for a given query, it will be done in > some canonical order as far as I can imagine. Do you have some specific > example in mind? Partitioned table t1 has partitions t1p1 and t1p2 Partitioned table t2 at the same level has partitions t2p1 and t2p2 Tuple routing causes the first row to insert into t2p2, so t2p2 is locked. Next insert locks t1p1 because it inserts into t1p1. But at the same time, somebody does DDL on some parent common to t1 and t2, so it locks the leaf partitions in a fixed specific order, which would be different than the insert lock order because that order depended upon the order of tables that the insert rows were routed to. > > Thanks, > Amit > -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On 2017/08/02 19:49, Amit Khandekar wrote: > On 2 August 2017 at 14:38, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> One approach I had considered was to have find_inheritance_children() >>> itself lock the children in bound order, so that everyone will have >>> bound-ordered oids, but that would be too expensive since it requires >>> opening all partitioned tables to initialize partition descriptors. In >>> find_inheritance_children(), we get all oids without opening any >>> tables. But now that I think more of it, it's only the partitioned >>> tables that we have to open, not the leaf partitions; and furthermore, >>> I didn't see calls to find_inheritance_children() and >>> find_all_inheritors() in performance-critical code, except in >>> expand_inherited_rtentry(). All of them are in DDL commands; but yes, >>> that can change in the future. >> >> This approach more or less amounts to calling the new >> RelationGetPartitionDispatchInfo() (per my proposed patch, a version of >> which I posted upthread.) Maybe we can add a wrapper on top, say, >> get_all_partition_oids() which throws away other things that >> RelationGetPartitionDispatchInfo() returned. In addition it locks all the >> partitions that are returned, unlike only the partitioned ones, which is >> what RelationGetPartitionDispatchInfo() has been taught to do. > > So there are three different task items here : > 1. Arrange the oids in consistent order everywhere. > 2. Prepare the Partition Dispatch Info data structure in the planner > as against during execution. > 3. For update tuple routing, assume that the result rels are ordered > consistently to make the searching efficient. That's a good breakdown. > #3 depends on #1. So for that, I have come up with a minimum set of > changes to have expand_inherited_rtentry() generate the rels in bound > order. When we do #2 , it may be possible that we may need to re-do my > changes in expand_inherited_rtentry(), but those are minimum. We may > even end up having the walker function being used at multiple places, > but right now it is not certain. So AFAICS: For performance reasons, we want the order in which leaf partition sub-plans appear in the ModifyTable node (and subsequently leaf partition ResultRelInfos ModifyTableState) to be some known canonical order. That's because we want to map partitions in the insert tuple-routing data structure (which appear in a known canonical order as determined by RelationGetPartititionDispatchInfo) to those appearing in the ModifyTableState. That's so that we can reuse the planner-generated WCO and RETURNING lists in the insert code path when update tuple-routing invokes that path. To implement that, planner should retrieve the list of leaf partition OIDs in the same order as ExecSetupPartitionTupleRouting() retrieves them. Because the latter calls RelationGetPartitionDispatchInfo on the root partitioned table, maybe the planner should do that too, instead of its current method getting OIDs using find_all_inheritors(). But it's currently not possible due to the way RelationGetPartitionDispatchInfo() and involved data structures are designed. One way forward I see is to invent new interface functions: List *get_all_partition_oids(Oid, LOCKMODE) List *get_partition_oids(Oid, LOCKMODE) that resemble find_all_inheritors() and find_inheritance_children(), respectively, but expects that users make sure that they are called only for partitioned tables. Needless to mention, OIDs are returned with canonical order determined by that of the partition bounds and partition tree structure. We replace all the calls of the old interface functions with the respective new ones. That means expand_inherited_rtentry (among others) now calls get_all_partition_oids() if the RTE is for a partitioned table and find_all_inheritors() otherwise. > So, I think we can continue the discussion about #1 and #2 in a separate thread. I have started a new thread named "expanding inheritance in partition bound order" and posted a couple of patches [1]. After applying those patches, you can write code for #3 without having to worry about the concerns of partition order, which I guess you've already done. >>> Regarding dynamically locking specific partitions as and when needed, >>> I think this method inherently has the issue of deadlock because the >>> order would be random. So it feels like there is no way around other >>> than to lock all partitions beforehand. >> >> I'm not sure why the order has to be random. If and when we decide to >> open and lock a subset of partitions for a given query, it will be done in >> some canonical order as far as I can imagine. Do you have some specific >> example in mind? > > Partitioned table t1 has partitions t1p1 and t1p2 > Partitioned table t2 at the same level has partitions t2p1 and t2p2 > Tuple routing causes the first row to insert into t2p2, so t2p2 is locked. > Next insert locks t1p1 because it inserts into t1p1. > But at the same time, somebody does DDL on some parent common to t1 > and t2, so it locks the leaf partitions in a fixed specific order, > which would be different than the insert lock order because that order > depended upon the order of tables that the insert rows were routed to. Note that we don't currently do this. That is, lock partitions in an order determined by incoming rows. ExecSetupPartitionTupleRouting() locks (RowExclusiveLock) all the partitions beforehand in the partition bound order. Any future patch that wants to delay locking and opening the relation descriptor of a leaf partition to when a tuple is actually routed to it will have to think hard about the deadlock problem you illustrate above. Aside from the insert case, let's consider locking order when planning a select on a partitioned table. We currently lock all the partitions in advance in expand_inherited_rtentry(). When replacing the current method by some new way, we will first determine all the partitions that satisfy a given query, collect them in an ordered list (some fixed canonical order), and lock them in that order. But maybe, I misunderstood what you said? Thanks, Amit [1] https://www.postgresql.org/message-id/0118a1f2-84bb-19a7-b906-dec040a206f2%40lab.ntt.co.jp
> > Below are the TODOS at this point : > > Fix for bug reported by Rajkumar about update with join. I had explained the root issue of this bug here : [1] Attached patch includes the fix, which is explained below. Currently in the patch, there is a check if the tuple is concurrently deleted by other session, i.e. when heap_update() returns HeapTupleUpdated. In such case we set concurrently_deleted output param to true. We should also do the same for HeapTupleSelfUpdated return value. In fact, there are other places in ExecDelete() where it can return without doing anything. For e.g. if a BR DELETE trigger prevents the delete from happening, ExecBRDeleteTriggers() returns false, in which case ExecDelete() returns. So what the fix does is : rename concurrently_deleted parameter to delete_skipped so as to indicate a more general status : whether delete has actually happened or was it skipped. And set this param to true only after the delete happens. This allows us to avoid adding a new rows for the trigger case also. Added test scenario for UPDATE with JOIN case, and also TRIGGER case. > Do something about two separate mapping tables for Transition tables > and update tuple-routing. On 1 July 2017 at 03:15, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Would make sense to have a set of functions with names like > GetConvertor{From,To}{Subplan,Leaf}(mtstate, index) which build arrays > m_convertors_{from,to}_by_{subplan,leaf} the first time they need > them? This was discussed here : [2]. I think even if we have them built when needed, still in presence of both tuple routing and transition tables, we do need separate arrays. So I think rather than dynamic arrays, we can have static arrays but their elements will point to a shared TupleConversionMap structure whenever possible. As already in the patch, in case of insert/update tuple routing, there is a per-leaf partition mt_transition_tupconv_maps array for transition tables, and a separate per-subplan arry mt_resultrel_maps for update tuple routing. *But*, what I am proposing is: for the mt_transition_tupconv_maps[] element for which the leaf partition also exists as a per-subplan result, that array element and the mt_resultrel_maps[] element will point to the same TupleConversionMap structure. This is quite similar to how we are re-using the per-subplan resultrels for the per-leaf result rels. We will re-use the per-subplan TupleConversionMap for the per-leaf mt_transition_tupconv_maps[] elements. Not yet implemented this. > GetUpdatedColumns() to be moved to header file. Done. I have moved it in execnodes.h > More test scenarios in regression tests. > Need to check/test whether we are correctly applying insert policies > (ant not update) while inserting a routed tuple. Yet to do above two. > Use getASTriggerResultRelInfo() for attrno mapping, rather than first > resultrel, for generating child WCO/RETURNING expression. > Regarding generating child WithCheckOption and Returning expressions using those of the root result relation, ModifyTablePath and ModifyTable should have new fields rootReturningList (and rootWithCheckOptions) which would be derived from root->parse->returningList in inheritance_planner(). But then, similar to per-subplan returningList, rootReturningList would have to pass through set_plan_refs()=>set_returning_clause_references() which requires the subplan targetlist to be passed. Because of this, for rootReturningList, we require a subplan for root partition, which is not there currently; we have subpans only for child rels. That means we would have to create such plan only for the sake of generating rootReturningList. The other option is to do the way the patch is currently doing in the executor by using the returningList of the first per-subplan result rel to generate the other child returningList (and WithCheckOption). This is working by applying map_partition_varattnos() to the first returningList. But now that we realized that we have to specially handle whole-row vars, map_partition_varattnos() would need some changes to convert whole row vars differently for child-rel-to-child-rel mapping. For childrel-to-childrel conversion, the whole-row var is already wrapped by ConvertRowtypeExpr, but we need to change its Var->vartype to the new child vartype. I think the second option looks easier, but I am open to suggestions, and I am myself still checking the first one. > Address Robert's review comments on make_resultrel_ordered.patch. > > +typedef struct ParentChild > > This is a pretty generic name. Pick something more specific and informative. I have used ChildPartitionInfo. But suggestions welcome. > > +static List *append_rel_partition_oids(List *rel_list, Relation rel); > > One could be forgiven for thinking that this function was just going > to append OIDs, but it actually appends ParentChild structures, so I > think the name needs work. Renamed it to append_child_partitions(). > > +List *append_rel_partition_oids(List *rel_list, Relation rel) > > Style. Please pgindent your patches. I have pgindent'ed changes in nodeModifyTable.c and partition.c, yet to do that for others. > > +#ifdef DEBUG_PRINT_OIDS > + print_oids(*leaf_part_oids); > +#endif > > I'd just rip out this debug stuff once you've got this working, but if > we keep it, it certainly can't have a name as generic as print_oids() > when it's actually doing something with a list of ParentChild > structures. Also, it prints names, not OIDs. And DEBUG_PRINT_OIDS is > no good for the same reasons. Now that I have tested it , I have removed this. Also, the ordered subplans printed in explain output serve the same purpose. > > + if (RelationGetPartitionDesc(rel)) > + walker->rels_list = append_rel_partition_oids(walker->rels_list, rel); > > Every place that calls append_rel_partition_oids guards that call with > if (RelationGetPartitionDesc(...)). It seems to me that it would be > simpler to remove those tests and instead just replace the > Assert(partdesc) inside that function with if (!partdesc) return; Done. > > Is there any real benefit in this "walker" interface? It looks to me > like it might be simpler to just change things around so that it > returns a list of OIDs, like find_all_inheritors, but generated > differently. Then if you want bound-ordering rather than > OID-ordering, you just do this: > > list_free(inhOids); > inhOids = get_partition_oids_in_bound_order(rel); > > That'd remove the need for some if/then logic as you've currently got > in get_next_child(). Have explained this here : https://www.postgresql.org/message-id/CAJ3gD9dQ2FKes8pP6aM-4Tx3ngqWvD8oyOJiDRxLVoQiY76t0A%40mail.gmail.com I am aware that this might get changed once we checkin a separate patch just floated to expand inheritence in bound order. > > + is_partitioned_resultrel = > + (oldrelation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE > + && rti == parse->resultRelation); > > I suspect this isn't correct for a table that contains wCTEs, because > there would in that case be multiple result relations. > > I think we should always expand in bound order rather than only when > it's a result relation. Have changed it to always expand in bound order for partitioned table. [1]. https://www.postgresql.org/message-id/CAKcux6%3Dz38gH4K6YAFi%2BYvo5tHTwBL4tam4VM33CAPZ5dDMk1Q%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAEepm=3sc_j1zwqDYrbU4DTfX5rHcaMNNuaXRKWZFgt9m23OcA@mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Fri, Aug 4, 2017 at 10:28 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> Below are the TODOS at this point :
>
> Fix for bug reported by Rajkumar about update with join.
I had explained the root issue of this bug here : [1]
Attached patch includes the fix, which is explained below.
Hi Amit,
I have applied v14 patch and tested from my side, everything looks good to me. attaching some of test case and out file for reference.
Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation Attachment
On 4 August 2017 at 22:28, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> >> Below are the TODOS at this point : >> >> Do something about two separate mapping tables for Transition tables >> and update tuple-routing. > On 1 July 2017 at 03:15, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >> Would make sense to have a set of functions with names like >> GetConvertor{From,To}{Subplan,Leaf}(mtstate, index) which build arrays >> m_convertors_{from,to}_by_{subplan,leaf} the first time they need >> them? > > This was discussed here : [2]. I think even if we have them built when > needed, still in presence of both tuple routing and transition tables, > we do need separate arrays. So I think rather than dynamic arrays, we > can have static arrays but their elements will point to a shared > TupleConversionMap structure whenever possible. > As already in the patch, in case of insert/update tuple routing, there > is a per-leaf partition mt_transition_tupconv_maps array for > transition tables, and a separate per-subplan arry mt_resultrel_maps > for update tuple routing. *But*, what I am proposing is: for the > mt_transition_tupconv_maps[] element for which the leaf partition also > exists as a per-subplan result, that array element and the > mt_resultrel_maps[] element will point to the same TupleConversionMap > structure. > > This is quite similar to how we are re-using the per-subplan > resultrels for the per-leaf result rels. We will re-use the > per-subplan TupleConversionMap for the per-leaf > mt_transition_tupconv_maps[] elements. > > Not yet implemented this. The attached patch has the above needed changes. Now we have following map arrays in ModifyTableState. The earlier naming was confusing so I renamed them. mt_perleaf_parentchild_maps : To be used for converting insert/update routed tuples from root to the destination leaf partition. mt_perleaf_childparent_maps : To be used for transition tables for converting back the tuples from leaf partition to root. mt_persubplan_childparent_maps : To be used by both transition tables and update-row movement for their own different purpose for UPDATEs. I also had to add another partition slot mt_rootpartition_tuple_slot alongside mt_partition_tuple_slot. For update-row-movement, in ExecInsert(), we used to have a common slot for root partition's tuple as well as leaf partition tuple. So the former tuple was a transient tuple. But mtstate->mt_transition_capture->tcs_original_insert_tuple requires the tuple to be valid, so we could not pass a transient tuple. Hence another partition slot. ------- But in the first place, while testing transition tables behaviour with update row movement, I found out that transition tables OLD TABLE AND NEW TABLE don't get populated with the rows that are moved to another partition. This is because the operation is ExecDelete() and ExecInsert(), which don't run the transition-related triggers for updates. Even though transition-table-triggers are statement-level, the AR ROW trigger-related functions like ExecARUpdateTriggers() do get run for each row, so that the tables get populated; and they skip the usual row-level trigger stuff. For update-row-movement, we need to teach ExecARUpdateTriggers() to run the transition-related processing for the DELETE+INESRT operation as well. But since delete and insert happen on different tables, we cannot call ExecARUpdateTriggers() at a single place. We need to call it once after ExecDelete() for loading the OLD row, and then after ExecInsert() for loading the NEW row. Also, currently ExecARUpdateTriggers() does not allow NULL old tuple or new tuple, but we need to allow it for the above transition table processing. The attached patch has the above needed changes. > >> Use getASTriggerResultRelInfo() for attrno mapping, rather than first >> resultrel, for generating child WCO/RETURNING expression. >> > > Regarding generating child WithCheckOption and Returning expressions > using those of the root result relation, ModifyTablePath and > ModifyTable should have new fields rootReturningList (and > rootWithCheckOptions) which would be derived from > root->parse->returningList in inheritance_planner(). But then, similar > to per-subplan returningList, rootReturningList would have to pass > through set_plan_refs()=>set_returning_clause_references() which > requires the subplan targetlist to be passed. Because of this, for > rootReturningList, we require a subplan for root partition, which is > not there currently; we have subpans only for child rels. That means > we would have to create such plan only for the sake of generating > rootReturningList. > > The other option is to do the way the patch is currently doing in the > executor by using the returningList of the first per-subplan result > rel to generate the other child returningList (and WithCheckOption). > This is working by applying map_partition_varattnos() to the first > returningList. But now that we realized that we have to specially > handle whole-row vars, map_partition_varattnos() would need some > changes to convert whole row vars differently for > child-rel-to-child-rel mapping. For childrel-to-childrel conversion, > the whole-row var is already wrapped by ConvertRowtypeExpr, but we > need to change its Var->vartype to the new child vartype. > > I think the second option looks easier, but I am open to suggestions, > and I am myself still checking the first one. I have done the changes using the second option above. In the attached patch, the same map_partition_varattnos() is called for child-to-child mapping. But in such case, the source child partition already has ConvertRowtypeExpr node, so another ConvertRowtypeExpr node is not added; just the containing var node is updated with the new composite type. In the regression test, I have included different types like numeric, int, text for the partition key columns, so as to test the same. >> More test scenarios in regression tests. >> Need to check/test whether we are correctly applying insert policies >> (ant not update) while inserting a routed tuple. > > Yet to do above two. This is still to do. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Fri, Aug 11, 2017 at 10:44 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 4 August 2017 at 22:28, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>> I am planning to review and test this patch, Seems like this patch needs to be rebased. [dilip@localhost postgresql]$ patch -p1 < ../patches/update-partition-key_v15.patch patching file doc/src/sgml/ddl.sgml patching file doc/src/sgml/ref/update.sgml patching file doc/src/sgml/trigger.sgml patching file src/backend/catalog/partition.c Hunk #3 succeeded at 910 (offset -1 lines). Hunk #4 succeeded at 924 (offset -1 lines). Hunk #5 succeeded at 934 (offset -1 lines). Hunk #6 succeeded at 994 (offset -1 lines). Hunk #7 succeeded at 1009 with fuzz 1 (offset 3 lines). Hunk #8 FAILED at 1023. Hunk #9 succeeded at 1059 with fuzz 2 (offset 10 lines). Hunk #10 succeeded at 2069 (offset 2 lines). Hunk #11 succeeded at 2406 (offset 2 lines). 1 out of 11 hunks FAILED -- saving rejects to file src/backend/catalog/partition.c.rej patching file src/backend/commands/copy.c Hunk #2 FAILED at 1426. Hunk #3 FAILED at 1462. Hunk #4 succeeded at 2616 (offset 7 lines). Hunk #5 succeeded at 2726 (offset 8 lines). Hunk #6 succeeded at 2846 (offset 8 lines). 2 out of 6 hunks FAILED -- saving rejects to file src/backend/commands/copy.c.rej patching file src/backend/commands/trigger.c Hunk #4 succeeded at 5261 with fuzz 2. patching file src/backend/executor/execMain.c Hunk #1 succeeded at 65 (offset 1 line). Hunk #2 succeeded at 103 (offset 1 line). Hunk #3 succeeded at 1829 (offset 20 lines). Hunk #4 succeeded at 1860 (offset 20 lines). Hunk #5 succeeded at 1927 (offset 20 lines). Hunk #6 succeeded at 2044 (offset 21 lines). Hunk #7 FAILED at 3210. Hunk #8 FAILED at 3244. Hunk #9 succeeded at 3289 (offset 26 lines). Hunk #10 FAILED at 3340. Hunk #11 succeeded at 3387 (offset 29 lines). Hunk #12 succeeded at 3424 (offset 29 lines). 3 out of 12 hunks FAILED -- saving rejects to file src/backend/executor/execMain.c.rej patching file src/backend/executor/execReplication.c patching file src/backend/executor/nodeModifyTable.c -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Thanks Dilip. I am working on rebasing the patch. Particularly, the partition walker in my patch depended on the fact that all the tables get opened (and then closed) while creating the tuple routing info. But in HEAD, now only the partitioned tables get opened. So need some changes in my patch. The partition walker related changes are going to be inapplicable once the other thread [1] commits the changes for expansion of inheritence in bound-order, but till then I would have to rebase the partition walker changes over HEAD. [1] https://www.postgresql.org/message-id/0118a1f2-84bb-19a7-b906-dec040a206f2%40lab.ntt.co.jp On 31 August 2017 at 12:09, Dilip Kumar <dilipbalaut@gmail.com> wrote: > On Fri, Aug 11, 2017 at 10:44 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> On 4 August 2017 at 22:28, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>>> > > I am planning to review and test this patch, Seems like this patch > needs to be rebased. > > [dilip@localhost postgresql]$ patch -p1 < > ../patches/update-partition-key_v15.patch > patching file doc/src/sgml/ddl.sgml > patching file doc/src/sgml/ref/update.sgml > patching file doc/src/sgml/trigger.sgml > patching file src/backend/catalog/partition.c > Hunk #3 succeeded at 910 (offset -1 lines). > Hunk #4 succeeded at 924 (offset -1 lines). > Hunk #5 succeeded at 934 (offset -1 lines). > Hunk #6 succeeded at 994 (offset -1 lines). > Hunk #7 succeeded at 1009 with fuzz 1 (offset 3 lines). > Hunk #8 FAILED at 1023. > Hunk #9 succeeded at 1059 with fuzz 2 (offset 10 lines). > Hunk #10 succeeded at 2069 (offset 2 lines). > Hunk #11 succeeded at 2406 (offset 2 lines). > 1 out of 11 hunks FAILED -- saving rejects to file > src/backend/catalog/partition.c.rej > patching file src/backend/commands/copy.c > Hunk #2 FAILED at 1426. > Hunk #3 FAILED at 1462. > Hunk #4 succeeded at 2616 (offset 7 lines). > Hunk #5 succeeded at 2726 (offset 8 lines). > Hunk #6 succeeded at 2846 (offset 8 lines). > 2 out of 6 hunks FAILED -- saving rejects to file > src/backend/commands/copy.c.rej > patching file src/backend/commands/trigger.c > Hunk #4 succeeded at 5261 with fuzz 2. > patching file src/backend/executor/execMain.c > Hunk #1 succeeded at 65 (offset 1 line). > Hunk #2 succeeded at 103 (offset 1 line). > Hunk #3 succeeded at 1829 (offset 20 lines). > Hunk #4 succeeded at 1860 (offset 20 lines). > Hunk #5 succeeded at 1927 (offset 20 lines). > Hunk #6 succeeded at 2044 (offset 21 lines). > Hunk #7 FAILED at 3210. > Hunk #8 FAILED at 3244. > Hunk #9 succeeded at 3289 (offset 26 lines). > Hunk #10 FAILED at 3340. > Hunk #11 succeeded at 3387 (offset 29 lines). > Hunk #12 succeeded at 3424 (offset 29 lines). > 3 out of 12 hunks FAILED -- saving rejects to file > src/backend/executor/execMain.c.rej > patching file src/backend/executor/execReplication.c > patching file src/backend/executor/nodeModifyTable.c > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On 31 August 2017 at 14:15, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > Thanks Dilip. I am working on rebasing the patch. Particularly, the > partition walker in my patch depended on the fact that all the tables > get opened (and then closed) while creating the tuple routing info. > But in HEAD, now only the partitioned tables get opened. So need some > changes in my patch. > > The partition walker related changes are going to be inapplicable once > the other thread [1] commits the changes for expansion of inheritence > in bound-order, but till then I would have to rebase the partition > walker changes over HEAD. > > [1] https://www.postgresql.org/message-id/0118a1f2-84bb-19a7-b906-dec040a206f2%40lab.ntt.co.jp > After recent commit 30833ba154, now the partitions are expanded in depth-first order. It didn't seem worthwhile rebasing my partition walker changes onto the latest code. So in the attached patch, I have removed all the partition walker changes. But RelationGetPartitionDispatchInfo() traverses in breadth-first order, which is different than the update result rels order (because inheritance expansion order is depth-first). So, in order to make the tuple-routing-related leaf partitions in the same order as that of the update result rels, we would have to make changes in RelationGetPartitionDispatchInfo(), which I am not sure whether it is going to be done as part of the thread "expanding inheritance in partition bound order" [1]. For now, in the attached patch, I have reverted back to the hash table method to find the leaf partitions in the update result rels. [1] https://www.postgresql.org/message-id/CAJ3gD9eyudCNU6V-veMme%2BeyzfX_ey%2BgEzULMzOw26c3f9rzdg%40mail.gmail.com Thanks -Amit Khandekar
On Sun, Sep 3, 2017 at 5:10 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 31 August 2017 at 14:15, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> Thanks Dilip. I am working on rebasing the patch. Particularly, the >> partition walker in my patch depended on the fact that all the tables >> get opened (and then closed) while creating the tuple routing info. >> But in HEAD, now only the partitioned tables get opened. So need some >> changes in my patch. >> >> The partition walker related changes are going to be inapplicable once >> the other thread [1] commits the changes for expansion of inheritence >> in bound-order, but till then I would have to rebase the partition >> walker changes over HEAD. >> >> [1] https://www.postgresql.org/message-id/0118a1f2-84bb-19a7-b906-dec040a206f2%40lab.ntt.co.jp >> > > After recent commit 30833ba154, now the partitions are expanded in > depth-first order. It didn't seem worthwhile rebasing my partition > walker changes onto the latest code. So in the attached patch, I have > removed all the partition walker changes. > It seems you have forgotten to attach the patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 4 September 2017 at 07:43, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Sun, Sep 3, 2017 at 5:10 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> On 31 August 2017 at 14:15, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>> Thanks Dilip. I am working on rebasing the patch. Particularly, the >>> partition walker in my patch depended on the fact that all the tables >>> get opened (and then closed) while creating the tuple routing info. >>> But in HEAD, now only the partitioned tables get opened. So need some >>> changes in my patch. >>> >>> The partition walker related changes are going to be inapplicable once >>> the other thread [1] commits the changes for expansion of inheritence >>> in bound-order, but till then I would have to rebase the partition >>> walker changes over HEAD. >>> >>> [1] https://www.postgresql.org/message-id/0118a1f2-84bb-19a7-b906-dec040a206f2%40lab.ntt.co.jp >>> >> >> After recent commit 30833ba154, now the partitions are expanded in >> depth-first order. It didn't seem worthwhile rebasing my partition >> walker changes onto the latest code. So in the attached patch, I have >> removed all the partition walker changes. >> > > It seems you have forgotten to attach the patch. Oops sorry. Now attached. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Mon, Sep 4, 2017 at 10:52 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 4 September 2017 at 07:43, Amit Kapila <amit.kapila16@gmail.com> wrote: > Oops sorry. Now attached. I have done some basic testing and initial review of the patch. I have some comments/doubts. I will continue the review. + if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture) + ExecARUpdateTriggers(estate, resultRelInfo, InvalidOid, For passing invalid ItemPointer we are using InvalidOid, this seems bit odd to me are we using simmilar convention some other place? I think it would be better to just pass 0? ------ - if ((event == TRIGGER_EVENT_DELETE && delete_old_table) || - (event == TRIGGER_EVENT_UPDATE && update_old_table)) + if (oldtup != NULL && + ((event == TRIGGER_EVENT_DELETE && delete_old_table) || + (event == TRIGGER_EVENT_UPDATE && update_old_table))) { Tuplestorestate *old_tuplestore; - Assert(oldtup != NULL); Only if TRIGGER_EVENT_UPDATE it is possible that oldtup can be NULL, so we have added an extra check for oldtup and removed the Assert, but if TRIGGER_EVENT_DELETE we never expect it to be NULL. Is it better to put Assert outside the condition check (Assert(oldtup != NULL || event == TRIGGER_EVENT_UPDATE)) ? same for the newtup. I think we should also explain in comments about why oldtup or newtup can be NULL in case of if TRIGGER_EVENT_UPDATE ------- + triggers affect the row being moved. As far as <literal>AFTER ROW</> + triggers are concerned, <literal>AFTER</> <command>DELETE</command> and + <literal>AFTER</> <command>INSERT</command> triggers are applied; but + <literal>AFTER</> <command>UPDATE</command> triggers are not applied + because the <command>UPDATE</command> has been converted to a + <command>DELETE</command> and <command>INSERT</command>. Above comments says that ARUpdate trigger is not fired but below code call ARUpdateTrigger + if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture) + ExecARUpdateTriggers(estate, resultRelInfo, InvalidOid, + NULL, + tuple, + NULL, + mtstate->mt_transition_capture); -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On 6 September 2017 at 21:47, Dilip Kumar <dilipbalaut@gmail.com> wrote: > On Mon, Sep 4, 2017 at 10:52 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> On 4 September 2017 at 07:43, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Oops sorry. Now attached. > > I have done some basic testing and initial review of the patch. I Thanks for taking this up for review. Attached is the updated patch v17, that covers the below points. > + if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture) > + ExecARUpdateTriggers(estate, resultRelInfo, InvalidOid, > > For passing invalid ItemPointer we are using InvalidOid, this seems > bit odd to me > are we using simmilar convention some other place? I think it would be better to > just pass 0? Yes that's right. Replaced InvalidOid by NULL since ItemPointer is a pointer. > > ------ > > - if ((event == TRIGGER_EVENT_DELETE && delete_old_table) || > - (event == TRIGGER_EVENT_UPDATE && update_old_table)) > + if (oldtup != NULL && > + ((event == TRIGGER_EVENT_DELETE && delete_old_table) || > + (event == TRIGGER_EVENT_UPDATE && update_old_table))) > { > Tuplestorestate *old_tuplestore; > > - Assert(oldtup != NULL); > > Only if TRIGGER_EVENT_UPDATE it is possible that oldtup can be NULL, > so we have added an extra > check for oldtup and removed the Assert, but if TRIGGER_EVENT_DELETE > we never expect it to be NULL. > > Is it better to put Assert outside the condition check (Assert(oldtup > != NULL || event == TRIGGER_EVENT_UPDATE)) ? > same for the newtup. > > I think we should also explain in comments about why oldtup or newtup > can be NULL in case of if > TRIGGER_EVENT_UPDATE Done all the above. Added two separate asserts, one for DELETE and the other for INSERT. > > ------- > > + triggers affect the row being moved. As far as <literal>AFTER ROW</> > + triggers are concerned, <literal>AFTER</> <command>DELETE</command> and > + <literal>AFTER</> <command>INSERT</command> triggers are applied; but > + <literal>AFTER</> <command>UPDATE</command> triggers are not applied > + because the <command>UPDATE</command> has been converted to a > + <command>DELETE</command> and <command>INSERT</command>. > > Above comments says that ARUpdate trigger is not fired but below code call > ARUpdateTrigger > > + if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture) > + ExecARUpdateTriggers(estate, resultRelInfo, InvalidOid, > + NULL, > + tuple, > + NULL, > + mtstate->mt_transition_capture); Actually, since transition tables came in, the functions like ExecARUpdateTriggers() or ExecARInsertTriggers() have this additional purpose of capturing transition table rows, so that the images of the tables are visible when statement triggers are fired that refer to these transition tables. So in the above code, these functions only capture rows, they do not add any event for firing any ROW triggers. AfterTriggerSaveEvent() returns without adding any event if it's called only for transition capture. So even if UPDATE row triggers are defined, they won't get fired in case of row movement, although the updated rows would be captured if transition tables are referenced in these triggers or in the statement triggers. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 3 September 2017 at 17:10, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > After recent commit 30833ba154, now the partitions are expanded in > depth-first order. It didn't seem worthwhile rebasing my partition > walker changes onto the latest code. So in the attached patch, I have > removed all the partition walker changes. But > RelationGetPartitionDispatchInfo() traverses in breadth-first order, > which is different than the update result rels order (because > inheritance expansion order is depth-first). So, in order to make the > tuple-routing-related leaf partitions in the same order as that of the > update result rels, we would have to make changes in > RelationGetPartitionDispatchInfo(), which I am not sure whether it is > going to be done as part of the thread "expanding inheritance in > partition bound order" [1]. For now, in the attached patch, I have > reverted back to the hash table method to find the leaf partitions in > the update result rels. > > [1] https://www.postgresql.org/message-id/CAJ3gD9eyudCNU6V-veMme%2BeyzfX_ey%2BgEzULMzOw26c3f9rzdg%40mail.gmail.com As mentioned by Amit Langote in the above mail thread, he is going to do changes for making RelationGetPartitionDispatchInfo() return the leaf partitions in depth-first order. Once that is done, I will then remove the hash table method for finding leaf partitions in update result rels, and instead use the earlier efficient method that takes advantage of the fact that update result rels and leaf partitions are in the same order. > > Thanks > -Amit Khandekar -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Attached is the patch rebased on latest HEAD. Thanks -Amit Khandekar -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Thu, Sep 7, 2017 at 6:17 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 3 September 2017 at 17:10, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> After recent commit 30833ba154, now the partitions are expanded in >> depth-first order. It didn't seem worthwhile rebasing my partition >> walker changes onto the latest code. So in the attached patch, I have >> removed all the partition walker changes. But >> RelationGetPartitionDispatchInfo() traverses in breadth-first order, >> which is different than the update result rels order (because >> inheritance expansion order is depth-first). So, in order to make the >> tuple-routing-related leaf partitions in the same order as that of the >> update result rels, we would have to make changes in >> RelationGetPartitionDispatchInfo(), which I am not sure whether it is >> going to be done as part of the thread "expanding inheritance in >> partition bound order" [1]. For now, in the attached patch, I have >> reverted back to the hash table method to find the leaf partitions in >> the update result rels. >> >> [1] https://www.postgresql.org/message-id/CAJ3gD9eyudCNU6V-veMme%2BeyzfX_ey%2BgEzULMzOw26c3f9rzdg%40mail.gmail.com > > As mentioned by Amit Langote in the above mail thread, he is going to > do changes for making RelationGetPartitionDispatchInfo() return the > leaf partitions in depth-first order. Once that is done, I will then > remove the hash table method for finding leaf partitions in update > result rels, and instead use the earlier efficient method that takes > advantage of the fact that update result rels and leaf partitions are > in the same order. Has he posted that patch yet? I don't think I saw it, but maybe I missed something. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/09/08 18:57, Robert Haas wrote: >> As mentioned by Amit Langote in the above mail thread, he is going to >> do changes for making RelationGetPartitionDispatchInfo() return the >> leaf partitions in depth-first order. Once that is done, I will then >> remove the hash table method for finding leaf partitions in update >> result rels, and instead use the earlier efficient method that takes >> advantage of the fact that update result rels and leaf partitions are >> in the same order. > > Has he posted that patch yet? I don't think I saw it, but maybe I > missed something. I will post on that thread in a moment. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, May 18, 2017 at 9:13 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, May 17, 2017 at 5:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, May 17, 2017 at 6:29 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I think we can do this even without using an additional infomask bit.
>> As suggested by Greg up thread, we can set InvalidBlockId in ctid to
>> indicate such an update.
>
> Hmm. How would that work?
>
We can pass a flag say row_moved (or require_row_movement) to
heap_delete which will in turn set InvalidBlockId in ctid instead of
setting it to self. Then the ExecUpdate needs to check for the same
and return an error when heap_update is not successful (result !=
HeapTupleMayBeUpdated). Can you explain what difficulty are you
envisioning?
Attaching WIP patch incorporates the above logic, although I am yet to check
all the code for places which might be using ip_blkid. I have got a small query here,
do we need an error on HeapTupleSelfUpdated case as well?
Note that patch should be applied to the top of Amit Khandekar's latest patch(v17_rebased).
Regards,
Amul
Attachment
On Fri, Sep 8, 2017 at 4:51 PM, amul sul <sulamul@gmail.com> wrote: > On Thu, May 18, 2017 at 9:13 AM, Amit Kapila <amit.kapila16@gmail.com> > wrote: >> >> On Wed, May 17, 2017 at 5:17 PM, Robert Haas <robertmhaas@gmail.com> >> wrote: >> > On Wed, May 17, 2017 at 6:29 AM, Amit Kapila <amit.kapila16@gmail.com> >> > wrote: >> >> I think we can do this even without using an additional infomask bit. >> >> As suggested by Greg up thread, we can set InvalidBlockId in ctid to >> >> indicate such an update. >> > >> > Hmm. How would that work? >> > >> >> We can pass a flag say row_moved (or require_row_movement) to >> heap_delete which will in turn set InvalidBlockId in ctid instead of >> setting it to self. Then the ExecUpdate needs to check for the same >> and return an error when heap_update is not successful (result != >> HeapTupleMayBeUpdated). Can you explain what difficulty are you >> envisioning? >> > > Attaching WIP patch incorporates the above logic, although I am yet to check > all the code for places which might be using ip_blkid. I have got a small > query here, > do we need an error on HeapTupleSelfUpdated case as well? > No, because that case is anyway a no-op (or error depending on whether is updated/deleted by same command or later command). Basically, even if the row wouldn't have been moved to another partition, we would not have allowed the command to proceed with the update. This handling is to make commands fail rather than a no-op where otherwise (when the tuple is not moved to another partition) the command would have succeeded. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 7, 2017 at 11:41 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 6 September 2017 at 21:47, Dilip Kumar <dilipbalaut@gmail.com> wrote: > Actually, since transition tables came in, the functions like > ExecARUpdateTriggers() or ExecARInsertTriggers() have this additional > purpose of capturing transition table rows, so that the images of the > tables are visible when statement triggers are fired that refer to > these transition tables. So in the above code, these functions only > capture rows, they do not add any event for firing any ROW triggers. > AfterTriggerSaveEvent() returns without adding any event if it's > called only for transition capture. So even if UPDATE row triggers are > defined, they won't get fired in case of row movement, although the > updated rows would be captured if transition tables are referenced in > these triggers or in the statement triggers. > Ok then I have one more question, With transition table, we can only support statement level trigger and for update statement, we are only going to execute UPDATE statement level trigger? so is there any point of making transition table entry for DELETE/INSERT trigger as those transition table will never be used. Or I am missing something? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 11 September 2017 at 21:12, Dilip Kumar <dilipbalaut@gmail.com> wrote: > On Thu, Sep 7, 2017 at 11:41 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> On 6 September 2017 at 21:47, Dilip Kumar <dilipbalaut@gmail.com> wrote: > >> Actually, since transition tables came in, the functions like >> ExecARUpdateTriggers() or ExecARInsertTriggers() have this additional >> purpose of capturing transition table rows, so that the images of the >> tables are visible when statement triggers are fired that refer to >> these transition tables. So in the above code, these functions only >> capture rows, they do not add any event for firing any ROW triggers. >> AfterTriggerSaveEvent() returns without adding any event if it's >> called only for transition capture. So even if UPDATE row triggers are >> defined, they won't get fired in case of row movement, although the >> updated rows would be captured if transition tables are referenced in >> these triggers or in the statement triggers. >> > > Ok then I have one more question, > > With transition table, we can only support statement level trigger Yes, we don't support row triggers with transition tables if the table is a partition. > and for update > statement, we are only going to execute UPDATE statement level > trigger? so is there > any point of making transition table entry for DELETE/INSERT trigger > as those transition > table will never be used. But the statement level trigger function can refer to OLD TABLE and NEW TABLE, which will contain all the OLD rows and NEW rows respectively. So the updated rows of the partitions (including the moved ones) need to be captured. So for OLD TABLE, we need to capture the deleted row, and for NEW TABLE, we need to capture the inserted row. In the regression test update.sql, check how the statement trigger trans_updatetrig prints all the updated rows, including the moved ones. > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 12, 2017 at 11:15 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > But the statement level trigger function can refer to OLD TABLE and > NEW TABLE, which will contain all the OLD rows and NEW rows > respectively. So the updated rows of the partitions (including the > moved ones) need to be captured. So for OLD TABLE, we need to capture > the deleted row, and for NEW TABLE, we need to capture the inserted > row. Yes, I agree. So in ExecDelete for OLD TABLE we only need to call ExecARUpdateTriggers which will make the entry in OLD TABLE only if transition table is there otherwise nothing and I guess this part already exists in your patch. And, we are also calling ExecARDeleteTriggers and I guess that is to fire the ROW-LEVEL delete trigger and that is also fine. What I don't understand is that if there is no "ROW- LEVEL delete trigger" and there is only a "statement level delete trigger" with transition table still we are making the entry in transition table of the delete trigger and that will never be used. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 12 September 2017 at 11:57, Dilip Kumar <dilipbalaut@gmail.com> wrote: > On Tue, Sep 12, 2017 at 11:15 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > >> But the statement level trigger function can refer to OLD TABLE and >> NEW TABLE, which will contain all the OLD rows and NEW rows >> respectively. So the updated rows of the partitions (including the >> moved ones) need to be captured. So for OLD TABLE, we need to capture >> the deleted row, and for NEW TABLE, we need to capture the inserted >> row. > > Yes, I agree. So in ExecDelete for OLD TABLE we only need to call > ExecARUpdateTriggers which will make the entry in OLD TABLE only if > transition table is there otherwise nothing and I guess this part > already exists in your patch. And, we are also calling > ExecARDeleteTriggers and I guess that is to fire the ROW-LEVEL delete > trigger and that is also fine. What I don't understand is that if > there is no "ROW- LEVEL delete trigger" and there is only a "statement > level delete trigger" with transition table still we are making the > entry in transition table of the delete trigger and that will never be > used. Hmm, ok, that might be happening, since we are calling ExecARDeleteTriggers() with mtstate->mt_transition_capture non-NULL, and so the deleted tuple gets captured even when there is no UPDATE statement trigger defined, which looks redundant. Will check this. Thanks. > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 8 September 2017 at 15:21, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > Attached is the patch rebased on latest HEAD. The patch got bit rotten again. Rebased version v17_rebased_2.patch has also some scenarios added in update.sql , that cover UPDATE row movement from non-default to default partition and vice versa. > > Thanks > -Amit Khandekar -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Sun, Sep 10, 2017 at 8:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Sep 8, 2017 at 4:51 PM, amul sul <sulamul@gmail.com> wrote:
> On Thu, May 18, 2017 at 9:13 AM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>
>> On Wed, May 17, 2017 at 5:17 PM, Robert Haas <robertmhaas@gmail.com>
>> wrote:
>> > On Wed, May 17, 2017 at 6:29 AM, Amit Kapila <amit.kapila16@gmail.com>
>> > wrote:
>> >> I think we can do this even without using an additional infomask bit.
>> >> As suggested by Greg up thread, we can set InvalidBlockId in ctid to
>> >> indicate such an update.
>> >
>> > Hmm. How would that work?
>> >
>>
>> We can pass a flag say row_moved (or require_row_movement) to
>> heap_delete which will in turn set InvalidBlockId in ctid instead of
>> setting it to self. Then the ExecUpdate needs to check for the same
>> and return an error when heap_update is not successful (result !=
>> HeapTupleMayBeUpdated). Can you explain what difficulty are you
>> envisioning?
>>
>
> Attaching WIP patch incorporates the above logic, although I am yet to check
> all the code for places which might be using ip_blkid. I have got a small
> query here,
> do we need an error on HeapTupleSelfUpdated case as well?
>
No, because that case is anyway a no-op (or error depending on whether
is updated/deleted by same command or later command). Basically, even
if the row wouldn't have been moved to another partition, we would not
have allowed the command to proceed with the update. This handling is
to make commands fail rather than a no-op where otherwise (when the
tuple is not moved to another partition) the command would have
succeeded.
Thank you.
I've rebased patch against Amit Khandekar's latest
patch
(v17_rebased_2).
Also
added ip_blkid validation
check in heap_get_latest_tid(), rewrite_heap_tuple()& rewrite_heap_tuple() function, because only
ItemPointerEquals() check is nolonger sufficient
after
this patch.Regards,
Amul
Attachment
On 12 September 2017 at 12:39, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 12 September 2017 at 11:57, Dilip Kumar <dilipbalaut@gmail.com> wrote: >> On Tue, Sep 12, 2017 at 11:15 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> >>> But the statement level trigger function can refer to OLD TABLE and >>> NEW TABLE, which will contain all the OLD rows and NEW rows >>> respectively. So the updated rows of the partitions (including the >>> moved ones) need to be captured. So for OLD TABLE, we need to capture >>> the deleted row, and for NEW TABLE, we need to capture the inserted >>> row. >> >> Yes, I agree. So in ExecDelete for OLD TABLE we only need to call >> ExecARUpdateTriggers which will make the entry in OLD TABLE only if >> transition table is there otherwise nothing and I guess this part >> already exists in your patch. And, we are also calling >> ExecARDeleteTriggers and I guess that is to fire the ROW-LEVEL delete >> trigger and that is also fine. What I don't understand is that if >> there is no "ROW- LEVEL delete trigger" and there is only a "statement >> level delete trigger" with transition table still we are making the >> entry in transition table of the delete trigger and that will never be >> used. > > Hmm, ok, that might be happening, since we are calling > ExecARDeleteTriggers() with mtstate->mt_transition_capture non-NULL, > and so the deleted tuple gets captured even when there is no UPDATE > statement trigger defined, which looks redundant. Will check this. > Thanks. I found out that, in case when there is a DELETE statement trigger using transition tables, it's not only an issue of redundancy; it's a correctness issue. Since for transition tables both DELETE and UPDATE use the same old row tuplestore for capturing OLD table, that table gets duplicate rows: one from ExecARDeleteTriggers() and another from ExecARUpdateTriggers(). In presence of INSERT statement trigger using transition tables, both INSERT and UPDATE events have separate tuplestore, so duplicate rows don't show up in the UPDATE NEW table. But, nevertheless, we need to prevent NEW rows to be collected in the INSERT event tuplestore, and capture the NEW rows only in the UPDATE event tuplestore. In the attached patch, we first call ExecARUpdateTriggers(), and while doing that, we first save the info that a NEW row is already captured (mtstate->mt_transition_capture->tcs_update_old_table == true). If it captured, we pass NULL transition_capture pointer to ExecARDeleteTriggers() (and ExecARInsertTriggers) so that it does not again capture an extra row. Modified a testcase in update.sql by including DELETE statement trigger that uses transition tables. ------- After commit 77b6b5e9c, the order of leaf partitions returned by RelationGetPartitionDispatchInfo() and the order of the UPDATE result rels are in the same order. Earlier, because of different orders, I had to use a hash table to search for the leaf partitions in the update result rels, so that we could re-use the per-subplan UPDATE ResultRelInfo's. Now since the order is same, in the attached patch, I have removed the hash table method, and instead, iterate over the leaf partition oids and at the same time keep shifting a position over the per-subplan resultrels whenever the resultrel at the position is found to be present in the leaf partitions list. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Fri, Sep 15, 2017 at 4:55 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 12 September 2017 at 12:39, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> On 12 September 2017 at 11:57, Dilip Kumar <dilipbalaut@gmail.com> wrote: >>> On Tue, Sep 12, 2017 at 11:15 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>> > I found out that, in case when there is a DELETE statement trigger > using transition tables, it's not only an issue of redundancy; it's a > correctness issue. Since for transition tables both DELETE and UPDATE > use the same old row tuplestore for capturing OLD table, that table > gets duplicate rows: one from ExecARDeleteTriggers() and another from > ExecARUpdateTriggers(). In presence of INSERT statement trigger using > transition tables, both INSERT and UPDATE events have separate > tuplestore, so duplicate rows don't show up in the UPDATE NEW table. > But, nevertheless, we need to prevent NEW rows to be collected in the > INSERT event tuplestore, and capture the NEW rows only in the UPDATE > event tuplestore. > > In the attached patch, we first call ExecARUpdateTriggers(), and while > doing that, we first save the info that a NEW row is already captured > (mtstate->mt_transition_capture->tcs_update_old_table == true). If it > captured, we pass NULL transition_capture pointer to > ExecARDeleteTriggers() (and ExecARInsertTriggers) so that it does not > again capture an extra row. > > Modified a testcase in update.sql by including DELETE statement > trigger that uses transition tables. Ok, this fix looks correct to me, I will review the latest patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Sep 18, 2017 at 11:29 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > On Fri, Sep 15, 2017 at 4:55 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> On 12 September 2017 at 12:39, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>> On 12 September 2017 at 11:57, Dilip Kumar <dilipbalaut@gmail.com> wrote: >>>> On Tue, Sep 12, 2017 at 11:15 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>>> >> In the attached patch, we first call ExecARUpdateTriggers(), and while >> doing that, we first save the info that a NEW row is already captured >> (mtstate->mt_transition_capture->tcs_update_old_table == true). If it >> captured, we pass NULL transition_capture pointer to >> ExecARDeleteTriggers() (and ExecARInsertTriggers) so that it does not >> again capture an extra row. >> >> Modified a testcase in update.sql by including DELETE statement >> trigger that uses transition tables. > > Ok, this fix looks correct to me, I will review the latest patch. Please find few more comments. + * in which they appear in the PartitionDesc. Also, extract the + * partition key columns of the root partitioned table. Those of the + * child partitions would be collected during recursive expansion. */ + pull_child_partition_columns(&all_part_cols, oldrelation, oldrelation); expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc, lockmode, &root->append_rel_list, + &all_part_cols, pcinfo->all_part_cols is only used in case of update, I think we can call pull_child_partition_columns only if rte has updateCols? @@ -2029,6 +2034,7 @@ typedef struct PartitionedChildRelInfo Index parent_relid; List *child_rels; + Bitmapset *all_part_cols; } PartitionedChildRelInfo; I might be missing something, but do we really need to store all_part_cols inside the PartitionedChildRelInfo, can't we call pull_child_partition_columns directly inside inheritance_planner whenever we realize that RTE has some updateCols and we want to check the overlap? +extern void partition_walker_init(PartitionWalker *walker, Relation rel); +extern Relation partition_walker_next(PartitionWalker *walker, + Relation *parent); + I don't see these functions are used anywhere? +typedef struct PartitionWalker +{ + List *rels_list; + ListCell *cur_cell; +} PartitionWalker; + Same as above -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 18 September 2017 at 20:45, Dilip Kumar <dilipbalaut@gmail.com> wrote: > Please find few more comments. > > + * in which they appear in the PartitionDesc. Also, extract the > + * partition key columns of the root partitioned table. Those of the > + * child partitions would be collected during recursive expansion. > */ > + pull_child_partition_columns(&all_part_cols, oldrelation, oldrelation); > expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc, > lockmode, &root->append_rel_list, > + &all_part_cols, > > pcinfo->all_part_cols is only used in case of update, I think we can > call pull_child_partition_columns > only if rte has updateCols? > > @@ -2029,6 +2034,7 @@ typedef struct PartitionedChildRelInfo > > Index parent_relid; > List *child_rels; > + Bitmapset *all_part_cols; > } PartitionedChildRelInfo; > > I might be missing something, but do we really need to store > all_part_cols inside the > PartitionedChildRelInfo, can't we call pull_child_partition_columns > directly inside > inheritance_planner whenever we realize that RTE has some updateCols > and we want to > check the overlap? One thing we will have to do extra is : Open and close the partitioned rels again. The idea was that we collect the bitmap *while* we are already expanding through the tree and the rel is open. Will check if this is feasible. > > +extern void partition_walker_init(PartitionWalker *walker, Relation rel); > +extern Relation partition_walker_next(PartitionWalker *walker, > + Relation *parent); > + > > I don't see these functions are used anywhere? > > +typedef struct PartitionWalker > +{ > + List *rels_list; > + ListCell *cur_cell; > +} PartitionWalker; > + > > Same as above Yes, this was left out from the earlier implementation. Will have this removed in the next updated patch. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 19, 2017 at 1:15 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 18 September 2017 at 20:45, Dilip Kumar <dilipbalaut@gmail.com> wrote: >> Please find few more comments. >> >> + * in which they appear in the PartitionDesc. Also, extract the >> + * partition key columns of the root partitioned table. Those of the >> + * child partitions would be collected during recursive expansion. >> */ >> + pull_child_partition_columns(&all_part_cols, oldrelation, oldrelation); >> expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc, >> lockmode, &root->append_rel_list, >> + &all_part_cols, >> >> pcinfo->all_part_cols is only used in case of update, I think we can >> call pull_child_partition_columns >> only if rte has updateCols? >> >> @@ -2029,6 +2034,7 @@ typedef struct PartitionedChildRelInfo >> >> Index parent_relid; >> List *child_rels; >> + Bitmapset *all_part_cols; >> } PartitionedChildRelInfo; >> >> I might be missing something, but do we really need to store >> all_part_cols inside the >> PartitionedChildRelInfo, can't we call pull_child_partition_columns >> directly inside >> inheritance_planner whenever we realize that RTE has some updateCols >> and we want to >> check the overlap? > > One thing we will have to do extra is : Open and close the > partitioned rels again. The idea was that we collect the bitmap > *while* we are already expanding through the tree and the rel is open. > Will check if this is feasible. Oh, I see. > >> >> +extern void partition_walker_init(PartitionWalker *walker, Relation rel); >> +extern Relation partition_walker_next(PartitionWalker *walker, >> + Relation *parent); >> + >> >> I don't see these functions are used anywhere? >> >> +typedef struct PartitionWalker >> +{ >> + List *rels_list; >> + ListCell *cur_cell; >> +} PartitionWalker; >> + >> >> Same as above > > Yes, this was left out from the earlier implementation. Will have this > removed in the next updated patch. Ok. I will continue my review thanks. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 15, 2017 at 7:25 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > [ new patch ] This already fails to apply again. In general, I think it would be a good idea to break this up into a patch series rather than have it as a single patch. That would allow some bits to be applied earlier. The main patch will probably still be pretty big, but at least we can make things a little easier by getting some of the cleanup out of the way first. Specific suggestions on what to break out below. If the changes to rewriteManip.c are a marginal efficiency hack and nothing more, then let's commit this part separately before the main patch. If they're necessary for correctness, then please add a comment explaining why they are necessary. There appears to be no reason why the definitions of GetInsertedColumns() and GetUpdatedColumns() need to be moved to a header file as a result of this patch. GetUpdatedColumns() was previously defined in trigger.c and execMain.c and, post-patch, is still called from only those files. GetInsertedColumns() was, and remains, called only from execMain.c. If this were needed I'd suggest doing it as a preparatory patch before the main patch, but it seems we don't need it at all. If I understand correctly, the reason for changing mt_partitions from ResultRelInfo * to ResultRelInfo ** is that, currently, all of the ResultRelInfos for a partitioning hierarchy are allocated as a single chunk, but we can't do that and also reuse the ResultRelInfos created during InitPlan. I suggest that we do this as a preparatory patch. Someone could argue that this is going the wrong way and that we ought to instead make InitPlan() create all of the necessarily ResultRelInfos, but it seems to me that eventually we probably want to allow setting up ResultRelInfos on the fly for only those partitions for which we end up needing them. The code already has some provision for creating ResultRelInfos on the fly - see ExecGetTriggerResultRel. I don't think it's this patch's job to try to apply that kind of thing to tuple routing, but it seems like in the long run if we're inserting 1 tuple into a table with 1000 partitions, or performing 1 update that touches the partition key, it would be best not to create ResultRelInfos for all 1000 partitions just for fun. But this sort of thing seems much easier of mt_partitions is ResultRelInfo ** rather than ResultRelInfo *, so I think what you have is going in the right direction. + * mtstate->resultRelInfo[]. Note: We assume that if the resultRelInfo + * does not belong to subplans, then it already matches the root tuple + * descriptor; although there is no such known scenario where this + * could happen. + */ + if (rootResultRelInfo != resultRelInfo && + mtstate->mt_persubplan_childparent_maps != NULL && + resultRelInfo >= mtstate->resultRelInfo && + resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans - 1) + { + int map_index = resultRelInfo - mtstate->resultRelInfo; I think you should Assert() that it doesn't happen instead of assuming that it doesn't happen. IOW, remove the last two branches of the if-condition, and then add an Assert() that map_index is sane. It is not clear to me why we need both mt_perleaf_childparent_maps and mt_persubplan_childparent_maps. + * Note: if the UPDATE is converted into a DELETE+INSERT as part of + * update-partition-key operation, then this function is also called + * separately for DELETE and INSERT to capture transition table rows. + * In such case, either old tuple or new tuple can be NULL. That seems pretty strange. I don't quite see how that's going to work correctly. I'm skeptical about the idea that the old tuple capture and new tuple capture can safely happen at different times. I wonder if we should have a reloption controlling whether update-tuple routing is enabled. I wonder how much more expensive it is to execute UPDATE root SET a = a + 1 WHERE a = 1 on a table with 1000 subpartitions with this patch than without, assuming the update succeeds in both cases. I also wonder how efficient this implementation is in general. For example, suppose you make a table with 1000 partitions each containing 10,000 tuples and update them all, and consider three scenarios: (1) partition key not updated but all tuples subject to non-HOT updates because the updated column is indexed, (2) partition key updated but no tuple movement required as a result, (3) partition key updated and all tuples move to a different partition. It would be useful to compare the times, and also to look at perf profiles and see if there are any obvious sources of inefficiency that can be squeezed out. It wouldn't surprise me if tuple movement is a bit slower than the other scenarios, but it would be nice to know how much slower and whether the bottlenecks are anything that we can easily fix. I don't feel that the performance constraints for this patch should be too tight, because we're talking about being able to do something vs. not being able to do it at all, but we should try to have it not stink. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 20 September 2017 at 00:06, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Sep 15, 2017 at 7:25 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> [ new patch ] > > This already fails to apply again. In general, I think it would be a > good idea to break this up into a patch series rather than have it as > a single patch. That would allow some bits to be applied earlier. > The main patch will probably still be pretty big, but at least we can > make things a little easier by getting some of the cleanup out of the > way first. Specific suggestions on what to break out below. > > If the changes to rewriteManip.c are a marginal efficiency hack and > nothing more, then let's commit this part separately before the main > patch. If they're necessary for correctness, then please add a > comment explaining why they are necessary. Ok. Yes, just wanted to avoid two ConvertRowtypeExpr nodes one over the other. But that was not causing any correctness issue. Will extract these changes into separate patch. > > There appears to be no reason why the definitions of > GetInsertedColumns() and GetUpdatedColumns() need to be moved to a > header file as a result of this patch. GetUpdatedColumns() was > previously defined in trigger.c and execMain.c and, post-patch, is > still called from only those files. GetInsertedColumns() was, and > remains, called only from execMain.c. If this were needed I'd suggest > doing it as a preparatory patch before the main patch, but it seems we > don't need it at all. In earlier versions of the patch, these functions were used in nodeModifyTable.c as well. Now that those calls are not there in this file, I will revert back the changes done for moving the definitions into header file. > > If I understand correctly, the reason for changing mt_partitions from > ResultRelInfo * to ResultRelInfo ** is that, currently, all of the > ResultRelInfos for a partitioning hierarchy are allocated as a single > chunk, but we can't do that and also reuse the ResultRelInfos created > during InitPlan. I suggest that we do this as a preparatory patch. Ok, will prepare a separate patch. Do you mean to include in that patch the changes I did in ExecSetupPartitionTupleRouting() that re-use the ResultRelInfo structures of per-subplan update result rels ? > Someone could argue that this is going the wrong way and that we ought > to instead make InitPlan() create all of the necessarily > ResultRelInfos, but it seems to me that eventually we probably want to > allow setting up ResultRelInfos on the fly for only those partitions > for which we end up needing them. The code already has some provision > for creating ResultRelInfos on the fly - see ExecGetTriggerResultRel. > I don't think it's this patch's job to try to apply that kind of thing > to tuple routing, but it seems like in the long run if we're inserting > 1 tuple into a table with 1000 partitions, or performing 1 update that > touches the partition key, it would be best not to create > ResultRelInfos for all 1000 partitions just for fun. Yes makes sense. > But this sort of > thing seems much easier of mt_partitions is ResultRelInfo ** rather > than ResultRelInfo *, so I think what you have is going in the right > direction. Ok. > > + * mtstate->resultRelInfo[]. Note: We assume that if the resultRelInfo > + * does not belong to subplans, then it already matches the root tuple > + * descriptor; although there is no such known scenario where this > + * could happen. > + */ > + if (rootResultRelInfo != resultRelInfo && > + mtstate->mt_persubplan_childparent_maps != NULL && > + resultRelInfo >= mtstate->resultRelInfo && > + resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans - 1) > + { > + int map_index = resultRelInfo - mtstate->resultRelInfo; > > I think you should Assert() that it doesn't happen instead of assuming > that it doesn't happen. IOW, remove the last two branches of the > if-condition, and then add an Assert() that map_index is sane. Ok. > > It is not clear to me why we need both mt_perleaf_childparent_maps and > mt_persubplan_childparent_maps. mt_perleaf_childparent_maps : This is used for converting transition-captured inserted/modified/deleted tuples from leaf to root partition, because we need to have all the ROWS in the root partition attribute order. This map is used only for tuples that are routed from root to leaf partition during INSERT, or when tuples are routed from one leaf partition to another leaf partition during update row movement. For both of these operations, we need per-leaf maps, because during tuple conversion, the source relation is among the mtstate->mt_partitions. mt_persubplan_childparent_maps : This is used at two places : 1. After an ExecUpdate() updates a row of a per-subplan update result rel, we need to capture the tuple, so again we need to convert to the root partition. Here, the source table is a per-subplan update result rel; so we need to have per-subplan conversion map array. So after UPDATE finishes with one update result rel, node->mt_transition_capture->tcs_map shifts to the next element in the mt_persubplan_childparent_maps array. : ExecModifyTable() { .... node->mt_transition_capture->tcs_map = node->mt_persubplan_childparent_maps[node->mt_whichplan]; .... } 2. In ExecInsert(), if it is part of update tuple routing, we need to convert the tuple from the update result rel to the root partition. So it re-uses this same conversion map. Now, instead of these two maps having separate allocations, I have arranged for the per-leaf map array to re-use the mapping allocations made by per-subplan array elements, similar to how we are doing for re-using the ResultRelInfos. But still the arrays themselves need to be separate. > > + * Note: if the UPDATE is converted into a DELETE+INSERT as part of > + * update-partition-key operation, then this function is also called > + * separately for DELETE and INSERT to capture transition table rows. > + * In such case, either old tuple or new tuple can be NULL. > > That seems pretty strange. I don't quite see how that's going to work > correctly. I'm skeptical about the idea that the old tuple capture > and new tuple capture can safely happen at different times. Actually the tuple capture involves just adding the tuple into the correct tuplestore for a particular event. There is no trigger event added for tuple capture. Calling ExecARUpdateTriggers() with either newtuple NULL or tupleid Invalid makes sure that it does not do anything other than transition capture : @@ -5306,7 +5322,8 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo, /* If transition tables are the only reason we're here, return. */ if (trigdesc == NULL || (event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) || (event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) || - (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row)) + (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) || + (event == TRIGGER_EVENT_UPDATE && (oldtup == NULL || newtup == NULL))) return; Even if we imagine a single place or a single function that we could call to do the OLD and NEW row capture, still the end result is going to be the same : OLD row would go into mtstate->mt_transition_capture->tcs_old_tuplestore, and NEW row would end up in mtstate->mt_transition_capture->tcs_update_tuplestore. Note that these are common tuple stores for all the partitions of the partition tree. (Actually I am still rebasing my patch over the recent changes where tcs_update_tuplestore no more exists; instead we need to use transition_capture->tcs_private->new_tuplestore). When we access the OLD and NEW tables for UPDATE trigger, there is no longer a co-relation as to which row of OLD TABLE correspond to which row of the NEW TABLE for a given updated row. So, at exactly which point OLD row and NEW row gets captured into their respective tuplestores, and in which order, is not important. Whereas, for the usual per ROW triggers, it is critical that the trigger event has both the OLD and NEW row together in the same trigger event, since they need to be both accessible in the same trigger function. Doing the OLD and NEW tables row capture separately is essential because the DELETE and INSERT happen on different tables, so we are not even sure if the insert is going to happen (thanks to triggers on partitions, if any). If the insert is skipped, we should not capture that tuple. > > I wonder if we should have a reloption controlling whether > update-tuple routing is enabled. I wonder how much more expensive it > is to execute UPDATE root SET a = a + 1 WHERE a = 1 on a table with > 1000 subpartitions with this patch than without, assuming the update > succeeds in both cases. You mean to check how much the patch slows down things for the existing updates involving no row movement ? And so have the reloption to have an option to disable the logic that slows down things ? > > I also wonder how efficient this implementation is in general. For > example, suppose you make a table with 1000 partitions each containing > 10,000 tuples and update them all, and consider three scenarios: (1) > partition key not updated but all tuples subject to non-HOT updates > because the updated column is indexed, (2) partition key updated but > no tuple movement required as a result, (3) partition key updated and > all tuples move to a different partition. It would be useful to > compare the times, and also to look at perf profiles and see if there > are any obvious sources of inefficiency that can be squeezed out. It > wouldn't surprise me if tuple movement is a bit slower than the other > scenarios, but it would be nice to know how much slower and whether > the bottlenecks are anything that we can easily fix. Ok yeah that would be helpful to remove any unnecessary slowness that may have been caused due to the patch; will do. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 20, 2017 at 9:27 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
On 20 September 2017 at 00:06, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Sep 15, 2017 at 7:25 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> [ new patch ]
86 - (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_ row))
87 + (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_ row) ||
88 + (event == TRIGGER_EVENT_UPDATE && (oldtup == NULL || newtup == NULL)))
89 return;
90 }
Either of oldtup or newtup will be valid at a time & vice versa. Can we improve
this check accordingly?
For e.g.:
(event == TRIGGER_EVENT_UPDATE && )(HeapTupleIsValid(oldtup) ^ ItemPointerIsValid(newtup)))))
247
248 + /*
249 + * EDB: In case this is part of update tuple routing, put this row into the
250 + * transition NEW TABLE if we are capturing transition tables. We need to
251 + * do this separately for DELETE and INSERT because they happen on
252 + * different tables.
253 + */
254 + if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_ capture)
255 + ExecARUpdateTriggers(estate, resultRelInfo, NULL,
256 + NULL,
257 + tuple,
258 + NULL,
259 + mtstate->mt_transition_ capture);
260 +
261 list_free(recheckIndexes);
267
268 + /*
269 + * EDB: In case this is part of update tuple routing, put this row into the
270 + * transition OLD TABLE if we are capturing transition tables. We need to
271 + * do this separately for DELETE and INSERT because they happen on
272 + * different tables.
273 + */
274 + if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_ capture)
275 + ExecARUpdateTriggers(estate, resultRelInfo, tupleid,
276 + oldtuple,
277 + NULL,
278 + NULL,
279 + mtstate->mt_transition_ capture);
280 +
Initially, I wondered that why can't we have above code right after
ExecInsert() & ExecIDelete() in ExecUpdate respectively?
We can do that for ExecIDelete() but not easily in the ExecInsert() case,
because ExecInsert() internally searches the correct partition's resultRelInfo
for an insert and before returning to ExecUpdate resultRelInfo is restored
to the old one. That's why current logic seems to be reasonable for now.
Is there anything that we can do?
Regards,
Amul
I have extracted a couple of changes into preparatory patches, as explained below : On 20 September 2017 at 21:27, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 20 September 2017 at 00:06, Robert Haas <robertmhaas@gmail.com> wrote: >> On Fri, Sep 15, 2017 at 7:25 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>> [ new patch ] >> >> This already fails to apply again. In general, I think it would be a >> good idea to break this up into a patch series rather than have it as >> a single patch. That would allow some bits to be applied earlier. >> The main patch will probably still be pretty big, but at least we can >> make things a little easier by getting some of the cleanup out of the >> way first. Specific suggestions on what to break out below. >> >> If the changes to rewriteManip.c are a marginal efficiency hack and >> nothing more, then let's commit this part separately before the main >> patch. If they're necessary for correctness, then please add a >> comment explaining why they are necessary. > > Ok. Yes, just wanted to avoid two ConvertRowtypeExpr nodes one over > the other. But that was not causing any correctness issue. Will > extract these changes into separate patch. The patch for the above change is : 0002-Prevent-a-redundant-ConvertRowtypeExpr-node.patch > >> >> There appears to be no reason why the definitions of >> GetInsertedColumns() and GetUpdatedColumns() need to be moved to a >> header file as a result of this patch. GetUpdatedColumns() was >> previously defined in trigger.c and execMain.c and, post-patch, is >> still called from only those files. GetInsertedColumns() was, and >> remains, called only from execMain.c. If this were needed I'd suggest >> doing it as a preparatory patch before the main patch, but it seems we >> don't need it at all. > > In earlier versions of the patch, these functions were used in > nodeModifyTable.c as well. Now that those calls are not there in this > file, I will revert back the changes done for moving the definitions > into header file. Did the above , and included in the attached revised patch update-partition-key_v19.patch. > >> >> If I understand correctly, the reason for changing mt_partitions from >> ResultRelInfo * to ResultRelInfo ** is that, currently, all of the >> ResultRelInfos for a partitioning hierarchy are allocated as a single >> chunk, but we can't do that and also reuse the ResultRelInfos created >> during InitPlan. I suggest that we do this as a preparatory patch. > > Ok, will prepare a separate patch. Do you mean to include in that > patch the changes I did in ExecSetupPartitionTupleRouting() that > re-use the ResultRelInfo structures of per-subplan update result rels > ? Above changes are in attached 0001-Re-use-UPDATE-result-rels-created-in-InitPlan.patch. Patches are to be applied in this order : 0001-Re-use-UPDATE-result-rels-created-in-InitPlan.patch 0002-Prevent-a-redundant-ConvertRowtypeExpr-node.patch update-partition-key_v19.patch -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 21 September 2017 at 19:52, amul sul <sulamul@gmail.com> wrote: > On Wed, Sep 20, 2017 at 9:27 PM, Amit Khandekar <amitdkhan.pg@gmail.com> > wrote: >> >> On 20 September 2017 at 00:06, Robert Haas <robertmhaas@gmail.com> wrote: >> > On Fri, Sep 15, 2017 at 7:25 AM, Amit Khandekar <amitdkhan.pg@gmail.com> >> > wrote: >> >> [ new patch ] > > > 86 - (event == TRIGGER_EVENT_UPDATE && > !trigdesc->trig_update_after_row)) > 87 + (event == TRIGGER_EVENT_UPDATE && > !trigdesc->trig_update_after_row) || > 88 + (event == TRIGGER_EVENT_UPDATE && (oldtup == NULL || newtup > == NULL))) > 89 return; > 90 } > > > Either of oldtup or newtup will be valid at a time & vice versa. Can we > improve > this check accordingly? > > For e.g.: > (event == TRIGGER_EVENT_UPDATE && )(HeapTupleIsValid(oldtup) ^ > ItemPointerIsValid(newtup))))) Ok, I will be doing this as below : - (event == TRIGGER_EVENT_UPDATE && (oldtup == NULL || newtup == NULL))) + (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL)))) At other places in the function, oldtup and newtup are checked for NULL, so to be consistent, I haven't used HeapTupleIsValid. Actually, it won't happen that both oldtup and newtup are NULL ... in either of delete, insert, or update, but I haven't added an Assert for this, because that has been true even on HEAD. Will include the above minor change in the next patch when more changes come in. > > > 247 > 248 + /* > 249 + * EDB: In case this is part of update tuple routing, put this row > into the > 250 + * transition NEW TABLE if we are capturing transition tables. We > need to > 251 + * do this separately for DELETE and INSERT because they happen on > 252 + * different tables. > 253 + */ > 254 + if (mtstate->operation == CMD_UPDATE && > mtstate->mt_transition_capture) > 255 + ExecARUpdateTriggers(estate, resultRelInfo, NULL, > 256 + NULL, > 257 + tuple, > 258 + NULL, > 259 + mtstate->mt_transition_capture); > 260 + > 261 list_free(recheckIndexes); > > 267 > 268 + /* > 269 + * EDB: In case this is part of update tuple routing, put this row > into the > 270 + * transition OLD TABLE if we are capturing transition tables. We > need to > 271 + * do this separately for DELETE and INSERT because they happen on > 272 + * different tables. > 273 + */ > 274 + if (mtstate->operation == CMD_UPDATE && > mtstate->mt_transition_capture) > 275 + ExecARUpdateTriggers(estate, resultRelInfo, tupleid, > 276 + oldtuple, > 277 + NULL, > 278 + NULL, > 279 + mtstate->mt_transition_capture); > 280 + > > Initially, I wondered that why can't we have above code right after > ExecInsert() & ExecIDelete() in ExecUpdate respectively? > > We can do that for ExecIDelete() but not easily in the ExecInsert() case, > because ExecInsert() internally searches the correct partition's > resultRelInfo > for an insert and before returning to ExecUpdate resultRelInfo is restored > to the old one. That's why current logic seems to be reasonable for now. > Is there anything that we can do? Yes, resultRelInfo is different when we return from ExecInsert(). Also, I think the trigger and transition capture be done immediately after the rows are inserted. This is true for existing code also. Furthermore, there is a dependency of ExecARUpdateTriggers() on ExecARInsertTriggers(). transition_capture is passed NULL if we already captured the tuple in ExecARUpdateTriggers(). It looks simpler to do all this at a single place. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Below are some performance figures. Overall, there does not appear to be a noticeable difference in the figures in partition key updates with and without row movement (which is surprising), and non-partition-key updates with and without the patch. All the values are in milliseconds. Configuration : shared_buffers = 8GB maintenance_work_mem = 4GB synchronous_commit = off checkpoint_timeout = 15min checkpoint_completion_target = 0.9 log_line_prefix = '%t [%p] ' max_wal_size = 5GB max_connections = 200 The attached files were used to create a partition tree made up of 16 partitioned tables, each containing 125 partitions. First half of the 2000 partitions are filled with 10 million rows. Update row movement moves the data to the other half of the partitions. gen.sql : Creates the partitions. insert.data : This data file is uploaded here [1]. Used "COPY ptab from '$PWD/insert.data' " index.sql : Optionally, Create index on column d. The schema looks like this : CREATE TABLE ptab (a date, b int, c int, d int) PARTITION BY RANGE (a, b); CREATE TABLE ptab_1_1 PARTITION OF ptab for values from ('1900-01-01', 1) to ('1900-01-01', 7501) PARTITION BY range (c); CREATE TABLE ptab_1_1_1 PARTITION OF ptab_1_1 for values from (1) to (81); CREATE TABLE ptab_1_1_2 PARTITION OF ptab_1_1 for values from (81) to (161); .......... .......... CREATE TABLE ptab_1_2 PARTITION OF ptab for values from ('1900-01-01', 7501) to ('1900-01-01', 15001) PARTITION BY range (c); .......... .......... On 20 September 2017 at 00:06, Robert Haas <robertmhaas@gmail.com> wrote: > I wonder how much more expensive it > is to execute UPDATE root SET a = a + 1 WHERE a = 1 on a table with > 1000 subpartitions with this patch than without, assuming the update > succeeds in both cases. UPDATE query used : UPDATE ptab set d = d + 1 where d = 1; -- where d is not a partition key of any of the partitions. This query updates 8 rows out of 10 million rows. With HEAD : 2953.691 , 2862.298 , 2855.286 , 2835.879 (avg : 2876) With Patch : 2933.719 , 2832.463 , 2749.979 , 2820.416 (avg : 2834) (All the values are in milliseconds.) > suppose you make a table with 1000 partitions each containing > 10,000 tuples and update them all, and consider three scenarios: (1) > partition key not updated but all tuples subject to non-HOT updates > because the updated column is indexed, (2) partition key updated but > no tuple movement required as a result, (3) partition key updated and > all tuples move to a different partition. Note that the following figures do not represent a consistent set of figures. They keep on varying. For e.g. , even though the partition-key-update without row movement appears to have taken a bit more time with patch than with HEAD, a new set of tests run might even end up the other way round. NPK : 42089 (patch) NPKI : 81593 (patch) PK : 45250 (patch) , 44944 (HEAD) PKR : 46701 (patch) The above figures are in milliseconds. The explanations of the above short-forms : NPK : Update of column that is not a partition-key. UPDATE query used : UPDATE ptab set d = d + 1 ; This update *all* rows. NPKI : Update of column that is not a partition-key. And this column is indexed (Used attached file index.sql). UPDATE query used : UPDATE ptab set d = d + 1 ; This update *all* rows. PK : Update of partition key, but row movement does not occur. There are no indexed columns. UPDATE query used : UPDATE ptab set a = a + '1 hour'::interval ; PKR : Update of partition key, with all rows moved to other partitions. There are no indexed columns. UPDATE query used : UPDATE ptab set a = a + '2 years'::interval ; [1] https://drive.google.com/open?id=0B_YJCqIAxKjeN3hMXzdDejlNYmlpWVJpaU9mWUhFRVhXTG5Z -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Wed, Sep 13, 2017 at 4:24 PM, amul sul <sulamul@gmail.com> wrote: > > > On Sun, Sep 10, 2017 at 8:47 AM, Amit Kapila <amit.kapila16@gmail.com> > wrote: >> >> On Fri, Sep 8, 2017 at 4:51 PM, amul sul <sulamul@gmail.com> wrote: >> > On Thu, May 18, 2017 at 9:13 AM, Amit Kapila <amit.kapila16@gmail.com> >> > wrote: >> >> >> >> On Wed, May 17, 2017 at 5:17 PM, Robert Haas <robertmhaas@gmail.com> >> >> wrote: >> >> > On Wed, May 17, 2017 at 6:29 AM, Amit Kapila >> >> > <amit.kapila16@gmail.com> >> >> > wrote: >> >> >> I think we can do this even without using an additional infomask >> >> >> bit. >> >> >> As suggested by Greg up thread, we can set InvalidBlockId in ctid to >> >> >> indicate such an update. >> >> > >> >> > Hmm. How would that work? >> >> > >> >> >> >> We can pass a flag say row_moved (or require_row_movement) to >> >> heap_delete which will in turn set InvalidBlockId in ctid instead of >> >> setting it to self. Then the ExecUpdate needs to check for the same >> >> and return an error when heap_update is not successful (result != >> >> HeapTupleMayBeUpdated). Can you explain what difficulty are you >> >> envisioning? >> >> >> > >> > Attaching WIP patch incorporates the above logic, although I am yet to >> > check >> > all the code for places which might be using ip_blkid. I have got a >> > small >> > query here, >> > do we need an error on HeapTupleSelfUpdated case as well? >> > >> >> No, because that case is anyway a no-op (or error depending on whether >> is updated/deleted by same command or later command). Basically, even >> if the row wouldn't have been moved to another partition, we would not >> have allowed the command to proceed with the update. This handling is >> to make commands fail rather than a no-op where otherwise (when the >> tuple is not moved to another partition) the command would have >> succeeded. >> > Thank you. > > I've rebased patch against Amit Khandekar's latest patch (v17_rebased_2). > Also, added ip_blkid validation check in heap_get_latest_tid(), rewrite_heap_tuple() > & rewrite_heap_tuple() function, because only ItemPointerEquals() check is no > longer sufficient after this patch. FYI, I have posted this patch in a separate thread : https://postgr.es/m/CAAJ_b95PkwojoYfz0bzXU8OokcTVGzN6vYGCNVUukeUDrnF3dw@mail.gmail.com Regards, Amul -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 22, 2017 at 1:57 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > The patch for the above change is : > 0002-Prevent-a-redundant-ConvertRowtypeExpr-node.patch Thinking about this a little more, I'm wondering about how this case arises. I think that for this patch to avoid multiple conversions, we'd have to be calling map_variable_attnos on an expression and then calling map_variable_attnos on that expression again. >>> If I understand correctly, the reason for changing mt_partitions from >>> ResultRelInfo * to ResultRelInfo ** is that, currently, all of the >>> ResultRelInfos for a partitioning hierarchy are allocated as a single >>> chunk, but we can't do that and also reuse the ResultRelInfos created >>> during InitPlan. I suggest that we do this as a preparatory patch. >> >> Ok, will prepare a separate patch. Do you mean to include in that >> patch the changes I did in ExecSetupPartitionTupleRouting() that >> re-use the ResultRelInfo structures of per-subplan update result rels >> ? > > Above changes are in attached > 0001-Re-use-UPDATE-result-rels-created-in-InitPlan.patch. No, not all of those changes. Just the adjustments to make ModifyTableState's mt_partitions be of type ResultRelInfo ** rather than ResultRelInfo *, and anything closely related to that. Not, for example, the num_update_rri stuff. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 30 September 2017 at 01:26, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Sep 29, 2017 at 3:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Fri, Sep 22, 2017 at 1:57 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>> The patch for the above change is : >>> 0002-Prevent-a-redundant-ConvertRowtypeExpr-node.patch >> >> Thinking about this a little more, I'm wondering about how this case >> arises. I think that for this patch to avoid multiple conversions, >> we'd have to be calling map_variable_attnos on an expression and then >> calling map_variable_attnos on that expression again. We are not calling map_variable_attnos() twice. The first time it calls, there is already the ConvertRowtypeExpr node if the expression is a whole row var. This node is already added from adjust_appendrel_attrs(). So the conversion is done by two different functions. For ConvertRowtypeExpr, map_variable_attnos_mutator() recursively calls map_variable_attnos_mutator() for ConvertRowtypeExpr->arg with coerced_var=true. > > I guess I didn't quite finish this thought, sorry. Maybe it's > obvious, but the point I was going for is: why would we do that, vs. > just converting once? The first time ConvertRowtypeExpr node gets added in the expression is when adjust_appendrel_attrs() is called for each of the child tables. Here, for each of the child table, when the parent parse tree is converted into the child parse tree, the whole row var (in RETURNING or WITH CHECK OPTIONS expr) is wrapped with ConvertRowtypeExpr(), so child parse tree (or the child WCO expr) has this ConvertRowtypeExpr node. The second time this node is added is during update-tuple-routing in ExecInitModifyTable(), when map_partition_varattnos() is called for each of the partitions to convert from the first per-subplan RETURNING/WCO expression to the RETURNING/WCO expression belonging to the leaf partition. This second conversion happens for the leaf partitions which are not already present in per-subplan UPDATE result rels. So the first conversion is from parent to child while building per-subplan plans, and the second is from first per-subplan child to another child for building expressions of the leaf partitions. So suppose the root partitioned table RETURNING expression is a whole row var wr(r) where r is its composite type representing the root table type. Then, one of its UPDATE child tables will have its RETURNING expression converted like this : wr(r) ===> CRE(r) -> wr(c1) where CRE(r) represents ConvertRowtypeExpr of result type r, which has its arg pointing to wr(c1) which is a whole row var of composite type c1 for the child table c1. So this node converts from composite type of child table to composite type of root table. Now, when the second conversion occurs for the leaf partition (i.e. during update-tuple-routing), the conversion looks like this : CRE(r) -> wr(c1) ===> CRE(r) -> wr(c2) But W/o the 0002*ConvertRowtypeExpr*.patch the conversion would have looked like this : CRE(r) -> wr(c1) ===> CRE(r) -> CRE(c1) -> wr(c2) In short, we omit the intermediate CRE(c1) node. While writing this down, I observed that after multi-level partition tree expansion was introduced, the child table expressions are not converted directly from the root. Instead, they are converted from their immediate parent. So there is a chain of conversions : to leaf from its parent, to that parent from its parent, and so on from the root. Effectively, during the first conversion, there are that many ConvertRowtypeExpr nodes one above the other already present in the UPDATE result rel expressions. But my patch handles the optimization only for the leaf partition conversions. If already has CRE : CRE(rr) -> wr(r) Parent-to-child conversion ::: CRE(p) -> wr(r) ===> CRE(rr) -> CRE(r) -> wr(c1) W patch : CRE(rr) -> CRE(r) -> wr(c1) ===> CRE(rr) -> CRE(r) -> wr(c2) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 3, 2017 at 8:16 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > While writing this down, I observed that after multi-level partition > tree expansion was introduced, the child table expressions are not > converted directly from the root. Instead, they are converted from > their immediate parent. So there is a chain of conversions : to leaf > from its parent, to that parent from its parent, and so on from the > root. Effectively, during the first conversion, there are that many > ConvertRowtypeExpr nodes one above the other already present in the > UPDATE result rel expressions. But my patch handles the optimization > only for the leaf partition conversions. > > If already has CRE : CRE(rr) -> wr(r) > Parent-to-child conversion ::: CRE(p) -> wr(r) ===> CRE(rr) -> > CRE(r) -> wr(c1) > W patch : CRE(rr) -> CRE(r) -> wr(c1) ===> CRE(rr) -> CRE(r) -> wr(c2) Maybe adjust_appendrel_attrs() should have a similar provision for avoiding extra ConvertRowTypeExpr nodes? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 30 September 2017 at 01:23, Robert Haas <robertmhaas@gmail.com> wrote: >>>> If I understand correctly, the reason for changing mt_partitions from >>>> ResultRelInfo * to ResultRelInfo ** is that, currently, all of the >>>> ResultRelInfos for a partitioning hierarchy are allocated as a single >>>> chunk, but we can't do that and also reuse the ResultRelInfos created >>>> during InitPlan. I suggest that we do this as a preparatory patch. >>> >>> Ok, will prepare a separate patch. Do you mean to include in that >>> patch the changes I did in ExecSetupPartitionTupleRouting() that >>> re-use the ResultRelInfo structures of per-subplan update result rels >>> ? >> >> Above changes are in attached >> 0001-Re-use-UPDATE-result-rels-created-in-InitPlan.patch. > > No, not all of those changes. Just the adjustments to make > ModifyTableState's mt_partitions be of type ResultRelInfo ** rather > than ResultRelInfo *, and anything closely related to that. Not, for > example, the num_update_rri stuff. Ok. Attached is the patch modified to have changes only to handle array of ResultRelInfo * instead of array of ResultRelInfo. ------- On 4 October 2017 at 01:08, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Oct 3, 2017 at 8:16 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> While writing this down, I observed that after multi-level partition >> tree expansion was introduced, the child table expressions are not >> converted directly from the root. Instead, they are converted from >> their immediate parent. So there is a chain of conversions : to leaf >> from its parent, to that parent from its parent, and so on from the >> root. Effectively, during the first conversion, there are that many >> ConvertRowtypeExpr nodes one above the other already present in the >> UPDATE result rel expressions. But my patch handles the optimization >> only for the leaf partition conversions. > > Maybe adjust_appendrel_attrs() should have a similar provision for > avoiding extra ConvertRowTypeExpr nodes? Yeah, I think we should be able to do that. Will check. ------ On 19 September 2017 at 13:15, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 18 September 2017 at 20:45, Dilip Kumar <dilipbalaut@gmail.com> wrote: >> Please find few more comments. >> >> + * in which they appear in the PartitionDesc. Also, extract the >> + * partition key columns of the root partitioned table. Those of the >> + * child partitions would be collected during recursive expansion. >> */ >> + pull_child_partition_columns(&all_part_cols, oldrelation, oldrelation); >> expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc, >> lockmode, &root->append_rel_list, >> + &all_part_cols, >> >> pcinfo->all_part_cols is only used in case of update, I think we can >> call pull_child_partition_columns >> only if rte has updateCols? >> >> @@ -2029,6 +2034,7 @@ typedef struct PartitionedChildRelInfo >> >> Index parent_relid; >> List *child_rels; >> + Bitmapset *all_part_cols; >> } PartitionedChildRelInfo; >> >> I might be missing something, but do we really need to store >> all_part_cols inside the >> PartitionedChildRelInfo, can't we call pull_child_partition_columns >> directly inside >> inheritance_planner whenever we realize that RTE has some updateCols >> and we want to >> check the overlap? > > One thing we will have to do extra is : Open and close the > partitioned rels again. The idea was that we collect the bitmap > *while* we are already expanding through the tree and the rel is open. > Will check if this is feasible. While giving more thought on this suggestion of Dilip's, I found out that pull_child_partition_columns() is getting called with child_rel and its immediate parent. That means, it maps the child rel attributes to its immediate parent. If that immediate parent is not the root partrel, then the conversion is not sufficient. We need to map child rel attnos to root partrel attnos. So for a partition tree with 3 or more levels, with the bottom partitioned rel having different att ordering than the root, this will not work. Before the commit that enabled recursive multi-level partition tree expansion, pull_child_partition_columns() was always getting called with child_rel and root rel. So this issue crept up when I rebased over this commit, overlooking the fact that parent rel is the immediate parent, not the root parent. Anyways, I think Dilip's suggestion makes sense : we can do the finding-all-part-cols work separately in inheritance_planner() using the partitioned_rels handle. Re-opening the partitioned tables should be cheap, because they have already been opened earlier, so they are available in relcache. So did this as he suggested using new function get_all_partition_cols(). While doing that, I have ensured that we use the root rel to map all the child rel attnos. So the above issue is fixed now. Also added test scenarios that test the above issue. Namely, made the partition tree 3 level, and added some specific scenarios where it used to wrongly error out without trying to move the tuple, because it determined partition-key is not updated. --------- Though we re-use the update result rels, the WCO and Returning expressions were not getting re-used from those update result rels. This check was missing : @@ -2059,7 +2380,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags) for (i = 0; i < mtstate->mt_num_partitions; i++) { Relation partrel; List *rlist; resultRelInfo = mtstate->mt_partitions[i]; + + /* + * If we are referring to a resultRelInfo from one of the update + * result rels, that result rel would already have a returningList + * built. + */ + if (resultRelInfo->ri_projectReturning) + continue; + partrel = resultRelInfo->ri_RelationDesc; Added this check in the patch. ---------- On 22 September 2017 at 16:13, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 21 September 2017 at 19:52, amul sul <sulamul@gmail.com> wrote: >> >> 86 - (event == TRIGGER_EVENT_UPDATE && >> !trigdesc->trig_update_after_row)) >> 87 + (event == TRIGGER_EVENT_UPDATE && >> !trigdesc->trig_update_after_row) || >> 88 + (event == TRIGGER_EVENT_UPDATE && (oldtup == NULL || newtup >> == NULL))) >> 89 return; >> 90 } >> >> >> Either of oldtup or newtup will be valid at a time & vice versa. Can we >> improve >> this check accordingly? >> >> For e.g.: >> (event == TRIGGER_EVENT_UPDATE && )(HeapTupleIsValid(oldtup) ^ >> ItemPointerIsValid(newtup))))) > >Ok, I will be doing this as below : >- (event == TRIGGER_EVENT_UPDATE && (oldtup == NULL || newtup == NULL))) >+ (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL)))) Have done this in the attached patch. -------- Attached are these patches : Preparatory patches : 0001-Prepare-for-re-using-UPDATE-result-rels-during-tuple.patch 0002-Prevent-a-redundant-ConvertRowtypeExpr-node.patch Main patch : update-partition-key_v20.patch Thanks -Amit Khandekar -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Wed, Oct 4, 2017 at 9:51 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > Preparatory patches : > 0001-Prepare-for-re-using-UPDATE-result-rels-during-tuple.patch > 0002-Prevent-a-redundant-ConvertRowtypeExpr-node.patch > Main patch : > update-partition-key_v20.patch Committed 0001 with a few tweaks and 0002 unchanged. Please check whether everything looks OK. Is anybody still reviewing the main patch here? (It would be good if the answer is "yes".) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/10/13 6:18, Robert Haas wrote: > Is anybody still reviewing the main patch here? (It would be good if > the answer is "yes".) I am going to try to look at the latest version over the weekend and early next week. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Hi Amit. On 2017/10/04 22:51, Amit Khandekar wrote: > Main patch : > update-partition-key_v20.patch Guess you're already working on it but the patch needs a rebase. A couple of hunks in the patch to execMain.c and nodeModifyTable.c fail. Meanwhile a few comments: +void +pull_child_partition_columns(Bitmapset **bitmapset, + Relation rel, + Relation parent) Nitpick: don't we normally list the output argument(s) at the end? Also, "bitmapset" could be renamed to something that conveys what it contains? + if (partattno != 0) + child_keycols = + bms_add_member(child_keycols, + partattno - FirstLowInvalidHeapAttributeNumber); + } + foreach(lc, partexprs) + { Elsewhere (in quite a few places), we don't iterate over partexprs separately like this, although I'm not saying it is bad, just different from other places. + * the transition tuplestores can be built. Furthermore, if the transition + * capture is happening for UPDATEd rows being moved to another partition due + * partition-key change, then this function is called once when the row is + * deleted (to capture OLD row), and once when the row is inserted to another + * partition (to capture NEW row). This is done separately because DELETE and + * INSERT happen on different tables. Extra space at the beginning from the 2nd line onwards. + (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL)))) Is there some reason why a bitwise operator is used here? + * 'update_rri' has the UPDATE per-subplan result rels. Could you explain why they are being received as input here? + * 'perleaf_parentchild_maps' receives an array of TupleConversionMap objects + * with on entry for every leaf partition (required to convert input tuple + * based on the root table's rowtype to a leaf partition's rowtype after + * tuple routing is done) Could this be named leaf_tupconv_maps, maybe? It perhaps makes clear that they are maps needed for "tuple conversion". And the other field holding the reverse map as leaf_rev_tupconv_maps. Either that or use underscores to separate words, but then it gets too long I guess. + tuple = ConvertPartitionTupleSlot(mtstate, + mtstate->mt_perleaf_parentchild_maps[leaf_part_index], + The 2nd line here seems to have gone over 80 characters. ISTM, ConvertPartitionTupleSlot has a very specific job and a bit complex interface. I guess it could simply have the following interface: static HeapTuple ConvertPartitionTuple(ModifyTabelState *mtstate, HeapTuple tuple, boolis_update); And figure out, based on the value of is_update, which map to use and which slot to set *p_new_slot to (what is now "new_slot" argument). You're getting mtstate here anyway, which contains all the information you need here. It seems better to make that (selecting which map and which slot) part of the function's implementation if we're having this function at all, imho. Maybe I'm missing some details there, but my point still remains that we should try to put more logic in that function instead of it just do the mechanical tuple conversion. + * We have already checked partition constraints above, so skip them + * below. How about: ", so skip checking here."? ISTM, the newly introduced logic in ExecSetupTransitionCaptureState() to try to reuse the per-subplan child-to-parent map as per-leaf child-to-parent map could be simplified a bit. I mean the following code: + /* + * But for Updates, we can share the per-subplan maps with the per-leaf + * maps. + */ + update_rri_index = 0; + update_rri = mtstate->resultRelInfo; + if (mtstate->mt_nplans > 0) + cur_reloid = RelationGetRelid(update_rri[0].ri_RelationDesc); - /* Choose the right set of partitions */ - if (mtstate->mt_partition_dispatch_info != NULL) + for (i = 0; i < numResultRelInfos; ++i) + { <snip> How about (pseudo-code): j = 0;for (i = 0; i < n_leaf_parts; i++){ if (j < n_subplans && leaf_rri[i]->oid == subplan_rri[j]->oid) { leaf_childparent_map[i]= subplan_childparent_map[j]; j++; } else { leaf_childparent_map[i] = new map }} I think the above would also be useful in ExecSetupPartitionTupleRouting() where you've added similar code to try to reuse per-subplan ResultRelInfos. In ExecInitModifyTable(), can we try to minimize the number of places where update_tuple_routing_needed is being set. Currently, it's being set in 3 places: + bool update_tuple_routing_needed = node->part_cols_updated; & + /* + * If this is an UPDATE and a BEFORE UPDATE trigger is present, we may + * need to do update tuple routing. + */ + if (resultRelInfo->ri_TrigDesc && + resultRelInfo->ri_TrigDesc->trig_update_before_row && + operation == CMD_UPDATE) + update_tuple_routing_needed = true; & + /* Decide whether we need to perform update tuple routing. */ + if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE) + update_tuple_routing_needed = false; In the following: ExecSetupPartitionTupleRouting(rel, + (operation == CMD_UPDATE ? + mtstate->resultRelInfo : NULL), + (operation == CMD_UPDATE ? nplans : 0), Can the second parameter be made to not span two lines? It was a bit hard for me to see that there two new parameters. + * Construct mapping from each of the resultRelInfo attnos to the root Maybe it's odd to say "resultRelInfo attno", because it's really the underlying partition whose attnos we're talking about as being possibly different from the root table's attnos. + * descriptor. In such case we need to convert tuples to the root s/In such case/In such a case,/ By the way, I've seen in a number of places that the patch calls "root table" a partition. Not just in comments, but also a variable appears to be given a name which contains rootpartition. I can see only one instance where root is called a partition in the existing source code, but it seems to have been introduced only recently: allpaths.c:1333: * A root partition will already have a + * qual for each partition. Note that, if there are SubPlans in there, + * they all end up attached to the one parent Plan node. The sentence starting with "Note that, " is a bit unclear. + Assert(update_tuple_routing_needed || + (operation == CMD_INSERT && + list_length(node->withCheckOptionLists) == 1 && + mtstate->mt_nplans == 1)); The comment I complained about above is perhaps about this Assert. - List *mapped_wcoList; + List *mappedWco; Not sure why this rename. After this rename, it's now inconsistent with the code above which handles non-partitioned case, which still calls it wcoList. Maybe, because you introduced firstWco and then this line: + firstWco = linitial(node->withCheckOptionLists); but note that each member of node->withCheckOptionLists is also a list, so the original naming. Also, further below, you're assigning mappedWco to a List * field. + resultRelInfo->ri_WithCheckOptions = mappedWco; Comments on the optimizer changes: +get_all_partition_cols(List *rtables, Did you mean rtable? get_all_partition_cols() seems to go over the rtable as many times as there are partitioned tables in the tree. Is there a way to do this work somewhere else? Maybe when the partitioned_rels list is built in the first place. But that would require us to make changes to extract partition columns in some place (prepunion.c) where it's hard to justify why it's being done there at all. + get_all_partition_cols(root->parse->rtable, top_parentRTindex, + partitioned_rels, &all_part_cols); Two more spaces needed on the 2nd line. + * If all_part_cols_p is non-NULL, *all_part_cols_p is set to a bitmapset + * of all partitioning columns used by the partitioned table or any + * descendent. + * Dead comment? Aha, so here's where all_part_cols was being set before... + TupleTableSlot *mt_rootpartition_tuple_slot; I guess I was complaining about this field where you call root a partition. Maybe, mt_root_tuple_slot would suffice. Thanks again for working on this. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 16 October 2017 at 08:28, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Hi Amit. > > On 2017/10/04 22:51, Amit Khandekar wrote: >> Main patch : >> update-partition-key_v20.patch > > Guess you're already working on it but the patch needs a rebase. A couple > of hunks in the patch to execMain.c and nodeModifyTable.c fail. Thanks for taking up this review Amit. Attached is the rebased version. Will get back on your review comments and updated patch soon. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 16 October 2017 at 08:28, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > + * the transition tuplestores can be built. Furthermore, if the transition > + * capture is happening for UPDATEd rows being moved to another > partition due > + * partition-key change, then this function is called once when the row is > + * deleted (to capture OLD row), and once when the row is inserted to > another > + * partition (to capture NEW row). This is done separately because > DELETE and > + * INSERT happen on different tables. > > Extra space at the beginning from the 2nd line onwards. Just observed that the existing comment lines use tab instead of spaces. I have now used tab for the new comments, instead of the multiple spaces. > > + (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup > == NULL)))) > > Is there some reason why a bitwise operator is used here? That exact condition means that the function is called for transition capture for updated rows being moved to another partition. For this scenario, either the oldtup or the newtup is NULL. I wanted to exactly capture that condition there. I think the bitwise operator is more user-friendly in emphasizing the point that it is indeed an "either a or b, not both" condition. > > + * 'update_rri' has the UPDATE per-subplan result rels. > > Could you explain why they are being received as input here? Added the explanation in the comments. > > + * 'perleaf_parentchild_maps' receives an array of TupleConversionMap objects > + * with on entry for every leaf partition (required to convert input > tuple > + * based on the root table's rowtype to a leaf partition's rowtype after > + * tuple routing is done) > > Could this be named leaf_tupconv_maps, maybe? It perhaps makes clear that > they are maps needed for "tuple conversion". And the other field holding > the reverse map as leaf_rev_tupconv_maps. Either that or use underscores > to separate words, but then it gets too long I guess. In master branch, now this param is already there with the name "tup_conv_maps". In the rebased version in the earlier mail, I haven't again changed it. I think "tup_conv_maps" looks clear enough. > > > + tuple = ConvertPartitionTupleSlot(mtstate, > + > mtstate->mt_perleaf_parentchild_maps[leaf_part_index], > + > > The 2nd line here seems to have gone over 80 characters. > > ISTM, ConvertPartitionTupleSlot has a very specific job and a bit complex > interface. I guess it could simply have the following interface: > > static HeapTuple ConvertPartitionTuple(ModifyTabelState *mtstate, > HeapTuple tuple, bool is_update); > > And figure out, based on the value of is_update, which map to use and > which slot to set *p_new_slot to (what is now "new_slot" argument). > You're getting mtstate here anyway, which contains all the information you > need here. It seems better to make that (selecting which map and which > slot) part of the function's implementation if we're having this function > at all, imho. Maybe I'm missing some details there, but my point still > remains that we should try to put more logic in that function instead of > it just do the mechanical tuple conversion. I tried to see how the interface would look if we do that way. Here is how the code looks : static TupleTableSlot * ConvertPartitionTupleSlot(ModifyTableState *mtstate, bool for_update_tuple_routing, int map_index, HeapTuple *tuple, TupleTableSlot *slot) { TupleConversionMap *map; TupleTableSlot *new_slot; if (for_update_tuple_routing) { map = mtstate->mt_persubplan_childparent_maps[map_index]; new_slot = mtstate->mt_rootpartition_tuple_slot; } else { map = mtstate->mt_perleaf_parentchild_maps[map_index]; new_slot = mtstate->mt_partition_tuple_slot; } if (!map) return slot; *tuple = do_convert_tuple(*tuple, map); /* * Change the partition tuple slot descriptor, as per converted tuple. */ ExecSetSlotDescriptor(new_slot, map->outdesc); ExecStoreTuple(*tuple, new_slot, InvalidBuffer, true); return new_slot; } It looks like the interface does not much simplify, and above that, we have more number of lines in that function. Also, the caller anyway has to be aware whether the map_index is the index into the leaf partitions or the update subplans. So it is not like the caller does not have to be aware about whether the mapping should be mt_persubplan_childparent_maps or mt_perleaf_parentchild_maps. > > + * We have already checked partition constraints above, so skip them > + * below. > > How about: ", so skip checking here."? Ok I have made it this way : * We have already checked partition constraints above, so skip * checking them here. > > > ISTM, the newly introduced logic in ExecSetupTransitionCaptureState() to > try to reuse the per-subplan child-to-parent map as per-leaf > child-to-parent map could be simplified a bit. I mean the following code: > > + /* > + * But for Updates, we can share the per-subplan maps with the per-leaf > + * maps. > + */ > + update_rri_index = 0; > + update_rri = mtstate->resultRelInfo; > + if (mtstate->mt_nplans > 0) > + cur_reloid = RelationGetRelid(update_rri[0].ri_RelationDesc); > > - /* Choose the right set of partitions */ > - if (mtstate->mt_partition_dispatch_info != NULL) > + for (i = 0; i < numResultRelInfos; ++i) > + { > <snip> > > How about (pseudo-code): > > j = 0; > for (i = 0; i < n_leaf_parts; i++) > { > if (j < n_subplans && leaf_rri[i]->oid == subplan_rri[j]->oid) > { > leaf_childparent_map[i] = subplan_childparent_map[j]; > j++; > } > else > { > leaf_childparent_map[i] = new map > } > } > > I think the above would also be useful in ExecSetupPartitionTupleRouting() > where you've added similar code to try to reuse per-subplan ResultRelInfos. Did something like that in the attached patch. Please have a look. After we conclude on that, will do the same for ExecSetupPartitionTupleRouting() as well. > > > In ExecInitModifyTable(), can we try to minimize the number of places > where update_tuple_routing_needed is being set. Currently, it's being set > in 3 places: Will see if we can skip some checks (TODO). > In the following: > > ExecSetupPartitionTupleRouting(rel, > + (operation == CMD_UPDATE ? > + mtstate->resultRelInfo : NULL), > + (operation == CMD_UPDATE ? nplans > : 0), > > Can the second parameter be made to not span two lines? It was a bit hard > for me to see that there two new parameters. I think it is safe to just pass mtstate->resultRelInfo. Inside ExecSetupPartitionTupleRouting() we should anyways check only the nplans param (and not update_rri) to decide whether it is for insert or update. So did the same. > > + * Construct mapping from each of the resultRelInfo attnos to the root > > Maybe it's odd to say "resultRelInfo attno", because it's really the > underlying partition whose attnos we're talking about as being possibly > different from the root table's attnos. Changed : resultRelInfo => partition > > + * descriptor. In such case we need to convert tuples to the root > > s/In such case/In such a case,/ Done. > > By the way, I've seen in a number of places that the patch calls "root > table" a partition. Not just in comments, but also a variable appears to > be given a name which contains rootpartition. I can see only one instance > where root is called a partition in the existing source code, but it seems > to have been introduced only recently: > > allpaths.c:1333: * A root partition will already have a Changed to either this : root partition => root partitioned table or this if we have to refer to it too often : root partition => root > > + * qual for each partition. Note that, if there are SubPlans in > there, > + * they all end up attached to the one parent Plan node. > > The sentence starting with "Note that, " is a bit unclear. > > + Assert(update_tuple_routing_needed || > + (operation == CMD_INSERT && > + list_length(node->withCheckOptionLists) == 1 && > + mtstate->mt_nplans == 1)); > > The comment I complained about above is perhaps about this Assert. > > - List *mapped_wcoList; > + List *mappedWco; > > Not sure why this rename. After this rename, it's now inconsistent with > the code above which handles non-partitioned case, which still calls it > wcoList. Maybe, because you introduced firstWco and then this line: > > + firstWco = linitial(node->withCheckOptionLists); > > but note that each member of node->withCheckOptionLists is also a list, so > the original naming. Also, further below, you're assigning mappedWco to > a List * field. > > + resultRelInfo->ri_WithCheckOptions = mappedWco; > > > Comments on the optimizer changes: > > +get_all_partition_cols(List *rtables, > > Did you mean rtable? > > > + get_all_partition_cols(root->parse->rtable, top_parentRTindex, > + partitioned_rels, &all_part_cols); > > Two more spaces needed on the 2nd line. > > > > +void > +pull_child_partition_columns(Bitmapset **bitmapset, > + Relation rel, > + Relation parent) > > Nitpick: don't we normally list the output argument(s) at the end? Also, > "bitmapset" could be renamed to something that conveys what it contains? > > + if (partattno != 0) > + child_keycols = > + bms_add_member(child_keycols, > + partattno - > FirstLowInvalidHeapAttributeNumber); > + } > + foreach(lc, partexprs) > + { > > Elsewhere (in quite a few places), we don't iterate over partexprs > separately like this, although I'm not saying it is bad, just different > from other places. > > get_all_partition_cols() seems to go over the rtable as many times as > there are partitioned tables in the tree. Is there a way to do this work > somewhere else? Maybe when the partitioned_rels list is built in the > first place. But that would require us to make changes to extract > partition columns in some place (prepunion.c) where it's hard to justify > why it's being done there at all. > > > + * If all_part_cols_p is non-NULL, *all_part_cols_p is set to a bitmapset > + * of all partitioning columns used by the partitioned table or any > + * descendent. > + * > > Dead comment? Aha, so here's where all_part_cols was being set before... > > + TupleTableSlot *mt_rootpartition_tuple_slot; > > I guess I was complaining about this field where you call root a > partition. Maybe, mt_root_tuple_slot would suffice. Will get back with the above comments (TODO) -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Below I have addressed the remaining review comments : On 16 October 2017 at 08:28, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > > In ExecInitModifyTable(), can we try to minimize the number of places > where update_tuple_routing_needed is being set. Currently, it's being set > in 3 places: I think the way it's done seems ok. For each resultRelInfo, update_tuple_routing_needed is updated in case that resultRel has partition cols changed. And at that point, we don't have rel opened, so we can't check if that rel is partitioned. So another check is required outside of the loop. > > + * qual for each partition. Note that, if there are SubPlans in > there, > + * they all end up attached to the one parent Plan node. > > The sentence starting with "Note that, " is a bit unclear. > > + Assert(update_tuple_routing_needed || > + (operation == CMD_INSERT && > + list_length(node->withCheckOptionLists) == 1 && > + mtstate->mt_nplans == 1)); > > The comment I complained about above is perhaps about this Assert. That is an existing comment. On HEAD, the "parent Plan" refers to mtstate->mt_plans[0]. Now in the patch, for the parent node in ExecInitQual(), mtstate->ps is passed rather than mt_plans[0]. So the parent plan refers to this mtstate node. BTW, the reason I had changed the parent node to mtstate->ps is : Other places in that code use mtstate->ps while initializing expressions : /* * Build a projection for each result rel. */ resultRelInfo->ri_projectReturning = ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps, resultRelInfo->ri_RelationDesc->rd_att); ........... /* build DO UPDATE WHERE clause expression */ if (node->onConflictWhere) { ExprState *qualexpr; qualexpr = ExecInitQual((List *) node->onConflictWhere, &mtstate->ps); .... } I think wherever we initialize expressions belonging to a plan, we should use that plan as the parent. WithCheckOptions are fields of ModifyTableState. > > - List *mapped_wcoList; > + List *mappedWco; > > Not sure why this rename. After this rename, it's now inconsistent with > the code above which handles non-partitioned case, which still calls it > wcoList. Maybe, because you introduced firstWco and then this line: > > + firstWco = linitial(node->withCheckOptionLists); > > but note that each member of node->withCheckOptionLists is also a list, so > the original naming. Also, further below, you're assigning mappedWco to > a List * field. > > + resultRelInfo->ri_WithCheckOptions = mappedWco; Done. Reverted mappedWco to mapped_wcoList. And firstWco to first_wcoList. > > > Comments on the optimizer changes: > > +get_all_partition_cols(List *rtables, > > Did you mean rtable? I did mean rtables. It's a list of rtables. > > > + get_all_partition_cols(root->parse->rtable, top_parentRTindex, > + partitioned_rels, &all_part_cols); > > Two more spaces needed on the 2nd line. Done. > > > > +void > +pull_child_partition_columns(Bitmapset **bitmapset, > + Relation rel, > + Relation parent) > > Nitpick: don't we normally list the output argument(s) at the end? Agreed. Done. > Also, "bitmapset" could be renamed to something that conveys what it contains? Renamed it to partcols > > + if (partattno != 0) > + child_keycols = > + bms_add_member(child_keycols, > + partattno - > FirstLowInvalidHeapAttributeNumber); > + } > + foreach(lc, partexprs) > + { > > Elsewhere (in quite a few places), we don't iterate over partexprs > separately like this, although I'm not saying it is bad, just different > from other places. I think you are suggesting we do it like how it's done in is_partition_attr(). Can you please let me know other places we do this same way ? I couldn't find. > > get_all_partition_cols() seems to go over the rtable as many times as > there are partitioned tables in the tree. Is there a way to do this work > somewhere else? Maybe when the partitioned_rels list is built in the > first place. But that would require us to make changes to extract > partition columns in some place (prepunion.c) where it's hard to justify > why it's being done there at all. See below ... > > > + * If all_part_cols_p is non-NULL, *all_part_cols_p is set to a bitmapset > + * of all partitioning columns used by the partitioned table or any > + * descendent. > + * > > Dead comment? Removed. > Aha, so here's where all_part_cols was being set before... Yes, and we used to have PartitionedChildRelInfo.all_part_cols field for that. We used to populate that while traversing through the partition tree in expand_inherited_rtentry(). I agreed with Dilip's opinion that this would unnecessarily add up some processing even when the query is not a DML. And also, we don't have to have PartitionedChildRelInfo.all_part_cols. For the earlier implementation, check v18 patch or earlier versions. > > + TupleTableSlot *mt_rootpartition_tuple_slot; > > I guess I was complaining about this field where you call root a > partition. Maybe, mt_root_tuple_slot would suffice. Done. Attached v22 patch. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hi Amit. Thanks a lot for updated patches and sorry that I couldn't get to looking at your emails sooner. Note that I'm replying here to both of your emails, but looking at only the latest v22 patch. On 2017/10/24 0:15, Amit Khandekar wrote: > On 16 October 2017 at 08:28, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> >> + (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup >> == NULL)))) >> >> Is there some reason why a bitwise operator is used here? > > That exact condition means that the function is called for transition > capture for updated rows being moved to another partition. For this > scenario, either the oldtup or the newtup is NULL. I wanted to exactly > capture that condition there. I think the bitwise operator is more > user-friendly in emphasizing the point that it is indeed an "either a > or b, not both" condition. I see. In that case, since this patch adds the new condition, a note about it in the comment just above would be good, because the situation you describe here seems to arise only during update-tuple-routing, IIUC. >> + * 'perleaf_parentchild_maps' receives an array of TupleConversionMap objects >> + * with on entry for every leaf partition (required to convert input >> tuple >> + * based on the root table's rowtype to a leaf partition's rowtype after >> + * tuple routing is done) >> >> Could this be named leaf_tupconv_maps, maybe? It perhaps makes clear that >> they are maps needed for "tuple conversion". And the other field holding >> the reverse map as leaf_rev_tupconv_maps. Either that or use underscores >> to separate words, but then it gets too long I guess. > > In master branch, now this param is already there with the name > "tup_conv_maps". In the rebased version in the earlier mail, I haven't > again changed it. I think "tup_conv_maps" looks clear enough. OK. In the latest patch: + * 'update_rri' has the UPDATE per-subplan result rels. These are re-used + * instead of allocating new ones while generating the array of all leaf + * partition result rels. Instead of: "These are re-used instead of allocating new ones while generating the array of all leaf partition result rels." how about: "There is no need to allocate a new ResultRellInfo entry for leaf partitions for which one already exists in this array" >> ISTM, ConvertPartitionTupleSlot has a very specific job and a bit complex >> interface. I guess it could simply have the following interface: >> >> static HeapTuple ConvertPartitionTuple(ModifyTabelState *mtstate, >> HeapTuple tuple, bool is_update); >> >> And figure out, based on the value of is_update, which map to use and >> which slot to set *p_new_slot to (what is now "new_slot" argument). >> You're getting mtstate here anyway, which contains all the information you >> need here. It seems better to make that (selecting which map and which >> slot) part of the function's implementation if we're having this function >> at all, imho. Maybe I'm missing some details there, but my point still >> remains that we should try to put more logic in that function instead of >> it just do the mechanical tuple conversion. > > I tried to see how the interface would look if we do that way. Here is > how the code looks : > > static TupleTableSlot * > ConvertPartitionTupleSlot(ModifyTableState *mtstate, > bool for_update_tuple_routing, > int map_index, > HeapTuple *tuple, > TupleTableSlot *slot) > { > TupleConversionMap *map; > TupleTableSlot *new_slot; > > if (for_update_tuple_routing) > { > map = mtstate->mt_persubplan_childparent_maps[map_index]; > new_slot = mtstate->mt_rootpartition_tuple_slot; > } > else > { > map = mtstate->mt_perleaf_parentchild_maps[map_index]; > new_slot = mtstate->mt_partition_tuple_slot; > } > > if (!map) > return slot; > > *tuple = do_convert_tuple(*tuple, map); > > /* > * Change the partition tuple slot descriptor, as per converted tuple. > */ > ExecSetSlotDescriptor(new_slot, map->outdesc); > ExecStoreTuple(*tuple, new_slot, InvalidBuffer, true); > > return new_slot; > } > > It looks like the interface does not much simplify, and above that, we > have more number of lines in that function. Also, the caller anyway > has to be aware whether the map_index is the index into the leaf > partitions or the update subplans. So it is not like the caller does > not have to be aware about whether the mapping should be > mt_persubplan_childparent_maps or mt_perleaf_parentchild_maps. Hmm, I think we should try to make it so that the caller doesn't have to be aware of that. And by caller I guess you mean ExecInsert(), which should not be a place, IMHO, where to try to introduce a lot of new logic specific to update tuple routing. ISTM, ModifyTableState now has one too many TupleConversionMap pointer arrays after the patch, creating the need to choose from in the first place. AIUI - * mt_perleaf_parentchild_maps: - each entry is a map to convert root parent's tuples to a given leaf partition's format - used to be called mt_partition_tupconv_maps and is needed when tuple- routing is in use; for both INSERT and UPDATEwith tuple-routing - as many entries in the array as there are leaf partitions and stored in the partition bound order * mt_perleaf_childparent_maps: - each entry is a map to convert a leaf partition's tuples to the root parent's format - newly added by this patch and seems to be needed for UPDATE with tuple-routing for two needs: 1. tuple-routing shouldstart with a tuple in root parent format whereas the tuple received is in leaf partition format when ExecInsert()called for update-tuple-routing (by ExecUpdate), 2. after tuple-routing, we must capture the tuple insertedinto the partition in the transition tuplestore which accepts tuples in root parent's format - as many entries in the array as there are leaf partitions and stored in the partition bound order * mt_persubplan_childparent_maps: - each entry is a map to convert a child table's tuples to the root parent's format - used to be called mt_transition_tupconv_maps and needed for converting child tuples to the root parent's format whenstoring them in the transition tuplestore which accepts tuples in root parent's format - as many entries in the array as there are sub-plans in mt_plans and stored in either the partition bound order or unknownorder (the latter in the regular inheritance case) I think we could combine the last two into one. The only apparent reason for them to be separate seems to be that the subplan array might contain less entries than perleaf array and ExecInsert() has only enough information to calculate the offset of a map in the persubplan array. That is, resultRelInfo of leaf partition that ExecInsert starts with in the update-tuple-routing case comes from mtstate->resultRelInfo array which contains only mt_nplans entries. So, if we only have the array with entries for *all* partitions, it's hard to get the offset of the map to use in that array. I suggest we don't add a new map array and a significant amount of new code to initialize the same and to implement the logic to choose the correct array to get the map from. Instead, we could simply add an array of integers with mt_nplans entries. Each entry is an offset of a given sub-plan in the array containing entries of something for *all* partitions. Since, we are teaching ExecSetupPartitionTupleRouting() to reuse ResultRelInfos from mtstate->resultRelInfos, there is a suitable place to construct such array. Let's say the array is called mt_subplan_partition_offsets[]. Let ExecSetupPartitionTupleRouting() also initialize the parent-to-partition maps for *all* partitions, in the update-tuple-routing case. Then add a quick-return check in ExecSetupTransitionCaptureState() to see if the map has already been set by ExecSetupPartitionTupleRouting(). Since we're using the same map for two purposes, we could rename mt_transition_tupconv_maps to something that doesn't bind it to its use only for transition tuple capture. With that, now there are no persubplan and perleaf arrays for ExecInsert() to pick from to select a map to pass to ConvertPartitionTupleSlot(), or maybe even no need for the separate function. The tuple-routing code block in ExecInsert would look like below (writing resultRelInfo as just Rel): rootRel = (mtstate->rootRel != NULL) ? mtstate->rootRel : Rel if (rootRel != Rel) /* update tuple-routing active */ { int subplan_off = Rel - mtstate->Rel[0]; int leaf_off= mtstate->mt_subplan_partition_offsets[subplan_off]; if (mt_transition_tupconv_maps[leaf_off]) { /* * Convert to root format using * mt_transition_tupconv_maps[leaf_off] */ slot = mt_root_tuple_slot; /* for tuple-routing */ /* Store the converted tuple into slot */ } } /* Existing tuple-routing flow follows */ new_leaf = ExecFindPartition(rootRel, slot, ...) if (mtstate->transition_capture) { transition_capture_map = mt_transition_tupconv_maps[new_leaf] } if (mt_partition_tupconv_maps[new_leaf]) { /* * Convert to leaf format using mt_partition_tupconv_maps[new_leaf] */ slot = mt_partition_tuple_slot; /* Store the converted tuple into slot */ } >> ISTM, the newly introduced logic in ExecSetupTransitionCaptureState() to >> try to reuse the per-subplan child-to-parent map as per-leaf >> child-to-parent map could be simplified a bit. I mean the following code: >> >> + /* >> + * But for Updates, we can share the per-subplan maps with the per-leaf >> + * maps. >> + */ >> + update_rri_index = 0; >> + update_rri = mtstate->resultRelInfo; >> + if (mtstate->mt_nplans > 0) >> + cur_reloid = RelationGetRelid(update_rri[0].ri_RelationDesc); >> >> - /* Choose the right set of partitions */ >> - if (mtstate->mt_partition_dispatch_info != NULL) >> + for (i = 0; i < numResultRelInfos; ++i) >> + { >> <snip> >> >> How about (pseudo-code): >> >> j = 0; >> for (i = 0; i < n_leaf_parts; i++) >> { >> if (j < n_subplans && leaf_rri[i]->oid == subplan_rri[j]->oid) >> { >> leaf_childparent_map[i] = subplan_childparent_map[j]; >> j++; >> } >> else >> { >> leaf_childparent_map[i] = new map >> } >> } >> >> I think the above would also be useful in ExecSetupPartitionTupleRouting() >> where you've added similar code to try to reuse per-subplan ResultRelInfos. > > Did something like that in the attached patch. Please have a look. > After we conclude on that, will do the same for > ExecSetupPartitionTupleRouting() as well. Yeah, ExecSetupTransitionCaptureState() looks better in v22, but as I explained above, we may not need to change the function so much. The approach, OTOH, should be adopted for ExecSetupPartitionTupleRouting(). >> In the following: >> >> ExecSetupPartitionTupleRouting(rel, >> + (operation == CMD_UPDATE ? >> + mtstate->resultRelInfo : NULL), >> + (operation == CMD_UPDATE ? nplans >> : 0), >> >> Can the second parameter be made to not span two lines? It was a bit hard >> for me to see that there two new parameters. > > I think it is safe to just pass mtstate->resultRelInfo. Inside > ExecSetupPartitionTupleRouting() we should anyways check only the > nplans param (and not update_rri) to decide whether it is for insert > or update. So did the same. OK. >> By the way, I've seen in a number of places that the patch calls "root >> table" a partition. Not just in comments, but also a variable appears to >> be given a name which contains rootpartition. I can see only one instance >> where root is called a partition in the existing source code, but it seems >> to have been introduced only recently: >> >> allpaths.c:1333: * A root partition will already have a > > Changed to either this : > root partition => root partitioned table > or this if we have to refer to it too often : > root partition => root That seems fine, thanks. On 2017/10/25 15:10, Amit Khandekar wrote: > On 16 October 2017 at 08:28, Amit Langote wrote: >> In ExecInitModifyTable(), can we try to minimize the number of places >> where update_tuple_routing_needed is being set. Currently, it's being set >> in 3 places: > > I think the way it's done seems ok. For each resultRelInfo, > update_tuple_routing_needed is updated in case that resultRel has > partition cols changed. And at that point, we don't have rel opened, > so we can't check if that rel is partitioned. So another check is > required outside of the loop. I understood why now. >> + * qual for each partition. Note that, if there are SubPlans in >> there, >> + * they all end up attached to the one parent Plan node. >> >> The sentence starting with "Note that, " is a bit unclear. >> >> + Assert(update_tuple_routing_needed || >> + (operation == CMD_INSERT && >> + list_length(node->withCheckOptionLists) == 1 && >> + mtstate->mt_nplans == 1)); >> >> The comment I complained about above is perhaps about this Assert. > > That is an existing comment. Sorry, my bad. > On HEAD, the "parent Plan" refers to > mtstate->mt_plans[0]. Now in the patch, for the parent node in > ExecInitQual(), mtstate->ps is passed rather than mt_plans[0]. So the > parent plan refers to this mtstate node. Hmm, I'm not really sure if doing that (passing mtstate->ps) would be accurate. In the update tuple routing case, it seems that it's better to pass the correct parent PlanState pointer to ExecInitQual(), that is, one corresponding to the partition's sub-plan. At least I get that feeling by looking at how parent is used downstream to that ExecInitQual() call, but there *may* not be anything to worry about there after all. I'm unsure. > BTW, the reason I had changed the parent node to mtstate->ps is : > Other places in that code use mtstate->ps while initializing > expressions : > > /* > * Build a projection for each result rel. > */ > resultRelInfo->ri_projectReturning = > ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps, > resultRelInfo->ri_RelationDesc->rd_att); > > ........... > > /* build DO UPDATE WHERE clause expression */ > if (node->onConflictWhere) > { > ExprState *qualexpr; > > qualexpr = ExecInitQual((List *) node->onConflictWhere, > &mtstate->ps); > .... > } > > I think wherever we initialize expressions belonging to a plan, we > should use that plan as the parent. WithCheckOptions are fields of > ModifyTableState. You may be right, but I see for WithCheckOptions initialization specifically that the non-tuple-routing code passes the actual sub-plan when initializing the WCO for a given result rel. >> Comments on the optimizer changes: >> >> +get_all_partition_cols(List *rtables, >> >> Did you mean rtable? > > I did mean rtables. It's a list of rtables. It's not, AFAIK. rtable (range table) is a list of range table entries, which is also what seems to get passed to get_all_partition_cols for that argument (root->parse->rtable, which is not a list of lists). Moreover, there are no existing instances of this naming within the planner other than those that this patch introduces: $ grep rtables src/backend/optimizer/ planner.c:114: static void get_all_partition_cols(List *rtables, planner.c:1063: get_all_partition_cols(List *rtables, planner.c:1069: Oid root_relid = getrelid(root_rti, rtables); planner.c:1078: Oid relid = getrelid(rti, rtables); OTOH, dependency.c does have rtables, but it's actually a list of range tables. For example: dependency.c:1360: context.rtables = list_make1(rtable); >> + if (partattno != 0) >> + child_keycols = >> + bms_add_member(child_keycols, >> + partattno - >> FirstLowInvalidHeapAttributeNumber); >> + } >> + foreach(lc, partexprs) >> + { >> >> Elsewhere (in quite a few places), we don't iterate over partexprs >> separately like this, although I'm not saying it is bad, just different >> from other places. > > I think you are suggesting we do it like how it's done in > is_partition_attr(). Can you please let me know other places we do > this same way ? I couldn't find. OK, not as many as I thought there would be, but there are following beside is_partition_attrs(): partition.c: get_range_nulltest() partition.c: get_qual_for_range() relcache.c: RelationBuildPartitionKey() >> Aha, so here's where all_part_cols was being set before... > > Yes, and we used to have PartitionedChildRelInfo.all_part_cols field > for that. We used to populate that while traversing through the > partition tree in expand_inherited_rtentry(). I agreed with Dilip's > opinion that this would unnecessarily add up some processing even when > the query is not a DML. And also, we don't have to have > PartitionedChildRelInfo.all_part_cols. For the earlier implementation, > check v18 patch or earlier versions. Hmm, I think I have to agree with both you and Dilip that that would add some redundant processing to other paths. > Attached v22 patch. Thanks again. Regards, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Oct 25, 2017 at 11:40 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > Below I have addressed the remaining review comments : The changes to trigger.c still make me super-nervous. Hey THOMAS MUNRO, any chance you could review that part? + /* The caller must have already locked all the partitioned tables. */ + root_rel = heap_open(root_relid, NoLock); + *all_part_cols = NULL; + foreach(lc, partitioned_rels) + { + Index rti = lfirst_int(lc); + Oid relid = getrelid(rti, rtables); + Relation part_rel = heap_open(relid, NoLock); + + pull_child_partition_columns(part_rel, root_rel, all_part_cols); + heap_close(part_rel, NoLock); I don't like the fact that we're opening and closing the relation here just to get information on the partitioning columns. I think it would be better to do this someplace that already has the relation open and store the details in the RelOptInfo. set_relation_partition_info() looks like the right spot. +void +pull_child_partition_columns(Relation rel, + Relation parent, + Bitmapset **partcols) This code has a lot in common with is_partition_attr(). I'm not sure it's worth trying to unify them, but it could be done. + * 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT, Instead of " : ", you could just write "is the". + * For Updates, if the leaf partition is already present in the + * per-subplan result rels, we re-use that rather than initialize a + * new result rel. The per-subplan resultrels and the resultrels of + * the leaf partitions are both in the same canonical order. So while It would be good to explain the reason. Also, Updates shouldn't be capitalized here. + Assert(cur_update_rri <= update_rri + num_update_rri - 1); Maybe just cur_update_rri < update_rri + num_update_rri, or even current_update_rri - update_rri < num_update_rri. Also, +1 for Amit Langote's idea of trying to merge mt_perleaf_childparent_maps with mt_persubplan_childparent_maps. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 7 November 2017 at 00:33, Robert Haas <robertmhaas@gmail.com> wrote: > Also, +1 for Amit Langote's idea of trying to merge > mt_perleaf_childparent_maps with mt_persubplan_childparent_maps. Currently I am trying to see if it simplifies things if we do that. We will be merging these arrays into one, but we are adding a new int[] array that maps subplans to leaf partitions. Will get back with how it looks finally. Robert, Amit , I will get back with your other review comments. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/11/07 14:40, Amit Khandekar wrote: > On 7 November 2017 at 00:33, Robert Haas <robertmhaas@gmail.com> wrote: > >> Also, +1 for Amit Langote's idea of trying to merge >> mt_perleaf_childparent_maps with mt_persubplan_childparent_maps. > > Currently I am trying to see if it simplifies things if we do that. We > will be merging these arrays into one, but we are adding a new int[] > array that maps subplans to leaf partitions. Will get back with how it > looks finally. One thing to note is that the int[] array I mentioned will be much faster to compute than going to convert_tuples_by_name() to build the additional maps array. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Nov 7, 2017 at 8:03 AM, Robert Haas <robertmhaas@gmail.com> wrote: > The changes to trigger.c still make me super-nervous. Hey THOMAS > MUNRO, any chance you could review that part? Looking, but here's one silly thing that jumped out at me while getting started with this patch. I cannot seem to convince my macOS system to agree with the expected sort order from :show_data, where underscores precede numbers: part_a_10_a_20 | a | 10 | 200 | 1 | part_a_1_a_10 | a | 1 | 1 | 1 | - part_d_1_15 | b | 15 | 146 | 1 | - part_d_1_15 | b | 16 | 147 | 2 | part_d_15_20 | b | 17 | 155 | 16 | part_d_15_20 | b | 19 | 155 | 19 | + part_d_1_15 | b | 15 | 146 | 1 | + part_d_1_15 | b | 16 | 147 | 2 | It seems that macOS (like older BSDs) just doesn't know how to sort Unicode and falls back to sorting the bits. I expect that means that the test will also fail on any other OS with "make check LC_COLLATE=C". I believe our regression tests are supposed to pass with a wide range of collations including C, so I wonder if this means we should stick a leading zero on those single digit numbers, or something, to stabilise the output. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 8 November 2017 at 07:55, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Tue, Nov 7, 2017 at 8:03 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> The changes to trigger.c still make me super-nervous. Hey THOMAS >> MUNRO, any chance you could review that part? > > Looking, but here's one silly thing that jumped out at me while > getting started with this patch. I cannot seem to convince my macOS > system to agree with the expected sort order from :show_data, where > underscores precede numbers: > > part_a_10_a_20 | a | 10 | 200 | 1 | > part_a_1_a_10 | a | 1 | 1 | 1 | > - part_d_1_15 | b | 15 | 146 | 1 | > - part_d_1_15 | b | 16 | 147 | 2 | > part_d_15_20 | b | 17 | 155 | 16 | > part_d_15_20 | b | 19 | 155 | 19 | > + part_d_1_15 | b | 15 | 146 | 1 | > + part_d_1_15 | b | 16 | 147 | 2 | > > It seems that macOS (like older BSDs) just doesn't know how to sort > Unicode and falls back to sorting the bits. I expect that means that > the test will also fail on any other OS with "make check > LC_COLLATE=C". I believe our regression tests are supposed to pass > with a wide range of collations including C, so I wonder if this means > we should stick a leading zero on those single digit numbers, or > something, to stabilise the output. I preferably need to retain the partition names. I have now added a LOCALE "C" for partname like this : -\set show_data 'select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4, 5, 6' +\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted order by 1, 2, 3, 4, 5, 6' Thomas, can you please try the attached incremental patch regress_locale_changes.patch and check if the test passes ? The patch is to be applied on the main v22 patch. If the test passes, I will include these changes (also for list_parted) in the upcoming v23 patch. Thanks -Amit Khandekar -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Wed, Nov 8, 2017 at 5:57 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > Thomas, can you please try the attached incremental patch > regress_locale_changes.patch and check if the test passes ? The patch > is to be applied on the main v22 patch. If the test passes, I will > include these changes (also for list_parted) in the upcoming v23 > patch. That looks good. Thanks. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Nov 8, 2017 at 5:57 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 8 November 2017 at 07:55, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >> On Tue, Nov 7, 2017 at 8:03 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> The changes to trigger.c still make me super-nervous. Hey THOMAS >>> MUNRO, any chance you could review that part? At first, it seemed quite strange to me that row triggers and statement triggers fire different events for the same modification. Row triggers see DELETE + INSERT (necessarily because different tables are involved), but this fact is hidden from the target table's statement triggers. The alternative would be for all triggers to see consistent events and transitions. Instead of having your special case code in ExecInsert and ExecDelete that creates the two halves of a 'synthetic' UPDATE for the transition tables, you'd just let the existing ExecInsert and ExecDelete code do its thing, and you'd need a flag to record that you should also fire INSERT/DELETE after statement triggers if any rows moved. After sleeping on this question, I am coming around to the view that the way you have it is right. The distinction isn't really between row triggers and statement triggers, it's between triggers at different levels in the hierarchy. It just so happens that we currently only fire target table statement triggers and leaf table row triggers. Future development ideas that seem consistent with your choice: 1. If we ever allow row triggers with transition tables on child tables, then I think *their* transition tables should certainly see the deletes and inserts, otherwise OLD TABLE and NEW TABLE would be inconsistent with the OLD and NEW variables in a single trigger invocation. (These were prohibited mainly due to lack of time and (AFAIK) limited usefulness; I think they would need probably need their own separate tuplestores, or possibly some kind of filtering.) 2. If we ever allow row triggers on partitioned tables (ie that fire when its children are modified), then I think their UPDATE trigger should probably fire when a row moves between any two (grand-)*child tables, just as you have it for target table statement triggers. It doesn't matter that the view from parent tables' triggers is inconsistent with the view from leaf table triggers: it's a feature that we 'hide' partitioning from the user to the extent we can so that you can treat the partitioned table just like a table. Any other views? As for the code, I haven't figured out how to break it yet, and I'm wondering if there is some way to refactor so that ExecInsert and ExecDelete don't have to record pseudo-UPDATE trigger events. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 2 November 2017 at 12:40, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > ISTM, ModifyTableState now has one too > many TupleConversionMap pointer arrays after the patch, creating the need > to choose from in the first place. AIUI - > > * mt_perleaf_parentchild_maps: > > - each entry is a map to convert root parent's tuples to a given leaf > partition's format > > - used to be called mt_partition_tupconv_maps and is needed when tuple- > routing is in use; for both INSERT and UPDATE with tuple-routing > > - as many entries in the array as there are leaf partitions and stored > in the partition bound order > > * mt_perleaf_childparent_maps: > > - each entry is a map to convert a leaf partition's tuples to the root > parent's format > > - newly added by this patch and seems to be needed for UPDATE with > tuple-routing for two needs: 1. tuple-routing should start with a > tuple in root parent format whereas the tuple received is in leaf > partition format when ExecInsert() called for update-tuple-routing (by > ExecUpdate), 2. after tuple-routing, we must capture the tuple > inserted into the partition in the transition tuplestore which accepts > tuples in root parent's format > > - as many entries in the array as there are leaf partitions and stored > in the partition bound order > > * mt_persubplan_childparent_maps: > > - each entry is a map to convert a child table's tuples to the root > parent's format > > - used to be called mt_transition_tupconv_maps and needed for converting > child tuples to the root parent's format when storing them in the > transition tuplestore which accepts tuples in root parent's format > > - as many entries in the array as there are sub-plans in mt_plans and > stored in either the partition bound order or unknown order (the > latter in the regular inheritance case) thanks for the detailed description. Yet that's correct. > > I think we could combine the last two into one. The only apparent reason > for them to be separate seems to be that the subplan array might contain > less entries than perleaf array and ExecInsert() has only enough > information to calculate the offset of a map in the persubplan array. > That is, resultRelInfo of leaf partition that ExecInsert starts with in > the update-tuple-routing case comes from mtstate->resultRelInfo array > which contains only mt_nplans entries. So, if we only have the array with > entries for *all* partitions, it's hard to get the offset of the map to > use in that array. > > I suggest we don't add a new map array and a significant amount of new > code to initialize the same and to implement the logic to choose the > correct array to get the map from. Instead, we could simply add an array > of integers with mt_nplans entries. Each entry is an offset of a given > sub-plan in the array containing entries of something for *all* > partitions. Since, we are teaching ExecSetupPartitionTupleRouting() to > reuse ResultRelInfos from mtstate->resultRelInfos, there is a suitable > place to construct such array. Let's say the array is called > mt_subplan_partition_offsets[]. Let ExecSetupPartitionTupleRouting() also > initialize the parent-to-partition maps for *all* partitions, in the > update-tuple-routing case. Then add a quick-return check in > ExecSetupTransitionCaptureState() to see if the map has already been set > by ExecSetupPartitionTupleRouting(). Since we're using the same map for > two purposes, we could rename mt_transition_tupconv_maps to something that > doesn't bind it to its use only for transition tuple capture. I was trying hard to verify whether this is really going to simplify the code. We are removing one array and adding one. In my approach, the map structures are anyway shared, they are not duplicated. Because I have separate arrays to access the tuple conversion map partition-based or subplan-based, there is no need for extra logic to get into the per-partition array. But on the other hand, we need not do that many changes in ExecSetupTransitionCaptureState() that I have done, although my patch hasn't resulted in more number of line in that function; it has just changed the logic. Also, each time we access the map, we need to know whether it is per-plan or per-partition, according to a set of factors like whether transition tables are there and whether tuple routing is there. But I realized that one plus point of your approach is that it is going to be extensible if we later need to have some more per-subplan information that is already there in a partition-wise array. In that case, we just need to re-use the int[] map; we don't have to create two new separate arrays; just create one per-leaf array, and use the map to get into one of its elements, given a per-subplan index. So I went ahead and did the changes : New mtstate maps : TupleConversionMap **mt_parentchild_tupconv_maps; /* Per partition map for tuple conversion from root to leaf */ TupleConversionMap **mt_childparent_tupconv_maps; /* Per plan/partition map for tuple conversion from child to root */ int *mt_subplan_partition_offsets; /* Stores position of update result rels in leaf partitions */ We need to know whether mt_childparent_tupconv_maps is per-plan or per-partition. Each time this map is accessed, it is tedious to go through conditions that determine whether that map is per-partition or not. Here are the conditions : For transition tables per-leaf map needed : in presence of tuple routing (insert or update, whichever). per-plan map needed : in presence of simple update (i.e. routing not involved) For update tuple routing. per-plan map needed : always So instead, added a new bool mtstate->mt_is_tupconv_perpart field that is set to true only while setting up transition tables and that too only when tuple routing is to be done. Since both transition tables and update tuple routing need a child-parent map, extracted the code to build the map into a common function ExecSetupChildParentMap(). (I think I could have done this earlier also) Each time we need to access this map, we not only have to use the int[] maps, we also need to first check if it's a per-leaf map. So put this logic in tupconv_map_for_subplan() and used this everywhere we need the map. Attached is v23 patch that has just the above changes (and also rebased on hash-partitioning changes, like update.sql). I am still doing some sanity testing on this, although regression passes. I am yet to respond to the other review comments; will do that with a v24 patch. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 9 November 2017 at 09:27, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Wed, Nov 8, 2017 at 5:57 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> On 8 November 2017 at 07:55, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >>> On Tue, Nov 7, 2017 at 8:03 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> The changes to trigger.c still make me super-nervous. Hey THOMAS >>>> MUNRO, any chance you could review that part? > > At first, it seemed quite strange to me that row triggers and > statement triggers fire different events for the same modification. > Row triggers see DELETE + INSERT (necessarily because different > tables are involved), but this fact is hidden from the target table's > statement triggers. > > The alternative would be for all triggers to see consistent events and > transitions. Instead of having your special case code in ExecInsert > and ExecDelete that creates the two halves of a 'synthetic' UPDATE for > the transition tables, you'd just let the existing ExecInsert and > ExecDelete code do its thing, and you'd need a flag to record that you > should also fire INSERT/DELETE after statement triggers if any rows > moved. Yeah I also had thought about that. But thought that change was too invasive. For e.g. letting ExecARInsertTriggers() do the transition capture even when transition_capture->tcs_update_new_table is set. I was also thinking of having a separate function to *only* add the transition table rows. So in ExecInsert, call this one instead of ExecARUpdateTriggers(). But realized that the existing ExecARUpdateTriggers() looks like a better, robust interface with all its checks. Just that calling ExecARUpdateTriggers() sounds like we are also firing trigger; we are not firing any trigger or saving any event, we are just adding the transition row. > > After sleeping on this question, I am coming around to the view that > the way you have it is right. The distinction isn't really between > row triggers and statement triggers, it's between triggers at > different levels in the hierarchy. It just so happens that we > currently only fire target table statement triggers and leaf table row > triggers. Yes. And rows are there only in leaf partitions. So we have to simulate as though the target table has these rows. Like you mentioned, the user has to get the impression of a normal table. So we have to do something extra to capture the rows. > Future development ideas that seem consistent with your choice: > > 1. If we ever allow row triggers with transition tables on child > tables, then I think *their* transition tables should certainly see > the deletes and inserts, otherwise OLD TABLE and NEW TABLE would be > inconsistent with the OLD and NEW variables in a single trigger > invocation. (These were prohibited mainly due to lack of time and > (AFAIK) limited usefulness; I think they would need probably need > their own separate tuplestores, or possibly some kind of filtering.) As we know, for row triggers on leaf partitions, we treat them as normal tables, so a trigger written on a leaf partition sees only the local changes. The trigger is unaware whether the insert is part of an UPDATE row movement. Similarly, the transition table referenced by that row trigger function should see only the NEW table, not the old table. > > 2. If we ever allow row triggers on partitioned tables (ie that fire > when its children are modified), then I think their UPDATE trigger > should probably fire when a row moves between any two (grand-)*child > tables, just as you have it for target table statement triggers. Yes I agree. > It doesn't matter that the view from parent tables' triggers is > inconsistent with the view from leaf table triggers: it's a feature > that we 'hide' partitioning from the user to the extent we can so that > you can treat the partitioned table just like a table. > > Any other views? I think because because there is no provision for a row trigger on partitioned table, users who want to have a common trigger on a partition subtree, has no choice but to create the same trigger individually on the leaf partitions. And that's the reason we cannot handle an update row movement with triggers without anomalies. Thanks -Amit Khandekar -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 10 November 2017 at 16:42, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: [ update-partition-key_v23.patch ] Hi Amit, Thanks for working on this. I'm looking forward to seeing this go in. So... I've signed myself up to review the patch, and I've just had a look at it, (after first reading this entire email thread!). Overall the patch looks like it's in quite a good shape. I think I do agree with Robert about the UPDATE anomaly that's been discussed. I don't think we're painting ourselves into any corner by not having this working correctly right away. Anyone who's using some trigger workarounds for the current lack of support for updating the partition key is already going to have the same issues, so at least this will save them some troubles implementing triggers and give them much better performance. I see you've documented this fact too, which is good. I'm writing this email now as I've just run out of review time for today. Here's what I noted down during my first pass: 1. Closing command tags in docs should not be abbreviated triggers are concerned, <literal>AFTER</> <command>DELETE</command> and This changed in c29c5789. I think Peter will be happy if you don't abbreviate the closing tags. 2. "about to do" would read better as "about to perform" concurrent session, and it is about to do an <command>UPDATE</command> I think this paragraph could be more clear if we identified the sessions with a number. Perhaps: Suppose, session 1 is performing an <command>UPDATE</command> on a partition key, meanwhile, session 2tries to perform an <command>UPDATE </command> or <command>DELETE</command> operation on the same row. Session2 can silently miss the row due to session 1's activity. In such a case, session 2's <command>UPDATE</command>/<command>DELETE </command>, being unaware of the row's movement, interprets this that the row has just been deleted, so there is nothing to be done for this row. Whereas, in the usual case where the tableis not partitioned, or where there is no row movement, the second session would have identified the newlyupdated row and carried <command>UPDATE</command>/<command>DELETE </command> on this new row version. 3. Integer width. get_partition_natts returns int but we assign to int16. int16 partnatts = get_partition_natts(key); Confusingly get_partition_col_attnum() returns int16 instead of AttrNumber but that's existingly not correct. 4. The following code could just pull_varattnos(partexprs, 1, &child_keycols); foreach(lc, partexprs) { Node *expr = (Node *) lfirst(lc); pull_varattnos(expr, 1, &child_keycols); } 5. Triggers. Do we need a new "TG_" tag to allow trigger functions to do something special when the DELETE/INSERT is a partition move? I have audit tables in mind here it may appear as though a user performed a DELETE when they actually performed an UPDATE giving visibility of this to the trigger function will allow the application to work around this. 6. change "row" to "a row" and "old" to "the old" * depending on whether the event is for row being deleted from old But to be honest, I'm having trouble parsing the comment. I think it would be better to say explicitly when the row will be NULL rather than "depending on whether the event" 7. I'm confused with how this change came about. If the old comment was correct here then the comment you're referring to here should remain in ExecPartitionCheck(), but you're saying it's in ExecConstraints(). /* See the comments in ExecConstraints. */ If the comment really is in ExecConstraints(), then you might want to give an overview of what you mean, then reference ExecConstraints() if more details are required. 8. I'm having trouble parsing this comment: * 'update_rri' has the UPDATE per-subplan result rels. I think "has" should be "contains" ? 9. Also, this should likely be reworded: * 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT,* this is 0. 'num_update_rri' number of elements in 'update_rri' array or zero for INSERT. 10. There should be no space before the '?' /* Is this leaf partition present in the update resultrel ? */ 11. I'm struggling to understand this comment: * This is required when converting tuple as per root * partition tuple descriptor. "tuple" should probably be "the tuple", but not quite sure what you mean by "as per root". I may have misunderstood, but maybe it should read: * This is required when we convert the partition's tuple to * be compatible with the partitioned table's tuple descriptor. 12. I think "as well" would be better written as "either". * If we didn't open the partition rel, it means we haven't * initialized the result rel as well. 13. I'm unsure what is meant by the following comment: * Verify result relation is a valid target for insert operation. Even * for updates, we are doing this for tuple-routing, so again, we need * to check the validity for insert operation. I'm not quite sure where UPDATE comes in here as we're only checking for INSERT? 14. Use of underscores instead of camelCase. COPY_SCALAR_FIELD(part_cols_updated); I know you're not the first one to break this as "partitioned_rels" does not follow it either, but that's probably not a good enough reason to break away from camelCase any further. I'd suggest "partColsUpdated". But after a re-think, maybe cols is incorrect. All columns are partitioned, it's the key columns that we care about, so how about "partKeyUpdate" 15. Are you sure that you mean "root" here? * All the child partition attribute numbers are converted to the root* partitioned table. Surely this is just the target relation. "parent" maybe? A sub-partitioned table might be the target of an UPDATE too. 15. I see get_all_partition_cols() is just used once to check if parent_rte->updatedCols contains and partition keys. Would it not be better to reform that function and pass parent_rte->updatedCols in and abort as soon as you see a single match? Maybe the function could return bool and be named partitioned_key_overlaps(), that way your assignment in inheritance_planner() would just become: part_cols_updated = partitioned_key_overlaps(root->parse->rtable, top_parentRTindex, partitioned_rels, parent_rte->updatedCols); or something like that anyway. 16. Typo in comment * 'part_cols_updated' if any partitioning columns are being updated, either* from the named relation or a descendent partitionetable. "partitione" should be "partitioned". Also, normally for bool parameters, we might word things like "True if ..." rather than just "if" You probably should follow camelCase I mentioned in 14 here too. 17. Comment needs a few changes: * ConvertPartitionTupleSlot -- convenience function for converting tuple and* storing it into a tuple slot provided through'new_slot', which typically* should be one of the dedicated partition tuple slot. Passes the partition* tuple slotback into output param p_old_slot. If no mapping present, keeps* p_old_slot unchanged.** Returns the converted tuple. There are a few typos here. For example, "tuple" should be "a tuple", but maybe the comment should just be worded like: * ConvertPartitionTupleSlot -- convenience function for tuple conversion* using 'map'. The tuple, if converted, is storedin 'new_slot' and* 'p_old_slot' is set to the original partition tuple slot. If map is NULL,* then the original tupleis returned unmodified, otherwise the converted* tuple is returned. 18. Line goes over 80 chars. TransitionCaptureState *transition_capture = mtstate->mt_transition_capture; Better just to split the declaration and assignment. 19. Confusing comment: /* * If the original operation is UPDATE, the root partitioned table * needs to be fetched from mtstate->rootResultRelInfo. */ It's not that clear here how you determine this is an UPDATE of a partitioned key. 20. This code looks convoluted: rootResultRelInfo = (mtstate->rootResultRelInfo ? mtstate->rootResultRelInfo : resultRelInfo); /* * If the resultRelInfo is not the root partitioned table (which * happens for UPDATE), we should convert the tuple into root's tuple * descriptor, since ExecFindPartition() starts the search from root. * The tuple conversion map list is in the order of * mtstate->resultRelInfo[], so to retrieve the one for this resultRel, * we need to know the position of the resultRel in * mtstate->resultRelInfo[]. */ if (rootResultRelInfo != resultRelInfo) { rootResultRelInfo is assigned via a ternary expression which makes the subsequent if test seem a little strange. Would it not be better to test: if (mtstate->rootResultRelInfo) { rootResultRelInfo = mtstate->rootResultRelInfo ... other stuff ... } else rootResultRelInfo = resultRelInfo; Then above the if test you can explain that rootResultRelInfo is only set during UPDATE of partition keys, as per #19. 21. How come you renamed mt_partition_tupconv_maps[] to mt_parentchild_tupconv_maps[]? 22. Comment in ExecInsert() could be worded better. /* * In case this is part of update tuple routing, put this row into the * transition NEW TABLE if we are capturing transition tables. We need to * do this separately for DELETE and INSERT because they happen on * different tables. */ /* * This INSERT may be the result of a partition-key-UPDATE. If so, * and we're required to capture transition tables then we'd better * record this as a statement level UPDATE on the target relation. * We're not interested in the statement level DELETE or INSERT as * these occur on the individual partitions, none of which are the * target of this the UPDATE statement. */ A similar comment could use a similar improvement in ExecDelete() 23. Line is longer than 80 chars. TransitionCaptureState *transition_capture = mtstate->mt_transition_capture; 24. I know from reading the thread this name has changed before, but I think delete_skipped seems like the wrong name for this variable in: if (delete_skipped) *delete_skipped = true; Skipped is the wrong word here as that indicates like we had some sort of choice and that we decided not to. However, that's not the case when the tuple was concurrently deleted. Would it not be better to call it "tuple_deleted" or even "success" and reverse the logic? It's just a bit confusing that you're setting this to skipped before anything happens. It would be nicer if there was a better way to do this whole thing as it's a bit of a wart in the code. I understand why the code exists though. Also, I wonder if it's better to always pass a boolean here to save having to test for NULL before setting it, that way you might consider putting the success = false just before the return NULL, then do success = true after the tuple is gone. Failing that, putting: something like: success = false; /* not yet! */ where you're doing the if (deleted_skipped) test is might also be better. 25. Comment "we should" should be "we must". /* * For some reason if DELETE didn't happen (for e.g. trigger * prevented it, or it was already deleted by self, or it was * concurrently deleted by another transaction), then we should * skip INSERT as well, otherwise, there will be effectively one * new row inserted. Maybe just: /* If the DELETE operation was unsuccessful, then we must not * perform the INSERT into the new partition. "for e.g." is not really correct in English. "For example, ..." or just "e.g. ..." is correct. If you de-abbreviate the e.g. then you've written "For exempli gratia", which translates to "For for example". 26. You're not really explaining what's going on here: if (mtstate->mt_transition_capture) saved_tcs_map = mtstate->mt_transition_capture->tcs_map; You have a comment later to say you're about to "Revert back to the transition capture map", but I missed the part that explained about modifying it in the first place. 27. Comment does not explain how we're skipping checking the partition constraint check in: * We have already checked partition constraints above, so skip * checking them here. Maybe something like: * We've already checked the partition constraint above, however, we * must still ensure the tuple passes all other constraints, so we'll * call ExecConstraints() and have it validate all remaining checks. 28. For table WITH OIDs, the OID should probably follow the new tuple for partition-key-UPDATEs. CREATE TABLE p (a BOOL NOT NULL, b INT NOT NULL) PARTITION BY LIST (a) WITH OIDS; CREATE TABLE P_true PARTITION OF p FOR VALUES IN('t'); CREATE TABLE P_false PARTITION OF p FOR VALUES IN('f'); INSERT INTO p VALUES('t', 10); SELECT tableoid::regclass,oid,a FROM p;tableoid | oid | a ----------+-------+---p_true | 16792 | t (1 row) UPDATE p SET a = 'f'; -- partition-key-UPDATE (oid has changed (it probably shouldn't have)) SELECT tableoid::regclass,oid,a FROM p;tableoid | oid | a ----------+-------+---p_false | 16793 | f (1 row) UPDATE p SET b = 20; -- non-partition-key-UPDATE (oid remains the same) SELECT tableoid::regclass,oid,a FROM p;tableoid | oid | a ----------+-------+---p_false | 16793 | f (1 row) I'll try to continue with the review tomorrow, but I think some other reviews are also looming too. -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Nov 10, 2017 at 4:42 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > Attached is v23 patch that has just the above changes (and also > rebased on hash-partitioning changes, like update.sql). I am still > doing some sanity testing on this, although regression passes. The test coverage[1] is 96.62%. Nice work. Here are the bits that aren't covered: In partition.c's pull_child_partition_columns(), the following loop is never run: + foreach(lc, partexprs) + { + Node *expr = (Node *) lfirst(lc); + + pull_varattnos(expr, 1, &child_keycols); + } In nodeModifyTable.c, the following conditional branches are never run: if (mtstate->mt_oc_transition_capture != NULL) + { + Assert(mtstate->mt_is_tupconv_perpart == true); mtstate->mt_oc_transition_capture->tcs_map= - mtstate->mt_transition_tupconv_maps[leaf_part_index]; + mtstate->mt_childparent_tupconv_maps[leaf_part_index]; + } if (node->mt_oc_transition_capture != NULL) { - Assert(node->mt_transition_tupconv_maps != NULL); node->mt_oc_transition_capture->tcs_map = - node->mt_transition_tupconv_maps[node->mt_whichplan]; + tupconv_map_for_subplan(node, node->mt_whichplan); } Is there any reason we shouldn't be able to test these paths? [1] https://codecov.io/gh/postgresql-cfbot/postgresql/commit/a3beb8d8f598a64d75aa4b3afc143a5d3e3f7826 -- Thomas Munro http://www.enterprisedb.com
On 14 November 2017 at 01:55, David Rowley <david.rowley@2ndquadrant.com> wrote: > I'll try to continue with the review tomorrow, but I think some other > reviews are also looming too. I started looking at this again today. Here's the remainder of my review. 29. ExecSetupChildParentMap gets called here for non-partitioned relations. Maybe that's not the best function name? The function only seems to do that when perleaf is True. Is a leaf a partition of a partitioned table? It's not that clear the meaning here. /* * If we found that we need to collect transition tuples then we may also * need tuple conversion maps for any children that have TupleDescs that * aren't compatible with the tuplestores. (We can share these maps * between the regular and ON CONFLICT cases.) */ if (mtstate->mt_transition_capture != NULL || mtstate->mt_oc_transition_capture != NULL) { int numResultRelInfos; numResultRelInfos = (mtstate->mt_partition_tuple_slot != NULL ? mtstate->mt_num_partitions : mtstate->mt_nplans); ExecSetupChildParentMap(mtstate, targetRelInfo, numResultRelInfos, (mtstate->mt_partition_dispatch_info != NULL)); 30. The following chunk of code is giving me a headache trying to verify which arrays are which size: ExecSetupPartitionTupleRouting(rel, mtstate->resultRelInfo, (operation == CMD_UPDATE ? nplans : 0), node->nominalRelation, estate, &partition_dispatch_info, &partitions, &partition_tupconv_maps, &subplan_leaf_map, &partition_tuple_slot, &num_parted, &num_partitions); mtstate->mt_partition_dispatch_info = partition_dispatch_info; mtstate->mt_num_dispatch = num_parted; mtstate->mt_partitions = partitions; mtstate->mt_num_partitions = num_partitions; mtstate->mt_parentchild_tupconv_maps = partition_tupconv_maps; mtstate->mt_subplan_partition_offsets = subplan_leaf_map; mtstate->mt_partition_tuple_slot = partition_tuple_slot; mtstate->mt_root_tuple_slot = MakeTupleTableSlot(); I know this patch is not completely responsible for it, but you're not making things any better. Would it not be better to invent some PartitionTupleRouting struct and make that struct a member of ModifyTableState and CopyState, then just pass the pointer to that struct to ExecSetupPartitionTupleRouting() and have it fill in the required details? I think the complexity of this is already on the high end, I think you really need to do the refactor before this gets any worse. The signature of the function is a bit scary! extern void ExecSetupPartitionTupleRouting(Relation rel, ResultRelInfo *update_rri, int num_update_rri, Index resultRTindex, EState *estate, PartitionDispatch **pd, ResultRelInfo ***partitions, TupleConversionMap ***tup_conv_maps, int **subplan_leaf_map, TupleTableSlot **partition_tuple_slot, int *num_parted, int *num_partitions); What do you think? 31. The following code seems incorrect: /* * If this is an UPDATE and a BEFORE UPDATE trigger is present, we may * need to do update tuple routing. */ if (resultRelInfo->ri_TrigDesc && resultRelInfo->ri_TrigDesc->trig_update_before_row && operation == CMD_UPDATE) update_tuple_routing_needed = true; Shouldn't this be setting update_tuple_routing_needed to false if there are no before row update triggers? Otherwise, you're setting it to true regardless of if there are any partition key columns being UPDATEd. That would make the work you're doing in inheritance_planner() to set part_cols_updated a waste of time. Also, this bit of code is a bit confused. /* Decide whether we need to perform update tuple routing. */ if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE) update_tuple_routing_needed = false; /* * Build state for tuple routing if it's an INSERT or if it's an UPDATE of * partition key. */ if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE && (operation == CMD_INSERT || update_tuple_routing_needed)) The first if test would not be required if you fixed the code where you set update_tuple_routing_needed = true regardless if its a partitioned table or not. So basically, you need to take the node->part_cols_updated from the planner, if that's true then perform your test for before row update triggers, set a bool to false if there are none, then proceed to setup the partition tuple routing for partition table inserts or if your bool is still true. Right? 32. "WCO" abbreviation is not that common and might need to be expanded. * Below are required as reference objects for mapping partition * attno's in expressions such as WCO and RETURNING. Searching for other comments which mention "WCO" they're all around places that is easy to understand they mean "With Check Option", e.g. next to a variable with a more descriptive name. That's not the case here. 33. "are anyway newly allocated", should "anyway" be "always"? Otherwise, it does not make sense. * If this result rel is one of the subplan result rels, let * ExecEndPlan() close it. For INSERTs, this does not apply because * all leaf partition result rels are anyway newly allocated. 34. Comment added which mentions a member that does not exist. * all_part_cols contains all attribute numbers from the parent that are* used as partitioning columns by the parent or somedescendent which is* itself partitioned.* I've not looked at the test coverage as I see Thomas has been looking at that in some detail. I'm going to set this patch as waiting for author now. Thanks again for working on this. -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
David Rowley wrote: > 5. Triggers. Do we need a new "TG_" tag to allow trigger functions to > do something special when the DELETE/INSERT is a partition move? I > have audit tables in mind here it may appear as though a user > performed a DELETE when they actually performed an UPDATE giving > visibility of this to the trigger function will allow the application > to work around this. +1 I think we do need a flag that can be inspected from the user trigger function. > 9. Also, this should likely be reworded: > > * 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT, > * this is 0. > > 'num_update_rri' number of elements in 'update_rri' array or zero for INSERT. Also: /pgsql/source/master/src/backend/executor/execMain.c: In function 'ExecSetupPartitionTupleRouting': /pgsql/source/master/src/backend/executor/execMain.c:3401:18: warning: 'leaf_part_arr' may be used uninitialized in thisfunction [-Wmaybe-uninitialized] leaf_part_rri = leaf_part_arr + i; ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~ I think using num_update_rri==0 as a flag to indicate INSERT is strange. I suggest passing an additional boolean -- or maybe just split the whole function in two, one for updates and another for inserts, say ExecSetupPartitionTupleRoutingForInsert() and ExecSetupPartitionTupleRoutingForUpdate(). They seem to share almost no code, and the current flow is hard to read; maybe just add a common subroutine for the lower bottom of the loop. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Thanks David Rowley, Alvaro Herrera and Thomas Munro for stepping in for the reviews ! In the attached patch v24, I have addressed Amit Langote's remaining review points, and David Rowley's comments upto point #26. Yet to address : Robert's few suggestions. All of Alvaro's comments. David's points from #27 to #34. Thomas's point about adding remaining test coverage on transition tables. Below has the responses for both Amit's and David's comments, starting with Amit's .... =============== On 2 November 2017 at 12:40, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/10/24 0:15, Amit Khandekar wrote: >> On 16 October 2017 at 08:28, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> >>> + (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup >>> == NULL)))) >>> >>> Is there some reason why a bitwise operator is used here? >> >> That exact condition means that the function is called for transition >> capture for updated rows being moved to another partition. For this >> scenario, either the oldtup or the newtup is NULL. I wanted to exactly >> capture that condition there. I think the bitwise operator is more >> user-friendly in emphasizing the point that it is indeed an "either a >> or b, not both" condition. > > I see. In that case, since this patch adds the new condition, a note > about it in the comment just above would be good, because the situation > you describe here seems to arise only during update-tuple-routing, IIUC. Done. Please check. > + * 'update_rri' has the UPDATE per-subplan result rels. These are re-used > + * instead of allocating new ones while generating the array of all leaf > + * partition result rels. > > Instead of: > > "These are re-used instead of allocating new ones while generating the > array of all leaf partition result rels." > > how about: > > "There is no need to allocate a new ResultRellInfo entry for leaf > partitions for which one already exists in this array" Ok. I have made it like this : + * 'update_rri' contains the UPDATE per-subplan result rels. For the output param + * 'partitions', we don't allocate new ResultRelInfo objects for + * leaf partitions for which they are already available in 'update_rri'. > >>> ISTM, ConvertPartitionTupleSlot has a very specific job and a bit complex >>> interface. I guess it could simply have the following interface: >>> >>> static HeapTuple ConvertPartitionTuple(ModifyTabelState *mtstate, >>> HeapTuple tuple, bool is_update); >>> >>> And figure out, based on the value of is_update, which map to use and >>> which slot to set *p_new_slot to (what is now "new_slot" argument). >>> You're getting mtstate here anyway, which contains all the information you >>> need here. It seems better to make that (selecting which map and which >>> slot) part of the function's implementation if we're having this function >>> at all, imho. Maybe I'm missing some details there, but my point still >>> remains that we should try to put more logic in that function instead of >>> it just do the mechanical tuple conversion. >> >> I tried to see how the interface would look if we do that way. Here is >> how the code looks : >> >> static TupleTableSlot * >> ConvertPartitionTupleSlot(ModifyTableState *mtstate, >> bool for_update_tuple_routing, >> int map_index, >> HeapTuple *tuple, >> TupleTableSlot *slot) >> { >> TupleConversionMap *map; >> TupleTableSlot *new_slot; >> >> if (for_update_tuple_routing) >> { >> map = mtstate->mt_persubplan_childparent_maps[map_index]; >> new_slot = mtstate->mt_rootpartition_tuple_slot; >> } >> else >> { >> map = mtstate->mt_perleaf_parentchild_maps[map_index]; >> new_slot = mtstate->mt_partition_tuple_slot; >> } >> >> if (!map) >> return slot; >> >> *tuple = do_convert_tuple(*tuple, map); >> >> /* >> * Change the partition tuple slot descriptor, as per converted tuple. >> */ >> ExecSetSlotDescriptor(new_slot, map->outdesc); >> ExecStoreTuple(*tuple, new_slot, InvalidBuffer, true); >> >> return new_slot; >> } >> >> It looks like the interface does not much simplify, and above that, we >> have more number of lines in that function. Also, the caller anyway >> has to be aware whether the map_index is the index into the leaf >> partitions or the update subplans. So it is not like the caller does >> not have to be aware about whether the mapping should be >> mt_persubplan_childparent_maps or mt_perleaf_parentchild_maps. > > Hmm, I think we should try to make it so that the caller doesn't have to > be aware of that. And by caller I guess you mean ExecInsert(), which > should not be a place, IMHO, where to try to introduce a lot of new logic > specific to update tuple routing. I think, for ExecInsert() since we have already given the job of routing the tuple from root partitioned table to a partition, it makes sense to give the function the additional job of routing the tuple from any partition to any partition. ExecInsert() can be looked at as doing this job : "insert a tuple into the right partition; the original tuple can belong to any partition" > With that, now there are no persubplan and perleaf arrays for ExecInsert() > to pick from to select a map to pass to ConvertPartitionTupleSlot(), or > maybe even no need for the separate function. The tuple-routing code > block in ExecInsert would look like below (writing resultRelInfo as just Rel): > > rootRel = (mtstate->rootRel != NULL) ? mtstate->rootRel : Rel > > if (rootRel != Rel) /* update tuple-routing active */ > { > int subplan_off = Rel - mtstate->Rel[0]; > int leaf_off = mtstate->mt_subplan_partition_offsets[subplan_off]; > > if (mt_transition_tupconv_maps[leaf_off]) > { > /* > * Convert to root format using > * mt_transition_tupconv_maps[leaf_off] > */ > > slot = mt_root_tuple_slot; /* for tuple-routing */ > > /* Store the converted tuple into slot */ > } > } > > /* Existing tuple-routing flow follows */ > new_leaf = ExecFindPartition(rootRel, slot, ...) > > if (mtstate->transition_capture) > { > transition_capture_map = mt_transition_tupconv_maps[new_leaf] > } > > if (mt_partition_tupconv_maps[new_leaf]) > { > /* > * Convert to leaf format using mt_partition_tupconv_maps[new_leaf] > */ > > slot = mt_partition_tuple_slot; > > /* Store the converted tuple into slot */ > } > After doing the changes for the int[] array map in the previous patch version, I still feel that ConvertPartitionTupleSlot() should be retained. We save some repeated lines of code saved. >> On HEAD, the "parent Plan" refers to >> mtstate->mt_plans[0]. Now in the patch, for the parent node in >> ExecInitQual(), mtstate->ps is passed rather than mt_plans[0]. So the >> parent plan refers to this mtstate node. > > Hmm, I'm not really sure if doing that (passing mtstate->ps) would be > accurate. In the update tuple routing case, it seems that it's better to > pass the correct parent PlanState pointer to ExecInitQual(), that is, one > corresponding to the partition's sub-plan. At least I get that feeling by > looking at how parent is used downstream to that ExecInitQual() call, but > there *may* not be anything to worry about there after all. I'm unsure. > >> BTW, the reason I had changed the parent node to mtstate->ps is : >> Other places in that code use mtstate->ps while initializing >> expressions : >> >> /* >> * Build a projection for each result rel. >> */ >> resultRelInfo->ri_projectReturning = >> ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps, >> resultRelInfo->ri_RelationDesc->rd_att); >> >> ........... >> >> /* build DO UPDATE WHERE clause expression */ >> if (node->onConflictWhere) >> { >> ExprState *qualexpr; >> >> qualexpr = ExecInitQual((List *) node->onConflictWhere, >> &mtstate->ps); >> .... >> } >> >> I think wherever we initialize expressions belonging to a plan, we >> should use that plan as the parent. WithCheckOptions are fields of >> ModifyTableState. > > You may be right, but I see for WithCheckOptions initialization > specifically that the non-tuple-routing code passes the actual sub-plan > when initializing the WCO for a given result rel. Yes that's true. The problem with WithCheckOptions for newly allocated partition result rels is : we can't use a subplan for the parent parameter because there is no subplan for it. But I will still think on it a bit more (TODO). > >>> Comments on the optimizer changes: >>> >>> +get_all_partition_cols(List *rtables, >>> >>> Did you mean rtable? >> >> I did mean rtables. It's a list of rtables. > > It's not, AFAIK. rtable (range table) is a list of range table entries, > which is also what seems to get passed to get_all_partition_cols for that > argument (root->parse->rtable, which is not a list of lists). > > Moreover, there are no existing instances of this naming within the > planner other than those that this patch introduces: > > $ grep rtables src/backend/optimizer/ > planner.c:114: static void get_all_partition_cols(List *rtables, > planner.c:1063: get_all_partition_cols(List *rtables, > planner.c:1069: Oid root_relid = getrelid(root_rti, rtables); > planner.c:1078: Oid relid = getrelid(rti, rtables); > > OTOH, dependency.c does have rtables, but it's actually a list of range > tables. For example: > > dependency.c:1360: context.rtables = list_make1(rtable); Yes, Ok. To be consistent with naming convention at multiple places, I have changed it to rtable. > >>> + if (partattno != 0) >>> + child_keycols = >>> + bms_add_member(child_keycols, >>> + partattno - >>> FirstLowInvalidHeapAttributeNumber); >>> + } >>> + foreach(lc, partexprs) >>> + { >>> >>> Elsewhere (in quite a few places), we don't iterate over partexprs >>> separately like this, although I'm not saying it is bad, just different >>> from other places. >> >> I think you are suggesting we do it like how it's done in >> is_partition_attr(). Can you please let me know other places we do >> this same way ? I couldn't find. > > OK, not as many as I thought there would be, but there are following > beside is_partition_attrs(): > > partition.c: get_range_nulltest() > partition.c: get_qual_for_range() > relcache.c: RelationBuildPartitionKey() > Ok, I think I will first address Robert's suggestion of re-using is_partition_attrs() for pull_child_partition_columns(). If I do that, this discussion won't be applicable, so I am deferring this one. (TODO) ============= Below are my responses to David's comments upto point #26 : On 13 November 2017 at 18:25, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 10 November 2017 at 16:42, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > [ update-partition-key_v23.patch ] > > Hi Amit, > > Thanks for working on this. I'm looking forward to seeing this go in. > > So... I've signed myself up to review the patch, and I've just had a > look at it, (after first reading this entire email thread!). Thanks a lot for your extensive review. > > Overall the patch looks like it's in quite a good shape. Nice to hear that. > I think I do agree with Robert about the UPDATE anomaly that's been discussed. > I don't think we're painting ourselves into any corner by not having > this working correctly right away. Anyone who's using some trigger > workarounds for the current lack of support for updating the partition > key is already going to have the same issues, so at least this will > save them some troubles implementing triggers and give them much > better performance. I believe you are referring to the concurrency anomaly. Yes I agree on that. By the way, (you may be already aware), there is a separate mail thread going on to address this anamoly, so that we don't silently proceed with the UPDATE without error : https://www.postgresql.org/message-id/CAAJ_b95PkwojoYfz0bzXU8OokcTVGzN6vYGCNVUukeUDrnF3dw%40mail.gmail.com > 1. Closing command tags in docs should not be abbreviated > > triggers are concerned, <literal>AFTER</> <command>DELETE</command> and > > This changed in c29c5789. I think Peter will be happy if you don't > abbreviate the closing tags. Added the tag. I had done most of the corrections after I rebased over this commit, but I think I missed some of those with <literal> tag. > > 2. "about to do" would read better as "about to perform" > > concurrent session, and it is about to do an <command>UPDATE</command> > > I think this paragraph could be more clear if we identified the > sessions with a number. > > Perhaps: > Suppose, session 1 is performing an <command>UPDATE</command> on a > partition key, meanwhile, session 2 tries to perform an <command>UPDATE > </command> or <command>DELETE</command> operation on the same row. > Session 2 can silently miss the row due to session 1's activity. In > such a case, session 2's <command>UPDATE</command>/<command>DELETE > </command>, being unaware of the row's movement, interprets this that the > row has just been deleted, so there is nothing to be done for this row. > Whereas, in the usual case where the table is not partitioned, or where > there is no row movement, the second session would have identified the > newly updated row and carried <command>UPDATE</command>/<command>DELETE > </command> on this new row version. Done like above, with slight changes. > > > 3. Integer width. get_partition_natts returns int but we assign to int16. > > int16 partnatts = get_partition_natts(key); > > Confusingly get_partition_col_attnum() returns int16 instead of AttrNumber > but that's existingly not correct. > > 4. The following code could just pull_varattnos(partexprs, 1, &child_keycols); > > foreach(lc, partexprs) > { > Node *expr = (Node *) lfirst(lc); > > pull_varattnos(expr, 1, &child_keycols); > } I will defer this till I address Robert's request to try and see if we can have a common code for pull_child_partition_columns() and is_partition_attr(). (TODO) > > 5. Triggers. Do we need a new "TG_" tag to allow trigger functions to > do something > special when the DELETE/INSERT is a partition move? I have audit > tables in mind here > it may appear as though a user performed a DELETE when they actually > performed an UPDATE > giving visibility of this to the trigger function will allow the > application to work around this. I feel it's too early to add a user-visible variable for such purpose. Currently we don't support triggers on partitioned tables, and so a user who wants to have a common trigger for a partition subtree has no choice but to install the same trigger on all the leaf partitions under it. And so we have to live with a not-very-obvious behaviour of firing triggers even for the delete/insert part of the update row movement. > > 6. change "row" to "a row" and "old" to "the old" > > * depending on whether the event is for row being deleted from old > > But to be honest, I'm having trouble parsing the comment. I think it > would be better to > say explicitly when the row will be NULL rather than "depending on > whether the event" I have put it this way now : * For INSERT events newtup should be non-NULL, for DELETE events * oldtup should be non-NULL, whereas for UPDATE events normally both * oldtup and newtup are non-NULL. But for UPDATE event fired for * capturing transition tuples during UPDATE partition-key row * movement, oldtup is NULL when the event is for row being inserted, * whereas newtup is NULL when the event is for row being deleted. > > 7. I'm confused with how this change came about. If the old comment > was correct here then the comment you're referring to here should > remain in ExecPartitionCheck(), but you're saying it's in > ExecConstraints(). > > /* See the comments in ExecConstraints. */ > > If the comment really is in ExecConstraints(), then you might want to > give an overview of what you mean, then reference ExecConstraints() if > more details are required. I have put it this way : * Need to first convert the tuple to the root partitioned table's row * type. For details, check similar comments in ExecConstraints(). Basically, the comment to be referred in ExecConstraints() is this : * If the tuple has been routed, it's been converted to the * partition's rowtype, which might differ from the root * table's. We must convert it back to the root table's * rowtype so that val_desc shown error message matches the * input tuple. > > 8. I'm having trouble parsing this comment: > > * 'update_rri' has the UPDATE per-subplan result rels. > > I think "has" should be "contains" ? Ok, changed it to 'contains'. > > 9. Also, this should likely be reworded: > > * 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT, > * this is 0. > > 'num_update_rri' number of elements in 'update_rri' array or zero for INSERT. Done. > > 10. There should be no space before the '?' > > /* Is this leaf partition present in the update resultrel ? */ Done. > > 11. I'm struggling to understand this comment: > > * This is required when converting tuple as per root > * partition tuple descriptor. > > "tuple" should probably be "the tuple", but not quite sure what you > mean by "as per root". > > I may have misunderstood, but maybe it should read: > > * This is required when we convert the partition's tuple to > * be compatible with the partitioned table's tuple descriptor. ri_PartitionRoot is set to NULL while creating the result rels for each of the UPDATE subplans; and it is required to be set to the root table for leaf partitions created for tuple routing so that the error message displays the row in root tuple descriptor. Because we re-use the same result rels for the per-partition array, we need to set it for them here. I have reworded the comment this way : * This is required when we convert the partition's tuple to be * compatible with the root partitioned table's tuple * descriptor. When generating the per-subplan UPDATE result * rels, this was not set. Let me know if this is clear enough. > > 12. I think "as well" would be better written as "either". > > * If we didn't open the partition rel, it means we haven't > * initialized the result rel as well. Done. > > 13. I'm unsure what is meant by the following comment: > > * Verify result relation is a valid target for insert operation. Even > * for updates, we are doing this for tuple-routing, so again, we need > * to check the validity for insert operation. > > I'm not quite sure where UPDATE comes in here as we're only checking for INSERT? Here, "Even for update" means "Even when ExecSetupPartitionTupleRouting() is called for an UPDATE operation". > > 14. Use of underscores instead of camelCase. > > COPY_SCALAR_FIELD(part_cols_updated); > > I know you're not the first one to break this as "partitioned_rels" > does not follow it either, but that's probably not a good enough > reason to break away from camelCase any further. > > I'd suggest "partColsUpdated". But after a re-think, maybe cols is > incorrect. All columns are partitioned, it's the key columns that we > care about, so how about "partKeyUpdate" Sure. I have used partKeyUpdated as against partKeyUpdate. > > 15. Are you sure that you mean "root" here? > > * All the child partition attribute numbers are converted to the root > * partitioned table. > > Surely this is just the target relation. "parent" maybe? A > sub-partitioned table might be the target of an UPDATE too. Here the root means the root of the partition subtree, which is also the UPDATE target relation. I think in other places we call it the root even though it may also have ancestors. It is the root of the subtree in question. This is similar to how we have named the ModifyTableState->rootResultRelInfo field. Note that Robert has requested to collect the partition cols at some other place where we have already the table open. So this function itself may change. > > 15. I see get_all_partition_cols() is just used once to check if > parent_rte->updatedCols contains and partition keys. > > Would it not be better to reform that function and pass > parent_rte->updatedCols in and abort as soon as you see a single > match? > > Maybe the function could return bool and be named > partitioned_key_overlaps(), that way your assignment in > inheritance_planner() would just become: > > part_cols_updated = partitioned_key_overlaps(root->parse->rtable, > top_parentRTindex, partitioned_rels, parent_rte->updatedCols); > > or something like that anyway. I am going to think on all of this when I start checking if we can have some common code for pull_child_partition_columns() and is_partition_attr(). (TODO) One thing to note is : Usually the user is not going to modify partition cols. So typically we would need to scan through all the partitioned tables to check if the partition key is modified. So to make this scan more efficient, avoid the "bitmap_overlap" operation for each of the partitioned tables separately, and instead, collect them first from all partitioned tables, and then do a single overlap operation. This way we make the normal updates a tiny bit fast, at the expense of tiny-bit slower partition-key-updates because we don't abort the scan as soon as we find the partition key updated. > > 16. Typo in comment > > * 'part_cols_updated' if any partitioning columns are being updated, either > * from the named relation or a descendent partitione table. > > "partitione" should be "partitioned". Also, normally for bool > parameters, we might word things like "True if ..." rather than just "if" > > You probably should follow camelCase I mentioned in 14 here too. Done. Similar to the other bool param canSetTag, made it : "'partColsUpdated' is true if any ..." > > 17. Comment needs a few changes: > > * ConvertPartitionTupleSlot -- convenience function for converting tuple and > * storing it into a tuple slot provided through 'new_slot', which typically > * should be one of the dedicated partition tuple slot. Passes the partition > * tuple slot back into output param p_old_slot. If no mapping present, keeps > * p_old_slot unchanged. > * > * Returns the converted tuple. > > There are a few typos here. For example, "tuple" should be "a tuple", > but maybe the comment should just be worded like: > > * ConvertPartitionTupleSlot -- convenience function for tuple conversion > * using 'map'. The tuple, if converted, is stored in 'new_slot' and > * 'p_old_slot' is set to the original partition tuple slot. If map is NULL, > * then the original tuple is returned unmodified, otherwise the converted > * tuple is returned. Modified, with some changes. p_old_slot name is a bit confusing. So I have renamed it to p_my_slot. Here is how it looks now : * ConvertPartitionTupleSlot -- convenience function for tuple conversion using * 'map'. The tuple, if converted, is stored in 'new_slot', and 'p_my_slot' is * updated with the 'new_slot'. 'new_slot' typically should be one of the * dedicated partition tuple slots. If map is NULL, keeps p_my_slot unchanged. * * Returns the converted tuple, unless map is NULL, in which case original * tuple is returned unmodified. > > 18. Line goes over 80 chars. > > TransitionCaptureState *transition_capture = mtstate->mt_transition_capture; > > Better just to split the declaration and assignment. Done. > > 19. Confusing comment: > > /* > * If the original operation is UPDATE, the root partitioned table > * needs to be fetched from mtstate->rootResultRelInfo. > */ > > It's not that clear here how you determine this is an UPDATE of a > partitioned key. > > 20. This code looks convoluted: > > rootResultRelInfo = (mtstate->rootResultRelInfo ? > mtstate->rootResultRelInfo : resultRelInfo); > > /* > * If the resultRelInfo is not the root partitioned table (which > * happens for UPDATE), we should convert the tuple into root's tuple > * descriptor, since ExecFindPartition() starts the search from root. > * The tuple conversion map list is in the order of > * mtstate->resultRelInfo[], so to retrieve the one for this resultRel, > * we need to know the position of the resultRel in > * mtstate->resultRelInfo[]. > */ > if (rootResultRelInfo != resultRelInfo) > { > > rootResultRelInfo is assigned via a ternary expression which makes the > subsequent if test seem a little strange. > > Would it not be better to test: > > if (mtstate->rootResultRelInfo) > { > rootResultRelInfo = mtstate->rootResultRelInfo > ... other stuff ... > } > else > rootResultRelInfo = resultRelInfo; > > Then above the if test you can explain that rootResultRelInfo is only > set during UPDATE of partition keys, as per #19. Giving more thought on this, I think to avoid confusion to the reader, we better have an explicit (operation == CMD_UPDATE) condition, and in that block, assert that mtstate->rootResultRelInfo is non-NULL. I have accordingly shuffled the if conditions. I think this is simple and clear. Please check. > > 21. How come you renamed mt_partition_tupconv_maps[] to > mt_parentchild_tupconv_maps[]? mt_transition_tupconv_maps must be renamed to a more general map name because it is not only used for transition capture but also for update tuple routing. And we have mt_partition_tupconv_maps which is already a general name. So to distinguish between the two tupconv maps, I prepended "parent-child" or "child-parent" to "tupconv_maps". > > 22. Comment in ExecInsert() could be worded better. > > /* > * In case this is part of update tuple routing, put this row into the > * transition NEW TABLE if we are capturing transition tables. We need to > * do this separately for DELETE and INSERT because they happen on > * different tables. > */ > > /* > * This INSERT may be the result of a partition-key-UPDATE. If so, > * and we're required to capture transition tables then we'd better > * record this as a statement level UPDATE on the target relation. > * We're not interested in the statement level DELETE or INSERT as > * these occur on the individual partitions, none of which are the > * target of this the UPDATE statement. > */ > > A similar comment could use a similar improvement in ExecDelete() I want to emphasize the fact that we need to do the OLD and NEW row separately for DELETE and INSERT. And also, I think we need not mention about statement triggers, though the transition table capture with partitions currently is supported only for statement triggers. We should only worry about capturing the row if mtstate->mt_transition_capture != NULL, without having to know whether it is for statement trigger or not. Below is how the comment looks now after I did some changes as per your suggestion about wording : * If this INSERT is part of a partition-key-UPDATE and we are capturing * transition tables, put this row into the transition NEW TABLE. * (Similarly we need to add the deleted row in OLD TABLE). We need to do * this separately for DELETE and INSERT because they happen on different * tables. > > 23. Line is longer than 80 chars. > > TransitionCaptureState *transition_capture = mtstate->mt_transition_capture; Done. > > 24. I know from reading the thread this name has changed before, but I > think delete_skipped seems like the wrong name for this variable in: > > if (delete_skipped) > *delete_skipped = true; > > Skipped is the wrong word here as that indicates like we had some sort > of choice and that we decided not to. However, that's not the case > when the tuple was concurrently deleted. Would it not be better to > call it "tuple_deleted" or even "success" and reverse the logic? It's > just a bit confusing that you're setting this to skipped before > anything happens. It would be nicer if there was a better way to do > this whole thing as it's a bit of a wart in the code. I understand why > the code exists though. I think "success" sounds like : if it is false, ExecDelete has failed. So I have chosen "tuple_deleted". "tuple_actually_deleted" might sound still better, but it is too long. > Also, I wonder if it's better to always pass a boolean here to save > having to test for NULL before setting it, that way you might consider > putting the success = false just before the return NULL, then do > success = true after the tuple is gone. > Failing that, putting: something like: success = false; /* not yet! */ > where you're doing the if (deleted_skipped) test is might also be > better. I didn't really understand this. > > 25. Comment "we should" should be "we must". > > /* > * For some reason if DELETE didn't happen (for e.g. trigger > * prevented it, or it was already deleted by self, or it was > * concurrently deleted by another transaction), then we should > * skip INSERT as well, otherwise, there will be effectively one > * new row inserted. > > Maybe just: > /* If the DELETE operation was unsuccessful, then we must not > * perform the INSERT into the new partition. I think we better mention some scenarios of why this can happen , otherwise its confusing to the reader why the delete can't happen, or why we shouldn't error out in that case. > > "for e.g." is not really correct in English. "For example, ..." or > just "e.g. ..." is correct. If you de-abbreviate the e.g. then you've > written "For exempli gratia", which translates to "For for example". I see. Good to know that. Done. > > 26. You're not really explaining what's going on here: > > if (mtstate->mt_transition_capture) > saved_tcs_map = mtstate->mt_transition_capture->tcs_map; > > You have a comment later to say you're about to "Revert back to the > transition capture map", but I missed the part that explained about > modifying it in the first place. I have now added main comments while saving the map, and I refer to this comment while reverting back the map. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Attachment
The following contains replies to David's remaining comments , i.e. from #27 onwards, followed by replies to Alvaro's review comments. Attached is the revised patch v25. ===================== On 13 November 2017 at 18:25, David Rowley <david.rowley@2ndquadrant.com> wrote: > > 27. Comment does not explain how we're skipping checking the partition > constraint check in: > > * We have already checked partition constraints above, so skip > * checking them here. > > Maybe something like: > > * We've already checked the partition constraint above, however, we > * must still ensure the tuple passes all other constraints, so we'll > * call ExecConstraints() and have it validate all remaining checks. Done. > > 28. For table WITH OIDs, the OID should probably follow the new tuple > for partition-key-UPDATEs. > I understand that as far as possible we want to simulate the UPDATE as if it's a normal table update. But for system columns, I think we should avoid that; and instead, let the system handle it the way it is handling (i.e. the new row in a table should always have a new OID.) > 29. ExecSetupChildParentMap gets called here for non-partitioned relations. > Maybe that's not the best function name? The function only seems to do > that when perleaf is True. I didn't clearly understand this, particularly, what task you were referring to when you said "the function only seems to do that" ? The function does setup child-parent map even when perleaf=false. The function name is chosen that way because the map is always a child-to-root map, but the map array elements may be arranged in the order of the per-partition array 'mtstate->mt_partitions[]', or in the order of the per-subplan result rels 'mtstate->resultRelInfo[]' > > Is a leaf a partition of a partitioned table? It's not that clear the > meaning here. Leaf partition means it is a child of a partitioned table, but it itself is not a partitioned table. I have added more comments for the function ExecSetupChildParentMap() (both, at the function header and inside). Please check and let me know if you still have questions. > > 30. The following chunk of code is giving me a headache trying to > verify which arrays are which size: > > ExecSetupPartitionTupleRouting(rel, > mtstate->resultRelInfo, > (operation == CMD_UPDATE ? nplans : 0), > node->nominalRelation, > estate, > &partition_dispatch_info, > &partitions, > &partition_tupconv_maps, > &subplan_leaf_map, > &partition_tuple_slot, > &num_parted, &num_partitions); > mtstate->mt_partition_dispatch_info = partition_dispatch_info; > mtstate->mt_num_dispatch = num_parted; > mtstate->mt_partitions = partitions; > mtstate->mt_num_partitions = num_partitions; > mtstate->mt_parentchild_tupconv_maps = partition_tupconv_maps; > mtstate->mt_subplan_partition_offsets = subplan_leaf_map; > mtstate->mt_partition_tuple_slot = partition_tuple_slot; > mtstate->mt_root_tuple_slot = MakeTupleTableSlot(); > > I know this patch is not completely responsible for it, but you're not > making things any better. > > Would it not be better to invent some PartitionTupleRouting struct and > make that struct a member of ModifyTableState and CopyState, then just > pass the pointer to that struct to ExecSetupPartitionTupleRouting() > and have it fill in the required details? I think the complexity of > this is already on the high end, I think you really need to do the > refactor before this gets any worse. > Ok. I am currently working on doing this change. So not yet included in the attached patch. Will send yet another revised patch for this change. (TODO) > > 31. The following code seems incorrect: > > /* > * If this is an UPDATE and a BEFORE UPDATE trigger is present, we may > * need to do update tuple routing. > */ > if (resultRelInfo->ri_TrigDesc && > resultRelInfo->ri_TrigDesc->trig_update_before_row && > operation == CMD_UPDATE) > update_tuple_routing_needed = true; > > Shouldn't this be setting update_tuple_routing_needed to false if > there are no before row update triggers? Otherwise, you're setting it > to true regardless of if there are any partition key columns being > UPDATEd. That would make the work you're doing in > inheritance_planner() to set part_cols_updated a waste of time. The point of setting it to true regardless of whether the partition key is updated is : even if partition key is not explicitly modified by the UPDATE, a before-row trigger can update it later. But we can never know whether it is actually going to update. So if there are BR UPDATE triggers on result rels of any of the subplans, we *always* setup the tuple routing. This approach was concluded in the earlier discussions about trigger handling. > > Also, this bit of code is a bit confused. > > /* Decide whether we need to perform update tuple routing. */ > if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE) > update_tuple_routing_needed = false; > > /* > * Build state for tuple routing if it's an INSERT or if it's an UPDATE of > * partition key. > */ > if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE && > (operation == CMD_INSERT || update_tuple_routing_needed)) > > > The first if test would not be required if you fixed the code where > you set update_tuple_routing_needed = true regardless if its a > partitioned table or not. The place where I set update_tuple_routing_needed to true unconditionally, we don't have the relation open, so we don't know whether it is a partitioned table. Hence, set it anyways, and then revert it to false if it's not a partitioned table after all. > > So basically, you need to take the node->part_cols_updated from the > planner, if that's true then perform your test for before row update > triggers, set a bool to false if there are none, then proceed to setup > the partition tuple routing for partition table inserts or if your > bool is still true. Right? I think if we look at "update_tuple_routing_needed" as meaning that update tuple routing *may be* required, then the logic as-is makes sense: Set the variable if we see that we may require to do update routing. And the conditions for that are : either node->partKeyUpdated is true, or there is a BR UPDATE trigger and the operation is UPDATE. So set this variable for those conditions, and revert it back to false later if it is found that it's not a partitioned table. So I have retained the existing logic in the patch, but with some additional comments to make this logic clear to the reader. > > 32. "WCO" abbreviation is not that common and might need to be expanded. > > * Below are required as reference objects for mapping partition > * attno's in expressions such as WCO and RETURNING. > > Searching for other comments which mention "WCO" they're all around > places that is easy to understand they mean "With Check Option", e.g. > next to a variable with a more descriptive name. That's not the case > here. Ok. Changed WCO to WithCheckOptions. > > 33. "are anyway newly allocated", should "anyway" be "always"? > Otherwise, it does not make sense. > OK. Changed this : * because all leaf partition result rels are anyway newly allocated. to this (also removed 'all') : * because leaf partition result rels are always newly allocated. > > 34. Comment added which mentions a member that does not exist. > > * all_part_cols contains all attribute numbers from the parent that are > * used as partitioning columns by the parent or some descendent which is > * itself partitioned. > * Oops. Left-overs from earlier patch. Removed the comment. ===================== Below are Alvaro's review comments : On 14 November 2017 at 22:22, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > David Rowley wrote: > >> 5. Triggers. Do we need a new "TG_" tag to allow trigger functions to >> do something special when the DELETE/INSERT is a partition move? I >> have audit tables in mind here it may appear as though a user >> performed a DELETE when they actually performed an UPDATE giving >> visibility of this to the trigger function will allow the application >> to work around this. > > +1 I think we do need a flag that can be inspected from the user > trigger function. What I feel is : it's too early to do such changes. I think we should first get in the core patch, and then consider this request and any further enhancements. > >> 9. Also, this should likely be reworded: >> >> * 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT, >> * this is 0. >> >> 'num_update_rri' number of elements in 'update_rri' array or zero for INSERT. > > Also: > > /pgsql/source/master/src/backend/executor/execMain.c: In function 'ExecSetupPartitionTupleRouting': > /pgsql/source/master/src/backend/executor/execMain.c:3401:18: warning: 'leaf_part_arr' may be used uninitialized in thisfunction [-Wmaybe-uninitialized] > leaf_part_rri = leaf_part_arr + i; > ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~ > Right. I have now made "leaf_part_arr = NULL" during declaration. Actually leaf_part_arr is used only for inserts; but for compiler-sake we should add this initialization. > I think using num_update_rri==0 as a flag to indicate INSERT is strange. > I suggest passing an additional boolean -- I think adding another param looks redundant. To make the condition more readable, I have introduced a new local variable : bool is_update = (num_update_rri > 0); > or maybe just split the whole > function in two, one for updates and another for inserts, say > ExecSetupPartitionTupleRoutingForInsert() and > ExecSetupPartitionTupleRoutingForUpdate(). They seem to > share almost no code, and the current flow is hard to read; maybe just > add a common subroutine for the lower bottom of the loop. So there are two common code sections. One is the initial code which initializes various arrays and output params. And the 2nd common code is the 2nd half of the for loop block that includes calls to heap_open(), InitResultRelInfo(), convert_tuples_by_name(), CheckValidResultRel() and others. So it looks like there is a lot of common code. We would need to have two functions, one to have the initialization code, and the other to run the later half of the loop. And, heap_open() and InitResultRelInfo() need to be called only if partrel (which needs to be passed as function param) is NULL. Rather than this, I think this condition is better placed in-line in ExecSetupPartitionTupleRouting() for clarity. I am feeling it's not worth doing the shuffling. We are extracting the code into two functions only to avoid the "if num_update_rri" conditions. That's why I feel having a "is_update" variable would solve the purpose. The hard-to-understand code, I presume, is the update part where it tries to re-use already-existing result resl, and this part would anyways remain, although in a separate function. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Attachment
Thanks Amit. Looking at the latest v25 patch. On 2017/11/16 23:50, Amit Khandekar wrote: > Below has the responses for both Amit's and David's comments, starting > with Amit's .... > On 2 November 2017 at 12:40, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On 2017/10/24 0:15, Amit Khandekar wrote: >>> On 16 October 2017 at 08:28, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>>> >>>> + (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup >>>> == NULL)))) >>>> >>>> Is there some reason why a bitwise operator is used here? >>> >>> That exact condition means that the function is called for transition >>> capture for updated rows being moved to another partition. For this >>> scenario, either the oldtup or the newtup is NULL. I wanted to exactly >>> capture that condition there. I think the bitwise operator is more >>> user-friendly in emphasizing the point that it is indeed an "either a >>> or b, not both" condition. >> >> I see. In that case, since this patch adds the new condition, a note >> about it in the comment just above would be good, because the situation >> you describe here seems to arise only during update-tuple-routing, IIUC. > > Done. Please check. Looks fine. >> + * 'update_rri' has the UPDATE per-subplan result rels. These are re-used >> + * instead of allocating new ones while generating the array of all leaf >> + * partition result rels. >> >> Instead of: >> >> "These are re-used instead of allocating new ones while generating the >> array of all leaf partition result rels." >> >> how about: >> >> "There is no need to allocate a new ResultRellInfo entry for leaf >> partitions for which one already exists in this array" > > Ok. I have made it like this : > > + * 'update_rri' contains the UPDATE per-subplan result rels. For the > output param > + * 'partitions', we don't allocate new ResultRelInfo objects for > + * leaf partitions for which they are already available > in 'update_rri'. Sure. >>> It looks like the interface does not much simplify, and above that, we >>> have more number of lines in that function. Also, the caller anyway >>> has to be aware whether the map_index is the index into the leaf >>> partitions or the update subplans. So it is not like the caller does >>> not have to be aware about whether the mapping should be >>> mt_persubplan_childparent_maps or mt_perleaf_parentchild_maps. >> >> Hmm, I think we should try to make it so that the caller doesn't have to >> be aware of that. And by caller I guess you mean ExecInsert(), which >> should not be a place, IMHO, where to try to introduce a lot of new logic >> specific to update tuple routing. > > I think, for ExecInsert() since we have already given the job of > routing the tuple from root partitioned table to a partition, it makes > sense to give the function the additional job of routing the tuple > from any partition to any partition. ExecInsert() can be looked at as > doing this job : "insert a tuple into the right partition; the > original tuple can belong to any partition" Yeah, that's one way of looking at that. But I think ExecInsert() as it is today thinks it's got a *new* tuple and that's it. I think the newly introduced code in it to find out that it is not so (that the tuple actually comes from some other partition), that it's really the update-turned-into-delete-plus-insert, and then switch to the root partitioned table's ResultRelInfo, etc. really belongs outside of it. Maybe in its caller, which is ExecUpdate(). I mean why not add this code to the block in ExecUpdate() that handles update-row-movement. Just before calling ExecInsert() to do the re-routing seems like a good place to do all that. For example, try the attached incremental patch that applies on top of yours. I can see after applying it that diffs to ExecInsert() are now just some refactoring ones and there are no significant additions making it look like supporting update-row-movement required substantial changes to how ExecInsert() itself works. > After doing the changes for the int[] array map in the previous patch > version, I still feel that ConvertPartitionTupleSlot() should be > retained. We save some repeated lines of code saved. OK. >> You may be right, but I see for WithCheckOptions initialization >> specifically that the non-tuple-routing code passes the actual sub-plan >> when initializing the WCO for a given result rel. > > Yes that's true. The problem with WithCheckOptions for newly allocated > partition result rels is : we can't use a subplan for the parent > parameter because there is no subplan for it. But I will still think > on it a bit more (TODO). Alright. >>> I think you are suggesting we do it like how it's done in >>> is_partition_attr(). Can you please let me know other places we do >>> this same way ? I couldn't find. >> >> OK, not as many as I thought there would be, but there are following >> beside is_partition_attrs(): >> >> partition.c: get_range_nulltest() >> partition.c: get_qual_for_range() >> relcache.c: RelationBuildPartitionKey() >> > > Ok, I think I will first address Robert's suggestion of re-using > is_partition_attrs() for pull_child_partition_columns(). If I do that, > this discussion won't be applicable, so I am deferring this one. > (TODO) Sure, no problem. Thanks, Amit
Attachment
On 21 November 2017 at 17:24, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 13 November 2017 at 18:25, David Rowley <david.rowley@2ndquadrant.com> wrote: >> >> 30. The following chunk of code is giving me a headache trying to >> verify which arrays are which size: >> >> ExecSetupPartitionTupleRouting(rel, >> mtstate->resultRelInfo, >> (operation == CMD_UPDATE ? nplans : 0), >> node->nominalRelation, >> estate, >> &partition_dispatch_info, >> &partitions, >> &partition_tupconv_maps, >> &subplan_leaf_map, >> &partition_tuple_slot, >> &num_parted, &num_partitions); >> mtstate->mt_partition_dispatch_info = partition_dispatch_info; >> mtstate->mt_num_dispatch = num_parted; >> mtstate->mt_partitions = partitions; >> mtstate->mt_num_partitions = num_partitions; >> mtstate->mt_parentchild_tupconv_maps = partition_tupconv_maps; >> mtstate->mt_subplan_partition_offsets = subplan_leaf_map; >> mtstate->mt_partition_tuple_slot = partition_tuple_slot; >> mtstate->mt_root_tuple_slot = MakeTupleTableSlot(); >> >> I know this patch is not completely responsible for it, but you're not >> making things any better. >> >> Would it not be better to invent some PartitionTupleRouting struct and >> make that struct a member of ModifyTableState and CopyState, then just >> pass the pointer to that struct to ExecSetupPartitionTupleRouting() >> and have it fill in the required details? I think the complexity of >> this is already on the high end, I think you really need to do the >> refactor before this gets any worse. >> > >Ok. I am currently working on doing this change. So not yet included in the attached patch. Will send yet another revisedpatch for this change. Attached incremental patch encapsulate_partinfo.patch (to be applied over the latest v25 patch) contains changes to move all the partition-related information into new structure PartitionTupleRouting. Further to that, I also moved PartitionDispatchInfo into this structure. So it looks like this : typedef struct PartitionTupleRouting { PartitionDispatch *partition_dispatch_info; int num_dispatch; ResultRelInfo **partitions; int num_partitions; TupleConversionMap **parentchild_tupconv_maps; int *subplan_partition_offsets; TupleTableSlot *partition_tuple_slot; TupleTableSlot *root_tuple_slot; } PartitionTupleRouting; So this structure now encapsulates *all* the partition-tuple-routing-related information. So ModifyTableState now has only one tuple-routing-related field 'mt_partition_tuple_routing'. It is changed like this : @@ -976,24 +976,14 @@ typedef struct ModifyTableState TupleTableSlot *mt_existing; /* slot to store existing target tuple in */ List *mt_excludedtlist; /* the excluded pseudo relation's tlist */ TupleTableSlot *mt_conflproj; /* CONFLICT ... SET ... projection target */ - struct PartitionDispatchData **mt_partition_dispatch_info; - /* Tuple-routing support info */ - int mt_num_dispatch; /* Number of entries in the above array */ - int mt_num_partitions; /* Number of members in the following - * arrays */ - ResultRelInfo **mt_partitions; /* Per partition result relation pointers */ - TupleTableSlot *mt_partition_tuple_slot; - TupleTableSlot *mt_root_tuple_slot; + struct PartitionTupleRouting *mt_partition_tuple_routing; /* Tuple-routing support info */ struct TransitionCaptureState *mt_transition_capture; /* controls transition table population for specified operation */ struct TransitionCaptureState *mt_oc_transition_capture; /* controls transition table population for INSERT...ON CONFLICT UPDATE */ - TupleConversionMap **mt_parentchild_tupconv_maps; - /* Per partition map for tuple conversion from root to leaf */ TupleConversionMap **mt_childparent_tupconv_maps; /* Per plan/partition map for tuple conversion from child to root */ bool mt_is_tupconv_perpart; /* Is the above map per-partition ? */ - int *mt_subplan_partition_offsets; /* Stores position of update result rels in leaf partitions */ } ModifyTableState; So the code in nodeModifyTable.c (and similar code in copy.c) is accordingly adjusted to use mtstate->mt_partition_tuple_routing. The places where we use (mtstate->mt_partition_dispatch_info != NULL) condition to run tuple-routing code, I have replaced it with mtstate->mt_partition_tuple_routing != NULL. If you are ok with the incremental patch, I can extract this change into a separate preparatory patch to be applied on PG master. Thanks -Amit Khandekar
Attachment
On 7 November 2017 at 00:33, Robert Haas <robertmhaas@gmail.com> wrote: > + /* The caller must have already locked all the partitioned tables. */ > + root_rel = heap_open(root_relid, NoLock); > + *all_part_cols = NULL; > + foreach(lc, partitioned_rels) > + { > + Index rti = lfirst_int(lc); > + Oid relid = getrelid(rti, rtables); > + Relation part_rel = heap_open(relid, NoLock); > + > + pull_child_partition_columns(part_rel, root_rel, all_part_cols); > + heap_close(part_rel, NoLock); > > I don't like the fact that we're opening and closing the relation here > just to get information on the partitioning columns. I think it would > be better to do this someplace that already has the relation open and > store the details in the RelOptInfo. set_relation_partition_info() > looks like the right spot. It seems, for UPDATE, baserel RelOptInfos are created only for the subplan relations, not for the partitioned tables. I verified that build_simple_rel() does not get called for paritioned tables for UPDATE. In earlier versions of the patch, we used to collect the partition keys while expanding the partition tree so that we could get them while the relations are open. After some reviews, I was inclined to think that the collection logic better be moved out into the inheritance_planner(), because it involved pulling the attributes from partition key expressions, and the bitmap operation, which would be unnecessarily done for SELECTs as well. On the other hand, if we collect the partition keys separately in inheritance_planner(), then as you say, we need to open the relations. Opening the relation which is already in relcache and which is already locked, involves only a hash lookup. Do you think this is expensive ? I am open for either of these approaches. If we collect the partition keys in expand_partitioned_rtentry(), we need to pass the root relation also, so that we can convert the partition key attributes to root rel descriptor. And the other thing is, may be, we can check beforehand (in expand_inherited_rtentry) whether the rootrte's updatedCols is empty, which I think implies that it's not an UPDATE operation. If yes, we can just skip collecting the partition keys. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On 2017/11/23 21:57, Amit Khandekar wrote: > If we collect the partition keys in expand_partitioned_rtentry(), we > need to pass the root relation also, so that we can convert the > partition key attributes to root rel descriptor. And the other thing > is, may be, we can check beforehand (in expand_inherited_rtentry) > whether the rootrte's updatedCols is empty, which I think implies that > it's not an UPDATE operation. If yes, we can just skip collecting the > partition keys. Yeah, it seems like a good idea after all to check in expand_inherited_rtentry() whether the root RTE's updatedCols is non-empty and if so check if any of the updatedCols are partition keys. If we find some, then it will suffice to just set a simple flag in the PartitionedChildRelInfo that will be created for the root table. That should be done *after* we have visited all the tables in the partition tree including some that might be partitioned and hence will provide their partition keys. The following block in expand_inherited_rtentry() looks like a good spot: if (rte->inh && partitioned_child_rels != NIL) { PartitionedChildRelInfo *pcinfo; pcinfo = makeNode(PartitionedChildRelInfo); Thanks, Amit
On 24 November 2017 at 10:52, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/11/23 21:57, Amit Khandekar wrote: >> If we collect the partition keys in expand_partitioned_rtentry(), we >> need to pass the root relation also, so that we can convert the >> partition key attributes to root rel descriptor. And the other thing >> is, may be, we can check beforehand (in expand_inherited_rtentry) >> whether the rootrte's updatedCols is empty, which I think implies that >> it's not an UPDATE operation. If yes, we can just skip collecting the >> partition keys. > > Yeah, it seems like a good idea after all to check in > expand_inherited_rtentry() whether the root RTE's updatedCols is non-empty > and if so check if any of the updatedCols are partition keys. If we find > some, then it will suffice to just set a simple flag in the > PartitionedChildRelInfo that will be created for the root table. That > should be done *after* we have visited all the tables in the partition > tree including some that might be partitioned and hence will provide their > partition keys. The following block in expand_inherited_rtentry() looks > like a good spot: > > if (rte->inh && partitioned_child_rels != NIL) > { > PartitionedChildRelInfo *pcinfo; > > pcinfo = makeNode(PartitionedChildRelInfo); Yes, I am thinking about something like that. Thanks. I am also working on your suggestion of moving the convert-to-root-descriptor logic from ExecInsert() to ExecUpdate(). So, in the upcoming patch version, I am intending to include the above two, and if possible, Robert's idea of re-using is_partition_attr() for pull_child_partition_columns(). -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Mon, Nov 27, 2017 at 5:28 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > So, in the upcoming patch version, I am intending to include the above > two, and if possible, Robert's idea of re-using is_partition_attr() > for pull_child_partition_columns(). Discussions are still going on, so moved to next CF. -- Michael
On 27 November 2017 at 13:58, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 24 November 2017 at 10:52, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On 2017/11/23 21:57, Amit Khandekar wrote: >>> If we collect the partition keys in expand_partitioned_rtentry(), we >>> need to pass the root relation also, so that we can convert the >>> partition key attributes to root rel descriptor. And the other thing >>> is, may be, we can check beforehand (in expand_inherited_rtentry) >>> whether the rootrte's updatedCols is empty, which I think implies that >>> it's not an UPDATE operation. If yes, we can just skip collecting the >>> partition keys. >> >> Yeah, it seems like a good idea after all to check in >> expand_inherited_rtentry() whether the root RTE's updatedCols is non-empty >> and if so check if any of the updatedCols are partition keys. If we find >> some, then it will suffice to just set a simple flag in the >> PartitionedChildRelInfo that will be created for the root table. That >> should be done *after* we have visited all the tables in the partition >> tree including some that might be partitioned and hence will provide their >> partition keys. The following block in expand_inherited_rtentry() looks >> like a good spot: >> >> if (rte->inh && partitioned_child_rels != NIL) >> { >> PartitionedChildRelInfo *pcinfo; >> >> pcinfo = makeNode(PartitionedChildRelInfo); > > Yes, I am thinking about something like that. Thanks. In expand_partitioned_rtentry(), rather than collecting partition key attributes of all partitioned tables, I am now checking if parentrte->updatedCols has any partition key attributes. If an earlier parentrte's updatedCols was already found to have partition-keys, don't continue to check more. Also, rather than converting all the partition key attriubtes to be compatible with root's tuple descriptor, we better compare with each of the partitioned table's updatedCols when we have their handle handy. Each of the parentrte's updatedCols has exactly the same attributes as the root's, just with the ordering possibly changed. So it is safe to compare using the updatedCols of intermediate partition rels rather than those of the root rel. And, the advantage is : we now got rid of the conversion mapping from each of the partitioned table to root that was earlier done in pull_child_partition_columns() in the previous patches. PartitionedChildRelInfo now has is_partition_key_update field. This is updated using get_partitioned_child_rels(). > I am also working on your suggestion of moving the > convert-to-root-descriptor logic from ExecInsert() to ExecUpdate(). Done. > > So, in the upcoming patch version, I am intending to include the above > two, and if possible, Robert's idea of re-using is_partition_attr() > for pull_child_partition_columns(). Done. Now, is_partition_attr() is renamed to has_partition_attrs(). This function now accepts a bitmapset of attnums instead of a single attnum. Moved this function from tablecmds.c to partition.c. This is now re-used, and the earlier pull_child_partition_columns() is removed. Attached v26, that has all of the above points covered. Also, this patch contains the incremental changes that were attached in the patch encapsulate_partinfo.patch attached in [1]. In the next version, I will extract them out again and keep them as a separate preparatory patch. [1] https://www.postgresql.org/message-id/CAJ3gD9f86H64e4OCjFFszWW7f4EeyriSaFL8SvJs2yOUbc8VEw%40mail.gmail.com -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Attachment
On 29 November 2017 at 17:25, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > Also, this > patch contains the incremental changes that were attached in the patch > encapsulate_partinfo.patch attached in [1]. In the next version, I > will extract them out again and keep them as a separate preparatory > patch. As mentioned above, attached is encapsulate_partinfo_preparatory.patch. This addresses David Rowley's request to move all the partition-related information into new structure PartitionTupleRouting, so that for ExecSetupPartitionTupleRouting(), we could pass pointer to this structure instead of the many parameters that we currently pass: [1] The main update-partition-key patch is to be applied over the above preparatory patch. Attached is its v27 version. This version addresses Thomas Munro's comments : On 14 November 2017 at 01:32, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Fri, Nov 10, 2017 at 4:42 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> Attached is v23 patch that has just the above changes (and also >> rebased on hash-partitioning changes, like update.sql). I am still >> doing some sanity testing on this, although regression passes. > > The test coverage[1] is 96.62%. Nice work. Here are the bits that > aren't covered: > > In partition.c's pull_child_partition_columns(), the following loop is > never run: > > + foreach(lc, partexprs) > + { > + Node *expr = (Node *) lfirst(lc); > + > + pull_varattnos(expr, 1, &child_keycols); > + } In update.sql, part_c_100_200 is now partitioned by range(abs(d)). So now the new function has_partition_atttrs() (in recent patch versions, this function has replaced pull_child_partition_columns) goes through the above code segment. This was indeed an important part left uncovered. Thanks. > > In nodeModifyTable.c, the following conditional branches are never run: > > if (mtstate->mt_oc_transition_capture != NULL) > + { > + Assert(mtstate->mt_is_tupconv_perpart == true); > mtstate->mt_oc_transition_capture->tcs_map = > - > mtstate->mt_transition_tupconv_maps[leaf_part_index]; > + > mtstate->mt_childparent_tupconv_maps[leaf_part_index]; > + } I think this code segment never hits even without the patch. For partitions, ON CONFLICT is not supported, and this code segment runs only for partitions. > > > if (node->mt_oc_transition_capture != NULL) > { > - > Assert(node->mt_transition_tupconv_maps != NULL); > > node->mt_oc_transition_capture->tcs_map = > - > node->mt_transition_tupconv_maps[node->mt_whichplan]; > + > tupconv_map_for_subplan(node, node->mt_whichplan); > } Here also, I verified that none of the regression tests hits this segment. The reason might be : this segment is run when an UPDATE starts with the next subplan, and mtstate->mt_oc_transition_capture is never allocated for UPDATEs. [1] : https://www.postgresql.org/message-id/CAJ3gD9f86H64e4OCjFFszWW7f4EeyriSaFL8SvJs2yOUbc8VEw%40mail.gmail.com -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Attachment
While addressing Thomas's point about test scenarios not yet covered, I observed the following ... Suppose an UPDATE RLS policy with a WITH CHECK clause is defined on the target table. Now In ExecUpdate(), the corresponding WCO qual gets executed *before* the partition constraint check, as per existing behaviour. And the qual succeeds. And then because of partition-key updated, the row is moved to another partition. Here, suppose there is a BR INSERT trigger which modifies the row, and the resultant row actually would *not* pass the UPDATE RLS policy. But for this partition, since it is an INSERT, only INSERT RLS WCO quals are executed. So effectively, with a user-perspective, an RLS WITH CHECK policy that was defined to reject an updated row, is getting updated successfully. This is because the policy is not checked *after* a row trigger in the new partition is executed. Attached is a test case that reproduces this issue. I think, in case of row-movement, we should defer calling ExecWithCheckOptions() until the row is being inserted using ExecInsert(). And then in ExecInsert(), ExecWithCheckOptions() should be called using WCO_RLS_UPDATE_CHECK rather than WCO_RLS_INSERT_CHECK (I recall Amit Langote was of this opinion) as below : --- a/src/backend/executor/nodeModifyTable.c +++ b/src/backend/executor/nodeModifyTable.c @@ -510,7 +510,9 @@ ExecInsert(ModifyTableState *mtstate, * we are looking for at this point. */ if (resultRelInfo->ri_WithCheckOptions != NIL) - ExecWithCheckOptions(WCO_RLS_INSERT_CHECK, + ExecWithCheckOptions((mtstate->operation == CMD_UPDATE ? + WCO_RLS_UPDATE_CHECK : WCO_RLS_INSERT_CHECK), resultRelInfo, slot, estate); It can be argued that since in case of triggers we always execute INSERT row triggers for rows inserted as part of update-row-movement, we should be consistent and execute INSERT WCOs and not UPDATE WCOs for such rows. But note that, the row triggers we execute are defined on the leaf partitions. But the RLS policies being executed are defined for the target partitioned table, and not the leaf partition. Hence it makes sense to execute them as per the original operation on the target table. This is similar to why we execute UPDATE statement triggers even when the row is eventually inserted into another partition. This is because UPDATE statement trigger was defined for the target table, not the leaf partition. Barring any objections, I am going to send a revised patch that fixes the above issue as described. Thanks -Amit Khandekar
Attachment
On 30 November 2017 at 18:56, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > While addressing Thomas's point about test scenarios not yet covered, > I observed the following ... > > Suppose an UPDATE RLS policy with a WITH CHECK clause is defined on > the target table. Now In ExecUpdate(), the corresponding WCO qual gets > executed *before* the partition constraint check, as per existing > behaviour. And the qual succeeds. And then because of partition-key > updated, the row is moved to another partition. Here, suppose there is > a BR INSERT trigger which modifies the row, and the resultant row > actually would *not* pass the UPDATE RLS policy. But for this > partition, since it is an INSERT, only INSERT RLS WCO quals are > executed. > > So effectively, with a user-perspective, an RLS WITH CHECK policy that > was defined to reject an updated row, is getting updated successfully. > This is because the policy is not checked *after* a row trigger in the > new partition is executed. > > Attached is a test case that reproduces this issue. > > I think, in case of row-movement, we should defer calling > ExecWithCheckOptions() until the row is being inserted using > ExecInsert(). And then in ExecInsert(), ExecWithCheckOptions() should > be called using WCO_RLS_UPDATE_CHECK rather than WCO_RLS_INSERT_CHECK > (I recall Amit Langote was of this opinion) as below : > > --- a/src/backend/executor/nodeModifyTable.c > +++ b/src/backend/executor/nodeModifyTable.c > @@ -510,7 +510,9 @@ ExecInsert(ModifyTableState *mtstate, > * we are looking for at this point. > */ > if (resultRelInfo->ri_WithCheckOptions != NIL) > - ExecWithCheckOptions(WCO_RLS_INSERT_CHECK, > + ExecWithCheckOptions((mtstate->operation == CMD_UPDATE ? > + WCO_RLS_UPDATE_CHECK : WCO_RLS_INSERT_CHECK), > resultRelInfo, slot, estate); Attached is v28 patch which has the fix for this issue as described above. In ExecUpdate(), if partition constraint fails, we skip ExecWithCheckOptions (), and later in ExecInsert() it gets called with WCO_RLS_UPDATE_CHECK. Added a few test scenarios for the same, in regress/sql/update.sql. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Attachment
On 1 December 2017 at 17:24, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > Attached is v28 patch which has the fix for this issue as described > above. In ExecUpdate(), if partition constraint fails, we skip > ExecWithCheckOptions (), and later in ExecInsert() it gets called with > WCO_RLS_UPDATE_CHECK. Amit Langote informed me off-list, - along with suggestions for changes - that my patch needs a rebase. Attached is the rebased version. I have also bumped the patch version number (now v29), because this as additional changes, again, suggested by Amit L : Because ExecSetupPartitionTupleRouting() has mtstate parameter now, no need to pass update_rri and num_update_rri, since they can be retrieved from mtstate. Also, the preparatory patch is also rebased. Thanks Amit Langote. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Attachment
Thanks for the updated patches, Amit. Some review comments. Forgot to remove the description of update_rri and num_update_rri in the header comment of ExecSetupPartitionTupleRouting(). - +extern void pull_child_partition_columns(Relation rel, + Relation parent, + Bitmapset **partcols); It seems you forgot to remove this declaration in partition.h, because I don't find it defined or used anywhere. I think some of the changes that are currently part of the main patch are better taken out into their own patches, because having those diffs appear in the main patch is kind of distracting. Just like you now have a patch that introduces a PartitionTupleRouting structure. I know that leads to too many patches, but it helps to easily tell less substantial changes from the substantial ones. 1. Patch to rename partition_tupconv_maps to parentchild_tupconv_maps. 2. Patch that introduces has_partition_attrs() in place of is_partition_attr() 3. Patch to change the names of map_partition_varattnos() arguments 4. Patch that does the refactoring involving ExecConstrains(), ExecPartitionCheck(), and the introduction of ExecPartitionCheckEmitError() Regarding ExecSetupChildParentMap(), it seems to me that it could simply be declared as static void ExecSetupChildParentMap(ModifyTableState *mtstate); Looking at the places from where it's called, it seems that you're just extracting information from mtstate and passing the same for the rest of its arguments. mt_is_tupconv_perpart seems like it's unnecessary. Its function could be fulfilled by inspecting the state of some other fields of ModifyTableState. For example, in the case of an update (operation == CMD_UPDATE), if mt_partition_tuple_routing is non-NULL, then we can always assume that mt_childparent_tupconv_maps has entries for all partitions. If it's NULL, then there would be only entries for partitions that have sub-plans. tupconv_map_for_subplan() looks like it could be done as a macro. Thanks, Amit
On Wed, Dec 13, 2017 at 5:18 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > Amit Langote informed me off-list, - along with suggestions for > changes - that my patch needs a rebase. Attached is the rebased > version. I have also bumped the patch version number (now v29), > because this as additional changes, again, suggested by Amit L : > Because ExecSetupPartitionTupleRouting() has mtstate parameter now, > no need to pass update_rri and num_update_rri, since they can be > retrieved from mtstate. > > Also, the preparatory patch is also rebased. Reviewing the preparatory patch: + PartitionTupleRouting *partition_tuple_routing; + /* Tuple-routing support info */ Something's wrong with the formatting here. - PartitionDispatch **pd, - ResultRelInfo ***partitions, - TupleConversionMap ***tup_conv_maps, - TupleTableSlot **partition_tuple_slot, - int *num_parted, int *num_partitions) + PartitionTupleRouting **partition_tuple_routing) Since we're consolidating all of ExecSetupPartitionTupleRouting's output parameters into a single structure, I think it might make more sense to have it just return that value. I think it's only done with output parameter today because there are so many different things being produced, and we can't return them all. + PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing; This is just nitpicking, but I don't find "ptr" to be the greatest variable name; it looks too much like "pointer". Maybe we could use "routing" or "proute" or something. It seems to me that we could improve things here by adding a function ExecCleanupTupleRouting(PartitionTupleRouting *) which would do the various heap_close(), ExecDropSingleTupleTableSlot(), and ExecCloseIndices() operations which are currently performed in CopyFrom() and, by separate code, in ExecEndModifyTable(). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Dec 15, 2017 at 7:58 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Reviewing the preparatory patch: I started another review pass over the main patch, so here are some comments about that. This is unfortunately not a complete review, however. - map = ptr->partition_tupconv_maps[leaf_part_index]; + map = ptr->parentchild_tupconv_maps[leaf_part_index]; I don't think there's any reason to rename this. In previous patch versions, you had multiple arrays of tuple conversion maps in this structure, but the refactoring eliminated that. Likewise, I'm not sure I get the point of mt_transition_tupconv_maps -> mt_childparent_tupconv_maps. That seems like it could similarly be left alone. + /* + * If transition tables are the only reason we're here, return. As + * mentioned above, we can also be here during update tuple routing in + * presence of transition tables, in which case this function is called + * separately for oldtup and newtup, so either can be NULL, not both. + */ if (trigdesc == NULL || (event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) || (event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) || - (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row)) + (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) || + (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL)))) I guess this is correct, but it seems awfully fragile. Can't we have some more explicit signaling about whether we're only here for transition tables, rather than deducing it based on exactly one of oldtup and newtup being NULL? + /* Initialization specific to update */ + if (mtstate && mtstate->operation == CMD_UPDATE) + { + ModifyTable *node = (ModifyTable *) mtstate->ps.plan; + + is_update = true; + update_rri = mtstate->resultRelInfo; + num_update_rri = list_length(node->plans); + } I guess I don't see why we need a separate "if" block for this. Neither is_update nor update_rri nor num_update_rri are used until we get to the block that begins with "if (is_update)". Why not just change that block to test if (mtstate && mtstate->operation == CMD_UPDATE)" and put the rest of these initializations inside that block? + int num_update_rri = 0, + update_rri_index = 0; ... + update_rri_index = 0; It's already 0. + leaf_part_rri = &update_rri[update_rri_index]; ... + leaf_part_rri = leaf_part_arr + i; These are doing the same kind of thing, but using different styles. I prefer the former style, so I'd change the second one to &leaf_part_arr[i]. Alternatively, you could change the first one to update_rri + update_rri_indx. But it's strange to see the same variable initialized in two different ways just a few lines apart. + if (!partrel) + { + /* + * We locked all the partitions above including the leaf + * partitions. Note that each of the newly opened relations in + * *partitions are eventually closed by the caller. + */ + partrel = heap_open(leaf_oid, NoLock); + InitResultRelInfo(leaf_part_rri, + partrel, + resultRTindex, + rel, + estate->es_instrument); + } Hmm, isn't there a problem here? Before, we opened all the relations here and the caller closed them all. But now, we're only opening some of them. If the caller closes them all, then they will be closing some that we opened and some that we didn't. That seems quite bad, because the reference counts that are incremented and decremented by opening and closing should all end up at 0. Maybe I'm confused because it seems like this would break in any scenario where even 1 relation was already opened and surely you must have tested that case... but if there's some reason this works, I don't know what it is, and the comment doesn't tell me. +static HeapTuple +ConvertPartitionTupleSlot(ModifyTableState *mtstate, + TupleConversionMap *map, + HeapTuple tuple, + TupleTableSlot *new_slot, + TupleTableSlot **p_my_slot) This function doesn't use the mtstate argument at all. + * (Similarly we need to add the deleted row in OLD TABLE). We need to do The period should be before, not after, the closing parenthesis. + * Now that we have already captured NEW TABLE row, any AR INSERT + * trigger should not again capture it below. Arrange for the same. A more American style would be something like "We've already captured the NEW TABLE row, so make sure any AR INSERT trigger fired below doesn't capture it again." (Similarly for the other case.) + /* The delete has actually happened, so inform that to the caller */ + if (tuple_deleted) + *tuple_deleted = true; In the US, we inform the caller, not inform that to the caller. In other words, here the direct object of "inform" is the person or thing getting the information (in this case, "the caller"), not the information being conveyed (in this case, "that"). I realize your usage is probably typical for your country... + Assert(mtstate->mt_is_tupconv_perpart == true); We usually just Assert(thing_that_should_be_true), not Assert(thing_that_should_be_true == true). + * In case this is part of update tuple routing, put this row into the + * transition OLD TABLE if we are capturing transition tables. We need to + * do this separately for DELETE and INSERT because they happen on + * different tables. Maybe "...OLD table, but only if we are..." Should it be capturing transition tables or capturing transition tuples? I'm not sure. + * partition, in which case, we should check the RLS CHECK policy just In the US, the second comma in this sentence is incorrect and should be removed. + * When an UPDATE is run with a leaf partition, we would not have + * partition tuple routing setup. In that case, fail with run with -> run on would not -> will not setup -> set up + * deleted by another transaction), then we should skip INSERT as + * well, otherwise, there will be effectively one new row inserted. skip INSERT -> skip the insert well, otherwise -> well; otherwise I would also change "there will be effectively one new row inserted" to "an UPDATE could cause an increase in the total number of rows across all partitions, which is clearly wrong". + /* + * UPDATEs set the transition capture map only when a new subplan + * is chosen. But for INSERTs, it is set for each row. So after + * INSERT, we need to revert back to the map created for UPDATE; + * otherwise the next UPDATE will incorrectly use the one created + * for INESRT. So first save the one created for UPDATE. + */ + if (mtstate->mt_transition_capture) + saved_tcs_map = mtstate->mt_transition_capture->tcs_map; UPDATEs -> Updates INESRT -> INSERT I wonder if there is some more elegant way to handle this problem. Basically, the issue is that ExecInsert() is stomping on mtstate->mt_transition_capture, and your solution is to save and restore the value you want to have there. But maybe we could instead find a way to get ExecInsert() not to stomp on that state in the first place. It seems like the ON CONFLICT stuff handled that by adding a second TransitionCaptureState pointer to ModifyTable, thus mt_transition_capture and mt_oc_transition_capture. By that precedent, we could add mt_utr_transition_capture or similar, and maybe that's the way to go. It seems a bit unsatisfying, but so does what you have now. + * 2. For capturing transition tables that are partitions. For UPDATEs, we need This isn't worded well. A transition table is never a partition; transition tables and partitions are two different kinds of things. + * If per-leaf map is required and the map is already created, that map + * has to be per-leaf. If that map is per-subplan, we won't be able to + * access the maps leaf-partition-wise. But if the map is per-leaf, we + * will be able to access the maps subplan-wise using the + * subplan_partition_offsets map using function + * tupconv_map_for_subplan(). So if the callers might need to access + * the map both leaf-partition-wise and subplan-wise, they should make + * sure that the first time this function is called, it should be + * called with perleaf=true so that the map created is per-leaf, not + * per-subplan. This sounds complicated and fragile. It ends up meaning that mt_childparent_tupconv_maps is sometimes indexed by subplan number and sometimes by partition leaf index, which is extremely confusing and likely to lead to coding errors, either in this patch or in future ones. Would it be reasonable to just always do this by partition leaf index, even if we don't strictly need that set of mappings? That's all I've got for now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 14 December 2017 at 08:11, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > Forgot to remove the description of update_rri and num_update_rri in the > header comment of ExecSetupPartitionTupleRouting(). > > - > +extern void pull_child_partition_columns(Relation rel, > + Relation parent, > + Bitmapset **partcols); > > It seems you forgot to remove this declaration in partition.h, because I > don't find it defined or used anywhere. Done both of the above. Attached v30 patch has the above changes. > > I think some of the changes that are currently part of the main patch are > better taken out into their own patches, because having those diffs appear > in the main patch is kind of distracting. Just like you now have a patch > that introduces a PartitionTupleRouting structure. I know that leads to > too many patches, but it helps to easily tell less substantial changes > from the substantial ones. Done. Created patches as shown below : > > 1. Patch to rename partition_tupconv_maps to parentchild_tupconv_maps. As per Robert's suggestion, reverted back the renaming of this field. > > 2. Patch that introduces has_partition_attrs() in place of > is_partition_attr() 0002-Changed-is_partition_attr-to-has_partition_attrs.patch > > 3. Patch to change the names of map_partition_varattnos() arguments 0003-Renaming-parameters-of-map_partition_var_attnos.patch > > 4. Patch that does the refactoring involving ExecConstrains(), > ExecPartitionCheck(), and the introduction of > ExecPartitionCheckEmitError() 0004-Refactor-CheckConstraint-related-code.patch The preparatory patches are to be applied in order of the patch numbers, followed by the main patch update-partition-key_v30.patch > > > Regarding ExecSetupChildParentMap(), it seems to me that it could simply > be declared as > > static void ExecSetupChildParentMap(ModifyTableState *mtstate); > > Looking at the places from where it's called, it seems that you're just > extracting information from mtstate and passing the same for the rest of > its arguments. > Agreed. But the last parameter per_leaf might be necessary. I will defer this until I address Robert's concern about the complexity of the related code. > mt_is_tupconv_perpart seems like it's unnecessary. Its function could be > fulfilled by inspecting the state of some other fields of > ModifyTableState. For example, in the case of an update (operation == > CMD_UPDATE), if mt_partition_tuple_routing is non-NULL, then we can always > assume that mt_childparent_tupconv_maps has entries for all partitions. > If it's NULL, then there would be only entries for partitions that have > sub-plans. I think we better have this field separately for code-clarity, and to avoid repeated execution of multiple conditions, and in order to have some signficant Asserts() that use this field. > > tupconv_map_for_subplan() looks like it could be done as a macro. Or may be inline function. I will again defer this for similar reason as the above deferred item about ExecSetupChildParentMap parameters. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Attachment
On 15 December 2017 at 18:28, Robert Haas <robertmhaas@gmail.com> wrote: > Reviewing the preparatory patch: > > + PartitionTupleRouting *partition_tuple_routing; > + /* Tuple-routing support info */ > > Something's wrong with the formatting here. Moved the comment above the declaration. > > - PartitionDispatch **pd, > - ResultRelInfo ***partitions, > - TupleConversionMap ***tup_conv_maps, > - TupleTableSlot **partition_tuple_slot, > - int *num_parted, int *num_partitions) > + PartitionTupleRouting **partition_tuple_routing) > > Since we're consolidating all of ExecSetupPartitionTupleRouting's > output parameters into a single structure, I think it might make more > sense to have it just return that value. I think it's only done with > output parameter today because there are so many different things > being produced, and we can't return them all. You mean ExecSetupPartitionTupleRouting() will return the structure (not pointer to structure), and the caller will get the copy of the structure like this ? : mtstate->mt_partition_tuple_routing = ExecSetupPartitionTupleRouting(mtstate, rel, node->nominalRelation, estate); I am ok with that, but just wanted to confirm if that is what you are saying. I don't recall seeing a structure return value in PG code, so not sure if it is conventional in PG to do that. Hence, I am somewhat inclined to keep it as output param. It also avoids a structure copy. Another way is for ExecSetupPartitionTupleRouting() to palloc this structure, and return its pointer, but then caller would have to anyway do a structure copy, so that's not convenient, and I don't think you are suggesting this way either. > > + PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing; > > This is just nitpicking, but I don't find "ptr" to be the greatest > variable name; it looks too much like "pointer". Maybe we could use > "routing" or "proute" or something. Done. Renamed it to "proute". > > It seems to me that we could improve things here by adding a function > ExecCleanupTupleRouting(PartitionTupleRouting *) which would do the > various heap_close(), ExecDropSingleTupleTableSlot(), and > ExecCloseIndices() operations which are currently performed in > CopyFrom() and, by separate code, in ExecEndModifyTable(). > Done. Changes are kept in a new preparatory patch 0005-Organize-cleanup-done-for-partition-tuple-routing.patch Yet to address your other review comments. Attached is patch v31. (Preparatory patches to be applied in order of patch numbers, followed by the main patch) Thanks -Amit
Attachment
- 0001-Encapsulate-partition-related-info-in-a-structure_v2.patch
- 0002-Changed-is_partition_attr-to-has_partition_attrs.patch
- 0003-Renaming-parameters-of-map_partition_var_attnos.patch
- 0004-Refactor-CheckConstraint-related-code.patch
- 0005-Organize-cleanup-done-for-partition-tuple-routing.patch
- update-partition-key_v31.patch
On 16 December 2017 at 03:09, Robert Haas <robertmhaas@gmail.com> wrote: > started another review pass over the main patch, so here are > some comments about that. I am yet to address all the comments, but meanwhile, below are some specific points ... > + if (!partrel) > + { > + /* > + * We locked all the partitions above including the leaf > + * partitions. Note that each of the newly opened relations in > + * *partitions are eventually closed by the caller. > + */ > + partrel = heap_open(leaf_oid, NoLock); > + InitResultRelInfo(leaf_part_rri, > + partrel, > + resultRTindex, > + rel, > + estate->es_instrument); > + } > > Hmm, isn't there a problem here? Before, we opened all the relations > here and the caller closed them all. But now, we're only opening some > of them. If the caller closes them all, then they will be closing > some that we opened and some that we didn't. That seems quite bad, > because the reference counts that are incremented and decremented by > opening and closing should all end up at 0. Maybe I'm confused > because it seems like this would break in any scenario where even 1 > relation was already opened and surely you must have tested that > case... but if there's some reason this works, I don't know what it > is, and the comment doesn't tell me. In ExecCleanupTupleRouting(), we are closing only those newly opened partitions. We skip those which are actually part of the update result rels. > + /* > + * UPDATEs set the transition capture map only when a new subplan > + * is chosen. But for INSERTs, it is set for each row. So after > + * INSERT, we need to revert back to the map created for UPDATE; > + * otherwise the next UPDATE will incorrectly use the one created > + * for INESRT. So first save the one created for UPDATE. > + */ > + if (mtstate->mt_transition_capture) > + saved_tcs_map = mtstate->mt_transition_capture->tcs_map; > > I wonder if there is some more elegant way to handle this problem. > Basically, the issue is that ExecInsert() is stomping on > mtstate->mt_transition_capture, and your solution is to save and > restore the value you want to have there. But maybe we could instead > find a way to get ExecInsert() not to stomp on that state in the first > place. It seems like the ON CONFLICT stuff handled that by adding a > second TransitionCaptureState pointer to ModifyTable, thus > mt_transition_capture and mt_oc_transition_capture. By that > precedent, we could add mt_utr_transition_capture or similar, and > maybe that's the way to go. It seems a bit unsatisfying, but so does > what you have now. In case of ON CONFLICT, if there are both INSERT and UPDATE statement triggers referencing transition tables, both of the triggers need to independently populate their own transition tables, and hence the need for two separate transition states : mt_transition_capture and mt_oc_transition_capture. But in case of update-tuple-routing, the INSERT statement trigger won't come into picture. So the same mt_transition_capture can serve the purpose of populating the transition table with OLD and NEW rows. So I think it would be too redundant, if not incorrect, to have a whole new transition state for update tuple routing. I will see if it turns out better to have two tcs_maps in TransitionCaptureState, one for update and one for insert. But this, on first look, does not look good. > + * If per-leaf map is required and the map is already created, that map > + * has to be per-leaf. If that map is per-subplan, we won't be able to > + * access the maps leaf-partition-wise. But if the map is per-leaf, we > + * will be able to access the maps subplan-wise using the > + * subplan_partition_offsets map using function > + * tupconv_map_for_subplan(). So if the callers might need to access > + * the map both leaf-partition-wise and subplan-wise, they should make > + * sure that the first time this function is called, it should be > + * called with perleaf=true so that the map created is per-leaf, not > + * per-subplan. > > This sounds complicated and fragile. It ends up meaning that > mt_childparent_tupconv_maps is sometimes indexed by subplan number and > sometimes by partition leaf index, which is extremely confusing and > likely to lead to coding errors, either in this patch or in future > ones. Even if we always index the map by leaf partition, while accessing the map the code still needs to be aware of whether the index number with which we are accessing the map is the subplan number or leaf partition number: If the access is by subplan number, use subplan_partition_offsets to convert to the leaf partition index. So the function tupconv_map_for_subplan() is anyways necessary for accessing using subplan index. Only thing that will change is : tupconv_map_for_subplan() will not have to check if the the map is indexed by leaf partition or not. But that complexity is hidden in this function alone; the outside code need not worry about that. If the access is by leaf partition number, I think you are worried here that the map might have been incorrectly indexed by subplan, and the code might access it partition-wise. Currently we access the map by leaf-partition-index only when setting up mtstate->mt_*transition_capture->tcs_map during inserts. At that place, there is an Assert(mtstate->mt_is_tupconv_perpart == true). May be, we can have another function tupconv_map_for_partition() rather than directly accessing mt_childparent_tupconv_maps[], and have this Assert() in that function. What do you say ? I am more inclined towards avoiding an always-leaf-partition-indexed map for additional reasons mentioned below ... > Would it be reasonable to just always do this by partition leaf > index, even if we don't strictly need that set of mappings? If there are no transition tables in picture, we don't require per-leaf child-parent conversion. So, this would mean that the tuple conversion maps will be set up for all (say, 100) leaf partitions even if there are only, say, a couple of update plans. I feel this would unnecessarily increase the startup cost of update-partition-key operation. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On 23 December 2017 at 04:00, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 15 December 2017 at 18:28, Robert Haas <robertmhaas@gmail.com> wrote: >> - PartitionDispatch **pd, >> - ResultRelInfo ***partitions, >> - TupleConversionMap ***tup_conv_maps, >> - TupleTableSlot **partition_tuple_slot, >> - int *num_parted, int *num_partitions) >> + PartitionTupleRouting **partition_tuple_routing) >> >> Since we're consolidating all of ExecSetupPartitionTupleRouting's >> output parameters into a single structure, I think it might make more >> sense to have it just return that value. I think it's only done with >> output parameter today because there are so many different things >> being produced, and we can't return them all. > > You mean ExecSetupPartitionTupleRouting() will return the structure > (not pointer to structure), and the caller will get the copy of the > structure like this ? : > > mtstate->mt_partition_tuple_routing = > ExecSetupPartitionTupleRouting(mtstate, rel, node->nominalRelation, estate); > > I am ok with that, but just wanted to confirm if that is what you are > saying. I don't recall seeing a structure return value in PG code, so > not sure if it is conventional in PG to do that. Hence, I am somewhat > inclined to keep it as output param. It also avoids a structure copy. > > Another way is for ExecSetupPartitionTupleRouting() to palloc this > structure, and return its pointer, but then caller would have to > anyway do a structure copy, so that's not convenient, and I don't > think you are suggesting this way either. I'm pretty sure Robert is suggesting that ExecSetupPartitionTupleRouting pallocs the memory for the structure, sets it up then returns a pointer to the new struct. That's not very unusual. It seems unusual for a function to return void and modify a single parameter pointer to get the value to the caller rather than just to return that value. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 16 December 2017 at 03:09, Robert Haas <robertmhaas@gmail.com> wrote: > > - map = ptr->partition_tupconv_maps[leaf_part_index]; > + map = ptr->parentchild_tupconv_maps[leaf_part_index]; > > I don't think there's any reason to rename this. In previous patch > versions, you had multiple arrays of tuple conversion maps in this > structure, but the refactoring eliminated that. Done in an earlier version of the patch. > > Likewise, I'm not sure I get the point of mt_transition_tupconv_maps > -> mt_childparent_tupconv_maps. That seems like it could similarly be > left alone. We need to change it's name because now this map is not only used for transition capture, but also for update-tuple-routing. Does it look ok for you if, for readability, we keep the childparent tag ? Or else, we can just make it "mt_tupconv_maps", but "mt_childparent_tupconv_maps" looks more informative. > > + /* > + * If transition tables are the only reason we're here, return. As > + * mentioned above, we can also be here during update tuple routing in > + * presence of transition tables, in which case this function is called > + * separately for oldtup and newtup, so either can be NULL, not both. > + */ > if (trigdesc == NULL || > (event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) || > (event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) || > - (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row)) > + (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) || > + (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL)))) > > I guess this is correct, but it seems awfully fragile. Can't we have > some more explicit signaling about whether we're only here for > transition tables, rather than deducing it based on exactly one of > oldtup and newtup being NULL? I had given a thought on this earlier. I felt, even the pre-existing conditions like "!trigdesc->trig_update_after_row" are all indirect ways to determine that this function is called only to capture transition tables, and thought that it may have been better to have separate parameter transition_table_only. But then decided that I can continue on similar lines and add another such condition to indicate that we are only capturing update-routed tuples. Instead of adding another parameter to AfterTriggerSaveEvent(), I had also considered another approach: Put the transition-tuples-capture logic part of AfterTriggerSaveEvent() into a helper function CaptureTransitionTables(). In ExecInsert() and ExecDelete(), instead of calling ExecARUpdateTriggers(), call this function CaptureTransitionTables(). I then dropped this idea and thought rather to call ExecARUpdateTriggers() which neatly does the required checks and other things like locking the old tuple via GetTupleForTrigger(). So if we go by CaptureTransitionTables(), we would need to do what ExecARUpdateTriggers() does before calling CaptureTransitionTables(). This is doable. If you think this is worth doing so as to get rid of the "(oldtup == NULL) ^ (newtup == NULL)" condition, we can do that. > > + /* Initialization specific to update */ > + if (mtstate && mtstate->operation == CMD_UPDATE) > + { > + ModifyTable *node = (ModifyTable *) mtstate->ps.plan; > + > + is_update = true; > + update_rri = mtstate->resultRelInfo; > + num_update_rri = list_length(node->plans); > + } > > I guess I don't see why we need a separate "if" block for this. > Neither is_update nor update_rri nor num_update_rri are used until we > get to the block that begins with "if (is_update)". Why not just > change that block to test if (mtstate && mtstate->operation == > CMD_UPDATE)" and put the rest of these initializations inside that > block? Done. > > + int num_update_rri = 0, > + update_rri_index = 0; > ... > + update_rri_index = 0; > > It's already 0. Done. Retained the comment that mentions why we need to set it to 0, and added a note in the end that we have already done this during initialization. > > + leaf_part_rri = &update_rri[update_rri_index]; > ... > + leaf_part_rri = leaf_part_arr + i; > > These are doing the same kind of thing, but using different styles. I > prefer the former style, so I'd change the second one to > &leaf_part_arr[i]. Alternatively, you could change the first one to > update_rri + update_rri_indx. But it's strange to see the same > variable initialized in two different ways just a few lines apart. > Done. Used the first style. > > +static HeapTuple > +ConvertPartitionTupleSlot(ModifyTableState *mtstate, > + TupleConversionMap *map, > + HeapTuple tuple, > + TupleTableSlot *new_slot, > + TupleTableSlot **p_my_slot) > > This function doesn't use the mtstate argument at all. Removed mtstate. > > + * (Similarly we need to add the deleted row in OLD TABLE). We need to do > > The period should be before, not after, the closing parenthesis. Done. > > + * Now that we have already captured NEW TABLE row, any AR INSERT > + * trigger should not again capture it below. Arrange for the same. > > A more American style would be something like "We've already captured > the NEW TABLE row, so make sure any AR INSERT trigger fired below > doesn't capture it again." (Similarly for the other case.) Done. > > + /* The delete has actually happened, so inform that to the caller */ > + if (tuple_deleted) > + *tuple_deleted = true; > > In the US, we inform the caller, not inform that to the caller. In > other words, here the direct object of "inform" is the person or thing > getting the information (in this case, "the caller"), not the > information being conveyed (in this case, "that"). I realize your > usage is probably typical for your country... Changed it to "inform the caller about the same" > > + Assert(mtstate->mt_is_tupconv_perpart == true); > > We usually just Assert(thing_that_should_be_true), not > Assert(thing_that_should_be_true == true). Ok. Changed it to Assert(mtstate->mt_is_tupconv_perpart) > > + * In case this is part of update tuple routing, put this row into the > + * transition OLD TABLE if we are capturing transition tables. We need to > + * do this separately for DELETE and INSERT because they happen on > + * different tables. > > Maybe "...OLD table, but only if we are..." > > Should it be capturing transition tables or capturing transition > tuples? I'm not sure. Changed it to "capturing transition tuples". In trigger.c, I see this short form notation as well as a long-form notation like "capturing tuples in transition tables". But not seen anywhere "capturing transition tables", and it does seem odd. > > + * partition, in which case, we should check the RLS CHECK policy just > > In the US, the second comma in this sentence is incorrect and should be removed. Done. > > + * When an UPDATE is run with a leaf partition, we would not have > + * partition tuple routing setup. In that case, fail with > > run with -> run on > would not -> will not > setup -> set up Done. > > + * deleted by another transaction), then we should skip INSERT as > + * well, otherwise, there will be effectively one new row inserted. > > skip INSERT -> skip the insert > well, otherwise -> well; otherwise > > I would also change "there will be effectively one new row inserted" > to "an UPDATE could cause an increase in the total number of rows > across all partitions, which is clearly wrong". Done both. > > + /* > + * UPDATEs set the transition capture map only when a new subplan > + * is chosen. But for INSERTs, it is set for each row. So after > + * INSERT, we need to revert back to the map created for UPDATE; > + * otherwise the next UPDATE will incorrectly use the one created > + * for INESRT. So first save the one created for UPDATE. > + */ > + if (mtstate->mt_transition_capture) > + saved_tcs_map = mtstate->mt_transition_capture->tcs_map; > > UPDATEs -> Updates Done. I believe you want to do this only if it's a plural ? In the same para, also changed "INSERTs" to "inserts". > INESRT -> INSERT Done. > > + * 2. For capturing transition tables that are partitions. For UPDATEs, we need > > This isn't worded well. A transition table is never a partition; > transition tables and partitions are two different kinds of things. Yeah. Changed it to : "For capturing transition tuples when the target table is a partitioned table." Attached v32 patch. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Attachment
On 2 January 2018 at 10:56, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 23 December 2017 at 04:00, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> On 15 December 2017 at 18:28, Robert Haas <robertmhaas@gmail.com> wrote: >>> - PartitionDispatch **pd, >>> - ResultRelInfo ***partitions, >>> - TupleConversionMap ***tup_conv_maps, >>> - TupleTableSlot **partition_tuple_slot, >>> - int *num_parted, int *num_partitions) >>> + PartitionTupleRouting **partition_tuple_routing) >>> >>> Since we're consolidating all of ExecSetupPartitionTupleRouting's >>> output parameters into a single structure, I think it might make more >>> sense to have it just return that value. I think it's only done with >>> output parameter today because there are so many different things >>> being produced, and we can't return them all. >> >> You mean ExecSetupPartitionTupleRouting() will return the structure >> (not pointer to structure), and the caller will get the copy of the >> structure like this ? : >> >> mtstate->mt_partition_tuple_routing = >> ExecSetupPartitionTupleRouting(mtstate, rel, node->nominalRelation, estate); >> >> I am ok with that, but just wanted to confirm if that is what you are >> saying. I don't recall seeing a structure return value in PG code, so >> not sure if it is conventional in PG to do that. Hence, I am somewhat >> inclined to keep it as output param. It also avoids a structure copy. >> >> Another way is for ExecSetupPartitionTupleRouting() to palloc this >> structure, and return its pointer, but then caller would have to >> anyway do a structure copy, so that's not convenient, and I don't >> think you are suggesting this way either. > > I'm pretty sure Robert is suggesting that > ExecSetupPartitionTupleRouting pallocs the memory for the structure, > sets it up then returns a pointer to the new struct. That's not very > unusual. It seems unusual for a function to return void and modify a > single parameter pointer to get the value to the caller rather than > just to return that value. Sorry, my mistake. Earlier I somehow was under the impression that the callers of ExecSetupPartitionTupleRouting() already have this structure palloc'ed, and that they pass address of this structure. I now can see that both CopyStateData->partition_tuple_routing and ModifyTableState->mt_partition_tuple_routing are pointers, not structures. So it make perfect sense for ExecSetupPartitionTupleRouting() to palloc and return a pointer. Sorry for the noise. Will share the change in an upcoming patch version. Thanks ! -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On 1 January 2018 at 21:43, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 16 December 2017 at 03:09, Robert Haas <robertmhaas@gmail.com> wrote: >> + /* >> + * UPDATEs set the transition capture map only when a new subplan >> + * is chosen. But for INSERTs, it is set for each row. So after >> + * INSERT, we need to revert back to the map created for UPDATE; >> + * otherwise the next UPDATE will incorrectly use the one created >> + * for INESRT. So first save the one created for UPDATE. >> + */ >> + if (mtstate->mt_transition_capture) >> + saved_tcs_map = mtstate->mt_transition_capture->tcs_map; >> >> I wonder if there is some more elegant way to handle this problem. >> Basically, the issue is that ExecInsert() is stomping on >> mtstate->mt_transition_capture, and your solution is to save and >> restore the value you want to have there. But maybe we could instead >> find a way to get ExecInsert() not to stomp on that state in the first >> place. It seems like the ON CONFLICT stuff handled that by adding a >> second TransitionCaptureState pointer to ModifyTable, thus >> mt_transition_capture and mt_oc_transition_capture. By that >> precedent, we could add mt_utr_transition_capture or similar, and >> maybe that's the way to go. It seems a bit unsatisfying, but so does >> what you have now. > > In case of ON CONFLICT, if there are both INSERT and UPDATE statement > triggers referencing transition tables, both of the triggers need to > independently populate their own transition tables, and hence the need > for two separate transition states : mt_transition_capture and > mt_oc_transition_capture. But in case of update-tuple-routing, the > INSERT statement trigger won't come into picture. So the same > mt_transition_capture can serve the purpose of populating the > transition table with OLD and NEW rows. So I think it would be too > redundant, if not incorrect, to have a whole new transition state for > update tuple routing. > > I will see if it turns out better to have two tcs_maps in > TransitionCaptureState, one for update and one for insert. But this, > on first look, does not look good. Suppose TransitionCaptureState has separate maps, upd_del_tcs_maps and insert_tcs_maps for UPDATE/DELETE and INSERT events respectively. So upd_del_tcs_maps will be updated only after we start with the next UPDATE subplan, whereas insert_tcs_maps will keep on getting updated for each row. So in AfterTriggerSaveEvent(), upd_del_tcs_maps would be used when the event is TRIGGER_EVENT_[UPDATE/DELETE], and insert_tcs_maps will be used when event == TRIGGER_EVENT_INSERT. But the issue is : even if the event is TRIGGER_EVENT_UPDATE, we don't know whether this is caused by a normal update or as part of an insert into new partition during partition-key-update. So blindly using upd_del_tcs_maps is incorrect. If the event is caused by the later, we should use insert_tcs_maps rather than upd_del_tcs_maps. But we do not have the information in trigger.c as to what caused this event. So, overall, it would not work, and even if we make it work by passing or storing some more information somewhere, the AfterTriggerSaveEvent() logic will become too complicated. So I can't think of anything else, but to keep the way I did, i.e. reverting back the tcs_map once insert finishes. We so a similar thing for reverting back the estate->es_result_relation_info. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On 20 December 2017 at 11:52, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 14 December 2017 at 08:11, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> >> Regarding ExecSetupChildParentMap(), it seems to me that it could simply >> be declared as >> >> static void ExecSetupChildParentMap(ModifyTableState *mtstate); >> >> Looking at the places from where it's called, it seems that you're just >> extracting information from mtstate and passing the same for the rest of >> its arguments. > > Agreed. But the last parameter per_leaf might be necessary. I will > defer this until I address Robert's concern about the complexity of > the related code. Removed those parameters, but kept perleaf. The map required for update-tuple-routing is a per-subplan one despite the presence of partition tuple routing. And we cannot deduce from mtstate whether update tuple routing is true. So for this case, the caller has to explicitly specify that per-subplan map has to be created. >> >> tupconv_map_for_subplan() looks like it could be done as a macro. > > Or may be inline function. I will again defer this for similar reason > as the above deferred item about ExecSetupChildParentMap parameters. > Made it inline. Did the above changes in attached update-partition-key_v33.patch On 3 January 2018 at 11:42, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 2 January 2018 at 10:56, David Rowley <david.rowley@2ndquadrant.com> wrote: >> I'm pretty sure Robert is suggesting that >> ExecSetupPartitionTupleRouting pallocs the memory for the structure, >> sets it up then returns a pointer to the new struct. That's not very >> unusual. It seems unusual for a function to return void and modify a >> single parameter pointer to get the value to the caller rather than >> just to return that value. > > Sorry, my mistake. Earlier I somehow was under the impression that the > callers of ExecSetupPartitionTupleRouting() already have this > structure palloc'ed, and that they pass address of this structure. I > now can see that both CopyStateData->partition_tuple_routing and > ModifyTableState->mt_partition_tuple_routing are pointers, not > structures. So it make perfect sense for > ExecSetupPartitionTupleRouting() to palloc and return a pointer. Sorry > for the noise. Will share the change in an upcoming patch version. > Thanks ! ExecSetupPartitionTupleRouting() now returns PartitionTupleRouting *. Did this change in v3 version of 0001-Encapsulate-partition-related-info-in-a-structure.patch -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Attachment
> On 3 January 2018 at 11:42, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> [...] So it make perfect sense for >> ExecSetupPartitionTupleRouting() to palloc and return a pointer. Sorry >> for the noise. Will share the change in an upcoming patch version. >> Thanks ! > > ExecSetupPartitionTupleRouting() now returns PartitionTupleRouting *. Thanks for changing. I've just done almost a complete review of v32. (v33 came along a bit sooner than I thought). I've not finished looking at the regression tests yet, but here are a few things, some may have been changed in v33, I've not looked yet. Also apologies in advance if anything seems nitpicky. 1. "by INSERT" -> "by an INSERT" in: from the original partition followed by <command>INSERT</command> into the 2. "and INSERT" -> "and an INSERT" in: a <command>DELETE</command> and <command>INSERT</command>. As far as 3. "due partition-key change" -> "due to the partition-key being changed" in: * capture is happening for UPDATEd rows being moved to another partition due * partition-key change, then this function is called once when the row is 4. "inserted to another" -> "inserted into another" in: * deleted (to capture OLD row), and once when the row is inserted to another 5. "for UPDATE event" -> "for an UPDATE event" (singular), or -> "for UPDATE events" (plural) * oldtup and newtup are non-NULL. But for UPDATE event fired for I'm unsure if you need singular or plural. It perhaps does not matter. 6. "for row" -> "for a row" in: * movement, oldtup is NULL when the event is for row being inserted, Likewise in: * whereas newtup is NULL when the event is for row being deleted. 7. In the following fragment the code does not do what the comment says: /* * If transition tables are the only reason we're here, return. As * mentioned above, we can also be here during update tuple routing in * presence of transition tables, in which case this function is called * separately for oldtup and newtup, so either can be NULL, not both. */ if (trigdesc == NULL || (event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) || (event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) || (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) || (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL)))) return; With the comment; "so either can be NULL, not both.", I'd expect a boolean OR not an XOR. maybe the comment is better written as: "so we expect exactly one of them to be non-NULL" (I know you've been discussing with Robert, so I've not checked v33 to see if this still exists) 8. I'm struggling to make sense of this: /* * Save a tuple conversion map to convert a tuple routed to this * partition from the parent's type to the partition's. */ Maybe it's better to write this as: /* * Generate a tuple conversion map to convert tuples of the parent's * type into the partition's type. */ 9. insert should be capitalised here and should be prefixed with "an": /* * Verify result relation is a valid target for insert operation. Even * for updates, we are doing this for tuple-routing, so again, we need * to check the validity for insert operation. */ CheckValidResultRel(leaf_part_rri, CMD_INSERT); Maybe it's better to write: /* * Verify result relation is a valid target for an INSERT. An UPDATE of * a partition-key becomes a DELETE/INSERT operation, so this check is * still required when the operation is CMD_UPDATE. */ 10. The following code would be more clear if you replaced mtstate->mt_transition_capture with transition_capture. if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture && mtstate->mt_transition_capture->tcs_update_new_table) { ExecARUpdateTriggers(estate, resultRelInfo, NULL, NULL, tuple, NULL, mtstate->mt_transition_capture); /* * Now that we have already captured NEW TABLE row, any AR INSERT * trigger should not again capture it below. Arrange for the same. */ transition_capture = NULL; } You are, after all, doing: transition_capture = mtstate->mt_transition_capture; at the top of the function. There are a few other places you're also accessing mtstate->mt_transition_capture. 11. Should tuple_deleted and process_returning be camelCase like the other params?: static TupleTableSlot * ExecDelete(ModifyTableState *mtstate, ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *planSlot, EPQState *epqstate, EState *estate, bool *tuple_deleted, bool process_returning, bool canSetTag) 12. The following comment talks about "target table descriptor", which I think is a good term. In several other places, you mention "root", which I take it to mean "target table". * This map array is required for two purposes : * 1. For update-tuple-routing. We need to convert the tuple from the subplan * result rel to the root partitioned table descriptor. * 2. For capturing transition tuples when the target table is a partitioned * table. For updates, we need to convert the tuple from subplan result rel to * target table descriptor, and for inserts, we need to convert the inserted * tuple from leaf partition to the target table descriptor. I'd personally rather we always talked about "target" rather than "root". I understand there's probably many places in the code where we talk about the target table as "root", but I really think we need to fix that, so I'd rather not see the problem get any worse before it gets better. The comment block might also look better if you tab indent after the 1. and 2. then on each line below it. Also the space before the ':' is not correct. 13. Does the following code really need to palloc0 rather than just palloc? /* * Build array of conversion maps from each child's TupleDesc to the * one used in the tuplestore. The map pointers may be NULL when no * conversion is necessary, which is hopefully a common case for * partitions. */ mtstate->mt_childparent_tupconv_maps = (TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) * numResultRelInfos); I don't see any case in the initialization of the array where any of the elements are not assigned a value, so I think palloc() is fine. 14. I don't really like the way tupconv_map_for_subplan() works. It would be nice to have two separate functions for this, but looking a bit more at it, it seems the caller won't just need to always call exactly one of those functions. I don't have any ideas to improve it, so this is just a note. 15. I still don't really like the way ExecInitModifyTable() sets and unsets update_tuple_routing_needed. I know we talked about this before, but couldn't you just change: if (resultRelInfo->ri_TrigDesc && resultRelInfo->ri_TrigDesc->trig_update_before_row && operation == CMD_UPDATE) update_tuple_routing_needed = true; To: if (resultRelInfo->ri_TrigDesc && resultRelInfo->ri_TrigDesc->trig_update_before_row && node->partitioned_rels != NIL && operation == CMD_UPDATE) update_tuple_routing_needed = true; and get rid of: /* * If it's not a partitioned table after all, UPDATE tuple routing should * not be attempted. */ if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE) update_tuple_routing_needed = false; looking at inheritance_planner(), partitioned_rels is only set to a non-NIL value if parent_rte->relkind == RELKIND_PARTITIONED_TABLE. 16. "named" -> "target" in: * 'partKeyUpdated' is true if any partitioning columns are being updated, * either from the named relation or a descendent partitioned table. I guess we're calling this one of; root, named, target :-( 17. You still have the following comment in ModifyTableState but you've moved all those fields out to PartitionTupleRouting: /* Tuple-routing support info */ 18. Should the following not be just called partKeyUpdate (without the 'd')? bool partKeyUpdated; /* some part key in hierarchy updated */ This occurs in the planner were the part key is certainly being updated. 19. In pathnode.h you've named a parameter partColsUpdated, but the function in the .c file calls it partKeyUpdated. I'll try to look at the tests tomorrow and also do some testing. So far I've only read the code and the docs. Overall, the patch appears to look quite good. Good to see the various cleanups going in like the new PartitionTupleRouting struct. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Robert, for tracking purpose, below I have consolidated your review items on which we are yet to conclude. Let me know if you have more comments on the points which I made. ------------------ 1. ExecUpdate() needs to revert back tcs_map value changed by ExecInsert() ------------------ >> + /* >> + * UPDATEs set the transition capture map only when a new subplan >> + * is chosen. But for INSERTs, it is set for each row. So after >> + * INSERT, we need to revert back to the map created for UPDATE; >> + * otherwise the next UPDATE will incorrectly use the one created >> + * for INESRT. So first save the one created for UPDATE. >> + */ >> + if (mtstate->mt_transition_capture) >> + saved_tcs_map = mtstate->mt_transition_capture->tcs_map; >> >> I wonder if there is some more elegant way to handle this problem. >> Basically, the issue is that ExecInsert() is stomping on >> mtstate->mt_transition_capture, and your solution is to save and >> restore the value you want to have there. But maybe we could instead >> find a way to get ExecInsert() not to stomp on that state in the first >> place. It seems like the ON CONFLICT stuff handled that by adding a >> second TransitionCaptureState pointer to ModifyTable, thus >> mt_transition_capture and mt_oc_transition_capture. By that >> precedent, we could add mt_utr_transition_capture or similar, and >> maybe that's the way to go. It seems a bit unsatisfying, but so does >> what you have now. > > In case of ON CONFLICT, if there are both INSERT and UPDATE statement > triggers referencing transition tables, both of the triggers need to > independently populate their own transition tables, and hence the need > for two separate transition states : mt_transition_capture and > mt_oc_transition_capture. But in case of update-tuple-routing, the > INSERT statement trigger won't come into picture. So the same > mt_transition_capture can serve the purpose of populating the > transition table with OLD and NEW rows. So I think it would be too > redundant, if not incorrect, to have a whole new transition state for > update tuple routing. > > I will see if it turns out better to have two tcs_maps in > TransitionCaptureState, one for update and one for insert. But this, > on first look, does not look good. Suppose TransitionCaptureState has separate maps, upd_del_tcs_maps and insert_tcs_maps for UPDATE/DELETE and INSERT events respectively. So upd_del_tcs_maps will be updated only after we start with the next UPDATE subplan, whereas insert_tcs_maps will keep on getting updated for each row. So in AfterTriggerSaveEvent(), upd_del_tcs_maps would be used when the event is TRIGGER_EVENT_[UPDATE/DELETE], and insert_tcs_maps will be used when event == TRIGGER_EVENT_INSERT. But the issue is : even if the event is TRIGGER_EVENT_UPDATE, we don't know whether this is caused by a normal update or as part of an insert into new partition during partition-key-update. So blindly using upd_del_tcs_maps is incorrect. If the event is caused by the later, we should use insert_tcs_maps rather than upd_del_tcs_maps. But we do not have the information in trigger.c as to what caused this event. So, overall, it would not work, and even if we make it work by passing or storing some more information somewhere, the AfterTriggerSaveEvent() logic will become too complicated. So I can't think of anything else, but to keep the way I did, i.e. reverting back the tcs_map once insert finishes. We so a similar thing for reverting back the estate->es_result_relation_info. ------------------ 2. mt_childparent_tupconv_maps is indexed by subplan or partition leaf index. ------------------ > + * If per-leaf map is required and the map is already created, that map > + * has to be per-leaf. If that map is per-subplan, we won't be able to > + * access the maps leaf-partition-wise. But if the map is per-leaf, we > + * will be able to access the maps subplan-wise using the > + * subplan_partition_offsets map using function > + * tupconv_map_for_subplan(). So if the callers might need to access > + * the map both leaf-partition-wise and subplan-wise, they should make > + * sure that the first time this function is called, it should be > + * called with perleaf=true so that the map created is per-leaf, not > + * per-subplan. > > This sounds complicated and fragile. It ends up meaning that > mt_childparent_tupconv_maps is sometimes indexed by subplan number and > sometimes by partition leaf index, which is extremely confusing and > likely to lead to coding errors, either in this patch or in future > ones. Even if we always index the map by leaf partition, while accessing the map the code still needs to be aware of whether the index number with which we are accessing the map is the subplan number or leaf partition number: If the access is by subplan number, use subplan_partition_offsets to convert to the leaf partition index. So the function tupconv_map_for_subplan() is anyways necessary for accessing using subplan index. Only thing that will change is : tupconv_map_for_subplan() will not have to check if the the map is indexed by leaf partition or not. But that complexity is hidden in this function alone; the outside code need not worry about that. If the access is by leaf partition number, I think you are worried here that the map might have been incorrectly indexed by subplan, and the code might access it partition-wise. Currently we access the map by leaf-partition-index only when setting up mtstate->mt_*transition_capture->tcs_map during inserts. At that place, there is an Assert(mtstate->mt_is_tupconv_perpart == true). May be, we can have another function tupconv_map_for_partition() rather than directly accessing mt_childparent_tupconv_maps[], and have this Assert() in that function. What do you say ? I am more inclined towards avoiding an always-leaf-partition-indexed map for additional reasons mentioned below ... > Would it be reasonable to just always do this by partition leaf > index, even if we don't strictly need that set of mappings? If there are no transition tables in picture, we don't require per-leaf child-parent conversion. So, this would mean that the tuple conversion maps will be set up for all (say, 100) leaf partitions even if there are only, say, a couple of update plans. I feel this would unnecessarily increase the startup cost of update-partition-key operation. ------------------ 3. Renaming of mt_transition_tupconv_maps to mt_childparent_tupconv_maps ------------------ > > Likewise, I'm not sure I get the point of mt_transition_tupconv_maps > -> mt_childparent_tupconv_maps. That seems like it could similarly be > left alone. We need to change it's name because now this map is not only used for transition capture, but also for update-tuple-routing. Does it look ok for you if, for readability, we keep the childparent tag ? Or else, we can just make it "mt_tupconv_maps", but "mt_childparent_tupconv_maps" looks more informative. ------------------- 4. Explicit signaling for "we are only here for transition tables" ------------------- > > + /* > + * If transition tables are the only reason we're here, return. As > + * mentioned above, we can also be here during update tuple routing in > + * presence of transition tables, in which case this function is called > + * separately for oldtup and newtup, so either can be NULL, not both. > + */ > if (trigdesc == NULL || > (event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) || > (event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) || > - (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row)) > + (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) || > + (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL)))) > > I guess this is correct, but it seems awfully fragile. Can't we have > some more explicit signaling about whether we're only here for > transition tables, rather than deducing it based on exactly one of > oldtup and newtup being NULL? I had given a thought on this earlier. I felt, even the pre-existing conditions like "!trigdesc->trig_update_after_row" are all indirect ways to determine that this function is called only to capture transition tables, and thought that it may have been better to have separate parameter transition_table_only. But then decided that I can continue on similar lines and add another such condition to indicate that we are only capturing update-routed tuples. Instead of adding another parameter to AfterTriggerSaveEvent(), I had also considered another approach: Put the transition-tuples-capture logic part of AfterTriggerSaveEvent() into a helper function CaptureTransitionTables(). In ExecInsert() and ExecDelete(), instead of calling ExecARUpdateTriggers(), call this function CaptureTransitionTables(). I then dropped this idea and thought rather to call ExecARUpdateTriggers() which neatly does the required checks and other things like locking the old tuple via GetTupleForTrigger(). So if we go by CaptureTransitionTables(), we would need to do what ExecARUpdateTriggers() does before calling CaptureTransitionTables(). This is doable. If you think this is worth doing so as to get rid of the "(oldtup == NULL) ^ (newtup == NULL)" condition, we can do that.
On 4 January 2018 at 02:52, David Rowley <david.rowley@2ndquadrant.com> wrote: > I'll try to look at the tests tomorrow and also do some testing. So > far I've only read the code and the docs. There are a few more things I noticed on another pass I made today: 20. "carried" -> "carried out the" + would have identified the newly updated row and carried + <command>UPDATE</command>/<command>DELETE</command> on this new row 21. Extra new line + <xref linkend="ddl-partitioning-declarative-limitations">. + </para> 22. In copy.c CopyFrom() you have the following code: /* * We might need to convert from the parent rowtype to the * partition rowtype. */ map = proute->partition_tupconv_maps[leaf_part_index]; if (map) { Relation partrel = resultRelInfo->ri_RelationDesc; tuple = do_convert_tuple(tuple, map); /* * We must use the partition's tuple descriptor from this * point on. Use a dedicated slot from this point on until * we're finished dealing with the partition. */ slot = proute->partition_tuple_slot; Assert(slot != NULL); ExecSetSlotDescriptor(slot, RelationGetDescr(partrel)); ExecStoreTuple(tuple, slot, InvalidBuffer, true); } Should this use ConvertPartitionTupleSlot() instead? 23. Why write; last_resultRelInfo = mtstate->resultRelInfo + mtstate->mt_nplans; when you can write; last_resultRelInfo = mtstate->resultRelInfo[mtstate->mt_nplans];? 24. In ExecCleanupTupleRouting(), do you think that you could just have a special case loop for (mtstate && mtstate->operation == CMD_UPDATE)? /* * If this result rel is one of the UPDATE subplan result rels, let * ExecEndPlan() close it. For INSERT or COPY, this does not apply * because leaf partition result rels are always newly allocated. */ if (is_update && resultRelInfo >= first_resultRelInfo && resultRelInfo < last_resultRelInfo) continue; Something like: if (mtstate && mtstate->operation == CMD_UPDATE) { ResultRelInfo *first_resultRelInfo = mtstate->resultRelInfo; ResultRelInfo *last_resultRelInfo = mtstate->resultRelInfo[mtstate->mt_nplans]; for (i = 0; i < proute->num_partitions; i++) { ResultRelInfo *resultRelInfo = proute->partitions[i]; /* * Leave any resultRelInfos that belong to the UPDATE's subplan * list. These will be closed during executor shutdown. */ if (resultRelInfo >= first_resultRelInfo && resultRelInfo < last_resultRelInfo) continue; ExecCloseIndices(resultRelInfo); heap_close(resultRelInfo->ri_RelationDesc, NoLock); } } else { for (i = 0; i < proute->num_partitions; i++) { ResultRelInfo *resultRelInfo = proute->partitions[i]; ExecCloseIndices(resultRelInfo); heap_close(resultRelInfo->ri_RelationDesc, NoLock); } } -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jan 3, 2018 at 6:29 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > ExecSetupPartitionTupleRouting() now returns PartitionTupleRouting *. > > Did this change in v3 version of > 0001-Encapsulate-partition-related-info-in-a-structure.patch I'll have to come back to some of the other open issues, but 0001 and 0005 look good to me now, so I pushed those as a single commit after fixing a few things that pgindent didn't like. I also think 0002 and 0003 look basically good, so I pushed those two as a single commit also. But the comment changes in 0003 didn't seem extensive enough to me so I made a few more changes there along the way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 5 January 2018 at 03:04, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jan 3, 2018 at 6:29 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> ExecSetupPartitionTupleRouting() now returns PartitionTupleRouting *. >> >> Did this change in v3 version of >> 0001-Encapsulate-partition-related-info-in-a-structure.patch > > I'll have to come back to some of the other open issues, but 0001 and > 0005 look good to me now, so I pushed those as a single commit after > fixing a few things that pgindent didn't like. I also think 0002 and > 0003 look basically good, so I pushed those two as a single commit > also. But the comment changes in 0003 didn't seem extensive enough to > me so I made a few more changes there along the way. Thanks. Attached is a rebased update-partition-key_v34.patch, which also has the changes as per David Rowley's review comments as explained below. The above patch is to be applied over the last remaining preparatory patch, now named (and attached) : 0001-Refactor-CheckConstraint-related-code.patch On 3 January 2018 at 19:22, David Rowley <david.rowley@2ndquadrant.com> wrote: > I've not finished looking at the regression tests yet, but here are a > few things, some may have been changed in v33, I've not looked yet. > Also apologies in advance if anything seems nitpicky. No worries. In fact, it's good to do this right now, otherwise it's difficult to notice and fix at later point of time. Thanks. > > 1. "by INSERT" -> "by an INSERT" in: > > from the original partition followed by <command>INSERT</command> into the > > 2. "and INSERT" -> "and an INSERT" in: > > a <command>DELETE</command> and <command>INSERT</command>. As far as > > 3. "due partition-key change" -> "due to the partition-key being changed" in: > > * capture is happening for UPDATEd rows being moved to another partition due > * partition-key change, then this function is called once when the row is > > 4. "inserted to another" -> "inserted into another" in: > > * deleted (to capture OLD row), and once when the row is inserted to another > > 5. "for UPDATE event" -> "for an UPDATE event" (singular), or -> "for > UPDATE events" (plural) > > * oldtup and newtup are non-NULL. But for UPDATE event fired for > > I'm unsure if you need singular or plural. It perhaps does not matter. > > 6. "for row" -> "for a row" in: > > * movement, oldtup is NULL when the event is for row being inserted, > > Likewise in: > > * whereas newtup is NULL when the event is for row being deleted. Done all of the above. > > 7. In the following fragment the code does not do what the comment says: > > /* > * If transition tables are the only reason we're here, return. As > * mentioned above, we can also be here during update tuple routing in > * presence of transition tables, in which case this function is called > * separately for oldtup and newtup, so either can be NULL, not both. > */ > if (trigdesc == NULL || > (event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) || > (event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) || > (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) || > (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL)))) > return; > > With the comment; "so either can be NULL, not both.", I'd expect a > boolean OR not an XOR. > > maybe the comment is better written as: > > "so we expect exactly one of them to be non-NULL" Ok. Made it : "so we expect exactly one of them to be NULL" > > (I know you've been discussing with Robert, so I've not checked v33 to > see if this still exists) Yes, it's not yet concluded. > > 8. I'm struggling to make sense of this: > > /* > * Save a tuple conversion map to convert a tuple routed to this > * partition from the parent's type to the partition's. > */ > > Maybe it's better to write this as: > > /* > * Generate a tuple conversion map to convert tuples of the parent's > * type into the partition's type. > */ This is existing code; not from my patch. > > 9. insert should be capitalised here and should be prefixed with "an": > > /* > * Verify result relation is a valid target for insert operation. Even > * for updates, we are doing this for tuple-routing, so again, we need > * to check the validity for insert operation. > */ > CheckValidResultRel(leaf_part_rri, CMD_INSERT); > > Maybe it's better to write: > > /* > * Verify result relation is a valid target for an INSERT. An UPDATE of > * a partition-key becomes a DELETE/INSERT operation, so this check is > * still required when the operation is CMD_UPDATE. > */ Done. Instead of DELETE/INSERT, used DELETE+INSERT. > > 10. The following code would be more clear if you replaced > mtstate->mt_transition_capture with transition_capture. > > if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture > && mtstate->mt_transition_capture->tcs_update_new_table) > { > ExecARUpdateTriggers(estate, resultRelInfo, NULL, > NULL, > tuple, > NULL, > mtstate->mt_transition_capture); > > /* > * Now that we have already captured NEW TABLE row, any AR INSERT > * trigger should not again capture it below. Arrange for the same. > */ > transition_capture = NULL; > } > > You are, after all, doing: > > transition_capture = mtstate->mt_transition_capture; > > at the top of the function. There are a few other places you're also > accessing mtstate->mt_transition_capture. Actually I wanted to be able to have a temporary variable that has it's scope only for ExecARInsertTriggers(). But because that wasn't possible, had to declare it at the top. I feel if we use transition_capture all over, and if some future code below the NULL assignment starts using transition_capture, it will wrongly get the left-over NULL value. Instead, what I have done is : used a special variable name only for this purpose : ar_insert_trig_tcs, so that code won't use this variable, by looking at it's name. And also moved it's assignment down to where it is used the first time. Similarly for ExecDelete(), used ar_delete_trig_tcs. > > 11. Should tuple_deleted and process_returning be camelCase like the > other params?: > > static TupleTableSlot * > ExecDelete(ModifyTableState *mtstate, > ItemPointer tupleid, > HeapTuple oldtuple, > TupleTableSlot *planSlot, > EPQState *epqstate, > EState *estate, > bool *tuple_deleted, > bool process_returning, > bool canSetTag) Done. > > 12. The following comment talks about "target table descriptor", which > I think is a good term. In several other places, you mention "root", > which I take it to mean "target table". > > * This map array is required for two purposes : > * 1. For update-tuple-routing. We need to convert the tuple from the subplan > * result rel to the root partitioned table descriptor. > * 2. For capturing transition tuples when the target table is a partitioned > * table. For updates, we need to convert the tuple from subplan result rel to > * target table descriptor, and for inserts, we need to convert the inserted > * tuple from leaf partition to the target table descriptor. > > I'd personally rather we always talked about "target" rather than > "root". I understand there's probably many places in the code > where we talk about the target table as "root", but I really think we > need to fix that, so I'd rather not see the problem get any worse > before it gets better. Not very sure if that's true at all places. In some contexts, it makes sense to use root to emphasize that it is the root partitioned table. E.g. : + * For ExecInsert(), make it look like we are inserting into the + * root. + */ + Assert(mtstate->rootResultRelInfo != NULL); + estate->es_result_relation_info = mtstate->rootResultRelInfo; + * resultRelInfo is one of the per-subplan resultRelInfos. So we + * should convert the tuple into root's tuple descriptor, since + * ExecInsert() starts the search from root. The tuple conversion > > The comment block might also look better if you tab indent after the > 1. and 2. then on each line below it. Used spaces instead of tab, because tab was taking it too much away from the numbers, which looked odd. > Also the space before the ':' is not correct. Done > > 13. Does the following code really need to palloc0 rather than just palloc? > > /* > * Build array of conversion maps from each child's TupleDesc to the > * one used in the tuplestore. The map pointers may be NULL when no > * conversion is necessary, which is hopefully a common case for > * partitions. > */ > mtstate->mt_childparent_tupconv_maps = (TupleConversionMap **) > palloc0(sizeof(TupleConversionMap *) * numResultRelInfos); > > I don't see any case in the initialization of the array where any of > the elements are not assigned a value, so I think palloc() is fine. Right. Used palloc(). > > 14. I don't really like the way tupconv_map_for_subplan() works. It > would be nice to have two separate functions for this, but looking a > bit more at it, it seems the caller won't just need to always call > exactly one of those functions. I don't have any ideas to improve it, > so this is just a note. I am assuming you mean one function for the case where mt_is_tupconv_perpart is true, and the other function when it is not true. The idea is, the caller should not have to worry if the map is per-subplan or not. > > 15. I still don't really like the way ExecInitModifyTable() sets and > unsets update_tuple_routing_needed. I know we talked about this > before, but couldn't you just change: > > if (resultRelInfo->ri_TrigDesc && > resultRelInfo->ri_TrigDesc->trig_update_before_row && > operation == CMD_UPDATE) > update_tuple_routing_needed = true; > > To: > > if (resultRelInfo->ri_TrigDesc && > resultRelInfo->ri_TrigDesc->trig_update_before_row && > node->partitioned_rels != NIL && > operation == CMD_UPDATE) > update_tuple_routing_needed = true; > > and get rid of: > ..... > if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE) > update_tuple_routing_needed = false; > > looking at inheritance_planner(), partitioned_rels is only set to a > non-NIL value if parent_rte->relkind == RELKIND_PARTITIONED_TABLE. > Initially update_tuple_routing_needed can be already true because of : bool update_tuple_routing_needed = node->partKeyUpdated; So if it's not a partitioned table and update_tuple_routing_needed is set to true due to the above declaration, the variable will remain true if we don't check the relkind in the end, which means the final conclusion will be that update-tuple-routing is required, when it is really not. Now, I understand that node->partKeyUpdated will not be true if it's a partitioned table, but I think we better play safe here. partKeyUpdated as per its name implies whether any of the partition key columns are updated; it does not imply whether the target table is a partitioned table or just a partition. > 16. "named" -> "target" in: > > * 'partKeyUpdated' is true if any partitioning columns are being updated, > * either from the named relation or a descendent partitioned table. > > I guess we're calling this one of; root, named, target :-( Changed it to: * either from the target relation or a descendent partitioned table. > > 17. You still have the following comment in ModifyTableState but > you've moved all those fields out to PartitionTupleRouting: > > /* Tuple-routing support info */ This comment applies to mt_partition_tuple_routing field. > > 18. Should the following not be just called partKeyUpdate (without the 'd')? > > bool partKeyUpdated; /* some part key in hierarchy updated */ > > This occurs in the planner were the part key is certainly being updated. > Actually the way it is named, it can mean : the partition key "is updated" or "..has been updated" or "..is being updated" all of which make sense. This sounds consistent with the name RangeTblEntry->updatedCols that means "which of the columns are being updated". > 19. In pathnode.h you've named a parameter partColsUpdated, but the > function in the .c file calls it partKeyUpdated. Renamed partColsUpdated to partKeyUpdated. > > I'll try to look at the tests tomorrow and also do some testing. So > far I've only read the code and the docs. Thanks David. Your review is valuable. > 20. "carried" -> "carried out the" > > + would have identified the newly updated row and carried > + <command>UPDATE</command>/<command>DELETE</command> on this new row Done. > > 21. Extra new line > > + <xref linkend="ddl-partitioning-declarative-limitations">. > + > </para> Done. I am not sure when exactly, but this line has started giving compile errors, seemingly because > should be />. Fixed it. > > 22. In copy.c CopyFrom() you have the following code: > > /* > * We might need to convert from the parent rowtype to the > * partition rowtype. > */ > map = proute->partition_tupconv_maps[leaf_part_index]; > if (map) > { > Relation partrel = resultRelInfo->ri_RelationDesc; > > tuple = do_convert_tuple(tuple, map); > > /* > * We must use the partition's tuple descriptor from this > * point on. Use a dedicated slot from this point on until > * we're finished dealing with the partition. > */ > slot = proute->partition_tuple_slot; > Assert(slot != NULL); > ExecSetSlotDescriptor(slot, RelationGetDescr(partrel)); > ExecStoreTuple(tuple, slot, InvalidBuffer, true); > } > > Should this use ConvertPartitionTupleSlot() instead? I will have a look at it to see if we can use ConvertPartitionTupleSlot() without any changes. (TODO) > > 23. Why write; > > last_resultRelInfo = mtstate->resultRelInfo + mtstate->mt_nplans; > > when you can write; > > last_resultRelInfo = mtstate->resultRelInfo[mtstate->mt_nplans];? You meant : (with &) > last_resultRelInfo = &mtstate->resultRelInfo[mtstate->mt_nplans];? I think both are equally good, and equally readable. In this case, we don't even want the array element, so why not just increment the pointer to a particular offset. > > > 24. In ExecCleanupTupleRouting(), do you think that you could just > have a special case loop for (mtstate && mtstate->operation == > CMD_UPDATE)? > > /* > * If this result rel is one of the UPDATE subplan result rels, let > * ExecEndPlan() close it. For INSERT or COPY, this does not apply > * because leaf partition result rels are always newly allocated. > */ > if (is_update && > resultRelInfo >= first_resultRelInfo && > resultRelInfo < last_resultRelInfo) > continue; > > Something like: > > if (mtstate && mtstate->operation == CMD_UPDATE) > { > ResultRelInfo *first_resultRelInfo = mtstate->resultRelInfo; > ResultRelInfo *last_resultRelInfo = > mtstate->resultRelInfo[mtstate->mt_nplans]; > > for (i = 0; i < proute->num_partitions; i++) > { > ResultRelInfo *resultRelInfo = proute->partitions[i]; > > /* > * Leave any resultRelInfos that belong to the UPDATE's subplan > * list. These will be closed during executor shutdown. > */ > if (resultRelInfo >= first_resultRelInfo && > resultRelInfo < last_resultRelInfo) > continue; > > ExecCloseIndices(resultRelInfo); > heap_close(resultRelInfo->ri_RelationDesc, NoLock); > } > } > else > { > for (i = 0; i < proute->num_partitions; i++) > { > ResultRelInfo *resultRelInfo = proute->partitions[i]; > > ExecCloseIndices(resultRelInfo); > heap_close(resultRelInfo->ri_RelationDesc, NoLock); > } > } I thought it's not worth having two separate loops in order to reduce one if(is_update) condition in case of inserts. Although we will have one less is_update check per partition, the code is not running per-row. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Attachment
On Fri, Jan 5, 2018 at 7:12 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > The above patch is to be applied over the last remaining preparatory > patch, now named (and attached) : > 0001-Refactor-CheckConstraint-related-code.patch Committed that one, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 4 January 2018 at 02:52, David Rowley <david.rowley@2ndquadrant.com> wrote: > I'll try to look at the tests tomorrow and also do some testing. I've made a pass over the tests. Again, sometimes I'm probably a bit pedantic. The reason for that is that the tests are not that easy to follow. Moving creation and cleanup of objects closer to where they're used and no longer needed makes it easier to read through and verify the tests. There are some genuine mistakes in there too. 1. NEW.c = NEW.c + 1; -- Make even number odd, or vice versa This seems to be worded as if there'd only ever be one number. I think it should be plural and read "Make even numbers odd, and vice versa" 2. The following comment does not make a huge amount of sense. -- UPDATE with -- partition key or non-partition columns, with different column ordering, -- triggers. Should "or" be "on"? Does ", triggers" mean "with triggers"? 3. The follow test tries to test a BEFORE DELETE trigger stopping a DELETE on sub_part1, but going by the SELECT, there are no rows in that table to stop being DELETEd. select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4; tableoid | a | b | c ------------+---+----+---- list_part1 | 2 | 52 | 50 list_part1 | 3 | 6 | 60 sub_part2 | 1 | 2 | 10 sub_part2 | 1 | 2 | 70 (4 rows) drop trigger parted_mod_b ON sub_part1 ; -- If BR DELETE trigger prevented DELETE from happening, we should also skip -- the INSERT if that delete is part of UPDATE=>DELETE+INSERT. create or replace function func_parted_mod_b() returns trigger as $$ begin return NULL; end $$ language plpgsql; create trigger trig_skip_delete before delete on sub_part1 for each row execute procedure func_parted_mod_b(); update list_parted set b = 1 where c = 70; select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4; tableoid | a | b | c ------------+---+----+---- list_part1 | 2 | 52 | 50 list_part1 | 3 | 6 | 60 sub_part1 | 1 | 1 | 70 sub_part2 | 1 | 2 | 10 (4 rows) You've added the BEFORE DELETE trigger to sub_part1, but you can see the tuple was DELETEd from sub_part2 and INSERTed into sub_part1, so the test is not working as you've commented. It's probably a good idea to RAISE NOTICE 'something useful here'; in the trigger function to verify they're actually being called in the test. 4. I think the final drop function in the following should be before the UPDATE FROM test. You've already done some cleanup for that test by doing "drop trigger trig_skip_delete ON sub_part1 ;" drop trigger trig_skip_delete ON sub_part1 ; -- UPDATE partition-key with FROM clause. If join produces multiple output -- rows for the same row to be modified, we should tuple-route the row only once. -- There should not be any rows inserted. create table non_parted (id int); insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3); update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1; select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4; tableoid | a | b | c ------------+---+----+---- list_part1 | 2 | 1 | 70 list_part1 | 2 | 2 | 10 list_part1 | 2 | 52 | 50 list_part1 | 3 | 6 | 60 (4 rows) drop table non_parted; drop function func_parted_mod_b(); Also, there's a space before the ; in the drop trigger above. Can that be removed? 5. The following comment: -- update to a partition should check partition bound constraint for the new tuple. -- If partition key is updated, the row should be moved to the appropriate -- partition. updatable views using partitions should enforce the check options -- for the rows that have been moved. Can this be changed a bit? I think it's not accurate to say that an update to a partition key causes the row to move. The row movement only occurs when the new tuple does not match the partition bound and another partition exists that does have a partition bound that matches the tuple. How about: -- When a partitioned table receives an UPDATE to the partitioned key and the -- new values no longer meet the partition's bound, the row must be moved to -- the correct partition for the new partition key (if one exists). We must -- also ensure that updatable views on partitioned tables properly enforce any -- WITH CHECK OPTION that is defined. The situation with triggers in this case -- also requires thorough testing as partition key updates causing row -- movement convert UPDATEs into DELETE+INSERT. 6. What does the following actually test? -- This tests partition-key UPDATE on a partitioned table that does not have any child partitions update part_b_10_b_20 set b = b - 6; There are no records in that partition, or anywhere in the hierarchy. Are you just testing that there's no error? If so then the comment should say so. 7. I think the following comment: -- As mentioned above, the partition creation is intentionally kept in descending bound order. should instead say: -- Create some more partitions following the above pattern of descending bound -- order, but let's make the situation a bit more complex by having the -- attribute numbers of the columns vary from their parent partition. 8. Just to make the tests a bit easier to follow, can you move the following down to where you're first using it: create table mintab(c1 int); insert into mintab values (120); and CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION; 9. It seems that the existing part of update.sql capitalises SQL keywords, but you mostly don't. I understand we're not always consistent, but can you keep it the same as the existing part of the file? 10. Stray space before trailing ':' -- fail (row movement happens only within the partition subtree) : 11. Can the following become: -- succeeds, row movement , check option passes -- success, update with row movement, check option passes: Seems there's also quite a mix of comment formats in your tests. You're using either one of; ok, success, succeeds followed by sometimes a comma, and sometimes a reason in parentheses. The existing part of the file seems to use: -- fail, <reason>: and just -- <reason>: for non-failures. Would be great to stick to what's there. 12. The following comment seems to indicate that you're installing triggers on all leaf partitions, but that's not the case: -- Install BR triggers on child partition, so that transition tuple conversion takes place. maybe you should write "on some child partitions"? Or did you mean to define a trigger on them all? 13. Stray space at the end of the case statement: update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96; 14. Stray space in the USING clause: create policy seeall ON range_parted as PERMISSIVE for SELECT using ( true); 15. we -> we're -- part_a_10_a_20 to part_d_1_15, because we setting 'c' to an odd number. 16. The comment probably should be before the "update range_parted", not the "set session authorization": -- This should fail with RLS violation error while moving row from -- part_a_10_a_20 to part_d_1_15, because we setting 'c' to an odd number. set session authorization regress_range_parted_user; update range_parted set a = 'b', c = 151 where a = 'a' and c = 200; 17. trigger -> the trigger function -- part_d_1_15, because trigger makes 'c' value an even number. likewise in: -- This should fail with RLS violation error because trigger makes 'c' value -- an odd number. 18. Why two RESET SESSION AUTHORIZATIONs? reset session authorization; drop trigger trig_d_1_15 ON part_d_1_15; drop function func_d_1_15(); -- Policy expression contains SubPlan reset session authorization; 19. The following should be cleaned up in the final test that its used on rather than randomly after the next test after it: drop table mintab; 20. Comment is not worded very well: -- UPDATE which does not modify partition key of partitions that are chosen for update. Does "partitions that are chosen for update" mean "the UPDATE target"? I'm also not quite sure what the test is testing. In the past I've written tests that have a header comment as -- Ensure that <what the test is testing>. Perhaps if you can't think of what you're ensuring with the test, then the test might not be that worthwhile. 21. The following comment could be improved: -- Triggers can cause UPDATE row movement if it modified partition key. Might be better to write: -- Tests for BR UPDATE triggers changing the partition key. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jan 4, 2018 at 1:18 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > ------------------ > 1. ExecUpdate() needs to revert back tcs_map value changed by ExecInsert() > ------------------ > >>> It seems like the ON CONFLICT stuff handled that by adding a >>> second TransitionCaptureState pointer to ModifyTable, thus >>> mt_transition_capture and mt_oc_transition_capture. By that >>> precedent, we could add mt_utr_transition_capture or similar, and >>> maybe that's the way to go. It seems a bit unsatisfying, but so does >>> what you have now. >> >> In case of ON CONFLICT, if there are both INSERT and UPDATE statement >> triggers referencing transition tables, both of the triggers need to >> independently populate their own transition tables, and hence the need >> for two separate transition states : mt_transition_capture and >> mt_oc_transition_capture. But in case of update-tuple-routing, the >> INSERT statement trigger won't come into picture. So the same >> mt_transition_capture can serve the purpose of populating the >> transition table with OLD and NEW rows. So I think it would be too >> redundant, if not incorrect, to have a whole new transition state for >> update tuple routing. >> >> I will see if it turns out better to have two tcs_maps in >> TransitionCaptureState, one for update and one for insert. But this, >> on first look, does not look good. > > Suppose TransitionCaptureState has separate maps, upd_del_tcs_maps and > insert_tcs_maps for UPDATE/DELETE and INSERT events respectively. That's not what I suggested. If you look at what I wrote, I floated the idea of having two TransitionCaptureStates, not two separate maps within the same TransitionCaptureState. > ------------------ > 2. mt_childparent_tupconv_maps is indexed by subplan or partition leaf index. > ------------------ > ------------------ > 3. Renaming of mt_transition_tupconv_maps to mt_childparent_tupconv_maps > ------------------ > > We need to change it's name because now this map is not only used for > transition capture, but also for update-tuple-routing. Does it look ok > for you if, for readability, we keep the childparent tag ? Or else, we > can just make it "mt_tupconv_maps", but "mt_childparent_tupconv_maps" > looks more informative. I see your point: the array is being renamed because it now has more than one purpose. But that's also what I'm complaining about with regard to point #2: the same array is being used for more than one purpose. That's generally bad style. If you have two loops in a function, it's best to declare two separate loop variables rather than reusing the same variable. This lets the compiler detect, for example, an error where the second loop variable is used before it's initialized, which would be undetectable if you reused the same variable in both places. Although that particular benefit doesn't pertain in this case, I maintain that having a single structure member that is indexed one of two different ways is a bad idea. If I understand correctly, the way we got here is that, in earlier patch versions, you had two arrays of maps, but it wasn't clear why we needed both of them, and David suggested replacing one of them with an array of indexes instead, in the hopes of reducing confusion. However, it looks to me like that didn't really work out. If we always needed both maps, or even if we always needed the per-leaf map, it would have been a good idea, but it seems here that we can need either the per-leaf map or the per-subplan map or both or neither, and we want to avoid computing all of the per-leaf conversion maps if we only need per-subplan access. I think one way to fix this might be to build the per-leaf maps on demand. Just because we're doing UPDATE tuple routing doesn't necessarily mean we'll actually need a TupleConversionMap for every child. So we could allocate an array with one byte per leaf, where 0 means we don't know whether tuple conversion is necessary, 1 means it is not, and 2 means it is, or something like that. Then we have a second array with conversion maps. We provide a function tupconv_map_for_leaf() or similar that checks the array; if it finds 1, it returns NULL; if it finds 2, it returns the conversion map previously calculated. If it finds 0, it calls convert_tuples_by_name, caches the result for later, updates the one-byte-per-leaf array with the appropriate value, and returns the just-computed conversion map. (The reason I'm suggesting 0/1/2 instead of just true/false is to reduce cache misses; if we find a 1 in the first array we don't need to access the second array at all.) If that doesn't seem like a good idea for some reason, then my second choice would be to leave mt_transition_tupconv_maps named the way it is currently and have a separate mt_update_tupconv_maps, with the two pointing, if both are initialized and as far as possible, to the same TupleConversionMap objects. > ------------------- > 4. Explicit signaling for "we are only here for transition tables" > ------------------- > > I had given a thought on this earlier. I felt, even the pre-existing > conditions like "!trigdesc->trig_update_after_row" are all indirect > ways to determine that this function is called only to capture > transition tables, and thought that it may have been better to have > separate parameter transition_table_only. I see your point. I guess it's not really this patch's job to solve this problem, although I think this is going to need some refactoring in the not-too-distant future. So I think the way you did it is probably OK. > Instead of adding another parameter to AfterTriggerSaveEvent(), I had > also considered another approach: Put the transition-tuples-capture > logic part of AfterTriggerSaveEvent() into a helper function > CaptureTransitionTables(). In ExecInsert() and ExecDelete(), instead > of calling ExecARUpdateTriggers(), call this function > CaptureTransitionTables(). I then dropped this idea and thought rather > to call ExecARUpdateTriggers() which neatly does the required checks > and other things like locking the old tuple via GetTupleForTrigger(). > So if we go by CaptureTransitionTables(), we would need to do what > ExecARUpdateTriggers() does before calling CaptureTransitionTables(). > This is doable. If you think this is worth doing so as to get rid of > the "(oldtup == NULL) ^ (newtup == NULL)" condition, we can do that. Duplicating logic elsewhere to avoid this problem here doesn't seem like a good plan. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jan 5, 2018 at 3:25 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jan 5, 2018 at 7:12 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> The above patch is to be applied over the last remaining preparatory >> patch, now named (and attached) : >> 0001-Refactor-CheckConstraint-related-code.patch > > Committed that one, too. Some more comments on the main patch: I don't really like the fact that ExecCleanupTupleRouting() now takes a ModifyTableState as an argument, particularly because of the way that is using that argument. To figure out whether a ResultRelInfo was pre-existing or one it created, it checks whether the pointer address of the ResultRelInfo is >= mtstate->resultRelInfo and < mtstate->resultRelInfo + mtstate->mt_nplans. However, that means that ExecCleanupTupleRouting() ends up knowing about the memory allocation pattern used by ExecInitModifyTable(), which seems like a slightly dangerous amount of action at a distance. I think it would be better for the PartitionTupleRouting structure to explicitly indicate which ResultRelInfos should be closed, for example by storing a Bitmapset *input_partitions. (Here, by "input", I mean "provided from the mtstate rather than created by the PartitionTupleRouting structure; other naming suggestions welcome.) When ExecSetupPartitionTupleRouting latches onto a partition, it can do proute->input_partitions = bms_add_member(proute->input_partitons, i). In ExecCleanupTupleRouting, it can do if (bms_is_member(proute->input_partitions, i)) continue. We have a test, in the regression test suite for file_fdw, which generates the message "cannot route inserted tuples to a foreign table". I think we should have a similar test for the case where an UPDATE tries to move a tuple from a regular partition to a foreign table partition. I'm not sure if it should fail with the same error or a different one, but I think we should have a test that it fails cleanly and with a nice error message of some sort. The comment for get_partitioned_child_rels() claims that it sets is_partition_key_update, but it really sets *is_partition_key_update. And I think instead of "is a partition key" it should say "is used in the partition key either of the relation whose RTI is specified or of any child relation." I propose "used in" instead of "is" because there can be partition expressions, and the rest is to clarify that child partition keys matter. create_modifytable_path uses partColsUpdated rather than partKeyUpdated, which actually seems like better terminology. I propose partKeyUpdated -> partColsUpdated everywhere. Also, why use is_partition_key_update for basically the same thing in some other places? I propose changing that to partColsUpdated as well. The capitalization of the first comment hunk in execPartition.h is strange. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 4 January 2018 at 02:52, David Rowley <david.rowley@2ndquadrant.com> wrote: > > 1. > > NEW.c = NEW.c + 1; -- Make even number odd, or vice versa > > This seems to be worded as if there'd only ever be one number. I think > it should be plural and read "Make even numbers odd, and vice versa" Done. > > 2. The following comment does not make a huge amount of sense. > > -- UPDATE with > -- partition key or non-partition columns, with different column ordering, > -- triggers. > > Should "or" be "on"? Does ", triggers" mean "with triggers"? Actually I was trying to summarize what kinds of scenarios are going to be tested. Now I think we don't have to give this summary. Rather, we should describe each of the scenarios individually. But I did want to use list partitions at least in a subset of update-partition-key scenarios. So I have removed this comment, and replaced it by : -- Some more update-partition-key test scenarios below. This time use list -- partitions. > > 3. The follow test tries to test a BEFORE DELETE trigger stopping a > DELETE on sub_part1, but going by the SELECT, there are no rows in > that table to stop being DELETEd. > > select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4; > tableoid | a | b | c > ------------+---+----+---- > list_part1 | 2 | 52 | 50 > list_part1 | 3 | 6 | 60 > sub_part2 | 1 | 2 | 10 > sub_part2 | 1 | 2 | 70 > (4 rows) > > drop trigger parted_mod_b ON sub_part1 ; > -- If BR DELETE trigger prevented DELETE from happening, we should also skip > -- the INSERT if that delete is part of UPDATE=>DELETE+INSERT. > create or replace function func_parted_mod_b() returns trigger as $$ > begin return NULL; end $$ language plpgsql; > create trigger trig_skip_delete before delete on sub_part1 > for each row execute procedure func_parted_mod_b(); > update list_parted set b = 1 where c = 70; > select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4; > tableoid | a | b | c > ------------+---+----+---- > list_part1 | 2 | 52 | 50 > list_part1 | 3 | 6 | 60 > sub_part1 | 1 | 1 | 70 > sub_part2 | 1 | 2 | 10 > (4 rows) > > You've added the BEFORE DELETE trigger to sub_part1, but you can see > the tuple was DELETEd from sub_part2 and INSERTed into sub_part1, so > the test is not working as you've commented. > > It's probably a good idea to RAISE NOTICE 'something useful here'; in > the trigger function to verify they're actually being called in the > test. Done. The trigger should have been for sub_part2, not sub_part1. Corrected that. Also, dropped the trigger and again tested the UPDATE. > > 4. I think the final drop function in the following should be before > the UPDATE FROM test. You've already done some cleanup for that test > by doing "drop trigger trig_skip_delete ON sub_part1 ;" > > drop trigger trig_skip_delete ON sub_part1 ; > -- UPDATE partition-key with FROM clause. If join produces multiple output > -- rows for the same row to be modified, we should tuple-route the row > only once. > -- There should not be any rows inserted. > create table non_parted (id int); > insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3); > update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1; > select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4; > tableoid | a | b | c > ------------+---+----+---- > list_part1 | 2 | 1 | 70 > list_part1 | 2 | 2 | 10 > list_part1 | 2 | 52 | 50 > list_part1 | 3 | 6 | 60 > (4 rows) > > drop table non_parted; > drop function func_parted_mod_b(); Done. Moved it to relevant place. > > Also, there's a space before the ; in the drop trigger above. Can that > be removed? Removed. > > 5. The following comment: > > -- update to a partition should check partition bound constraint for > the new tuple. > -- If partition key is updated, the row should be moved to the appropriate > -- partition. updatable views using partitions should enforce the check options > -- for the rows that have been moved. > > Can this be changed a bit? I think it's not accurate to say that an > update to a partition key causes the row to move. The row movement > only occurs when the new tuple does not match the partition bound and > another partition exists that does have a partition bound that matches > the tuple. How about: > > -- When a partitioned table receives an UPDATE to the partitioned key and the > -- new values no longer meet the partition's bound, the row must be moved to > -- the correct partition for the new partition key (if one exists). We must > -- also ensure that updatable views on partitioned tables properly enforce any > -- WITH CHECK OPTION that is defined. The situation with triggers in this case > -- also requires thorough testing as partition key updates causing row > -- movement convert UPDATEs into DELETE+INSERT. Done. > > 6. What does the following actually test? > > -- This tests partition-key UPDATE on a partitioned table that does > not have any child partitions > update part_b_10_b_20 set b = b - 6; > > There are no records in that partition, or anywhere in the hierarchy. > Are you just testing that there's no error? If so then the comment > should say so. Yes, I understand that there won't be any update scan plans. But, with the modifications done in ExecInitModifyTable(), I wanted to run that code with this scenario where there are no partitions, to make sure it does not behave weirdly or crash. Any suggestions for comments, given this perspective ? For now, I have made the comment this way: -- Check that partition-key UPDATE works sanely on a partitioned table that does not have any child partitions. > > 7. I think the following comment: > > -- As mentioned above, the partition creation is intentionally kept in > descending bound order. > > should instead say: > > -- Create some more partitions following the above pattern of descending bound > -- order, but let's make the situation a bit more complex by having the > -- attribute numbers of the columns vary from their parent partition. Done. > > 8. Just to make the tests a bit easier to follow, can you move the > following down to where you're first using it: > > create table mintab(c1 int); > insert into mintab values (120); > > and > > CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 > from mintab) WITH CHECK OPTION; Done. > > 9. It seems that the existing part of update.sql capitalises SQL > keywords, but you mostly don't. I understand we're not always > consistent, but can you keep it the same as the existing part of the > file? Done. > > 10. Stray space before trailing ':' > > -- fail (row movement happens only within the partition subtree) : Done, at other applicable places also. > > 11. Can the following become: > > -- succeeds, row movement , check option passes > > -- success, update with row movement, check option passes: > > Seems there's also quite a mix of comment formats in your tests. > > You're using either one of; ok, success, succeeds followed by > sometimes a comma, and sometimes a reason in parentheses. The existing > part of the file seems to use: > > -- fail, <reason>: > > and just > > -- <reason>: > > for non-failures. > > Would be great to stick to what's there. There were existing lines where "ok, " was used. So, now used this everywhere : ok, ... fail, ... > > 12. The following comment seems to indicate that you're installing > triggers on all leaf partitions, but that's not the case: > > -- Install BR triggers on child partition, so that transition tuple > conversion takes place. > > maybe you should write "on some child partitions"? Or did you mean to > define a trigger on them all? Trigger should be installed at least on the partitions onto which rows are moved. I have corrected the comment accordingly. Actually, to test transition tuple conversion with update-row-movement, it requires a statement level trigger that references transition tables. And trans_updatetrig already was dropped. So transition tuple conversion for rows being inserted did not get tested (I had manually tested though). So I have moved down the drop statement. > > 13. Stray space at the end of the case statement: > > update range_parted set c = (case when c = 96 then 110 else c + 1 end > ) where a = 'b' and b > 10 and c >= 96; Done. > > 14. Stray space in the USING clause: > > create policy seeall ON range_parted as PERMISSIVE for SELECT using ( true); Done > > 15. we -> we're > -- part_a_10_a_20 to part_d_1_15, because we setting 'c' to an odd number. Changed it to "we are" > > 16. The comment probably should be before the "update range_parted", > not the "set session authorization": > -- This should fail with RLS violation error while moving row from > -- part_a_10_a_20 to part_d_1_15, because we setting 'c' to an odd number. > set session authorization regress_range_parted_user; > update range_parted set a = 'b', c = 151 where a = 'a' and c = 200; Moved "set session authorization" statement above the comment. > > 17. trigger -> the trigger function > > -- part_d_1_15, because trigger makes 'c' value an even number. > > likewise in: > > -- This should fail with RLS violation error because trigger makes 'c' value > -- an odd number. I have made changes to the comment to make it clearer. Finally, the statement contains phrase "trigger at the destination partition again makes it an even number". With this phrase, "trigger function at destination partition" looks odd. So I think "trigger at destination partition makes ..." looks ok. It is implied that it is the trigger function that is actually changing the value. > > 18. Why two RESET SESSION AUTHORIZATIONs? > > reset session authorization; > drop trigger trig_d_1_15 ON part_d_1_15; > drop function func_d_1_15(); > -- Policy expression contains SubPlan > reset session authorization; The second reset is actually in a different paragraph. The reason it's there is to ensure we have reset it regardless of the earlier cleanup. > > 19. The following should be cleaned up in the final test that its used > on rather than randomly after the next test after it: > > drop table mintab; Done. > > 20. Comment is not worded very well: > > -- UPDATE which does not modify partition key of partitions that are > chosen for update. > > Does "partitions that are chosen for update" mean "the UPDATE target"? Actually it means the partitions participating in the update subplans, i.e the unpruned ones. I have modified the comment as : -- Test update-partition-key, where the unpruned partitions do not have their -- partition keys updated. > > I'm also not quite sure what the test is testing. In the past I've > written tests that have a header comment as -- Ensure that <what the > test is testing>. Perhaps if you can't think of what you're ensuring > with the test, then the test might not be that worthwhile. I am just testing that the update behaves sanely in the particular scenario. BTW, it was a concious decision made that in this particular scenario, we still conclude internally that update-tuple-routing is needed, and do the tuple routing setup. > > 21. The following comment could be improved: > > -- Triggers can cause UPDATE row movement if it modified partition key. > > Might be better to write: > > -- Tests for BR UPDATE triggers changing the partition key. Done I have also done this following suggestion of yours : > > 22. In copy.c CopyFrom() you have the following code: > > /* > * We might need to convert from the parent rowtype to the > * partition rowtype. > */ > map = proute->partition_tupconv_maps[leaf_part_index]; > if (map) > { > Relation partrel = resultRelInfo->ri_RelationDesc; > > tuple = do_convert_tuple(tuple, map); > > /* > * We must use the partition's tuple descriptor from this > * point on. Use a dedicated slot from this point on until > * we're finished dealing with the partition. > */ > slot = proute->partition_tuple_slot; > Assert(slot != NULL); > ExecSetSlotDescriptor(slot, RelationGetDescr(partrel)); > ExecStoreTuple(tuple, slot, InvalidBuffer, true); > } > > Should this use ConvertPartitionTupleSlot() instead? Attached v35 patch. Thanks.
Attachment
Thanks for making those changes. On 11 January 2018 at 04:00, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > Yes, I understand that there won't be any update scan plans. But, with > the modifications done in ExecInitModifyTable(), I wanted to run that > code with this scenario where there are no partitions, to make sure it > does not behave weirdly or crash. Any suggestions for comments, given > this perspective ? For now, I have made the comment this way: > > -- Check that partition-key UPDATE works sanely on a partitioned table > that does not have any child partitions. Sounds good. >> 18. Why two RESET SESSION AUTHORIZATIONs? >> >> reset session authorization; >> drop trigger trig_d_1_15 ON part_d_1_15; >> drop function func_d_1_15(); >> -- Policy expression contains SubPlan >> reset session authorization; > > The second reset is actually in a different paragraph. The reason it's > there is to ensure we have reset it regardless of the earlier cleanup. hmm, I was reviewing the .out file, which does not have the empty lines. Still seems a bit surplus. > Attached v35 patch. Thanks. Thanks. I'll try to look at it soon. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 11 January 2018 at 10:44, David Rowley <david.rowley@2ndquadrant.com> wrote: >>> 18. Why two RESET SESSION AUTHORIZATIONs? >>> >>> reset session authorization; >>> drop trigger trig_d_1_15 ON part_d_1_15; >>> drop function func_d_1_15(); >>> -- Policy expression contains SubPlan >>> reset session authorization; >> >> The second reset is actually in a different paragraph. The reason it's >> there is to ensure we have reset it regardless of the earlier cleanup. > > hmm, I was reviewing the .out file, which does not have the empty > lines. Still seems a bit surplus. I believe the output file does not have the blank lines present in the .sql file. I was referring to the paragraph in the *.sql* file. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On 9 January 2018 at 23:07, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jan 4, 2018 at 1:18 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> ------------------ >> 1. ExecUpdate() needs to revert back tcs_map value changed by ExecInsert() >> ------------------ >> >>>> It seems like the ON CONFLICT stuff handled that by adding a >>>> second TransitionCaptureState pointer to ModifyTable, thus >>>> mt_transition_capture and mt_oc_transition_capture. By that >>>> precedent, we could add mt_utr_transition_capture or similar, and >>>> maybe that's the way to go. It seems a bit unsatisfying, but so does >>>> what you have now. >>> >>> In case of ON CONFLICT, if there are both INSERT and UPDATE statement >>> triggers referencing transition tables, both of the triggers need to >>> independently populate their own transition tables, and hence the need >>> for two separate transition states : mt_transition_capture and >>> mt_oc_transition_capture. But in case of update-tuple-routing, the >>> INSERT statement trigger won't come into picture. So the same >>> mt_transition_capture can serve the purpose of populating the >>> transition table with OLD and NEW rows. So I think it would be too >>> redundant, if not incorrect, to have a whole new transition state for >>> update tuple routing. >>> >>> I will see if it turns out better to have two tcs_maps in >>> TransitionCaptureState, one for update and one for insert. But this, >>> on first look, does not look good. >> >> Suppose TransitionCaptureState has separate maps, upd_del_tcs_maps and >> insert_tcs_maps for UPDATE/DELETE and INSERT events respectively. > > That's not what I suggested. If you look at what I wrote, I floated > the idea of having two TransitionCaptureStates, not two separate maps > within the same TransitionCaptureState. In the first paragraph of my explanation, I was explaining why two Transition capture states does not look like a good idea to me : >>> In case of ON CONFLICT, if there are both INSERT and UPDATE statement >>> triggers referencing transition tables, both of the triggers need to >>> independently populate their own transition tables, and hence the need >>> for two separate transition states : mt_transition_capture and >>> mt_oc_transition_capture. But in case of update-tuple-routing, the >>> INSERT statement trigger won't come into picture. So the same >>> mt_transition_capture can serve the purpose of populating the >>> transition table with OLD and NEW rows. So I think it would be too >>> redundant, if not incorrect, to have a whole new transition state for >>> update tuple routing. And in the next para, I explained about the other alternative of having two separate maps as against transition states. > >> ------------------ >> 2. mt_childparent_tupconv_maps is indexed by subplan or partition leaf index. >> ------------------ >> ------------------ >> 3. Renaming of mt_transition_tupconv_maps to mt_childparent_tupconv_maps >> ------------------ >> >> We need to change it's name because now this map is not only used for >> transition capture, but also for update-tuple-routing. Does it look ok >> for you if, for readability, we keep the childparent tag ? Or else, we >> can just make it "mt_tupconv_maps", but "mt_childparent_tupconv_maps" >> looks more informative. > > I see your point: the array is being renamed because it now has more > than one purpose. But that's also what I'm complaining about with > regard to point #2: the same array is being used for more than one > purpose. That's generally bad style. If you have two loops in a > function, it's best to declare two separate loop variables rather than > reusing the same variable. This lets the compiler detect, for > example, an error where the second loop variable is used before it's > initialized, which would be undetectable if you reused the same > variable in both places. Although that particular benefit doesn't > pertain in this case, I maintain that having a single structure member > that is indexed one of two different ways is a bad idea. > > If I understand correctly, the way we got here is that, in earlier > patch versions, you had two arrays of maps, but it wasn't clear why we > needed both of them, and David suggested replacing one of them with an > array of indexes instead, in the hopes of reducing confusion. Slight correction; it was suggested by Amit Langote; not by David. > However, it looks to me like that didn't really work out. If we > always needed both maps, or even if we always needed the per-leaf map, > it would have been a good idea, but it seems here that we can need > either the per-leaf map or the per-subplan map or both or neither, and > we want to avoid computing all of the per-leaf conversion maps if we > only need per-subplan access. I was ok with either mine or Amit Langote's approach. His approach uses array of offsets to leaf-partition array, which sounded to me like it may be re-usable for some similar purpose later. > > I think one way to fix this might be to build the per-leaf maps on > demand. Just because we're doing UPDATE tuple routing doesn't > necessarily mean we'll actually need a TupleConversionMap for every > child. So we could allocate an array with one byte per leaf, where 0 > means we don't know whether tuple conversion is necessary, 1 means it > is not, and 2 means it is, or something like that. Then we have a > second array with conversion maps. We provide a function > tupconv_map_for_leaf() or similar that checks the array; if it finds > 1, it returns NULL; if it finds 2, it returns the conversion map > previously calculated. If it finds 0, it calls convert_tuples_by_name, > caches the result for later, updates the one-byte-per-leaf array with > the appropriate value, and returns the just-computed conversion map. > (The reason I'm suggesting 0/1/2 instead of just true/false is to > reduce cache misses; if we find a 1 in the first array we don't need > to access the second array at all.) > > If that doesn't seem like a good idea for some reason, then my second > choice would be to leave mt_transition_tupconv_maps named the way it > is currently and have a separate mt_update_tupconv_maps, with the two > pointing, if both are initialized and as far as possible, to the same > TupleConversionMap objects. So there are two independent optimizations we are talking about : 1. Create the map only when needed. We may not require a map for a leaf partition if there is no insert happening to that partition. And, the insert may be part of update-tuple-routing or a plain INSERT tuple-routing. Also, we may not require map for *every* subplan. It may happen that many of the update subplans do not return any tuples, in which case we don't require the maps for the partitions corresponding to those subplans. This optimization was also suggested by Thomas Munro initially. 2. In case of UPDATE, for partitions that take part in update scans, there should be a single map; there should not be two separate maps, one for accessing per-subplan and the other for accessing per-leaf. My approach for this was to have a per-leaf array and a per-subplan array, but they should share the maps wherever possible. I think this is what you are suggesting in your second choice. The other approach is as suggested by Amit Langote (which is present in the latest versions of the patch), where we have an array of maps, and a subplan-offsets array. So your preference is for #1. But I think this optimization is not specific for update-tuple-routing. This was applicable for inserts also, from the beginning. And we can do this on-demand stuff for subplan maps also. Both optimizations are good, and they are independently required. But I think optimization#2 is purely relevant to update-tuple-routing, so we should do it now. We can do optimization #1 as a general optimization, over and above optimization #2. So my opinion is, we do #1 not as part of update-tuple-routing patch. For optimization#2 (i.e. your second choice), I can revert back to the way I had earlier used two different arrays, with per-leaf array re-using the per-subplan maps. Let me know if you are ok with this plan. Then later once we do optimization #1, the maps will not be just shared between per-subplan and per-leaf arrays, they will also be created only when required. Regarding the array names ... Regardless of any approach, we are going to require two array maps, one is per-subplan, and the other per-leaf. Now, for transition capture, we would require both of these maps: per-subplan for capturing updated rows, and per-leaf for routed rows. And during update-tuple-routing, for converting the tuple from source partition to root partition, we require only per-subplan map. So if we name the per-subplan map as mt_transition_tupconv_maps, it implies the per-leaf map is not used for transition capture, which is incorrect. Similar thing, if we name the per-leaf map as mt_transition_tupconv_maps. Update-tuple-routing uses only per-subplan map. So per-subplan map can be named mt_update_tupconv_maps. But again, how can we name the per-leaf map ? Noting all this, I feel we can go with names according to the structure of maps. Something like : mt_perleaf_tupconv_maps, and mt_persubplan_tupconv_maps. Other suggestions welcome. > >> ------------------- >> 4. Explicit signaling for "we are only here for transition tables" >> ------------------- >> >> I had given a thought on this earlier. I felt, even the pre-existing >> conditions like "!trigdesc->trig_update_after_row" are all indirect >> ways to determine that this function is called only to capture >> transition tables, and thought that it may have been better to have >> separate parameter transition_table_only. > > I see your point. I guess it's not really this patch's job to solve > this problem, although I think this is going to need some refactoring > in the not-too-distant future. So I think the way you did it is > probably OK. > >> Instead of adding another parameter to AfterTriggerSaveEvent(), I had >> also considered another approach: Put the transition-tuples-capture >> logic part of AfterTriggerSaveEvent() into a helper function >> CaptureTransitionTables(). In ExecInsert() and ExecDelete(), instead >> of calling ExecARUpdateTriggers(), call this function >> CaptureTransitionTables(). I then dropped this idea and thought rather >> to call ExecARUpdateTriggers() which neatly does the required checks >> and other things like locking the old tuple via GetTupleForTrigger(). >> So if we go by CaptureTransitionTables(), we would need to do what >> ExecARUpdateTriggers() does before calling CaptureTransitionTables(). >> This is doable. If you think this is worth doing so as to get rid of >> the "(oldtup == NULL) ^ (newtup == NULL)" condition, we can do that. > > Duplicating logic elsewhere to avoid this problem here doesn't seem > like a good plan. Yeah, ok. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Thu, Jan 11, 2018 at 6:07 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > In the first paragraph of my explanation, I was explaining why two > Transition capture states does not look like a good idea to me : Oh, sorry. I didn't read what you wrote carefully enough, I guess. I see your points. I think that there is probably a general need for some refactoring here. AfterTriggerSaveEvent() got significantly more complicated and harder to understand with the arrival of transition tables, and this patch is adding more complexity still. It's also adding complexity in other places to make ExecInsert() and ExecDelete() usable for the semi-internal DELETE/INSERT operations being produced when we split a partition key update into a DELETE and INSERT pair. It would be awfully nice to have some better way to separate out each of the different things we might or might not want to do depending on the situation: capture old tuple, capture new tuple, fire before triggers, fire after triggers, count processed rows, set command tag, perform actual heap operation, update indexes, etc. However, I don't have a specific idea how to do it better, so maybe we should just get this committed for now and perhaps, with more eyes on the code, someone will have a good idea. > Slight correction; it was suggested by Amit Langote; not by David. Oh, OK, sorry. > So there are two independent optimizations we are talking about : > > 1. Create the map only when needed. > 2. In case of UPDATE, for partitions that take part in update scans, > there should be a single map; there should not be two separate maps, > one for accessing per-subplan and the other for accessing per-leaf. These optimizations aren't completely independent. Optimization #2 can be implemented in several different ways. The way you've chosen to do it is to index the same array in two different ways depending on whether per-leaf indexing is not needed, which I think is unacceptable. Another approach, which I proposed upthread, is to always built the per-leaf mapping, but you pointed out that this could involve doing a lot of unnecessary work in the case where most leaves were pruned. However, if you also implement #1, then that problem goes away. In other words, depending on the design you choose for #2, you may or may not need to also implement optimization #1 to get good performance. To put that another way, I think Amit's idea of keeping a subplan-offsets array is a pretty good one. From your comments, you do too. But if we want to keep that, then we need a way to avoid the expense of populating it for leaves that got pruned, except when we are doing update row movement. Otherwise, I don't see much choice but to jettison the subplan-offsets array and just maintain two separate arrays of mappings. > Regarding the array names ... > > Noting all this, I feel we can go with names according to the > structure of maps. Something like : mt_perleaf_tupconv_maps, and > mt_persubplan_tupconv_maps. Other suggestions welcome. I'd probably do mt_per_leaf_tupconv_maps, since inserting an underscore between some but not all words seems strange. But OK otherwise. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12 January 2018 at 01:18, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jan 11, 2018 at 6:07 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> In the first paragraph of my explanation, I was explaining why two >> Transition capture states does not look like a good idea to me : > > Oh, sorry. I didn't read what you wrote carefully enough, I guess. > > I see your points. I think that there is probably a general need for > some refactoring here. AfterTriggerSaveEvent() got significantly more > complicated and harder to understand with the arrival of transition > tables, and this patch is adding more complexity still. It's also > adding complexity in other places to make ExecInsert() and > ExecDelete() usable for the semi-internal DELETE/INSERT operations > being produced when we split a partition key update into a DELETE and > INSERT pair. It would be awfully nice to have some better way to > separate out each of the different things we might or might not want > to do depending on the situation: capture old tuple, capture new > tuple, fire before triggers, fire after triggers, count processed > rows, set command tag, perform actual heap operation, update indexes, > etc. However, I don't have a specific idea how to do it better, so > maybe we should just get this committed for now and perhaps, with more > eyes on the code, someone will have a good idea. > >> Slight correction; it was suggested by Amit Langote; not by David. > > Oh, OK, sorry. > >> So there are two independent optimizations we are talking about : >> >> 1. Create the map only when needed. >> 2. In case of UPDATE, for partitions that take part in update scans, >> there should be a single map; there should not be two separate maps, >> one for accessing per-subplan and the other for accessing per-leaf. > > These optimizations aren't completely independent. Optimization #2 > can be implemented in several different ways. The way you've chosen > to do it is to index the same array in two different ways depending on > whether per-leaf indexing is not needed, which I think is > unacceptable. Another approach, which I proposed upthread, is to > always built the per-leaf mapping, but you pointed out that this could > involve doing a lot of unnecessary work in the case where most leaves > were pruned. However, if you also implement #1, then that problem > goes away. In other words, depending on the design you choose for #2, > you may or may not need to also implement optimization #1 to get good > performance. > > To put that another way, I think Amit's idea of keeping a > subplan-offsets array is a pretty good one. From your comments, you > do too. But if we want to keep that, then we need a way to avoid the > expense of populating it for leaves that got pruned, except when we > are doing update row movement. Otherwise, I don't see much choice but > to jettison the subplan-offsets array and just maintain two separate > arrays of mappings. Ok. So giving more thought on our both's points, here's what I feel we can do ... With the two arrays mt_per_leaf_tupconv_maps and mt_per_subplan_tupconv_maps, we want the following things : 1. Create the map on-demand. 2. If possible, try to share the maps between the per-subplan and per-leaf arrays. For this, option 1 is : ------- Both the arrays elements are made of this structure : typedef struct TupleConversionMapInfo { uint8 map_required; /* 0 : Not known if map is required */ /* 1 : map is created/required */ /* 2 : map is not necessary */ TupleConversionMap *map; } TupleConversionMapInfo; Arrays look like this : TupleConversionMapInfo mt_per_subplan_tupconv_maps[]; TupleConversionMapInfo mt_per_leaf_tupconv_maps[]; When a per-subplan array is to be accessed at index i, a macro get_tupconv_map(mt_per_subplan_tupconv_maps, i, forleaf=false) will be called. This will create a new map if necessary, populate the array element fields, and it will also copy this info into a corresponding array element in the per-leaf array. To get to the per-leaf array element, we need a subplan-offsets array. Whereas, if the per-leaf array element is already populated, this info will be copied into the subplan element in the opposite direction. When a per-leaf array is to be accessed at index i, get_tupconv_map(mt_per_leaf_tupconv_maps, i, forleaf=true) will be called. Here, it will similarly update the per-leaf array element. But it will not try to access the corresponding per-subplan array because we don't have such offset array. This is how the macro will look like : #define get_tupconv_map(mapinfo, i, perleaf) ((mapinfo[i].map_required == 2) ? NULL : ((mapinfo[i].map_required == 1) ? mapinfo[i].map : create_new_map(mapinfo, i, perleaf))) where create_new_map() will take care of populating the array element on both the arrays, and then return the map if created, or NULL if not required. ------- Option 2 : Elements of both arrays are pointers to TupleConversionMapInfo structure. Arrays look like this : TupleConversionMapInfo *mt_per_subplan_tupconv_maps[]; TupleConversionMapInfo *mt_per_leaf_tupconv_maps[]; typedef struct TupleConversionMapInfo { uint8 map_required; /* 0 : map is not required, 1 : ... */ TupleConversionMap *map; } So in ExecInitModifyTable(), for each of the array elements of both arrays, we palloc TupleConversionMap structure, and wherever applicable, a common palloc'ed structure is shared between the two arrays. This way, subplan-offsets array is not required. In this case, the macro get_tupconv_map() similarly populates the structure, but it does not have to access the other map array, because the structures are already shared in the two arrays. The problem with this option is : since we have to share some of the structures allocated by the array elements, we have to build the two arrays together, but in the code the arrays are to be allocated when required at different points, like update_tuple_routing required and transition tables required. Also, beforehand we have to individually palloc memory for TupleConversionMapInfo for all the array elements, as against allocating memory in a single palloc of the whole array as in option 1. As of this writing, I am writing code relevant to adding the on-demand logic, and I anticipate option 1 would turn out better than option 2. But I would like to know if you are ok with both of these options. ------------ The reason why I am having map_required field inside a structure along with the map, as against a separate array, is so that we can do the on-demand allocation for both per-leaf array and per-subplan array. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Fri, Jan 12, 2018 at 5:12 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > The reason why I am having map_required field inside a structure along > with the map, as against a separate array, is so that we can do the > on-demand allocation for both per-leaf array and per-subplan array. Putting the map_required field inside the structure with the map makes it completely silly to do the 0/1/2 thing, because the whole structure is going to be on the same cache line anyway. It won't save anything to access the flag instead of a pointer in the same struct. Also, the uint8 will be followed by 7 bytes of padding, because the pointer that follows will need to begin on an 8-byte boundary (at least, on 64-bit machines), so this will use more memory. What I suggest is: #define MT_CONVERSION_REQUIRED_UNKNOWN 0 #define MT_CONVERSION_REQUIRED_YES 1 #define MT_CONVERSION_REQUIRED_NO 2 In ModifyTableState: uint8 *mt_per_leaf_tupconv_required; TupleConversionMap **mt_per_leaf_tupconv_maps; In PartitionTupleRouting: int *subplan_partition_offsets; When you initialize the ModifyTableState, do this: mtstate->mt_per_leaf_tupconv_required = palloc0(sizeof(uint8) * numResultRelInfos); mtstate->mt_per_leaf_tupconv_maps = palloc0(sizeof(TupleConversionMap *) * numResultRelInfos); When somebody needs a map, then (1) if they need it by subplan index, first use subplan_partition_offsets to convert it to a per-leaf index (2) then write a function that takes the per-leaf index and does this: switch (mtstate->mt_per_leaf_tupconv_required[leaf_part_index]) { case MT_CONVERSION_REQUIRED_UNKNOWN: map = convert_tuples_by_name(...); if (map == NULL) mtstate->mt_per_leaf_tupconv_required[leaf_part_index] = MT_CONVERSION_REQUIRED_NO; else { mtstate->mt_per_leaf_tupconv_required[leaf_part_index] = MT_CONVERSION_REQUIRED_YES; mtstate->mt_per_leaf_tupconv_maps[leaf_part_index] = map; } return map; case MT_CONVERSION_REQUIRED_YES: return mtstate->mt_per_leaf_tupconv_maps[leaf_part_index]; case MT_CONVERSION_REQUIRED_NO: return NULL; } -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12 January 2018 at 20:24, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jan 12, 2018 at 5:12 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> The reason why I am having map_required field inside a structure along >> with the map, as against a separate array, is so that we can do the >> on-demand allocation for both per-leaf array and per-subplan array. > > Putting the map_required field inside the structure with the map makes > it completely silly to do the 0/1/2 thing, because the whole structure > is going to be on the same cache line anyway. It won't save anything > to access the flag instead of a pointer in the same struct. I see. Got it. > Also, > the uint8 will be followed by 7 bytes of padding, because the pointer > that follows will need to begin on an 8-byte boundary (at least, on > 64-bit machines), so this will use more memory. > > What I suggest is: > > #define MT_CONVERSION_REQUIRED_UNKNOWN 0 > #define MT_CONVERSION_REQUIRED_YES 1 > #define MT_CONVERSION_REQUIRED_NO 2 > > In ModifyTableState: > > uint8 *mt_per_leaf_tupconv_required; > TupleConversionMap **mt_per_leaf_tupconv_maps; > > In PartitionTupleRouting: > > int *subplan_partition_offsets; > > When you initialize the ModifyTableState, do this: > > mtstate->mt_per_leaf_tupconv_required = palloc0(sizeof(uint8) * > numResultRelInfos); > mtstate->mt_per_leaf_tupconv_maps = palloc0(sizeof(TupleConversionMap > *) * numResultRelInfos); > A few points below where I wanted to confirm that we are on the same page ... > When somebody needs a map, then > > (1) if they need it by subplan index, first use > subplan_partition_offsets to convert it to a per-leaf index Before that, we need to check if there *is* an offset array. If there are no partitions, there is only going to be a per-subplan array, there won't be an offsets array. But I guess, you are saying : "do the on-demand allocation only for leaf partitions; if there are no partitions, the per-subplan maps will always be allocated for each of the subplans from the beginning" . So if there is no offset array, just return mtstate->mt_per_subplan_tupconv_maps[subplan_index] without any further checks. > > (2) then write a function that takes the per-leaf index and does this: > > switch (mtstate->mt_per_leaf_tupconv_required[leaf_part_index]) > { > case MT_CONVERSION_REQUIRED_UNKNOWN: > map = convert_tuples_by_name(...); > if (map == NULL) > mtstate->mt_per_leaf_tupconv_required[leaf_part_index] = > MT_CONVERSION_REQUIRED_NO; > else > { > mtstate->mt_per_leaf_tupconv_required[leaf_part_index] = > MT_CONVERSION_REQUIRED_YES; > mtstate->mt_per_leaf_tupconv_maps[leaf_part_index] = map; > } > return map; > case MT_CONVERSION_REQUIRED_YES: > return mtstate->mt_per_leaf_tupconv_maps[leaf_part_index]; > case MT_CONVERSION_REQUIRED_NO: > return NULL; > } Yeah, right. But after that, I am not sure then why is mt_per_sub_plan_maps[] array needed ? We are always going to convert the subplan index into leaf index, so per-subplan map array will not come into picture. Or are you saying, it will be allocated and used only when there are no partitions ? From one of your earlier replies, you did mention about trying to share the maps between the two arrays, that means you were considering both arrays being used at the same time. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Fri, Jan 12, 2018 at 12:23 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> (1) if they need it by subplan index, first use >> subplan_partition_offsets to convert it to a per-leaf index > > Before that, we need to check if there *is* an offset array. If there > are no partitions, there is only going to be a per-subplan array, > there won't be an offsets array. But I guess, you are saying : "do the > on-demand allocation only for leaf partitions; if there are no > partitions, the per-subplan maps will always be allocated for each of > the subplans from the beginning" . So if there is no offset array, > just return mtstate->mt_per_subplan_tupconv_maps[subplan_index] > without any further checks. Oops. I forgot that there might not be partitions. I was assuming that mt_per_subplan_tupconv_maps wouldn't exist at all, and we'd always use subplan_partition_offsets. Both that won't work in the inheritance case. > But after that, I am not sure then why is mt_per_sub_plan_maps[] array > needed ? We are always going to convert the subplan index into leaf > index, so per-subplan map array will not come into picture. Or are you > saying, it will be allocated and used only when there are no > partitions ? From one of your earlier replies, you did mention about > trying to share the maps between the two arrays, that means you were > considering both arrays being used at the same time. We'd use them both at the same time if we didn't have, or didn't use, subplan_partition_offsets, but if we have subplan_partition_offsets and can use it then we don't need mt_per_sub_plan_maps. I guess I'm inclined to keep mt_per_sub_plan_maps for the case where there are no partitions, but not use it when partitions are present. What do you think about that? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 13 January 2018 at 02:56, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jan 12, 2018 at 12:23 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>> (1) if they need it by subplan index, first use >>> subplan_partition_offsets to convert it to a per-leaf index >> >> Before that, we need to check if there *is* an offset array. If there >> are no partitions, there is only going to be a per-subplan array, >> there won't be an offsets array. But I guess, you are saying : "do the >> on-demand allocation only for leaf partitions; if there are no >> partitions, the per-subplan maps will always be allocated for each of >> the subplans from the beginning" . So if there is no offset array, >> just return mtstate->mt_per_subplan_tupconv_maps[subplan_index] >> without any further checks. > > Oops. I forgot that there might not be partitions. I was assuming > that mt_per_subplan_tupconv_maps wouldn't exist at all, and we'd > always use subplan_partition_offsets. Both that won't work in the > inheritance case. > >> But after that, I am not sure then why is mt_per_sub_plan_maps[] array >> needed ? We are always going to convert the subplan index into leaf >> index, so per-subplan map array will not come into picture. Or are you >> saying, it will be allocated and used only when there are no >> partitions ? From one of your earlier replies, you did mention about >> trying to share the maps between the two arrays, that means you were >> considering both arrays being used at the same time. > > We'd use them both at the same time if we didn't have, or didn't use, > subplan_partition_offsets, but if we have subplan_partition_offsets > and can use it then we don't need mt_per_sub_plan_maps. > > I guess I'm inclined to keep mt_per_sub_plan_maps for the case where > there are no partitions, but not use it when partitions are present. > What do you think about that? Even where partitions are present, in the usual case where there are no transition tables we won't require per-leaf map at all [1]. So I think we should keep mt_per_sub_plan_maps only for the case where per-leaf map is not allocated. And we will not allocate mt_per_sub_plan_maps when mt_per_leaf_maps is needed. In other words, exactly one of the two maps will be allocated. This is turning out to be close to what's already there in the last patch versions: use a single map array, and an offsets array. The difference is : in the patch I am using the *same* variable for the two maps. Where as, now we are talking about two different array variables for maps, but only allocating one of them. Are you ok with this ? I think the thing you were against was to have a common *variable* for two purposes. But above, I am saying we have two variables but assign a map array to only *one* of them and leave the other unused. --------- Regarding the on-demand map allocation .... Where mt_per_sub_plan_maps is allocated, we won't have the on-demand allocation: all the maps will be allocated initially. The reason is becaues the map_is_required array is only per-leaf. Or else, again, we need to keep another map_is_required array for per-subplan. May be we can support the on-demand stuff for subplan maps also, but only as a separate change after we are done with update-partition-key. --------- Regarding mt_per_leaf_tupconv_required, I am thinking we can make it a bool array, and name it : mt_per_leaf_map_not_required. When it is true for a given index, it means, we have already called convert_tuples_by_name() and it returned NULL; i.e. it means we are sure that map is not required. A false value means we need to call convert_tuples_by_name() if it is NULL, and then set mt_per_leaf_map_not_required to (map == NULL). Instead of a bool array, we can even make it a Bitmapset. But I think access would become slower as compared to array, particularly because it is going to be a heavily used function. --------- [1] - For update-tuple-routing, only per-subplan access is required; - For transition tables, per-subplan access is required, and additionally per-leaf access is required when tuples are update-routed - So if both update-tuple-routing and transition tables are required, both of the maps are needed. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On 10 January 2018 at 02:30, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jan 5, 2018 at 3:25 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Fri, Jan 5, 2018 at 7:12 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>> The above patch is to be applied over the last remaining preparatory >>> patch, now named (and attached) : >>> 0001-Refactor-CheckConstraint-related-code.patch >> >> Committed that one, too. > > Some more comments on the main patch: > > I don't really like the fact that ExecCleanupTupleRouting() now takes > a ModifyTableState as an argument, particularly because of the way > that is using that argument. To figure out whether a ResultRelInfo > was pre-existing or one it created, it checks whether the pointer > address of the ResultRelInfo is >= mtstate->resultRelInfo and < > mtstate->resultRelInfo + mtstate->mt_nplans. However, that means that > ExecCleanupTupleRouting() ends up knowing about the memory allocation > pattern used by ExecInitModifyTable(), which seems like a slightly > dangerous amount of action at a distance. I think it would be better > for the PartitionTupleRouting structure to explicitly indicate which > ResultRelInfos should be closed, for example by storing a Bitmapset > *input_partitions. (Here, by "input", I mean "provided from the > mtstate rather than created by the PartitionTupleRouting structure; > other naming suggestions welcome.) When > ExecSetupPartitionTupleRouting latches onto a partition, it can do > proute->input_partitions = bms_add_member(proute->input_partitons, i). > In ExecCleanupTupleRouting, it can do if > (bms_is_member(proute->input_partitions, i)) continue. Did the changes. But, instead of a new bitmapet, I used the offset array for the purpose. As per our parallel discussion on tup-conversion maps, it is almost finalized that the subplan-partition offset map is good to have. So I have used that offset array to determine whether a partition is present in the subplan. I used the assumption that subplan and partition array have their partitions in the same order. > > We have a test, in the regression test suite for file_fdw, which > generates the message "cannot route inserted tuples to a foreign > table". I think we should have a similar test for the case where an > UPDATE tries to move a tuple from a regular partition to a foreign > table partition. Added an UPDATE scenario in contrib/file_fdw/input/file_fdw.source. > I'm not sure if it should fail with the same error > or a different one, but I think we should have a test that it fails > cleanly and with a nice error message of some sort. The update-tuple-routing goes through the same ExecInsert() code, so it fails at the same place with the same error message. > > The comment for get_partitioned_child_rels() claims that it sets > is_partition_key_update, but it really sets *is_partition_key_update. > And I think instead of "is a partition key" it should say "is used in > the partition key either of the relation whose RTI is specified or of > any child relation." I propose "used in" instead of "is" because > there can be partition expressions, and the rest is to clarify that > child partition keys matter. Fixed. > > create_modifytable_path uses partColsUpdated rather than > partKeyUpdated, which actually seems like better terminology. I > propose partKeyUpdated -> partColsUpdated everywhere. Also, why use > is_partition_key_update for basically the same thing in some other > places? I propose changing that to partColsUpdated as well. Done. > > The capitalization of the first comment hunk in execPartition.h is strange. I think you are referring to : * subplan_partition_offsets int Array ordered by UPDATE subplans. Each Changed Array to array. Didn't change UPDATE. Attached v36 patch.
Attachment
On 14 January 2018 at 17:27, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 13 January 2018 at 02:56, Robert Haas <robertmhaas@gmail.com> wrote: > > I guess I'm inclined to keep mt_per_sub_plan_maps for the case where > > there are no partitions, but not use it when partitions are present. > > What do you think about that? > > Even where partitions are present, in the usual case where there are no transition tables we won't require per-leaf mapat all [1]. So I think we should keep mt_per_sub_plan_maps only for the case where per-leaf map is not allocated. Andwe will not allocate mt_per_sub_plan_maps when mt_per_leaf_maps is needed. In other words, exactly one of the two mapswill be allocated. > > This is turning out to be close to what's already there in the last patch versions: use a single map array, and an offsetsarray. The difference is : in the patch I am using the *same* variable for the two maps. Where as, now we are talkingabout two different array variables for maps, but only allocating one of them. > > Are you ok with this ? I think the thing you were against was to have a common *variable* for two purposes. But above,I am saying we have two variables but assign a map array to only *one* of them and leave the other unused. > > --------- > > Regarding the on-demand map allocation .... > Where mt_per_sub_plan_maps is allocated, we won't have the on-demand allocation: all the maps will be allocated initially.The reason is becaues the map_is_required array is only per-leaf. Or else, again, we need to keep another map_is_requiredarray for per-subplan. May be we can support the on-demand stuff for subplan maps also, but only as a separatechange after we are done with update-partition-key. > > > --------- > > Regarding mt_per_leaf_tupconv_required, I am thinking we can make it a bool, and name it : mt_per_leaf_map_not_required.When it is true for a given index, it means, we have already called convert_tuples_by_name()and it returned NULL; i.e. it means we are sure that map is not required. A false value means weneed to call convert_tuples_by_name() if it is NULL, and then set mt_per_leaf_map_not_required to (map == NULL). > > Instead of a bool array, , we can instead make it a Bitmapset. But I think access would become slower as compared to array,particularly because it is going to be a heavily used function. I went ahead and did the above changes. I haven't yet merged these changes in the main patch. Instead, I have attached it as an incremental patch to be applied on the main v36 patch. The incremental patch is not yet quite polished, and quite a bit of cosmetic changes might be required, plus testing. But am posting it in case I have some early feedback. Details : The per-subplan map array variable is kept in ModifyTableState : - TupleConversionMap **mt_childparent_tupconv_maps; - /* Per plan/partition map for tuple conversion from child to root */ - bool mt_is_tupconv_perpart; /* Is the above map per-partition ? */ + TupleConversionMap **mt_per_subplan_tupconv_maps; + /* Per plan map for tuple conversion from child to root */ } ModifyTableState; The per-leaf array variable and the not_required array is kept in PartitionTupleRouting : - TupleConversionMap **partition_tupconv_maps; + TupleConversionMap **parent_child_tupconv_maps; + TupleConversionMap **child_parent_tupconv_maps; + bool *child_parent_tupconv_map_not_reqd; As you can see above, all the arrays are per-partition. So removed the per-leaf tag in these arrays. Instead, renamed the existing partition_tupconv_maps to parent_child_tupconv_maps, and the new per-leaf array to child_parent_tupconv_maps Have two separate functions ExecSetupChildParentMapForLeaf() and ExecSetupChildParentMapForSubplan() since most of their code is different. And now because of this, we can re-use ExecSetupChildParentMapForLeaf() in both copy.c and nodeModifyTable.c. Even inserts/copy will benefit from the on-demand map allocation. This is because now there is a function TupConvMapForLeaf() that is called in both copy.c and ExecInsert(). This is the function that does on-demand allocation. Attached the incremental patch conversion_map_changes.patch that has the above changes. It is to be applied over the latest main patch (update-partition-key_v36.patch).
Attachment
On Sun, Jan 14, 2018 at 6:57 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > Even where partitions are present, in the usual case where there are > no transition tables we won't require per-leaf map at all [1]. So I > think we should keep mt_per_sub_plan_maps only for the case where > per-leaf map is not allocated. And we will not allocate > mt_per_sub_plan_maps when mt_per_leaf_maps is needed. In other words, > exactly one of the two maps will be allocated. > > This is turning out to be close to what's already there in the last > patch versions: use a single map array, and an offsets array. The > difference is : in the patch I am using the *same* variable for the > two maps. Where as, now we are talking about two different array > variables for maps, but only allocating one of them. > > Are you ok with this ? I think the thing you were against was to have > a common *variable* for two purposes. But above, I am saying we have > two variables but assign a map array to only *one* of them and leave > the other unused. Yes, I'm OK with that. > Regarding the on-demand map allocation .... > Where mt_per_sub_plan_maps is allocated, we won't have the on-demand > allocation: all the maps will be allocated initially. The reason is > becaues the map_is_required array is only per-leaf. Or else, again, we > need to keep another map_is_required array for per-subplan. May be we > can support the on-demand stuff for subplan maps also, but only as a > separate change after we are done with update-partition-key. Sure. > Regarding mt_per_leaf_tupconv_required, I am thinking we can make it a > bool array, and name it : mt_per_leaf_map_not_required. When it is > true for a given index, it means, we have already called > convert_tuples_by_name() and it returned NULL; i.e. it means we are > sure that map is not required. A false value means we need to call > convert_tuples_by_name() if it is NULL, and then set > mt_per_leaf_map_not_required to (map == NULL). OK. > Instead of a bool array, we can even make it a Bitmapset. But I think > access would become slower as compared to array, particularly because > it is going to be a heavily used function. It probably makes little difference -- the Bitmapset will be more compact (which saves time) but involve function calls (which cost time). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 16 January 2018 at 01:09, Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, Jan 14, 2018 at 6:57 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> Even where partitions are present, in the usual case where there are >> Instead of a bool array, we can even make it a Bitmapset. But I think >> access would become slower as compared to array, particularly because >> it is going to be a heavily used function. > > It probably makes little difference -- the Bitmapset will be more > compact (which saves time) but involve function calls (which cost > time). I'm not arguing in either direction, but you'd also want to factor in how Bitmapsets only allocate words for the maximum stored member, which might mean multiple realloc() calls resulting in palloc/memcpy calls. The array would just be allocated in a single chunk, although it would be more memory and would require a memset too, however, that's likely much cheaper than the palloc() anyway. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 15 January 2018 at 16:11, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > I went ahead and did the above changes. I haven't yet merged these > changes in the main patch. Instead, I have attached it as an > incremental patch to be applied on the main v36 patch. The incremental > patch is not yet quite polished, and quite a bit of cosmetic changes > might be required, plus testing. But am posting it in case I have some > early feedback. I have now embedded the above incremental patch changes into the main patch (v37) , which is attached. Because it is used heavily in case of transition tables with partitions, I have made TupConvMapForLeaf() a macro. And the actual creation of the map is in separate function CreateTupConvMapForLeaf(), so as to reduce the macro size. Retained child_parent_map_not_required as a bool array, as against a bitmap. To include one scenario related to on-demand map allocation that was not getting covered with the update.sql test, I added one more scenario in that file : +-- Case where per-partition tuple conversion map array is allocated, but the +-- map is not required for the particular tuple that is routed, thanks to +-- matching table attributes of the partition and the target table. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Attachment
On 16 January 2018 at 09:17, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 16 January 2018 at 01:09, Robert Haas <robertmhaas@gmail.com> wrote: >> On Sun, Jan 14, 2018 at 6:57 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>> Even where partitions are present, in the usual case where there are >>> Instead of a bool array, we can even make it a Bitmapset. But I think >>> access would become slower as compared to array, particularly because >>> it is going to be a heavily used function. >> >> It probably makes little difference -- the Bitmapset will be more >> compact (which saves time) but involve function calls (which cost >> time). > > I'm not arguing in either direction, but you'd also want to factor in > how Bitmapsets only allocate words for the maximum stored member, > which might mean multiple realloc() calls resulting in palloc/memcpy > calls. The array would just be allocated in a single chunk, although > it would be more memory and would require a memset too, however, > that's likely much cheaper than the palloc() anyway. Right. I agree. And also a function call for knowing whether required or not. Overall, I think especially because the data structure will be used heavily whenever it is set up, it's better to make it an array. In the latest patch, I have retained it as an array -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On 16 January 2018 at 16:09, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > I have now embedded the above incremental patch changes into the main > patch (v37) , which is attached. The patch had to be rebased over commit dca48d145e0e : Remove useless lookup of root partitioned rel in ExecInitModifyTable(). In ExecInitModifyTable(), "rel" variable was needed only for INSERT. And node->partitioned_rels is only set in UPDATE/DELETE cases, so the extra logic of getting the root partitioned rel from node->partitioned_rels was removed as part of that commit. But now for update-tuple-routing, we require rel for UPDATE also. So we need to get the root partitioned rel. But, rather than opening the root table from node->partitioned_rels, we can re-use the already-opened mtstate->rootResultInfo. rootResultInfo is the same as head of partitioned_rels. I have renamed getASTriggerResultRelInfo() to getTargetResultRelInfo(), and used it to get the root partitioned table. The rename made sense, because it has become a function for more general use, rather than specific to triggers-related functionality. Attached rebased patch. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Attachment
On Fri, Jan 19, 2018 at 4:37 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > Attached rebased patch. Committed with a bunch of mostly-cosmetic revisions. I removed the macro you added, which has a multiple evaluation hazard, and just put that logic back into the function. I don't think it's likely to matter for performance, and this way is safer. I removed an inline keyword from another static function as well; better to let the compiler decide what to do. I rearranged a few things to shorten some long lines, too. Aside from that I think all of the changes I made were to comments and documentation. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Committed with a bunch of mostly-cosmetic revisions. Buildfarm member skink has been unhappy since this patch went in. Running the regression tests under valgrind easily reproduces the failure. Now, I might be wrong about which of the patches committed on Friday caused the unhappiness, but the valgrind backtrace sure looks like it's to do with partition routing: ==00:00:05:49.683 17549== Invalid read of size 4 ==00:00:05:49.683 17549== at 0x62A8BA: ExecCleanupTupleRouting (execPartition.c:483) ==00:00:05:49.683 17549== by 0x6483AA: ExecEndModifyTable (nodeModifyTable.c:2682) ==00:00:05:49.683 17549== by 0x627139: standard_ExecutorEnd (execMain.c:1604) ==00:00:05:49.683 17549== by 0x7780AF: ProcessQuery (pquery.c:206) ==00:00:05:49.683 17549== by 0x7782E4: PortalRunMulti (pquery.c:1286) ==00:00:05:49.683 17549== by 0x778AAF: PortalRun (pquery.c:799) ==00:00:05:49.683 17549== by 0x774E4C: exec_simple_query (postgres.c:1120) ==00:00:05:49.683 17549== by 0x776C17: PostgresMain (postgres.c:4143) ==00:00:05:49.683 17549== by 0x6FA419: PostmasterMain (postmaster.c:4412) ==00:00:05:49.683 17549== by 0x66E51F: main (main.c:228) ==00:00:05:49.683 17549== Address 0xe25e298 is 2,088 bytes inside a block of size 32,768 alloc'd ==00:00:05:49.683 17549== at 0x4A06A2E: malloc (vg_replace_malloc.c:270) ==00:00:05:49.683 17549== by 0x89EB15: AllocSetAlloc (aset.c:945) ==00:00:05:49.683 17549== by 0x8A7577: palloc (mcxt.c:848) ==00:00:05:49.683 17549== by 0x671969: new_list (list.c:68) ==00:00:05:49.683 17549== by 0x672859: lappend_oid (list.c:169) ==00:00:05:49.683 17549== by 0x55330E: find_inheritance_children (pg_inherits.c:144) ==00:00:05:49.683 17549== by 0x553447: find_all_inheritors (pg_inherits.c:203) ==00:00:05:49.683 17549== by 0x62AC76: ExecSetupPartitionTupleRouting (execPartition.c:68) ==00:00:05:49.683 17549== by 0x64949D: ExecInitModifyTable (nodeModifyTable.c:2232) ==00:00:05:49.683 17549== by 0x62BBE8: ExecInitNode (execProcnode.c:174) ==00:00:05:49.683 17549== by 0x627B53: standard_ExecutorStart (execMain.c:1043) ==00:00:05:49.683 17549== by 0x778046: ProcessQuery (pquery.c:156) (This is my local result, but skink's log looks about the same.) regards, tom lane
On Sun, Jan 21, 2018 at 1:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Committed with a bunch of mostly-cosmetic revisions. > > Buildfarm member skink has been unhappy since this patch went in. > Running the regression tests under valgrind easily reproduces the > failure. Now, I might be wrong about which of the patches committed > on Friday caused the unhappiness, but the valgrind backtrace sure > looks like it's to do with partition routing: Yeah, that must be the fault of this patch. We assign to proute->subplan_partition_offsets[update_rri_index] from update_rri_index = 0 .. num_update_rri, and there's an Assert() at the bottom of this function that checks this, so probably this is indexing off the end of the array. I bet the issue happens when we find all of the UPDATE result rels while there are still partitions left; then, subplan_index will be equal to the length of the proute->subplan_partition_offsets array and we'll be indexing just off the end. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 22 January 2018 at 02:40, Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, Jan 21, 2018 at 1:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> Committed with a bunch of mostly-cosmetic revisions. >> >> Buildfarm member skink has been unhappy since this patch went in. >> Running the regression tests under valgrind easily reproduces the >> failure. Now, I might be wrong about which of the patches committed >> on Friday caused the unhappiness, but the valgrind backtrace sure >> looks like it's to do with partition routing: > > Yeah, that must be the fault of this patch. We assign to > proute->subplan_partition_offsets[update_rri_index] from > update_rri_index = 0 .. num_update_rri, and there's an Assert() at the > bottom of this function that checks this, so probably this is indexing > off the end of the array. I bet the issue happens when we find all of > the UPDATE result rels while there are still partitions left; then, > subplan_index will be equal to the length of the > proute->subplan_partition_offsets array and we'll be indexing just off > the end. Yes, right, that's what is happening. It is not happening on an Assert though (there is no assert in that function). It is happening when we try to access the array here : if (proute->subplan_partition_offsets && proute->subplan_partition_offsets[subplan_index] == i) Attached is a fix, where I have introduced another field PartitionTupleRouting.num_ subplan_partition_offsets, so that above, we can add another condition (subplan_index < proute->num_subplan_partition_offsets) in order to stop accessing the array once we are done with all the offset array elements. Ran the update.sql test with valgrind enabled on my laptop, and the valgrind output now does not show errors. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Attachment
On Mon, Jan 22, 2018 at 2:44 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > Yes, right, that's what is happening. It is not happening on an Assert > though (there is no assert in that function). It is happening when we > try to access the array here : > > if (proute->subplan_partition_offsets && > proute->subplan_partition_offsets[subplan_index] == i) > > Attached is a fix, where I have introduced another field > PartitionTupleRouting.num_ subplan_partition_offsets, so that above, > we can add another condition (subplan_index < > proute->num_subplan_partition_offsets) in order to stop accessing the > array once we are done with all the offset array elements. > > Ran the update.sql test with valgrind enabled on my laptop, and the > valgrind output now does not show errors. Tom, do you want to double-check that this fixes it for you? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Tom, do you want to double-check that this fixes it for you? I can confirm that a valgrind run succeeded for me with the patch in place. regards, tom lane
On Mon, Jan 22, 2018 at 9:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Tom, do you want to double-check that this fixes it for you? > > I can confirm that a valgrind run succeeded for me with the patch > in place. Committed. Sorry for the delay. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jan 25, 2018 at 10:39 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Jan 22, 2018 at 9:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> Tom, do you want to double-check that this fixes it for you? >> >> I can confirm that a valgrind run succeeded for me with the patch >> in place. > > Committed. Sorry for the delay. FYI I'm planning to look into adding a valgrind check to the commitfest CI thing I run so we can catch these earlier without committer involvement. It's super slow because of all those pesky regression tests so I'll probably need to improve the scheduling logic a bit to make it useful first (prioritising new patches or something, since otherwise it'll take up to multiple days to get around to valgrind-testing any given patch...). -- Thomas Munro http://www.enterprisedb.com