Thread: [HACKERS] UPDATE of partition key

[HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
Currently, an update of a partition key of a partition is not allowed,
since it requires to move the row(s) into the applicable partition.

Attached is a WIP patch (update-partition-key.patch) that removes this
restriction. When an UPDATE causes the row of a partition to violate
its partition constraint, then a partition is searched in that subtree
that can accommodate this row, and if found, the row is deleted from
the old partition and inserted in the new partition. If not found, an
error is reported.

There are a few things that can be discussed about :

1. We can run an UPDATE using a child partition at any level in a
nested partition tree. In such case, we should move the row only
within that child subtree.

For e.g. , in a tree such as :
tab ->
   t1 ->
      t1_1
      t1_2
   t2 ->
      t2_1
      t2_2

For "UPDATE t2 set col1 = 'AAA' " , if the modified tuple does not fit
in t2_1 but can fit in t1_1, it should not be moved to t1_1, because
the UPDATE is fired using t2.

2. In the patch, as part of the row movement, ExecDelete() is called
followed by ExecInsert(). This is done that way, because we want to
have the ROW triggers on that (sub)partition executed. If a user has
explicitly created DELETE and INSERT BR triggers for this partition, I
think we should run those. While at the same time, another question
is, what about UPDATE trigger on the same table ? Here again, one can
argue that because this UPDATE has been transformed into a
DELETE-INSERT, we should not run UPDATE trigger for row-movement. But
there can be a counter-argument. For e.g. if a user needs to make sure
about logging updates of particular columns of a row, he will expect
the logging to happen even when that row was transparently moved. In
the patch, I have retained the firing of UPDATE BR trigger.

3. In case of a concurrent update/delete, suppose session A has locked
the row for deleting it. Now a session B has decided to update this
row and that is going to cause row movement, which means it will
delete it first. But when session A is finished deleting it, session B
finds that it is already deleted. In such case, it should not go ahead
with inserting a new row as part of the row movement. For that, I have
added a new parameter 'already_delete' for ExecDelete().

Of course, this still won't completely solve the concurrency anomaly.
In the above case, the UPDATE of Session B gets lost. May be, for a
user that does not tolerate this, we can have a table-level option
that disallows row movement, or will cause an error to be thrown for
one of the concurrent session.

4. The ExecSetupPartitionTupleRouting() is re-used for routing the row
that is to be moved. So in ExecInitModifyTable(), we call
ExecSetupPartitionTupleRouting() even for UPDATE. We can also do this
only during execution time for the very first time we find that we
need to do a row movement. I will think over that, but I am thinking
it might complicate things, as compared to always doing the setup for
UPDATE. WIll check on that.


5. Regarding performance testing, I have compared the results of
row-movement with partition versus row-movement with inheritance tree
using triggers.  Below are the details :

Schema :

CREATE TABLE ptab (a date, b int, c int);

CREATE TABLE ptab (a date, b int, c int) PARTITION BY RANGE (a, b);

CREATE TABLE ptab_1_1 PARTITION OF ptab
for values from ('1900-01-01', 1) to ('1900-01-01', 101)
PARTITION BY range (c);

        CREATE TABLE ptab_1_1_1 PARTITION OF ptab_1_1
        for values from (1) to (51);
        CREATE TABLE ptab_1_1_2 PARTITION OF ptab_1_1
        for values from (51) to (101);
.....
.....
        CREATE TABLE ptab_1_1_n PARTITION OF ptab_1_1
        for values from (n) to (n+m);

......
......

CREATE TABLE ptab_5_n PARTITION OF ptab
for values from ('1905-01-01', 101) to ('1905-01-01', 201)
PARTITION BY range (c);

        CREATE TABLE ptab_1_2_1 PARTITION OF ptab_1_2
        for values from (1) to (51);
        CREATE TABLE ptab_1_2_2 PARTITION OF ptab_1_2
        for values from (51) to (101);
.....
.....
        CREATE TABLE ptab_1_2_n PARTITION OF ptab_1_2
        for values from (n) to (n+m);
.....
.....

Similarly for inheritance :

CREATE TABLE ptab_1_1
(constraint check_ptab_1_1 check (a = '1900-01-01' and b >= 1 and b <
8)) inherits (ptab);
create trigger brutrig_ptab_1_1 before update on ptab_1_1 for each row
execute procedure ptab_upd_trig();
CREATE TABLE ptab_1_1_1
(constraint check_ptab_1_1_1 check (c >= 1 and c < 51))
inherits (ptab_1_1);
create trigger brutrig_ptab_1_1_1 before update on ptab_1_1_1 for each
row execute procedure ptab_upd_trig();
CREATE TABLE ptab_1_1_2
(constraint check_ptab_1_1_2 check (c >= 51 and c < 101))
inherits (ptab_1_1);

create trigger brutrig_ptab_1_1_2 before update on ptab_1_1_2 for each
row execute procedure ptab_upd_trig();

I had to have a BR UPDATE trigger on each of the leaf tables.

Attached is the BR trigger function update_trigger.sql. There it
generates the table name assuming a fixed pattern of distribution of
data over the partitions. It first deletes the row and then inserts a
new one. I also skipped the deletion part, and it did not show any
significant change in results.


parts    partitioned   inheritance   no. of rows   subpartitions
=====    ===========   ===========   ===========   =============

500       10 sec       3 min 02 sec   1,000,000     0
1000      10 sec       3 min 05 sec   1,000,000     0
1000     1 min 38sec   30min 50 sec  10,000,000     0
4000      28 sec       5 min 41 sec   1,000,000     10

part : total number of partitions including subparitions if any.
partitioned : Partitions created using declarative syntax.
inheritence : Partitions created using inheritence , check constraints
and insert,update triggers.
subpartitions : Number of subpartitions for each partition (in a 2-level tree)

Overall the UPDATE in partitions is faster by 10-20 times compared
with inheritance with triggers.

The UPDATE query moved all of the rows into another partition. It was
something like this :
update ptab set a = '1949-01-1' where a <= '1924-01-01'

For a plain table with 1,000,000 rows, the UPDATE took 8 seconds, and
with 10,000,000 rows, it took 1min 32sec.

In general, for both partitioned and inheritence tables, the time
taken linearly rose with the number of rows.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Mon, Feb 13, 2017 at 7:01 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> parts    partitioned   inheritance   no. of rows   subpartitions
> =====    ===========   ===========   ===========   =============
>
> 500       10 sec       3 min 02 sec   1,000,000     0
> 1000      10 sec       3 min 05 sec   1,000,000     0
> 1000     1 min 38sec   30min 50 sec  10,000,000     0
> 4000      28 sec       5 min 41 sec   1,000,000     10

That's a big speedup.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] UPDATE of partition key

From
David Fetter
Date:
On Mon, Feb 13, 2017 at 05:31:56PM +0530, Amit Khandekar wrote:
> Currently, an update of a partition key of a partition is not
> allowed, since it requires to move the row(s) into the applicable
> partition.
> 
> Attached is a WIP patch (update-partition-key.patch) that removes
> this restriction. When an UPDATE causes the row of a partition to
> violate its partition constraint, then a partition is searched in
> that subtree that can accommodate this row, and if found, the row is
> deleted from the old partition and inserted in the new partition. If
> not found, an error is reported.

This is great!

Would it be really invasive to HINT something when the subtree is a
proper subtree?

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 14 February 2017 at 22:24, David Fetter <david@fetter.org> wrote:
> On Mon, Feb 13, 2017 at 05:31:56PM +0530, Amit Khandekar wrote:
>> Currently, an update of a partition key of a partition is not
>> allowed, since it requires to move the row(s) into the applicable
>> partition.
>>
>> Attached is a WIP patch (update-partition-key.patch) that removes
>> this restriction. When an UPDATE causes the row of a partition to
>> violate its partition constraint, then a partition is searched in
>> that subtree that can accommodate this row, and if found, the row is
>> deleted from the old partition and inserted in the new partition. If
>> not found, an error is reported.
>
> This is great!
>
> Would it be really invasive to HINT something when the subtree is a
> proper subtree?

I am not quite sure I understood this question. Can you please explain
it a bit more ...



Re: [HACKERS] UPDATE of partition key

From
David Fetter
Date:
On Wed, Feb 15, 2017 at 01:06:32PM +0530, Amit Khandekar wrote:
> On 14 February 2017 at 22:24, David Fetter <david@fetter.org> wrote:
> > On Mon, Feb 13, 2017 at 05:31:56PM +0530, Amit Khandekar wrote:
> >> Currently, an update of a partition key of a partition is not
> >> allowed, since it requires to move the row(s) into the applicable
> >> partition.
> >>
> >> Attached is a WIP patch (update-partition-key.patch) that removes
> >> this restriction. When an UPDATE causes the row of a partition to
> >> violate its partition constraint, then a partition is searched in
> >> that subtree that can accommodate this row, and if found, the row
> >> is deleted from the old partition and inserted in the new
> >> partition. If not found, an error is reported.
> >
> > This is great!
> >
> > Would it be really invasive to HINT something when the subtree is
> > a proper subtree?
> 
> I am not quite sure I understood this question. Can you please
> explain it a bit more ...

Sorry.  When an UPDATE can't happen, there are often ways to hint at
what went wrong and how to correct it.  Violating a uniqueness
constraint would be one example.

When an UPDATE can't happen and the depth of the subtree is a
plausible candidate for what prevents it, there might be a way to say
so.

Let's imagine a table called log with partitions on "stamp" log_YYYY
and subpartitions, also on "stamp", log_YYYYMM.  If you do something
like
   UPDATE log_2017 SET "stamp"='2016-11-08 23:03:00' WHERE ...

it's possible to know that it might have worked had the UPDATE taken
place on log rather than on log_2017.

Does that make sense, and if so, is it super invasive to HINT that?

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 15 February 2017 at 20:26, David Fetter <david@fetter.org> wrote:
> When an UPDATE can't happen, there are often ways to hint at
> what went wrong and how to correct it.  Violating a uniqueness
> constraint would be one example.
>
> When an UPDATE can't happen and the depth of the subtree is a
> plausible candidate for what prevents it, there might be a way to say
> so.
>
> Let's imagine a table called log with partitions on "stamp" log_YYYY
> and subpartitions, also on "stamp", log_YYYYMM.  If you do something
> like
>
>     UPDATE log_2017 SET "stamp"='2016-11-08 23:03:00' WHERE ...
>
> it's possible to know that it might have worked had the UPDATE taken
> place on log rather than on log_2017.
>
> Does that make sense, and if so, is it super invasive to HINT that?

Yeah, I think it should be possible to find the root partition with
the help of pg_partitioned_table, and then run ExecFindPartition()
again using the root. Will check. I am not sure right now how involved
that would turn out to be, but I think that logic would not change the
existing code, so in that sense it is not invasive.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] UPDATE of partition key

From
Amit Langote
Date:
On 2017/02/16 15:50, Amit Khandekar wrote:
> On 15 February 2017 at 20:26, David Fetter <david@fetter.org> wrote:
>> When an UPDATE can't happen, there are often ways to hint at
>> what went wrong and how to correct it.  Violating a uniqueness
>> constraint would be one example.
>>
>> When an UPDATE can't happen and the depth of the subtree is a
>> plausible candidate for what prevents it, there might be a way to say
>> so.
>>
>> Let's imagine a table called log with partitions on "stamp" log_YYYY
>> and subpartitions, also on "stamp", log_YYYYMM.  If you do something
>> like
>>
>>     UPDATE log_2017 SET "stamp"='2016-11-08 23:03:00' WHERE ...
>>
>> it's possible to know that it might have worked had the UPDATE taken
>> place on log rather than on log_2017.
>>
>> Does that make sense, and if so, is it super invasive to HINT that?
> 
> Yeah, I think it should be possible to find the root partition with

I assume you mean root *partitioned* table.

> the help of pg_partitioned_table,

The pg_partitioned_table catalog does not store parent-child
relationships, just information about the partition key of a table.  To
get the root partitioned table, you might want to create a recursive
version of get_partition_parent(), maybe called
get_partition_root_parent().  By the way, get_partition_parent() scans
pg_inherits to find the inheritance parent.

> and then run ExecFindPartition()
> again using the root. Will check. I am not sure right now how involved
> that would turn out to be, but I think that logic would not change the
> existing code, so in that sense it is not invasive.

I couldn't understand why run ExecFindPartition() again on the root
partitioned table, can you clarify?  ISTM, we just want to tell the user
in the HINT that trying the same update query with root partitioned table
might work.  I'm not sure if it would work instead to find some
intermediate partitioned table (that is, between the root and the one that
update query was tried with) to include in the HINT.

Thanks,
Amit





Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 16 February 2017 at 12:57, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/02/16 15:50, Amit Khandekar wrote:
>> On 15 February 2017 at 20:26, David Fetter <david@fetter.org> wrote:
>>> When an UPDATE can't happen, there are often ways to hint at
>>> what went wrong and how to correct it.  Violating a uniqueness
>>> constraint would be one example.
>>>
>>> When an UPDATE can't happen and the depth of the subtree is a
>>> plausible candidate for what prevents it, there might be a way to say
>>> so.
>>>
>>> Let's imagine a table called log with partitions on "stamp" log_YYYY
>>> and subpartitions, also on "stamp", log_YYYYMM.  If you do something
>>> like
>>>
>>>     UPDATE log_2017 SET "stamp"='2016-11-08 23:03:00' WHERE ...
>>>
>>> it's possible to know that it might have worked had the UPDATE taken
>>> place on log rather than on log_2017.
>>>
>>> Does that make sense, and if so, is it super invasive to HINT that?
>>
>> Yeah, I think it should be possible to find the root partition with
>
> I assume you mean root *partitioned* table.
>
>> the help of pg_partitioned_table,
>
> The pg_partitioned_table catalog does not store parent-child
> relationships, just information about the partition key of a table.  To
> get the root partitioned table, you might want to create a recursive
> version of get_partition_parent(), maybe called
> get_partition_root_parent().  By the way, get_partition_parent() scans
> pg_inherits to find the inheritance parent.

Yeah. But we also want to make sure that it's a part of declarative
partition tree, and not just an inheritance tree ? I am not sure
whether it is currently possible to have a mix of these two. May be it
is easy to prevent that from happening.

>
>> and then run ExecFindPartition()
>> again using the root. Will check. I am not sure right now how involved
>> that would turn out to be, but I think that logic would not change the
>> existing code, so in that sense it is not invasive.
>
> I couldn't understand why run ExecFindPartition() again on the root
> partitioned table, can you clarify?  ISTM, we just want to tell the user
> in the HINT that trying the same update query with root partitioned table
> might work. I'm not sure if it would work instead to find some
> intermediate partitioned table (that is, between the root and the one that
> update query was tried with) to include in the HINT.

What I had in mind was : Give that hint only if there *was* a
subpartition that could accommodate that row. And if found, we can
only include the subpartition name.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] UPDATE of partition key

From
Amit Langote
Date:
On 2017/02/16 17:55, Amit Khandekar wrote:
> On 16 February 2017 at 12:57, Amit Langote wrote:
>> On 2017/02/16 15:50, Amit Khandekar wrote:
>>> On 15 February 2017 at 20:26, David Fetter <david@fetter.org> wrote:
>>>> Does that make sense, and if so, is it super invasive to HINT that?
>>>
>>> Yeah, I think it should be possible to find the root partition with
>>
>> I assume you mean root *partitioned* table.
>>
>>> the help of pg_partitioned_table,
>>
>> The pg_partitioned_table catalog does not store parent-child
>> relationships, just information about the partition key of a table.  To
>> get the root partitioned table, you might want to create a recursive
>> version of get_partition_parent(), maybe called
>> get_partition_root_parent().  By the way, get_partition_parent() scans
>> pg_inherits to find the inheritance parent.
> 
> Yeah. But we also want to make sure that it's a part of declarative
> partition tree, and not just an inheritance tree ? I am not sure
> whether it is currently possible to have a mix of these two. May be it
> is easy to prevent that from happening.

It is not possible to mix declarative partitioning and regular
inheritance.  So, you cannot have a table in a declarative partitioning
tree that is not a (sub-) partition of the root table.

>>> and then run ExecFindPartition()
>>> again using the root. Will check. I am not sure right now how involved
>>> that would turn out to be, but I think that logic would not change the
>>> existing code, so in that sense it is not invasive.
>>
>> I couldn't understand why run ExecFindPartition() again on the root
>> partitioned table, can you clarify?  ISTM, we just want to tell the user
>> in the HINT that trying the same update query with root partitioned table
>> might work. I'm not sure if it would work instead to find some
>> intermediate partitioned table (that is, between the root and the one that
>> update query was tried with) to include in the HINT.
> 
> What I had in mind was : Give that hint only if there *was* a
> subpartition that could accommodate that row. And if found, we can
> only include the subpartition name.

Asking to try the update query with the root table sounds like a good
enough hint.  Trying to find the exact sub-partition (I assume you mean to
imply sub-tree here) seems like an overkill, IMHO.

Thanks,
Amit





Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 16 February 2017 at 14:42, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/02/16 17:55, Amit Khandekar wrote:
>> On 16 February 2017 at 12:57, Amit Langote wrote:
>>> On 2017/02/16 15:50, Amit Khandekar wrote:
>>>> On 15 February 2017 at 20:26, David Fetter <david@fetter.org> wrote:
>>>>> Does that make sense, and if so, is it super invasive to HINT that?
>>>>
>>>> Yeah, I think it should be possible to find the root partition with
>>>
>>> I assume you mean root *partitioned* table.
>>>
>>>> the help of pg_partitioned_table,
>>>
>>> The pg_partitioned_table catalog does not store parent-child
>>> relationships, just information about the partition key of a table.  To
>>> get the root partitioned table, you might want to create a recursive
>>> version of get_partition_parent(), maybe called
>>> get_partition_root_parent().  By the way, get_partition_parent() scans
>>> pg_inherits to find the inheritance parent.
>>
>> Yeah. But we also want to make sure that it's a part of declarative
>> partition tree, and not just an inheritance tree ? I am not sure
>> whether it is currently possible to have a mix of these two. May be it
>> is easy to prevent that from happening.
>
> It is not possible to mix declarative partitioning and regular
> inheritance.  So, you cannot have a table in a declarative partitioning
> tree that is not a (sub-) partition of the root table.

Ok, then that makes things easy.

>
>>>> and then run ExecFindPartition()
>>>> again using the root. Will check. I am not sure right now how involved
>>>> that would turn out to be, but I think that logic would not change the
>>>> existing code, so in that sense it is not invasive.
>>>
>>> I couldn't understand why run ExecFindPartition() again on the root
>>> partitioned table, can you clarify?  ISTM, we just want to tell the user
>>> in the HINT that trying the same update query with root partitioned table
>>> might work. I'm not sure if it would work instead to find some
>>> intermediate partitioned table (that is, between the root and the one that
>>> update query was tried with) to include in the HINT.
>>
>> What I had in mind was : Give that hint only if there *was* a
>> subpartition that could accommodate that row. And if found, we can
>> only include the subpartition name.
>
> Asking to try the update query with the root table sounds like a good
> enough hint.  Trying to find the exact sub-partition (I assume you mean to
> imply sub-tree here) seems like an overkill, IMHO.
Yeah ... I was thinking , anyways it's an error condition, so why not
let the server spend a bit more CPU and get the right sub-partition
for the message. If we decide to write code to find the root
partition, then it's just a matter of another function
ExecFindPartition().

Also, I was thinking : give the hint *only* if we know there is a
right sub-partition. Otherwise, it might distract the user.

>
> Thanks,
> Amit
>
>



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] UPDATE of partition key

From
Greg Stark
Date:
On 13 February 2017 at 12:01, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> There are a few things that can be discussed about :

If you do a normal update the new tuple is linked to the old one using
the ctid forming a chain of tuple versions. This tuple movement breaks
that chain.  So the question I had reading this proposal is what
behaviour depends on ctid and how is it affected by the ctid chain
being broken.

I think the concurrent update case is just a symptom of this. If you
try to update a row that's locked for a concurrent update you normally
wait until the concurrent update finishes, then follow the ctid chain
and recheck the where clause on the target of the link and if it still
matches you perform the update there.

At least you do that if you have isolation_level set to
repeatable_read. If you have isolation level set to serializable then
you just fail with a serialization failure. I think that's what you
should do if you come across a row that's been updated with a broken
ctid chain even in repeatable read mode. Just fail with a
serialization failure and document that in partitioned tables if you
perform updates that move tuples between partitions then you need to
be ensure your updates are prepared for serialization failures.

I think this would require another bit in the tuple info mask
indicating that this is tuple is the last version before a broken ctid
chain -- i.e. that it was updated by moving it to another partition.
Maybe there's some combination of bits you could use though since this
is only needed in a particular situation.

Offhand I don't know what other behaviours are dependent on the ctid
chain. I think you need to go search the docs -- and probably the code
just to be sure -- for any references to ctid to ensure you catch
every impact of breaking the ctid chain.

-- 
greg



Re: [HACKERS] UPDATE of partition key

From
David Fetter
Date:
On Thu, Feb 16, 2017 at 03:39:30PM +0530, Amit Khandekar wrote:
> >>>> and then run ExecFindPartition()
> >>>> again using the root. Will check. I am not sure right now how involved
> >>>> that would turn out to be, but I think that logic would not change the
> >>>> existing code, so in that sense it is not invasive.
> >>>
> >>> I couldn't understand why run ExecFindPartition() again on the root
> >>> partitioned table, can you clarify?  ISTM, we just want to tell the user
> >>> in the HINT that trying the same update query with root partitioned table
> >>> might work. I'm not sure if it would work instead to find some
> >>> intermediate partitioned table (that is, between the root and the one that
> >>> update query was tried with) to include in the HINT.
> >>
> >> What I had in mind was : Give that hint only if there *was* a
> >> subpartition that could accommodate that row. And if found, we can
> >> only include the subpartition name.
> >
> > Asking to try the update query with the root table sounds like a good
> > enough hint.  Trying to find the exact sub-partition (I assume you mean to
> > imply sub-tree here) seems like an overkill, IMHO.
> Yeah ... I was thinking , anyways it's an error condition, so why not
> let the server spend a bit more CPU and get the right sub-partition
> for the message. If we decide to write code to find the root
> partition, then it's just a matter of another function
> ExecFindPartition().
> 
> Also, I was thinking : give the hint *only* if we know there is a
> right sub-partition. Otherwise, it might distract the user.

If this is relatively straight-forward, it'd be great.  More
actionable knowledge is better.

Thanks for taking this on.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Thu, Feb 16, 2017 at 5:47 AM, Greg Stark <stark@mit.edu> wrote:
> On 13 February 2017 at 12:01, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> There are a few things that can be discussed about :
>
> If you do a normal update the new tuple is linked to the old one using
> the ctid forming a chain of tuple versions. This tuple movement breaks
> that chain.  So the question I had reading this proposal is what
> behaviour depends on ctid and how is it affected by the ctid chain
> being broken.

I think this is a good question.

> I think the concurrent update case is just a symptom of this. If you
> try to update a row that's locked for a concurrent update you normally
> wait until the concurrent update finishes, then follow the ctid chain
> and recheck the where clause on the target of the link and if it still
> matches you perform the update there.

Right.  EvalPlanQual behavior, in short.

> At least you do that if you have isolation_level set to
> repeatable_read. If you have isolation level set to serializable then
> you just fail with a serialization failure. I think that's what you
> should do if you come across a row that's been updated with a broken
> ctid chain even in repeatable read mode. Just fail with a
> serialization failure and document that in partitioned tables if you
> perform updates that move tuples between partitions then you need to
> be ensure your updates are prepared for serialization failures.

Now, this part I'm not sure about.  What's pretty clear is that,
barring some redesign of the heap format, we can't keep the CTID chain
intact when the tuple moves to a different relfilenode.  What's less
clear is what to do about that.  We can either (1) give up on
EvalPlanQual behavior in this case and act just as we would if the row
had been deleted; no update happens or (2) throw a serialization
error.  You're advocating for #2, but I'm not sure that's right,
because:

1. It's a lot more work,

2. Your proposed implementation needs an on-disk format change that
uses up a scarce infomask bit, and

3. It's not obvious to me that it's clearly preferable from a user
experience standpoint.  I mean, either way the user doesn't get the
behavior that they want.  Either they're hoping for EPQ semantics and
they instead do a no-op update, or they're hoping for EPQ semantics
and they instead get an ERROR.  Generally speaking, we don't throw
serialization errors today at READ COMMITTED, so if we do so here,
that's going to be a noticeable and perhaps unwelcome change.

More opinions welcome.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 16 February 2017 at 20:53, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Feb 16, 2017 at 5:47 AM, Greg Stark <stark@mit.edu> wrote:
>> On 13 February 2017 at 12:01, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>> There are a few things that can be discussed about :
>>
>> If you do a normal update the new tuple is linked to the old one using
>> the ctid forming a chain of tuple versions. This tuple movement breaks
>> that chain.  So the question I had reading this proposal is what
>> behaviour depends on ctid and how is it affected by the ctid chain
>> being broken.
>
> I think this is a good question.
>
>> I think the concurrent update case is just a symptom of this. If you
>> try to update a row that's locked for a concurrent update you normally
>> wait until the concurrent update finishes, then follow the ctid chain
>> and recheck the where clause on the target of the link and if it still
>> matches you perform the update there.
>
> Right.  EvalPlanQual behavior, in short.
>
>> At least you do that if you have isolation_level set to
>> repeatable_read. If you have isolation level set to serializable then
>> you just fail with a serialization failure. I think that's what you
>> should do if you come across a row that's been updated with a broken
>> ctid chain even in repeatable read mode. Just fail with a
>> serialization failure and document that in partitioned tables if you
>> perform updates that move tuples between partitions then you need to
>> be ensure your updates are prepared for serialization failures.
>
> Now, this part I'm not sure about.  What's pretty clear is that,
> barring some redesign of the heap format, we can't keep the CTID chain
> intact when the tuple moves to a different relfilenode.  What's less
> clear is what to do about that.  We can either (1) give up on
> EvalPlanQual behavior in this case and act just as we would if the row
> had been deleted; no update happens.

This is what the current patch has done.

> or (2) throw a serialization
> error.  You're advocating for #2, but I'm not sure that's right,
> because:
>
> 1. It's a lot more work,
>
> 2. Your proposed implementation needs an on-disk format change that
> uses up a scarce infomask bit, and
>
> 3. It's not obvious to me that it's clearly preferable from a user
> experience standpoint.  I mean, either way the user doesn't get the
> behavior that they want.  Either they're hoping for EPQ semantics and
> they instead do a no-op update, or they're hoping for EPQ semantics
> and they instead get an ERROR.  Generally speaking, we don't throw
> serialization errors today at READ COMMITTED, so if we do so here,
> that's going to be a noticeable and perhaps unwelcome change.
>
> More opinions welcome.

I am inclined to at least have some option for the user to decide the
behaviour. In the future we can even consider support for walking
through the ctid chain across multiple relfilenodes. But till then, we
need to decide what default behaviour to keep. My inclination is more
towards erroring out in an unfortunate even where there is an UPDATE
while the row-movement is happening. One option is to not get into
finding whether the DELETE was part of partition row-movement or it
was indeed a DELETE, and always error out the UPDATE when
heap_update() returns HeapTupleUpdated, but only if the table is a
leaf partition. But this obviously will cause annoyance because of
chances of getting such errors when there are concurrent updates and
deletes in the same partition. But we can keep a table-level option
for determining whether to error out or silently lose the UPDATE.

Another option I was thinking : When the UPDATE is on a partition key,
acquire ExclusiveLock (not AccessExclusiveLock) only on that
partition, so that the selects will continue to execute, but
UPDATE/DELETE will wait before opening the table for scan. The UPDATE
on partition key is not going to be a very routine operation, it
sounds more like a DBA maintenance operation; so it does not look like
it would come in between usual transactions.



Re: [HACKERS] UPDATE of partition key

From
Thomas Munro
Date:
On Thu, Feb 16, 2017 at 8:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Generally speaking, we don't throw
> serialization errors today at READ COMMITTED, so if we do so here,
> that's going to be a noticeable and perhaps unwelcome change.

Yes we do:

https://www.postgresql.org/docs/9.6/static/transaction-iso.html#XACT-REPEATABLE-READ

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] UPDATE of partition key

From
Thomas Munro
Date:
On Mon, Feb 20, 2017 at 3:36 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Thu, Feb 16, 2017 at 8:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Generally speaking, we don't throw
>> serialization errors today at READ COMMITTED, so if we do so here,
>> that's going to be a noticeable and perhaps unwelcome change.
>
> Yes we do:
>
> https://www.postgresql.org/docs/9.6/static/transaction-iso.html#XACT-REPEATABLE-READ

Oops -- please ignore, I misread that as repeatable read.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] UPDATE of partition key

From
Amit Langote
Date:
Hi Amit,

Thanks for working on this.

On 2017/02/13 21:01, Amit Khandekar wrote:
> Currently, an update of a partition key of a partition is not allowed,
> since it requires to move the row(s) into the applicable partition.
> 
> Attached is a WIP patch (update-partition-key.patch) that removes this
> restriction. When an UPDATE causes the row of a partition to violate
> its partition constraint, then a partition is searched in that subtree
> that can accommodate this row, and if found, the row is deleted from
> the old partition and inserted in the new partition. If not found, an
> error is reported.

That's clearly an improvement over what we have now.

> There are a few things that can be discussed about :
> 
> 1. We can run an UPDATE using a child partition at any level in a
> nested partition tree. In such case, we should move the row only
> within that child subtree.
> 
> For e.g. , in a tree such as :
> tab ->
>    t1 ->
>       t1_1
>       t1_2
>    t2 ->
>       t2_1
>       t2_2
> 
> For "UPDATE t2 set col1 = 'AAA' " , if the modified tuple does not fit
> in t2_1 but can fit in t1_1, it should not be moved to t1_1, because
> the UPDATE is fired using t2.

Makes sense.  One should perform the update by specifying tab such that
the row moves from t2 to t1, before we could determine t1_1 as the target
for the new row.  Specifying t2 directly in that case is clearly the
"violates partition constraint" situation.  I wonder if that's enough a
hint for the user to try updating the parent (or better still, root
parent).  Or as we were discussing, should there be an actual HINT message
spelling that out for the user.

> 2. In the patch, as part of the row movement, ExecDelete() is called
> followed by ExecInsert(). This is done that way, because we want to
> have the ROW triggers on that (sub)partition executed. If a user has
> explicitly created DELETE and INSERT BR triggers for this partition, I
> think we should run those. While at the same time, another question
> is, what about UPDATE trigger on the same table ? Here again, one can
> argue that because this UPDATE has been transformed into a
> DELETE-INSERT, we should not run UPDATE trigger for row-movement. But
> there can be a counter-argument. For e.g. if a user needs to make sure
> about logging updates of particular columns of a row, he will expect
> the logging to happen even when that row was transparently moved. In
> the patch, I have retained the firing of UPDATE BR trigger.

What of UPDATE AR triggers?

As a comment on how row-movement is being handled in code, I wonder if it
could be be made to look similar structurally to the code in ExecInsert()
that handles ON CONFLICT DO UPDATE.  That is,

if (partition constraint fails)
{   /* row movement */
}
else
{   /* ExecConstraints() */   /* heap_update(), EvalPlanQual(), and ExecInsertIndexTuples() */
}

I see that ExecConstraint() won't get called on the source partition's
constraints if row movement occurs.  Maybe, that's unnecessary because the
new row won't be inserted into that partition anyway.

ExecWithCheckOptions() for RLS update check does happen *before* row
movement though.

> 3. In case of a concurrent update/delete, suppose session A has locked
> the row for deleting it. Now a session B has decided to update this
> row and that is going to cause row movement, which means it will
> delete it first. But when session A is finished deleting it, session B
> finds that it is already deleted. In such case, it should not go ahead
> with inserting a new row as part of the row movement. For that, I have
> added a new parameter 'already_delete' for ExecDelete().

Makes sense.  Maybe: already_deleted -> concurrently_deleted.

> Of course, this still won't completely solve the concurrency anomaly.
> In the above case, the UPDATE of Session B gets lost. May be, for a
> user that does not tolerate this, we can have a table-level option
> that disallows row movement, or will cause an error to be thrown for
> one of the concurrent session.

Will this table-level option be specified for a partitioned table once or
for individual partitions?

> 4. The ExecSetupPartitionTupleRouting() is re-used for routing the row
> that is to be moved. So in ExecInitModifyTable(), we call
> ExecSetupPartitionTupleRouting() even for UPDATE. We can also do this
> only during execution time for the very first time we find that we
> need to do a row movement. I will think over that, but I am thinking
> it might complicate things, as compared to always doing the setup for
> UPDATE. WIll check on that.

Hmm.  ExecSetupPartitionTupleRouting(), which does significant amount of
setup work, is fine being called in ExecInitModifyTable() in the insert
case because there are often cases where that's a bulk-insert and hence
cost of the setup work is amortized.  Updates, OTOH, are seldom done in a
bulk manner.  So that might be an argument for doing it late only when
needed.  But that starts to sound less attractive when one realizes that
that will occur for every row that wants to move.

I wonder if updates that will require row movement when done will be done
in a bulk manner (as a maintenance op), so one-time tuple routing setup
seems fine.  Again, enable_row_movement option specified for the parent
sounds like it would be a nice to have.  Only do the setup if it's turned
on, which goes without saying.

> 5. Regarding performance testing, I have compared the results of
> row-movement with partition versus row-movement with inheritance tree
> using triggers.  Below are the details :
> 
> Schema :

[ ... ]

> parts    partitioned   inheritance   no. of rows   subpartitions
> =====    ===========   ===========   ===========   =============
> 
> 500       10 sec       3 min 02 sec   1,000,000     0
> 1000      10 sec       3 min 05 sec   1,000,000     0
> 1000     1 min 38sec   30min 50 sec  10,000,000     0
> 4000      28 sec       5 min 41 sec   1,000,000     10
> 
> part : total number of partitions including subparitions if any.
> partitioned : Partitions created using declarative syntax.
> inheritence : Partitions created using inheritence , check constraints
> and insert,update triggers.
> subpartitions : Number of subpartitions for each partition (in a 2-level tree)
> 
> Overall the UPDATE in partitions is faster by 10-20 times compared
> with inheritance with triggers.
>
> The UPDATE query moved all of the rows into another partition. It was
> something like this :
> update ptab set a = '1949-01-1' where a <= '1924-01-01'
> 
> For a plain table with 1,000,000 rows, the UPDATE took 8 seconds, and
> with 10,000,000 rows, it took 1min 32sec.

Nice!

> In general, for both partitioned and inheritence tables, the time
> taken linearly rose with the number of rows.

Hopefully not also with the number of partitions though.

I will look more closely at the code soon.

Thanks,
Amit





Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Mon, Feb 20, 2017 at 2:58 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> I am inclined to at least have some option for the user to decide the
> behaviour. In the future we can even consider support for walking
> through the ctid chain across multiple relfilenodes. But till then, we
> need to decide what default behaviour to keep. My inclination is more
> towards erroring out in an unfortunate even where there is an UPDATE
> while the row-movement is happening. One option is to not get into
> finding whether the DELETE was part of partition row-movement or it
> was indeed a DELETE, and always error out the UPDATE when
> heap_update() returns HeapTupleUpdated, but only if the table is a
> leaf partition. But this obviously will cause annoyance because of
> chances of getting such errors when there are concurrent updates and
> deletes in the same partition. But we can keep a table-level option
> for determining whether to error out or silently lose the UPDATE.

I'm still a fan of the "do nothing and just document that this is a
weirdness of partitioned tables" approach, because implementing
something will be complicated, will ensure that this misses this
release if not the next one, and may not be any better for users.  But
probably we need to get some more opinions from other people, since I
can imagine people being pretty unhappy if the consensus happens to be
at odds with my own preferences.

> Another option I was thinking : When the UPDATE is on a partition key,
> acquire ExclusiveLock (not AccessExclusiveLock) only on that
> partition, so that the selects will continue to execute, but
> UPDATE/DELETE will wait before opening the table for scan. The UPDATE
> on partition key is not going to be a very routine operation, it
> sounds more like a DBA maintenance operation; so it does not look like
> it would come in between usual transactions.

I think that's going to make users far more unhappy than breaking the
EPQ behavior ever would.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] UPDATE of partition key

From
"David G. Johnston"
Date:
On Friday, February 24, 2017, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Feb 20, 2017 at 2:58 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> I am inclined to at least have some option for the user to decide the
> behaviour. In the future we can even consider support for walking
> through the ctid chain across multiple relfilenodes. But till then, we
> need to decide what default behaviour to keep. My inclination is more
> towards erroring out in an unfortunate even where there is an UPDATE
> while the row-movement is happening. One option is to not get into
> finding whether the DELETE was part of partition row-movement or it
> was indeed a DELETE, and always error out the UPDATE when
> heap_update() returns HeapTupleUpdated, but only if the table is a
> leaf partition. But this obviously will cause annoyance because of
> chances of getting such errors when there are concurrent updates and
> deletes in the same partition. But we can keep a table-level option
> for determining whether to error out or silently lose the UPDATE.

I'm still a fan of the "do nothing and just document that this is a
weirdness of partitioned tables" approach, because implementing
something will be complicated, will ensure that this misses this
release if not the next one, and may not be any better for users.  But
probably we need to get some more opinions from other people, since I
can imagine people being pretty unhappy if the consensus happens to be
at odds with my own preferences.


For my own sanity - the move update would complete successfully and break every ctid chain that it touches.  Any update lined up behind it in the lock queue would discover their target record has been deleted and would experience whatever behavior their isolation level dictates for such a situation.  So multi-partition update queries will fail to update some records if they happen to move between partitions even if they would otherwise match the query's predicate.

Is there any difference in behavior between this and a SQL writeable CTE performing the same thing via delete-returning-insert?

David J.


Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Fri, Feb 24, 2017 at 1:18 PM, David G. Johnston
<david.g.johnston@gmail.com> wrote:
> For my own sanity - the move update would complete successfully and break
> every ctid chain that it touches.  Any update lined up behind it in the lock
> queue would discover their target record has been deleted and would
> experience whatever behavior their isolation level dictates for such a
> situation.  So multi-partition update queries will fail to update some
> records if they happen to move between partitions even if they would
> otherwise match the query's predicate.

Right.  That's the behavior for which I am advocating, on the grounds
that it's the simplest to implement and if we all agree on something
else more complicated later, we can do it then.

> Is there any difference in behavior between this and a SQL writeable CTE
> performing the same thing via delete-returning-insert?

Not to my knowledge.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] UPDATE of partition key

From
Simon Riggs
Date:
On 24 February 2017 at 07:02, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Feb 20, 2017 at 2:58 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> I am inclined to at least have some option for the user to decide the
>> behaviour. In the future we can even consider support for walking
>> through the ctid chain across multiple relfilenodes. But till then, we
>> need to decide what default behaviour to keep. My inclination is more
>> towards erroring out in an unfortunate even where there is an UPDATE
>> while the row-movement is happening. One option is to not get into
>> finding whether the DELETE was part of partition row-movement or it
>> was indeed a DELETE, and always error out the UPDATE when
>> heap_update() returns HeapTupleUpdated, but only if the table is a
>> leaf partition. But this obviously will cause annoyance because of
>> chances of getting such errors when there are concurrent updates and
>> deletes in the same partition. But we can keep a table-level option
>> for determining whether to error out or silently lose the UPDATE.
>
> I'm still a fan of the "do nothing and just document that this is a
> weirdness of partitioned tables" approach, because implementing
> something will be complicated, will ensure that this misses this
> release if not the next one, and may not be any better for users.  But
> probably we need to get some more opinions from other people, since I
> can imagine people being pretty unhappy if the consensus happens to be
> at odds with my own preferences.

I'd give the view that we cannot silently ignore this issue, bearing
in mind the point that we're expecting partitioned tables to behave
exactly like normal tables.

In my understanding the issue is that UPDATEs will fail to update a
row when a valid row exists in the case where a row moved between
partitions; that behaviour will be different to a standard table.

It is of course very good that we have something ready for this
release and can make a choice of what to do.

Thoughts

1. Reuse the tuple state HEAP_MOVED_OFF which IIRC represent exactly
almost exactly the same thing. An UPDATE which gets to a
HEAP_MOVED_OFF tuple will know to re-find the tuple via the partition
metadata, or I might be persuaded that in-this-release it is
acceptable to fail when this occurs with an ERROR and a retryable
SQLCODE, since the UPDATE will succeed on next execution.

2. I know that DB2 handles this by having the user specify WITH ROW
MOVEMENT to explicitly indicate they accept the issue and want update
to work even with that. We could have an explicit option to allow
that. This appears to be the only way we could avoid silent errors for
foreign table partitions.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Fri, Feb 24, 2017 at 3:24 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> I'd give the view that we cannot silently ignore this issue, bearing
> in mind the point that we're expecting partitioned tables to behave
> exactly like normal tables.

At the risk of repeating myself, I don't expect that, and I don't
think it's a reasonable expectation.  It's reasonable to expect
partitioning to be notably better than inheritance (which I think it
already is) and to provide a good base for future work (which I think
it does), but I think getting them to behave exactly like normal
tables (except for the things we want to be different) will take
another ten years of development work.

> In my understanding the issue is that UPDATEs will fail to update a
> row when a valid row exists in the case where a row moved between
> partitions; that behaviour will be different to a standard table.

Right, when at READ COMMITTED and EvalPlanQual would have happened otherwise.

> It is of course very good that we have something ready for this
> release and can make a choice of what to do.
>
> Thoughts
>
> 1. Reuse the tuple state HEAP_MOVED_OFF which IIRC represent exactly
> almost exactly the same thing. An UPDATE which gets to a
> HEAP_MOVED_OFF tuple will know to re-find the tuple via the partition
> metadata, or I might be persuaded that in-this-release it is
> acceptable to fail when this occurs with an ERROR and a retryable
> SQLCODE, since the UPDATE will succeed on next execution.

I've got my doubts about whether we can make that bit work that way,
considering that we still support pg_upgrade (possibly in multiple
steps) from old releases that had VACUUM FULL.  We really ought to put
some work into reclaiming those old bits, but there's probably no time
for that in v10.

> 2. I know that DB2 handles this by having the user specify WITH ROW
> MOVEMENT to explicitly indicate they accept the issue and want update
> to work even with that. We could have an explicit option to allow
> that. This appears to be the only way we could avoid silent errors for
> foreign table partitions.

Yeah, that's a thought.  We could give people a choice between (a)
updates that cause rows to move between partitions just fail and (b)
such updates work but with EPQ-related deficiencies.  I had previously
thought that, given those two choices, everybody would like (b) better
than (a), but maybe not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] UPDATE of partition key

From
"David G. Johnston"
Date:
On Friday, February 24, 2017, Simon Riggs <simon@2ndquadrant.com> wrote:
2. I know that DB2 handles this by having the user specify WITH ROW
MOVEMENT to explicitly indicate they accept the issue and want update
to work even with that. We could have an explicit option to allow
that. This appears to be the only way we could avoid silent errors for
foreign table partitions.


This does, however, make the partitioning very non-transparent to every update query just because it is remotely possible a partition-moving update might occur concurrently.

I dislike an error.  I'd say that making partition "just work" here is material for another patch.  In this one an update of the partition key can be documented as shorthand for delete-returning-insert with all the limitations that go with that.  If someone acceptably solves the ctid following logic later it can be committed - I'm assuming there would be no complaints to making things just work in a case where they only sorta worked.

David J.

Re: [HACKERS] UPDATE of partition key

From
Greg Stark
Date:
On 24 February 2017 at 14:57, David G. Johnston
<david.g.johnston@gmail.com> wrote:
> I dislike an error.  I'd say that making partition "just work" here is
> material for another patch.  In this one an update of the partition key can
> be documented as shorthand for delete-returning-insert with all the
> limitations that go with that.  If someone acceptably solves the ctid
> following logic later it can be committed - I'm assuming there would be no
> complaints to making things just work in a case where they only sorta
> worked.

Personally I don't think there's any hope that there will ever be
cross-table ctids links. Maybe one day there will be a major new table
storage format with very different capabilities than today but in the
current architecture it seems like an impossible leap.

I would expect everyone to come to terms with the basic idea that
partition key updates are always going to be a corner case. The user
defined the partition key and the docs should carefully explain to
them the impact of that definition. As long as that explanation gives
them something they can work with and manage the consequences of
that's going to be fine.

What I'm concerned about is that silently giving "wrong" answers in
regular queries -- not even ones doing the partition key updates -- is
something the user can't really manage. They have no way to rewrite
the query to avoid the problem if some other user or part of their
system is updating partition keys. They have no way to know the
problem is even occurring.

Just to spell it out -- it's not just "no-op updates" where the user
sees 0 records updated. If I update all records where
username='stark', perhaps to set the "user banned" flag and get back
"9 records updated" and later find out that I missed a record because
someone changed the department_id while my query was running -- how
would I even know? How could I possibly rewrite my query to avoid
that?

The reason I suggested throwing a serialization failure was because I
thought that would be the easiest short-cut to the problem. I had
imagined having a bit pattern that indicated such a move would
actually be a pretty minor change actually. I would actually consider
using a normal update bitmask with InvalidBlockId in the ctid to
indicate the tuple was updated and the target of the chain isn't
available. That may be something we'll need in the future for other
cases too.

Throwing an error means the user has to retry their query but that's
at least something they can do. Even if they don't do it automatically
the ultimate user will probably just retry whatever operation errored
out anyways. But at least their database isn't logically corrupted.

-- 
greg



Re: [HACKERS] UPDATE of partition key

From
"David G. Johnston"
Date:
On Sat, Feb 25, 2017 at 11:11 AM, Greg Stark <stark@mit.edu> wrote:
On 24 February 2017 at 14:57, David G. Johnston
<david.g.johnston@gmail.com> wrote:
> I dislike an error.  I'd say that making partition "just work" here is
> material for another patch.  In this one an update of the partition key can
> be documented as shorthand for delete-returning-insert with all the
> limitations that go with that.  If someone acceptably solves the ctid
> following logic later it can be committed - I'm assuming there would be no
> complaints to making things just work in a case where they only sorta
> worked.

Personally I don't think there's any hope that there will ever be
cross-table ctids links. Maybe one day there will be a major new table
storage format with very different capabilities than today but in the
current architecture it seems like an impossible leap.

​How about making it work without a physical token dynamic?  For instance, let the server recognize the serialization error but instead of returning it to the client the server itself tries again.​


I would expect everyone to come to terms with the basic idea that
partition key updates are always going to be a corner case. The user
defined the partition key and the docs should carefully explain to
them the impact of that definition. As long as that explanation gives
them something they can work with and manage the consequences of
that's going to be fine.

What I'm concerned about is that silently giving "wrong" answers in
regular queries -- not even ones doing the partition key updates -- is
something the user can't really manage. They have no way to rewrite
the query to avoid the problem if some other user or part of their
system is updating partition keys. They have no way to know the
problem is even occurring.

Just to spell it out -- it's not just "no-op updates" where the user
sees 0 records updated. If I update all records where
username='stark', perhaps to set the "user banned" flag and get back
"9 records updated" and later find out that I missed a record because
someone changed the department_id while my query was running -- how
would I even know? How could I possibly rewrite my query to avoid
that?

​But my point is that this isn't a regression from current behavior.  If I deleted one of those starks and re-inserted them with a different department_id that brand new record wouldn't be banned.  In short, my take on this patch is that it is a performance optimization.  Making the UPDATE command actually work as part of its implementation detail is a happy byproduct.​

From the POV of an external observer it doesn't have to matter whether the update or delete-insert SQL was used.  It would be nice if the UPDATE version could keep logical identity maintained but that is a feature enhancement.

Failing if the other session used the UPDATE SQL isn't wrong; and I'm not against it, I just don't believe that it is better than maintaining the status quo semantics.

That said my concurrency-fu is not that strong and I don't really have a practical reason to prefer one over the other - thus I fall back on maintaining internal consistency.

IIUC ​it is already possible, for those who care to do so, to get a serialization failure in this scenario by upgrading isolation to repeatable read.

David J.


Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Sat, Feb 25, 2017 at 11:41 PM, Greg Stark <stark@mit.edu> wrote:
> What I'm concerned about is that silently giving "wrong" answers in
> regular queries -- not even ones doing the partition key updates -- is
> something the user can't really manage. They have no way to rewrite
> the query to avoid the problem if some other user or part of their
> system is updating partition keys. They have no way to know the
> problem is even occurring.

That's a reasonable concern, but it's not like EvalPlanQual works
perfectly today and never causes any application-visible
inconsistencies that end up breaking things.  As the documentation
says:

----
Because of the above rules, it is possible for an updating command to
see an inconsistent snapshot: it can see the effects of concurrent
updating commands on the same rows it is trying to update, but it does
not see effects of those commands on other rows in the database. This
behavior makes Read Committed mode unsuitable for commands that
involve complex search conditions; however, it is just right for
simpler cases.
----

Maybe I've just spent too long hanging out with Kevin Grittner, but
I've come to view our EvalPlanQual behavior as pretty rickety and
unreliable in general.  For example, consider the fact that when I
spent over a year and approximately 1 gazillion email messages trying
to hammer out how join pushdown was going to EPQ rechecks, we
discovered that the FDW API wasn't actually handling those correctly
for even for scans of single tables, hence commit
5fc4c26db5120bd90348b6ee3101fcddfdf54800.  I'm not saying that time
and effort wasn't well-spent, but I wonder whether it's necessary to
hold partitioned tables to a higher standard than that to which the
FDW interface was held for the first 4.5 years of its life.  Perhaps
it is good for us to do that, but I'm not 100% convinced.  It seems
like we decide to worry about EvalPlanQual in some cases and not in
others more or less arbitrarily.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 23 February 2017 at 16:02, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
>> 2. In the patch, as part of the row movement, ExecDelete() is called
>> followed by ExecInsert(). This is done that way, because we want to
>> have the ROW triggers on that (sub)partition executed. If a user has
>> explicitly created DELETE and INSERT BR triggers for this partition, I
>> think we should run those. While at the same time, another question
>> is, what about UPDATE trigger on the same table ? Here again, one can
>> argue that because this UPDATE has been transformed into a
>> DELETE-INSERT, we should not run UPDATE trigger for row-movement. But
>> there can be a counter-argument. For e.g. if a user needs to make sure
>> about logging updates of particular columns of a row, he will expect
>> the logging to happen even when that row was transparently moved. In
>> the patch, I have retained the firing of UPDATE BR trigger.
>
> What of UPDATE AR triggers?

I think it does not make sense running after row triggers in case of
row-movement. There is no update happened on that leaf partition. This
reasoning can also apply to BR update triggers. But the reasons for
having a BR trigger and AR triggers are quite different. Generally, a
user needs to do some modifications to the row before getting the
final NEW row into the database, and hence [s]he defines a BR trigger
for that. And we can't just silently skip this step only because the
final row went into some other partition; in fact the row-movement
itself might depend on what the BR trigger did with the row. Whereas,
AR triggers are typically written for doing some other operation once
it is made sure the row is actually updated. In case of row-movement,
it is not actually updated.

>
> As a comment on how row-movement is being handled in code, I wonder if it
> could be be made to look similar structurally to the code in ExecInsert()
> that handles ON CONFLICT DO UPDATE.  That is,
>
> if (partition constraint fails)
> {
>     /* row movement */
> }
> else
> {
>     /* ExecConstraints() */
>     /* heap_update(), EvalPlanQual(), and ExecInsertIndexTuples() */
> }

I guess this is what has been effectively done for row movement, no ?

Looking at that, I found that in the current patch, if there is no
row-movement happening, ExecPartitionCheck() effectively gets called
twice : First time when ExecPartitionCheck() is explicitly called for
row-movement-required check, and second time in ExecConstraints()
call. May be there should be 2 separate functions
ExecCheckConstraints() and ExecPartitionConstraints(), and also
ExecCheckConstraints() that just calls both. This way we can call the
appropriate functions() accordingly in row-movement case, and the
other callers would continue to call ExecConstraints().

>
> I see that ExecConstraint() won't get called on the source partition's
> constraints if row movement occurs.  Maybe, that's unnecessary because the
> new row won't be inserted into that partition anyway.

Yes I agree.

>
> ExecWithCheckOptions() for RLS update check does happen *before* row
> movement though.

Yes. I think we should do it anyways.

>
>> 3. In case of a concurrent update/delete, suppose session A has locked
>> the row for deleting it. Now a session B has decided to update this
>> row and that is going to cause row movement, which means it will
>> delete it first. But when session A is finished deleting it, session B
>> finds that it is already deleted. In such case, it should not go ahead
>> with inserting a new row as part of the row movement. For that, I have
>> added a new parameter 'already_delete' for ExecDelete().
>
> Makes sense.  Maybe: already_deleted -> concurrently_deleted.

Right, concurrently_deleted sounds more accurate. In the next patch, I
will change that.

>
>> Of course, this still won't completely solve the concurrency anomaly.
>> In the above case, the UPDATE of Session B gets lost. May be, for a
>> user that does not tolerate this, we can have a table-level option
>> that disallows row movement, or will cause an error to be thrown for
>> one of the concurrent session.
>
> Will this table-level option be specified for a partitioned table once or
> for individual partitions?

My opinion is, if decide to have table-level option, it should be on
the root partition, to keep it simple.

>
>> 4. The ExecSetupPartitionTupleRouting() is re-used for routing the row
>> that is to be moved. So in ExecInitModifyTable(), we call
>> ExecSetupPartitionTupleRouting() even for UPDATE. We can also do this
>> only during execution time for the very first time we find that we
>> need to do a row movement. I will think over that, but I am thinking
>> it might complicate things, as compared to always doing the setup for
>> UPDATE. WIll check on that.
>
> Hmm.  ExecSetupPartitionTupleRouting(), which does significant amount of
> setup work, is fine being called in ExecInitModifyTable() in the insert
> case because there are often cases where that's a bulk-insert and hence
> cost of the setup work is amortized.  Updates, OTOH, are seldom done in a
> bulk manner.  So that might be an argument for doing it late only when
> needed.

Yes, agreed.

> But that starts to sound less attractive when one realizes that
> that will occur for every row that wants to move.

If we manage to call ExecSetupPartitionTupleRouting() during execution
phase only once for the very first time we find the update requires
row movement, then we can re-use the info.


One more thing I noticed is that, in case of update-returning, the
ExecDelete() will also generate result of RETURNING, which we are
discarding. So this is a waste. We should not even process RETURNING
in ExecDelete() called for row-movement. The RETURNING should be
processed only for ExecInsert().

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] UPDATE of partition key

From
Amit Langote
Date:
On 2017/02/26 4:01, David G. Johnston wrote:
> IIUC ​it is already possible, for those who care to do so, to get a
> serialization failure in this scenario by upgrading isolation to repeatable
> read.

Maybe, this can be added as a note in the documentation.

Thanks,
Amit





Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Thu, Mar 2, 2017 at 11:53 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> I think it does not make sense running after row triggers in case of
> row-movement. There is no update happened on that leaf partition. This
> reasoning can also apply to BR update triggers. But the reasons for
> having a BR trigger and AR triggers are quite different. Generally, a
> user needs to do some modifications to the row before getting the
> final NEW row into the database, and hence [s]he defines a BR trigger
> for that. And we can't just silently skip this step only because the
> final row went into some other partition; in fact the row-movement
> itself might depend on what the BR trigger did with the row. Whereas,
> AR triggers are typically written for doing some other operation once
> it is made sure the row is actually updated. In case of row-movement,
> it is not actually updated.

How about running the BR update triggers for the old partition and the
AR update triggers for the new partition?  It seems weird to run BR
update triggers but not AR update triggers.  Another option would be
to run BR and AR delete triggers and then BR and AR insert triggers,
emphasizing the choice to treat this update as a delete + insert, but
(as Amit Kh. pointed out to me when we were in a room together this
week) that precludes using the BEFORE trigger to modify the row.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] UPDATE of partition key

From
Amit Langote
Date:
Hi,

On 2017/03/02 15:23, Amit Khandekar wrote:
> On 23 February 2017 at 16:02, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>
>>> 2. In the patch, as part of the row movement, ExecDelete() is called
>>> followed by ExecInsert(). This is done that way, because we want to
>>> have the ROW triggers on that (sub)partition executed. If a user has
>>> explicitly created DELETE and INSERT BR triggers for this partition, I
>>> think we should run those. While at the same time, another question
>>> is, what about UPDATE trigger on the same table ? Here again, one can
>>> argue that because this UPDATE has been transformed into a
>>> DELETE-INSERT, we should not run UPDATE trigger for row-movement. But
>>> there can be a counter-argument. For e.g. if a user needs to make sure
>>> about logging updates of particular columns of a row, he will expect
>>> the logging to happen even when that row was transparently moved. In
>>> the patch, I have retained the firing of UPDATE BR trigger.
>>
>> What of UPDATE AR triggers?
> 
> I think it does not make sense running after row triggers in case of
> row-movement. There is no update happened on that leaf partition. This
> reasoning can also apply to BR update triggers. But the reasons for
> having a BR trigger and AR triggers are quite different. Generally, a
> user needs to do some modifications to the row before getting the
> final NEW row into the database, and hence [s]he defines a BR trigger
> for that. And we can't just silently skip this step only because the
> final row went into some other partition; in fact the row-movement
> itself might depend on what the BR trigger did with the row. Whereas,
> AR triggers are typically written for doing some other operation once
> it is made sure the row is actually updated. In case of row-movement,
> it is not actually updated.

OK, so it'd be better to clarify in the documentation that that's the case.

>> As a comment on how row-movement is being handled in code, I wonder if it
>> could be be made to look similar structurally to the code in ExecInsert()
>> that handles ON CONFLICT DO UPDATE.  That is,
>>
>> if (partition constraint fails)
>> {
>>     /* row movement */
>> }
>> else
>> {
>>     /* ExecConstraints() */
>>     /* heap_update(), EvalPlanQual(), and ExecInsertIndexTuples() */
>> }
> 
> I guess this is what has been effectively done for row movement, no ?

Yes, although it seems nice how the formatting of the code in ExecInsert()
makes it apparent that they are distinct code paths.  OTOH, the additional
diffs caused by the suggested formatting might confuse other reviewers.

> Looking at that, I found that in the current patch, if there is no
> row-movement happening, ExecPartitionCheck() effectively gets called
> twice : First time when ExecPartitionCheck() is explicitly called for
> row-movement-required check, and second time in ExecConstraints()
> call. May be there should be 2 separate functions
> ExecCheckConstraints() and ExecPartitionConstraints(), and also
> ExecCheckConstraints() that just calls both. This way we can call the
> appropriate functions() accordingly in row-movement case, and the
> other callers would continue to call ExecConstraints().

One random idea: we could add a bool ri_PartitionCheckOK which is set to
true after it is checked in ExecConstraints().  And modify the condition
in ExecConstraints() as follows:
   if (resultRelInfo->ri_PartitionCheck &&
+       !resultRelInfo->ri_PartitionCheckOK &&       !ExecPartitionCheck(resultRelInfo, slot, estate))

>>> 3. In case of a concurrent update/delete, suppose session A has locked
>>> the row for deleting it. Now a session B has decided to update this
>>> row and that is going to cause row movement, which means it will
>>> delete it first. But when session A is finished deleting it, session B
>>> finds that it is already deleted. In such case, it should not go ahead
>>> with inserting a new row as part of the row movement. For that, I have
>>> added a new parameter 'already_delete' for ExecDelete().
>>
>> Makes sense.  Maybe: already_deleted -> concurrently_deleted.
> 
> Right, concurrently_deleted sounds more accurate. In the next patch, I
> will change that.

Okay, thanks.

>>> Of course, this still won't completely solve the concurrency anomaly.
>>> In the above case, the UPDATE of Session B gets lost. May be, for a
>>> user that does not tolerate this, we can have a table-level option
>>> that disallows row movement, or will cause an error to be thrown for
>>> one of the concurrent session.
>>
>> Will this table-level option be specified for a partitioned table once or
>> for individual partitions?
> 
> My opinion is, if decide to have table-level option, it should be on
> the root partition, to keep it simple.

I see.

>> But that starts to sound less attractive when one realizes that
>> that will occur for every row that wants to move.
> 
> If we manage to call ExecSetupPartitionTupleRouting() during execution
> phase only once for the very first time we find the update requires
> row movement, then we can re-use the info.

That might work, too.  But I guess we're going with initialization in
ExecInitModifyTable().

> One more thing I noticed is that, in case of update-returning, the
> ExecDelete() will also generate result of RETURNING, which we are
> discarding. So this is a waste. We should not even process RETURNING
> in ExecDelete() called for row-movement. The RETURNING should be
> processed only for ExecInsert().

I wonder if it makes sense to have ExecDeleteInternal() and
ExecInsertInternal(), which perform the core function of DELETE and
INSERT, respectively.  Such as running triggers, checking constraints,
etc.  The RETURNING part is controllable by the statement, so it will be
handled by the ExecDelete() and ExecInsert(), like it is now.

When called from ExecUpdate() as part of row-movement, they perform just
the core part and leave the rest to be done by ExecUpdate() itself.

Thanks,
Amit





Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
I haven't yet handled all points, but meanwhile, some of the important
points are discussed below ...

On 6 March 2017 at 15:11, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
>>> But that starts to sound less attractive when one realizes that
>>> that will occur for every row that wants to move.
>>
>> If we manage to call ExecSetupPartitionTupleRouting() during execution
>> phase only once for the very first time we find the update requires
>> row movement, then we can re-use the info.
>
> That might work, too.  But I guess we're going with initialization in
> ExecInitModifyTable().

I am more worried about this: even the UPDATEs that do not involve row
movement would do the expensive setup. So do it only once when we find
that we need to move the row. Something like this :
ExecUpdate()
{
....   if (resultRelInfo->ri_PartitionCheck &&     !ExecPartitionCheck(resultRelInfo, slot, estate))   {     bool
already_deleted;
     ExecDelete(tupleid, oldtuple, planSlot, epqstate, estate,            &already_deleted, canSetTag);
     if (already_deleted)       return NULL;     else     {       /* If we haven't already built the state for INSERT
    * tuple routing, build it now */       if (!mtstate->mt_partition_dispatch_info)       {
ExecSetupPartitionTupleRouting(                  mtstate->resultRelInfo->ri_RelationDesc,
&mtstate->mt_partition_dispatch_info,                  &mtstate->mt_partitions,
&mtstate->mt_partition_tupconv_maps,                  &mtstate->mt_partition_tuple_slot,
&mtstate->mt_num_dispatch,                  &mtstate->mt_num_partitions);       }
 
       return ExecInsert(mtstate, slot, planSlot, NULL,                 ONCONFLICT_NONE, estate, false);     }   }
...
}


>
>> One more thing I noticed is that, in case of update-returning, the
>> ExecDelete() will also generate result of RETURNING, which we are
>> discarding. So this is a waste. We should not even process RETURNING
>> in ExecDelete() called for row-movement. The RETURNING should be
>> processed only for ExecInsert().
>
> I wonder if it makes sense to have ExecDeleteInternal() and
> ExecInsertInternal(), which perform the core function of DELETE and
> INSERT, respectively.  Such as running triggers, checking constraints,
> etc.  The RETURNING part is controllable by the statement, so it will be
> handled by the ExecDelete() and ExecInsert(), like it is now.
>
> When called from ExecUpdate() as part of row-movement, they perform just
> the core part and leave the rest to be done by ExecUpdate() itself.

Yes, if we decide to execute only the core insert/delete operations
and skip the triggers, then there is a compelling reason to have
something like ExecDeleteInternal() and ExecInsertInternal(). In fact,
I was about to start doing the same, except for the below discussion
...

On 4 March 2017 at 12:49, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Mar 2, 2017 at 11:53 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> I think it does not make sense running after row triggers in case of
>> row-movement. There is no update happened on that leaf partition. This
>> reasoning can also apply to BR update triggers. But the reasons for
>> having a BR trigger and AR triggers are quite different. Generally, a
>> user needs to do some modifications to the row before getting the
>> final NEW row into the database, and hence [s]he defines a BR trigger
>> for that. And we can't just silently skip this step only because the
>> final row went into some other partition; in fact the row-movement
>> itself might depend on what the BR trigger did with the row. Whereas,
>> AR triggers are typically written for doing some other operation once
>> it is made sure the row is actually updated. In case of row-movement,
>> it is not actually updated.
>
> How about running the BR update triggers for the old partition and the
> AR update triggers for the new partition?  It seems weird to run BR
> update triggers but not AR update triggers.  Another option would be
> to run BR and AR delete triggers and then BR and AR insert triggers,
> emphasizing the choice to treat this update as a delete + insert, but
> (as Amit Kh. pointed out to me when we were in a room together this
> week) that precludes using the BEFORE trigger to modify the row.

I checked the trigger behaviour in case of UPSERT. Here, when there is
conflict found, ExecOnConflictUpdate() is called, and then the
function returns immediately, which means AR INSERT trigger will not
fire. And ExecOnConflictUpdate() calls ExecUpdate(), which means BR
and AR UPDATE triggers will be fired. So in short, when an INSERT
becomes an UPDATE, BR INSERT triggers do fire, but then the BR UPDATE
and AR UPDATE also get fired. On the same lines, it makes sense in
case of UPDATE=>DELETE+INSERT operation to fire a BR UPDATE trigger on
the original table, and then the BR and AR DELETE/INSERT triggers on
the respective tables.

So the common policy can be :
Fire the BR trigger. It can be INESRT/UPDATE/DELETE trigger depending
upon what the statement is.
If there is a change in the operation, according to what the operation
is converted to (UPDATE->DELETE+INSERT or INSERT->UPDATE), all the
respective triggers would be fired.



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 17 March 2017 at 16:07, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 6 March 2017 at 15:11, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>
>>>> But that starts to sound less attractive when one realizes that
>>>> that will occur for every row that wants to move.
>>>
>>> If we manage to call ExecSetupPartitionTupleRouting() during execution
>>> phase only once for the very first time we find the update requires
>>> row movement, then we can re-use the info.
>>
>> That might work, too.  But I guess we're going with initialization in
>> ExecInitModifyTable().
>
> I am more worried about this: even the UPDATEs that do not involve row
> movement would do the expensive setup. So do it only once when we find
> that we need to move the row. Something like this :
> ExecUpdate()
> {
> ....
>     if (resultRelInfo->ri_PartitionCheck &&
>       !ExecPartitionCheck(resultRelInfo, slot, estate))
>     {
>       bool  already_deleted;
>
>       ExecDelete(tupleid, oldtuple, planSlot, epqstate, estate,
>              &already_deleted, canSetTag);
>
>       if (already_deleted)
>         return NULL;
>       else
>       {
>         /* If we haven't already built the state for INSERT
>          * tuple routing, build it now */
>         if (!mtstate->mt_partition_dispatch_info)
>         {
>           ExecSetupPartitionTupleRouting(
>                     mtstate->resultRelInfo->ri_RelationDesc,
>                     &mtstate->mt_partition_dispatch_info,
>                     &mtstate->mt_partitions,
>                     &mtstate->mt_partition_tupconv_maps,
>                     &mtstate->mt_partition_tuple_slot,
>                     &mtstate->mt_num_dispatch,
>                     &mtstate->mt_num_partitions);
>         }
>
>         return ExecInsert(mtstate, slot, planSlot, NULL,
>                   ONCONFLICT_NONE, estate, false);
>       }
>     }
> ...
> }

Attached is v2 patch which implements the above optimization. Now, for
UPDATE, ExecSetupPartitionTupleRouting() will be called only if row
movement is needed.

We have to open an extra relation for the root partition, and keep it
opened and its handle stored in
mt_partition_dispatch_info[0]->reldesc. So ExecEndModifyTable() closes
this if it is different from node->resultRelInfo->ri_RelationDesc. If
it is same as node->resultRelInfo, it should not be closed because it
gets closed as part of ExecEndPlan().

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] UPDATE of partition key

From
Amit Langote
Date:
Hi Amit,

Thanks for the updated patch.

On 2017/03/23 3:09, Amit Khandekar wrote:
> Attached is v2 patch which implements the above optimization.

Would it be better to have at least some new tests?  Also, there are a few
places in the documentation mentioning that such updates cause error,
which will need to be updated.  Perhaps also add some explanatory notes
about the mechanism (delete+insert), trigger behavior, caveats, etc.
There were some points discussed upthread that could be mentioned in the
documentation.

@@ -633,6 +634,9 @@ ExecDelete(ItemPointer tupleid,    HeapUpdateFailureData hufd;    TupleTableSlot *slot = NULL;

+    if (already_deleted)
+        *already_deleted = false;
+

concurrently_deleted?

@@ -962,7 +969,7 @@ ExecUpdate(ItemPointer tupleid,    }    else    {
-        LockTupleMode lockmode;
+        LockTupleMode   lockmode;

Useless hunk.

+            if (!mtstate->mt_partition_dispatch_info)
+            {

The if (pointer == NULL) style is better perhaps.

+                /* root table RT index is at the head of partitioned_rels */
+                if (node->partitioned_rels)
+                {
+                    Index   root_rti;
+                    Oid     root_oid;
+
+                    root_rti = linitial_int(node->partitioned_rels);
+                    root_oid = getrelid(root_rti, estate->es_range_table);
+                    root_rel = heap_open(root_oid, NoLock); /* locked by
InitPlan */
+                }
+                else
+                    root_rel = mtstate->resultRelInfo->ri_RelationDesc;

Some explanatory comments here might be good, for example, explain in what
situations node->partitioned_rels would not have been set and/or vice versa.

> Now, for
> UPDATE, ExecSetupPartitionTupleRouting() will be called only if row
> movement is needed.
> 
> We have to open an extra relation for the root partition, and keep it
> opened and its handle stored in
> mt_partition_dispatch_info[0]->reldesc. So ExecEndModifyTable() closes
> this if it is different from node->resultRelInfo->ri_RelationDesc. If
> it is same as node->resultRelInfo, it should not be closed because it
> gets closed as part of ExecEndPlan().

I guess you're referring to the following hunk.  Some comments:

@@ -2154,10 +2221,19 @@ ExecEndModifyTable(ModifyTableState *node)     * Close all the partitioned tables, leaf
partitions,and their indices     *     * Remember node->mt_partition_dispatch_info[0] corresponds to the root
 
-     * partitioned table, which we must not try to close, because it is the
-     * main target table of the query that will be closed by ExecEndPlan().
-     * Also, tupslot is NULL for the root partitioned table.
+     * partitioned table, which should not be closed if it is the main target
+     * table of the query, which will be closed by ExecEndPlan().

The last part could be written as: because it will be closed by ExecEndPlan().
Also, tupslot
+     * is NULL for the root partitioned table.     */
+    if (node->mt_num_dispatch > 0)
+    {
+        Relation    root_partition;

root_relation?

+
+        root_partition = node->mt_partition_dispatch_info[0]->reldesc;
+        if (root_partition != node->resultRelInfo->ri_RelationDesc)
+            heap_close(root_partition, NoLock);
+    }

It might be a good idea to Assert inside the if block above that
node->operation != CMD_INSERT.  Perhaps, also reflect that in the comment
above so that it's clearer.

I will set the patch to Waiting on Author.

Thanks,
Amit





Re: UPDATE of partition key

From
Amit Khandekar
Date:
Thanks Amit for your review comments. I am yet to handle all of your
comments, but meanwhile , attached is an updated patch, that handles
RETURNING.

Earlier it was not working because ExecInsert() did not return any
RETURNING clause. This is because the setup needed to create RETURNIG
projection info for leaf partitions is done in ExecInitModifyTable()
only in case of INSERT. But because it is an UPDATE operation, we have
to do this explicitly as a one-time operation when it is determined
that row-movement is required. This is similar to how we do one-time
setup of mt_partition_dispatch_info. So in the patch, I have moved
this code into a new function ExecInitPartitionReturningProjection(),
and now this is called in ExecInitModifyTable() as well as during row
movement for ExecInsert() processing the returning clause.

Basically we need to do all that is done in ExecInitModifyTable() for
INSERT. There are a couple of other things that I suspect that might
need to be done as part of the missing initialization for Execinsert()
during row-movement :
1. Junk filter handling
2. WITH CHECK OPTION


Yet, ExecDelete() during row-movement is still returning the RETURNING
result redundantly, which I am yet to handle this.

On 23 March 2017 at 07:04, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Hi Amit,
>
> Thanks for the updated patch.
>
> On 2017/03/23 3:09, Amit Khandekar wrote:
>> Attached is v2 patch which implements the above optimization.
>
> Would it be better to have at least some new tests?  Also, there are a few
> places in the documentation mentioning that such updates cause error,
> which will need to be updated.  Perhaps also add some explanatory notes
> about the mechanism (delete+insert), trigger behavior, caveats, etc.
> There were some points discussed upthread that could be mentioned in the
> documentation.

Yeah, agreed. Will do this in the subsequent patch.

>
> @@ -633,6 +634,9 @@ ExecDelete(ItemPointer tupleid,
>      HeapUpdateFailureData hufd;
>      TupleTableSlot *slot = NULL;
>
> +    if (already_deleted)
> +        *already_deleted = false;
> +
>
> concurrently_deleted?

Done.

>
> @@ -962,7 +969,7 @@ ExecUpdate(ItemPointer tupleid,
>      }
>      else
>      {
> -        LockTupleMode lockmode;
> +        LockTupleMode   lockmode;
>
> Useless hunk.
Removed.


I am yet to handle your other comments , still working on them, but
till then , attached is the updated patch.

Attachment

Re: UPDATE of partition key

From
Amit Khandekar
Date:
On 25 March 2017 at 01:34, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
I am yet to handle all of your comments, but meanwhile , attached is
> an updated patch, that handles RETURNING.
>
> Earlier it was not working because ExecInsert() did not return any
> RETURNING clause. This is because the setup needed to create RETURNIG
> projection info for leaf partitions is done in ExecInitModifyTable()
> only in case of INSERT. But because it is an UPDATE operation, we have
> to do this explicitly as a one-time operation when it is determined
> that row-movement is required. This is similar to how we do one-time
> setup of mt_partition_dispatch_info. So in the patch, I have moved
> this code into a new function ExecInitPartitionReturningProjection(),
> and now this is called in ExecInitModifyTable() as well as during row
> movement for ExecInsert() processing the returning clause.

> Basically we need to do all that is done in ExecInitModifyTable() for
> INSERT. There are a couple of other things that I suspect that might
> need to be done as part of the missing initialization for Execinsert()
> during row-movement :
> 1. Junk filter handling
> 2. WITH CHECK OPTION

Attached is an another updated patch v4 which does WITH-CHECK-OPTION
related initialization.

So we now have below two function calls during row movement :
/* Build WITH CHECK OPTION constraints for leaf partitions */
ExecInitPartitionWithCheckOptions(mtstate, root_rel);

/* Build a projection for each leaf partition rel. */
ExecInitPartitionReturningProjection(mtstate, root_rel);

And these functions are now re-used at two places : In
ExecInitModifyTable() and in row-movement code.
Basically whatever was not being initialized in ExecInitModifyTable()
is now done in row-movement code.

I have added relevant scenarios in sql/update.sql.

I checked the junk filter handling. I think there isn't anything that
needs to be done, because for INSERT, all that is needed is
ExecCheckPlanOutput(). And this function is anyway called even in
ExecInitModifyTable() even for UPDATE, so we don't have to initialize
anything additional.

> Yet, ExecDelete() during row-movement is still returning the RETURNING
> result redundantly, which I am yet to handle this.
Done above. Now we have a new parameter in ExecDelete() which tells
whether to skip RETURNING.

On 23 March 2017 at 07:04, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Would it be better to have at least some new tests?
Added some more scenarios in update.sql. Also have included scenarios
for WITH-CHECK-OPTION for updatable views.


> Also, there are a few places in the documentation mentioning that such updates cause error,
> which will need to be updated.  Perhaps also add some explanatory notes
> about the mechanism (delete+insert), trigger behavior, caveats, etc.
> There were some points discussed upthread that could be mentioned in the
> documentation.
Yeah, I agree. Documentation needs some important changes. I am still
working on them.

> +            if (!mtstate->mt_partition_dispatch_info)
> +            {
>
> The if (pointer == NULL) style is better perhaps.
>
> +                /* root table RT index is at the head of partitioned_rels */
> +                if (node->partitioned_rels)
> +                {
> +                    Index   root_rti;
> +                    Oid     root_oid;
> +
> +                    root_rti = linitial_int(node->partitioned_rels);
> +                    root_oid = getrelid(root_rti, estate->es_range_table);
> +                    root_rel = heap_open(root_oid, NoLock); /* locked by
> InitPlan */
> +                }
> +                else
> +                    root_rel = mtstate->resultRelInfo->ri_RelationDesc;
>
> Some explanatory comments here might be good, for example, explain in what
> situations node->partitioned_rels would not have been set and/or vice versa.
Added some more comments in the relevant if conditions.

>
>> Now, for
>> UPDATE, ExecSetupPartitionTupleRouting() will be called only if row
>> movement is needed.
>>
>> We have to open an extra relation for the root partition, and keep it
>> opened and its handle stored in
>> mt_partition_dispatch_info[0]->reldesc. So ExecEndModifyTable() closes
>> this if it is different from node->resultRelInfo->ri_RelationDesc. If
>> it is same as node->resultRelInfo, it should not be closed because it
>> gets closed as part of ExecEndPlan().
>
> I guess you're referring to the following hunk.  Some comments:
>
> @@ -2154,10 +2221,19 @@ ExecEndModifyTable(ModifyTableState *node)
>       * Close all the partitioned tables, leaf partitions, and their indices
>       *
>       * Remember node->mt_partition_dispatch_info[0] corresponds to the root
> -     * partitioned table, which we must not try to close, because it is the
> -     * main target table of the query that will be closed by ExecEndPlan().
> -     * Also, tupslot is NULL for the root partitioned table.
> +     * partitioned table, which should not be closed if it is the main target
> +     * table of the query, which will be closed by ExecEndPlan().
>
> The last part could be written as: because it will be closed by ExecEndPlan().

Actually I later realized that the relation is not required to be kept
open until ExecEndmodifyTable(). So I reverted the above changes. Now
it is immediately closed once all the row-movement-related setup is
done.
>
>  Also, tupslot
> +     * is NULL for the root partitioned table.
>       */
> +    if (node->mt_num_dispatch > 0)
> +    {
> +        Relation    root_partition;
>
> root_relation?
>
> +
> +        root_partition = node->mt_partition_dispatch_info[0]->reldesc;
> +        if (root_partition != node->resultRelInfo->ri_RelationDesc)
> +            heap_close(root_partition, NoLock);
> +    }
>
> It might be a good idea to Assert inside the if block above that
> node->operation != CMD_INSERT.  Perhaps, also reflect that in the comment
> above so that it's clearer.

This does not apply now since I reverted as mentioned above.

>
>> Looking at that, I found that in the current patch, if there is no
>> row-movement happening, ExecPartitionCheck() effectively gets called
>> twice : First time when ExecPartitionCheck() is explicitly called for
>> row-movement-required check, and second time in ExecConstraints()
>> call. May be there should be 2 separate functions
>> ExecCheckConstraints() and ExecPartitionConstraints(), and also
>> ExecCheckConstraints() that just calls both. This way we can call the
>> appropriate functions() accordingly in row-movement case, and the
>> other callers would continue to call ExecConstraints().
>
> One random idea: we could add a bool ri_PartitionCheckOK which is set to
> true after it is checked in ExecConstraints().  And modify the condition
> in ExecConstraints() as follows:
>
>    if (resultRelInfo->ri_PartitionCheck &&
>+       !resultRelInfo->ri_PartitionCheckOK &&
>        !ExecPartitionCheck(resultRelInfo, slot, estate))

I have taken out the part in ExecConstraints where it forms and emits
partition constraint error message, and put in new function
ExecPartitionCheckEmitError(), and this is called in ExecConstraints()
as well as in ExecUpdate() when it finds that it is not a partitioned
table. This happens when the UPDATE has been run on a leaf partition,
and when ExecPartitionCheck() fails for the leaf partition. Here, we
just need to emit the same error message that ExecConstraint() emits.

Attachment

Re: UPDATE of partition key

From
Amit Khandekar
Date:
On 27 March 2017 at 13:05, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> Also, there are a few places in the documentation mentioning that such updates cause error,
>> which will need to be updated.  Perhaps also add some explanatory notes
>> about the mechanism (delete+insert), trigger behavior, caveats, etc.
>> There were some points discussed upthread that could be mentioned in the
>> documentation.
>> Yeah, I agree. Documentation needs some important changes. I am still
>> working on them.

Attached patch v5 has above required doc changes added.

In the section 5.11 "Partitioning" => subsection 5 "Caveats", I have
removed the caveat of not being able to update partition key. And it
is now replaced by the caveat where an update/delete operations can
silently miss a row when there is a concurrent UPDATE of partition-key
happening.

UPDATE row movement behaviour is described in :
Part VI "Reference => SQL Commands => UPDATE

> On 4 March 2017 at 12:49, Robert Haas <robertmhaas@gmail.com> wrote:
>> How about running the BR update triggers for the old partition and the
>> AR update triggers for the new partition?  It seems weird to run BR
>> update triggers but not AR update triggers.  Another option would be
>> to run BR and AR delete triggers and then BR and AR insert triggers,
>> emphasizing the choice to treat this update as a delete + insert, but
>> (as Amit Kh. pointed out to me when we were in a room together this
>> week) that precludes using the BEFORE trigger to modify the row.
>
> I checked the trigger behaviour in case of UPSERT. Here, when there is
> conflict found, ExecOnConflictUpdate() is called, and then the
> function returns immediately, which means AR INSERT trigger will not
> fire. And ExecOnConflictUpdate() calls ExecUpdate(), which means BR
> and AR UPDATE triggers will be fired. So in short, when an INSERT
> becomes an UPDATE, BR INSERT triggers do fire, but then the BR UPDATE
> and AR UPDATE also get fired. On the same lines, it makes sense in
> case of UPDATE=>DELETE+INSERT operation to fire a BR UPDATE trigger on
> the original table, and then the BR and AR DELETE/INSERT triggers on
> the respective tables.
>
> So the common policy can be :
> Fire the BR trigger. It can be INESRT/UPDATE/DELETE trigger depending
> upon what the statement is.
> If there is a change in the operation, according to what the operation
> is converted to (UPDATE->DELETE+INSERT or INSERT->UPDATE), all the
> respective triggers would be fired.

The current patch already has the behaviour as per above policy. So I
have included the description of this trigger related behaviour in the
"Overview of Trigger Behavior" section of the docs. This has been
derived from the way it is written for trigger behaviour for UPSERT in
the preceding section.

Attachment

Re: UPDATE of partition key

From
Amit Langote
Date:
Hi Amit,

Thanks for the updated patches.

On 2017/03/28 19:12, Amit Khandekar wrote:
> On 27 March 2017 at 13:05, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>> Also, there are a few places in the documentation mentioning that such updates cause error,
>>> which will need to be updated.  Perhaps also add some explanatory notes
>>> about the mechanism (delete+insert), trigger behavior, caveats, etc.
>>> There were some points discussed upthread that could be mentioned in the
>>> documentation.
>>> Yeah, I agree. Documentation needs some important changes. I am still
>>> working on them.
> 
> Attached patch v5 has above required doc changes added.
> 
> In the section 5.11 "Partitioning" => subsection 5 "Caveats", I have
> removed the caveat of not being able to update partition key. And it
> is now replaced by the caveat where an update/delete operations can
> silently miss a row when there is a concurrent UPDATE of partition-key
> happening.

Hmm, how about just removing the "partition-changing updates are
disallowed" caveat from the list on the 5.11 Partitioning page and explain
the concurrency-related caveats on the UPDATE reference page?

> UPDATE row movement behaviour is described in :
> Part VI "Reference => SQL Commands => UPDATE
> 
>> On 4 March 2017 at 12:49, Robert Haas <robertmhaas@gmail.com> wrote:
>>> How about running the BR update triggers for the old partition and the
>>> AR update triggers for the new partition?  It seems weird to run BR
>>> update triggers but not AR update triggers.  Another option would be
>>> to run BR and AR delete triggers and then BR and AR insert triggers,
>>> emphasizing the choice to treat this update as a delete + insert, but
>>> (as Amit Kh. pointed out to me when we were in a room together this
>>> week) that precludes using the BEFORE trigger to modify the row.
>>
>> I checked the trigger behaviour in case of UPSERT. Here, when there is
>> conflict found, ExecOnConflictUpdate() is called, and then the
>> function returns immediately, which means AR INSERT trigger will not
>> fire. And ExecOnConflictUpdate() calls ExecUpdate(), which means BR
>> and AR UPDATE triggers will be fired. So in short, when an INSERT
>> becomes an UPDATE, BR INSERT triggers do fire, but then the BR UPDATE
>> and AR UPDATE also get fired. On the same lines, it makes sense in
>> case of UPDATE=>DELETE+INSERT operation to fire a BR UPDATE trigger on
>> the original table, and then the BR and AR DELETE/INSERT triggers on
>> the respective tables.
>>
>> So the common policy can be :
>> Fire the BR trigger. It can be INESRT/UPDATE/DELETE trigger depending
>> upon what the statement is.
>> If there is a change in the operation, according to what the operation
>> is converted to (UPDATE->DELETE+INSERT or INSERT->UPDATE), all the
>> respective triggers would be fired.
> 
> The current patch already has the behaviour as per above policy. So I
> have included the description of this trigger related behaviour in the
> "Overview of Trigger Behavior" section of the docs. This has been
> derived from the way it is written for trigger behaviour for UPSERT in
> the preceding section.

I tested how various row-level triggers behave and it all seems to work as
you have described in your various messages, which the latest patch also
documents.

Some comments on the patch itself:

-      An <command>UPDATE</> that causes a row to move from one partition to
-      another fails, because the new value of the row fails to satisfy the
-      implicit partition constraint of the original partition.  This might
-      change in future releases.
+      An <command>UPDATE</> causes a row to move from one partition to
another
+      if the new value of the row fails to satisfy the implicit partition
<snip>

As mentioned above, we can simply remove this item from the list of
caveats on ddl.sgml.  The new text can be moved to the Notes portion of
the UPDATE reference page.

+    If an <command>UPDATE</command> on a partitioned table causes a row to
+    move to another partition, it is possible that all row-level
+    <literal>BEFORE</> <command>UPDATE</command> triggers and all row-level
+    <literal>BEFORE</> <command>DELETE</command>/<command>INSERT</command>
+    triggers are applied on the respective partitions in a way that is
apparent
+    from the final state of the updated row.

How about dropping "it is possible that" from this sentence?

+    <command>UPDATE</command> is done by doing a <command>DELETE</command> on

How about: s/is done/is performed/g

+    triggers are not applied because the <command>UPDATE</command> is
converted
+    to a <command>DELETE</command> and <command>UPDATE</command>.

I think you meant DELETE and INSERT.

+        if (resultRelInfo->ri_PartitionCheck &&
+            !ExecPartitionCheck(resultRelInfo, slot, estate))
+        {

How about a one-line comment what this block of code does?

-         * Check the constraints of the tuple.  Note that we pass the same
+         * Check the constraints of the tuple. Note that we pass the same

I think that this hunk is not necessary.  (I've heard that two spaces
after a sentence-ending period is not a problem [1].)

+         * We have already run partition constraints above, so skip them
below.

How about: s/run/checked the/g?

@@ -2159,6 +2289,7 @@ ExecEndModifyTable(ModifyTableState *node)        heap_close(pd->reldesc, NoLock);
ExecDropSingleTupleTableSlot(pd->tupslot);   }
 
+    for (i = 0; i < node->mt_num_partitions; i++)    {        ResultRelInfo *resultRelInfo = node->mt_partitions + i;

Needless hunk.


Overall, I think the patch looks good now.  Thanks again for working on it.

Thanks,
Amit

[1] https://www.python.org/dev/peps/pep-0008/#comments





Re: UPDATE of partition key

From
Amit Khandekar
Date:
For some reason, my reply got sent to only Amit Langote instead of
reply-to-all. Below is the mail reply. Thanks Amit Langote for
bringing this to my notice.

On 31 March 2017 at 16:54, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 31 March 2017 at 14:04, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2017/03/28 19:12, Amit Khandekar wrote:
>>> In the section 5.11 "Partitioning" => subsection 5 "Caveats", I have
>>> removed the caveat of not being able to update partition key. And it
>>> is now replaced by the caveat where an update/delete operations can
>>> silently miss a row when there is a concurrent UPDATE of partition-key
>>> happening.
>>
>> Hmm, how about just removing the "partition-changing updates are
>> disallowed" caveat from the list on the 5.11 Partitioning page and explain
>> the concurrency-related caveats on the UPDATE reference page?
>
> IMHO this caveat is better placed in Partitioning chapter to emphasize
> that it is a drawback specifically in presence of partitioning.
>
>> +    If an <command>UPDATE</command> on a partitioned table causes a row to
>> +    move to another partition, it is possible that all row-level
>> +    <literal>BEFORE</> <command>UPDATE</command> triggers and all row-level
>> +    <literal>BEFORE</> <command>DELETE</command>/<command>INSERT</command>
>> +    triggers are applied on the respective partitions in a way that is
>> apparent
>> +    from the final state of the updated row.
>>
>> How about dropping "it is possible that" from this sentence?
>
> What the statement means is : "It is true that all triggers are
> applied on the respective partitions; but it is possible that they are
> applied in a way that is apparent from final state of the updated
> row". So "possible" applies to "in a way that is apparent..". It
> means, the user should be aware that all the triggers can change the
> row and so the final row will be affected by all those triggers.
> Actually, we have a similar statement for UPSERT involved with
> triggers in the preceding section. I have taken the statement from
> there.
>
>>
>> +    <command>UPDATE</command> is done by doing a <command>DELETE</command> on
>>
>> How about: s/is done/is performed/g
>
> Done.
>
>>
>> +    triggers are not applied because the <command>UPDATE</command> is
>> converted
>> +    to a <command>DELETE</command> and <command>UPDATE</command>.
>>
>> I think you meant DELETE and INSERT.
>
> Oops. Corrected.
>
>>
>> +        if (resultRelInfo->ri_PartitionCheck &&
>> +            !ExecPartitionCheck(resultRelInfo, slot, estate))
>> +        {
>>
>> How about a one-line comment what this block of code does?
>
> Yes, this was needed. Added a comment.
>
>>
>> -         * Check the constraints of the tuple.  Note that we pass the same
>> +         * Check the constraints of the tuple. Note that we pass the same
>>
>> I think that this hunk is not necessary.  (I've heard that two spaces
>> after a sentence-ending period is not a problem [1].)
>
> Actually I accidentally removed one space, thinking that it was one of
> my own comments. Reverted back this change, since it is a needless
> hunk.
>
>>
>> +         * We have already run partition constraints above, so skip them below.
>>
>> How about: s/run/checked the/g?
>
> Done.
>
>> @@ -2159,6 +2289,7 @@ ExecEndModifyTable(ModifyTableState *node)
>>          heap_close(pd->reldesc, NoLock);
>>          ExecDropSingleTupleTableSlot(pd->tupslot);
>>      }
>> +
>>      for (i = 0; i < node->mt_num_partitions; i++)
>>      {
>>          ResultRelInfo *resultRelInfo = node->mt_partitions + i;
>>
>> Needless hunk.
>
> Right. Removed.
>
>>
>> Overall, I think the patch looks good now.  Thanks again for working on it.
>
> Thanks Amit for your efforts in reviewing the patch. Attached is v6
> patch that contains above points handled.



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: UPDATE of partition key

From
Amit Langote
Date:
Hi Amit,

Thanks for updating the patch.  Since ddl.sgml got updated on Saturday,
patch needs a rebase.

> On 31 March 2017 at 16:54, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> On 31 March 2017 at 14:04, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> On 2017/03/28 19:12, Amit Khandekar wrote:
>>>> In the section 5.11 "Partitioning" => subsection 5 "Caveats", I have
>>>> removed the caveat of not being able to update partition key. And it
>>>> is now replaced by the caveat where an update/delete operations can
>>>> silently miss a row when there is a concurrent UPDATE of partition-key
>>>> happening.
>>>
>>> Hmm, how about just removing the "partition-changing updates are
>>> disallowed" caveat from the list on the 5.11 Partitioning page and explain
>>> the concurrency-related caveats on the UPDATE reference page?
>>
>> IMHO this caveat is better placed in Partitioning chapter to emphasize
>> that it is a drawback specifically in presence of partitioning.

I mean we fixed things for declarative partitioning so it's no longer a
caveat like it is for partitioning implemented using inheritance (in that
the former doesn't require user-defined triggers to implement
row-movement).  Seeing the first sentence, that is:

An <command>UPDATE</> causes a row to move from one partition to another
if the new value of the row fails to satisfy the implicit partition
constraint of the original partition but there is another partition which
can fit this row.

which clearly seems to suggest that row-movement, if required, is handled
by the system.  So it's not clear why it's in this list.  If we want to
describe the limitations of the current implementation, we'll need to
rephrase it a bit.  How about something like:

For an <command>UPDATE</> that causes a row to move from one partition to
another due the partition key being updated, the following caveats exist:
<a brief description of the possibility of surprising results in the
presence of concurrent manipulation of the row in question>

>>> +    If an <command>UPDATE</command> on a partitioned table causes a row to
>>> +    move to another partition, it is possible that all row-level
>>> +    <literal>BEFORE</> <command>UPDATE</command> triggers and all row-level
>>> +    <literal>BEFORE</> <command>DELETE</command>/<command>INSERT</command>
>>> +    triggers are applied on the respective partitions in a way that is
>>> apparent
>>> +    from the final state of the updated row.
>>>
>>> How about dropping "it is possible that" from this sentence?
>>
>> What the statement means is : "It is true that all triggers are
>> applied on the respective partitions; but it is possible that they are
>> applied in a way that is apparent from final state of the updated
>> row". So "possible" applies to "in a way that is apparent..". It
>> means, the user should be aware that all the triggers can change the
>> row and so the final row will be affected by all those triggers.
>> Actually, we have a similar statement for UPSERT involved with
>> triggers in the preceding section. I have taken the statement from
>> there.

I think where it appears in that sentence made me think it could be
confusing to some.  How about reordering sentences in that paragraph so
that the whole paragraphs reads as follows:

If an UPDATE on a partitioned table causes a row to move to another
partition, it will be performed as a DELETE from the original partition
followed by INSERT into the new partition. In this case, all row-level
BEFORE UPDATE triggers and all row-level BEFORE DELETE triggers are fired
on the original partition. Then all row-level BEFORE INSERT triggers are
fired on the destination partition. The possibility of surprising outcomes
should be considered when all these triggers affect the row being moved.
As far as AFTER ROW triggers are concerned, AFTER DELETE and AFTER INSERT
triggers are applied; but AFTER UPDATE triggers are not applied because
the UPDATE has been converted to a DELETE and INSERT. None of the DELETE
and INSERT statement-level triggers are fired, even if row movement
occurs; only the UPDATE triggers of the target table used in the UPDATE
statement will be fired.

Finally, I forgot to mention during the last review that the new parameter
'returning' to ExecDelete() could be called 'process_returning'.

Thanks,
Amit





Re: UPDATE of partition key

From
Amit Khandekar
Date:
On 3 April 2017 at 17:13, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Hi Amit,
>
> Thanks for updating the patch.  Since ddl.sgml got updated on Saturday,
> patch needs a rebase.

Rebased now.

>
>> On 31 March 2017 at 16:54, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>> On 31 March 2017 at 14:04, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>>> On 2017/03/28 19:12, Amit Khandekar wrote:
>>>>> In the section 5.11 "Partitioning" => subsection 5 "Caveats", I have
>>>>> removed the caveat of not being able to update partition key. And it
>>>>> is now replaced by the caveat where an update/delete operations can
>>>>> silently miss a row when there is a concurrent UPDATE of partition-key
>>>>> happening.
>>>>
>>>> Hmm, how about just removing the "partition-changing updates are
>>>> disallowed" caveat from the list on the 5.11 Partitioning page and explain
>>>> the concurrency-related caveats on the UPDATE reference page?
>>>
>>> IMHO this caveat is better placed in Partitioning chapter to emphasize
>>> that it is a drawback specifically in presence of partitioning.
>
> I mean we fixed things for declarative partitioning so it's no longer a
> caveat like it is for partitioning implemented using inheritance (in that
> the former doesn't require user-defined triggers to implement
> row-movement).  Seeing the first sentence, that is:
>
> An <command>UPDATE</> causes a row to move from one partition to another
> if the new value of the row fails to satisfy the implicit partition
> constraint of the original partition but there is another partition which
> can fit this row.
>
> which clearly seems to suggest that row-movement, if required, is handled
> by the system.  So it's not clear why it's in this list.  If we want to
> describe the limitations of the current implementation, we'll need to
> rephrase it a bit.

Yes I agree.

> How about something like:
> For an <command>UPDATE</> that causes a row to move from one partition to
> another due the partition key being updated, the following caveats exist:
> <a brief description of the possibility of surprising results in the
> presence of concurrent manipulation of the row in question>

Now with the slightly changed doc structuring for partitioning in
latest master, I have described in the end of section "5.10.2.
Declarative Partitioning" this note :

---

"Updating the partition key of a row might cause it to be moved into a
different partition where this row satisfies its partition
constraint."

---

And then in the Limitations section, I have replaced the earlier
can't-update-partition-key limitation with this new limitation as
below :

"When an UPDATE causes a row to move from one partition to another,
there is a chance that another concurrent UPDATE or DELETE misses this
row. Suppose, during the row movement, the row is still visible for
the concurrent session, and it is about to do an UPDATE or DELETE
operation on the same row. This DML operation can silently miss this
row if the row now gets deleted from the partition by the first
session as part of its UPDATE row movement. In such case, the
concurrent UPDATE/DELETE, being unaware of the row movement,
interprets that the row has just been deleted so there is nothing to
be done for this row. Whereas, in the usual case where the table is
not partitioned, or where there is no row movement, the second session
would have identified the newly updated row and carried UPDATE/DELETE
on this new row version."

---

Further, in the Notes section of update.sgml, I have kept a link to
the above limitations section like this :

"In the case of a partitioned table, updating a row might cause it to
no longer satisfy the partition constraint of the containing
partition. In that case, if there is some other partition in the
partition tree for which this row satisfies its partition constraint,
then the row is moved to that partition. If there isn't such a
partition, an error will occur. The error will also occur when
updating a partition directly. Behind the scenes, the row movement is
actually a DELETE and INSERT operation. However, there is a
possibility that a concurrent UPDATE or DELETE on the same row may
miss this row. For details see the section Section 5.10.2.3."

>
>>>> +    If an <command>UPDATE</command> on a partitioned table causes a row to
>>>> +    move to another partition, it is possible that all row-level
>>>> +    <literal>BEFORE</> <command>UPDATE</command> triggers and all row-level
>>>> +    <literal>BEFORE</> <command>DELETE</command>/<command>INSERT</command>
>>>> +    triggers are applied on the respective partitions in a way that is
>>>> apparent
>>>> +    from the final state of the updated row.
>>>>
>>>> How about dropping "it is possible that" from this sentence?
>>>
>>> What the statement means is : "It is true that all triggers are
>>> applied on the respective partitions; but it is possible that they are
>>> applied in a way that is apparent from final state of the updated
>>> row". So "possible" applies to "in a way that is apparent..". It
>>> means, the user should be aware that all the triggers can change the
>>> row and so the final row will be affected by all those triggers.
>>> Actually, we have a similar statement for UPSERT involved with
>>> triggers in the preceding section. I have taken the statement from
>>> there.
>
> I think where it appears in that sentence made me think it could be
> confusing to some.  How about reordering sentences in that paragraph so
> that the whole paragraphs reads as follows:
>
> If an UPDATE on a partitioned table causes a row to move to another
> partition, it will be performed as a DELETE from the original partition
> followed by INSERT into the new partition. In this case, all row-level
> BEFORE UPDATE triggers and all row-level BEFORE DELETE triggers are fired
> on the original partition. Then all row-level BEFORE INSERT triggers are
> fired on the destination partition. The possibility of surprising outcomes
> should be considered when all these triggers affect the row being moved.
> As far as AFTER ROW triggers are concerned, AFTER DELETE and AFTER INSERT
> triggers are applied; but AFTER UPDATE triggers are not applied because
> the UPDATE has been converted to a DELETE and INSERT. None of the DELETE
> and INSERT statement-level triggers are fired, even if row movement
> occurs; only the UPDATE triggers of the target table used in the UPDATE
> statement will be fired.

Yeah, most of the above makes sense to me. I have kept the phrase "as
far as statement-level triggers are concerned".

>
> Finally, I forgot to mention during the last review that the new parameter
> 'returning' to ExecDelete() could be called 'process_returning'.

Done, thanks.

Attached updated patch v7 has the above changes.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: UPDATE of partition key

From
Amit Langote
Date:
Hi Amit,

On 2017/04/04 20:11, Amit Khandekar wrote:
> On 3 April 2017 at 17:13, Amit Langote wrote:
>>>> On 31 March 2017 at 14:04, Amit Langote wrote:
>> How about something like:
>> For an <command>UPDATE</> that causes a row to move from one partition to
>> another due the partition key being updated, the following caveats exist:
>> <a brief description of the possibility of surprising results in the
>> presence of concurrent manipulation of the row in question>
> 
> Now with the slightly changed doc structuring for partitioning in
> latest master, I have described in the end of section "5.10.2.
> Declarative Partitioning" this note :
> 
> ---
> 
> "Updating the partition key of a row might cause it to be moved into a
> different partition where this row satisfies its partition
> constraint."
> 
> ---
> 
> And then in the Limitations section, I have replaced the earlier
> can't-update-partition-key limitation with this new limitation as
> below :
> 
> "When an UPDATE causes a row to move from one partition to another,
> there is a chance that another concurrent UPDATE or DELETE misses this
> row. Suppose, during the row movement, the row is still visible for
> the concurrent session, and it is about to do an UPDATE or DELETE
> operation on the same row. This DML operation can silently miss this
> row if the row now gets deleted from the partition by the first
> session as part of its UPDATE row movement. In such case, the
> concurrent UPDATE/DELETE, being unaware of the row movement,
> interprets that the row has just been deleted so there is nothing to
> be done for this row. Whereas, in the usual case where the table is
> not partitioned, or where there is no row movement, the second session
> would have identified the newly updated row and carried UPDATE/DELETE
> on this new row version."
> 
> ---

OK.

> Further, in the Notes section of update.sgml, I have kept a link to
> the above limitations section like this :
> 
> "In the case of a partitioned table, updating a row might cause it to
> no longer satisfy the partition constraint of the containing
> partition. In that case, if there is some other partition in the
> partition tree for which this row satisfies its partition constraint,
> then the row is moved to that partition. If there isn't such a
> partition, an error will occur. The error will also occur when
> updating a partition directly. Behind the scenes, the row movement is
> actually a DELETE and INSERT operation. However, there is a
> possibility that a concurrent UPDATE or DELETE on the same row may
> miss this row. For details see the section Section 5.10.2.3."

OK, too.  It seems to me that the details in 5.10.2.3 provide more or less
the same information as "concurrent UPDATE or DELETE looking at the moved
row will miss this row", but maybe that's fine.

>> If an UPDATE on a partitioned table causes a row to move to another
>> partition, it will be performed as a DELETE from the original partition
>> followed by INSERT into the new partition. In this case, all row-level
>> BEFORE UPDATE triggers and all row-level BEFORE DELETE triggers are fired
>> on the original partition. Then all row-level BEFORE INSERT triggers are
>> fired on the destination partition. The possibility of surprising outcomes
>> should be considered when all these triggers affect the row being moved.
>> As far as AFTER ROW triggers are concerned, AFTER DELETE and AFTER INSERT
>> triggers are applied; but AFTER UPDATE triggers are not applied because
>> the UPDATE has been converted to a DELETE and INSERT. None of the DELETE
>> and INSERT statement-level triggers are fired, even if row movement
>> occurs; only the UPDATE triggers of the target table used in the UPDATE
>> statement will be fired.
> 
> Yeah, most of the above makes sense to me. I have kept the phrase "as
> far as statement-level triggers are concerned".

OK, sure.

>> Finally, I forgot to mention during the last review that the new parameter
>> 'returning' to ExecDelete() could be called 'process_returning'.
> 
> Done, thanks.
> 
> Attached updated patch v7 has the above changes.

Marked as ready for committer.

Thanks,
Amit





Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Wed, Apr 5, 2017 at 5:54 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Marked as ready for committer.

Andres seems to have changed the status of this patch to "Needs
review" and then, 30 seconds later, to "Waiting on author", but
there's no actual email on the thread explaining what his concerns
were.  I'm going to set this back to "Ready for Committer" and push it
out to the next CommitFest.  I think this would be a great feature,
but I think it's not entirely clear that we have consensus on the
design, so let's revisit it for next release.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] UPDATE of partition key

From
Andres Freund
Date:
On 2017-04-07 13:55:51 -0400, Robert Haas wrote:
> On Wed, Apr 5, 2017 at 5:54 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> > Marked as ready for committer.
> 
> Andres seems to have changed the status of this patch to "Needs
> review" and then, 30 seconds later, to "Waiting on author"
> there's no actual email on the thread explaining what his concerns
> were.  I'm going to set this back to "Ready for Committer" and push it
> out to the next CommitFest.  I think this would be a great feature,
> but I think it's not entirely clear that we have consensus on the
> design, so let's revisit it for next release.

I was kind of looking for the appropriate status of "not entirely clear
that we have consensus on the design" - which isn't really
ready-for-committer, but no waiting-on-author either...

Greetings,

Andres Freund



Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Tue, Apr 4, 2017 at 7:11 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Attached updated patch v7 has the above changes.

This no longer applies.  Please rebase.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 2 May 2017 at 18:17, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Apr 4, 2017 at 7:11 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> Attached updated patch v7 has the above changes.
>
> This no longer applies.  Please rebase.

Thanks Robert for informing about this.

My patch has a separate function for emitting error message when a
partition constraint fails. And, the recent commit c0a8ae7be3 has
changes to correct the way the tuple is formed for displaying in the
error message. Hence there were some code-level conflicts.

Attached is the rebased patch, which resolves the above conflicts.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] UPDATE of partition key

From
Amit Kapila
Date:
On Fri, Mar 17, 2017 at 4:07 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 4 March 2017 at 12:49, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Mar 2, 2017 at 11:53 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>> I think it does not make sense running after row triggers in case of
>>> row-movement. There is no update happened on that leaf partition. This
>>> reasoning can also apply to BR update triggers. But the reasons for
>>> having a BR trigger and AR triggers are quite different. Generally, a
>>> user needs to do some modifications to the row before getting the
>>> final NEW row into the database, and hence [s]he defines a BR trigger
>>> for that. And we can't just silently skip this step only because the
>>> final row went into some other partition; in fact the row-movement
>>> itself might depend on what the BR trigger did with the row. Whereas,
>>> AR triggers are typically written for doing some other operation once
>>> it is made sure the row is actually updated. In case of row-movement,
>>> it is not actually updated.
>>
>> How about running the BR update triggers for the old partition and the
>> AR update triggers for the new partition?  It seems weird to run BR
>> update triggers but not AR update triggers.  Another option would be
>> to run BR and AR delete triggers and then BR and AR insert triggers,
>> emphasizing the choice to treat this update as a delete + insert, but
>> (as Amit Kh. pointed out to me when we were in a room together this
>> week) that precludes using the BEFORE trigger to modify the row.
>>

I also find the current behavior with respect to triggers quite odd.
The two points that appears odd are (a) Executing both before row
update and delete triggers on original partition sounds quite odd. (b)
It seems inconsistent to consider behavior for row and statement
triggers differently

>
> I checked the trigger behaviour in case of UPSERT. Here, when there is
> conflict found, ExecOnConflictUpdate() is called, and then the
> function returns immediately, which means AR INSERT trigger will not
> fire. And ExecOnConflictUpdate() calls ExecUpdate(), which means BR
> and AR UPDATE triggers will be fired. So in short, when an INSERT
> becomes an UPDATE, BR INSERT triggers do fire, but then the BR UPDATE
> and AR UPDATE also get fired. On the same lines, it makes sense in
> case of UPDATE=>DELETE+INSERT operation to fire a BR UPDATE trigger on
> the original table, and then the BR and AR DELETE/INSERT triggers on
> the respective tables.
>

I am not sure if it is good idea to compare it with "Insert On
Conflict Do Update", but  even if we want that way, I think Insert On
Conflict is consistent in statement level triggers which means it will
fire both Insert and Update statement level triggres (as per below
note in docs) whereas the documentation in the patch indicates that
this patch will only fire Update statement level triggers which is
odd.

Note in docs about Insert On Conflict
"Note that with an INSERT with an ON CONFLICT DO UPDATE clause, both
INSERT and UPDATE statement level trigger will be fired.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UPDATE of partition key

From
Amit Kapila
Date:
On Wed, May 3, 2017 at 11:22 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 2 May 2017 at 18:17, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Apr 4, 2017 at 7:11 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>> Attached updated patch v7 has the above changes.
>>
>
> Attached is the rebased patch, which resolves the above conflicts.
>

Few comments:
1.
Operating directly on partition doesn't allow update to move row.
Refer below example:
create table t1(c1 int) partition by range(c1);
create table t1_part_1 partition of t1 for values from (1) to (100);
create table t1_part_2 partition of t1 for values from (100) to (200);
insert into t1 values(generate_series(1,11));
insert into t1 values(generate_series(110,120));

postgres=# update t1_part_1 set c1=122 where c1=11;
ERROR:  new row for relation "t1_part_1" violates partition constraint
DETAIL:  Failing row contains (122).

2.
-
+static void ExecInitPartitionWithCheckOptions(ModifyTableState *mtstate,
+  Relation root_rel);

Spurious line delete.

3.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur.

Doesn't this error case indicate that this needs to be integrated with
Default partition patch of Rahila or that patch needs to take care
this error case?
Basically, if there is no matching partition, then move it to default partition.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Thu, May 11, 2017 at 7:54 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Few comments:
> 1.
> Operating directly on partition doesn't allow update to move row.
> Refer below example:
> create table t1(c1 int) partition by range(c1);
> create table t1_part_1 partition of t1 for values from (1) to (100);
> create table t1_part_2 partition of t1 for values from (100) to (200);
> insert into t1 values(generate_series(1,11));
> insert into t1 values(generate_series(110,120));
>
> postgres=# update t1_part_1 set c1=122 where c1=11;
> ERROR:  new row for relation "t1_part_1" violates partition constraint
> DETAIL:  Failing row contains (122).

I think that's correct behavior.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 11 May 2017 at 17:23, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Mar 17, 2017 at 4:07 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> On 4 March 2017 at 12:49, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Thu, Mar 2, 2017 at 11:53 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>>> I think it does not make sense running after row triggers in case of
>>>> row-movement. There is no update happened on that leaf partition. This
>>>> reasoning can also apply to BR update triggers. But the reasons for
>>>> having a BR trigger and AR triggers are quite different. Generally, a
>>>> user needs to do some modifications to the row before getting the
>>>> final NEW row into the database, and hence [s]he defines a BR trigger
>>>> for that. And we can't just silently skip this step only because the
>>>> final row went into some other partition; in fact the row-movement
>>>> itself might depend on what the BR trigger did with the row. Whereas,
>>>> AR triggers are typically written for doing some other operation once
>>>> it is made sure the row is actually updated. In case of row-movement,
>>>> it is not actually updated.
>>>
>>> How about running the BR update triggers for the old partition and the
>>> AR update triggers for the new partition?  It seems weird to run BR
>>> update triggers but not AR update triggers.  Another option would be
>>> to run BR and AR delete triggers and then BR and AR insert triggers,
>>> emphasizing the choice to treat this update as a delete + insert, but
>>> (as Amit Kh. pointed out to me when we were in a room together this
>>> week) that precludes using the BEFORE trigger to modify the row.
>>>
>
> I also find the current behavior with respect to triggers quite odd.
> The two points that appears odd are (a) Executing both before row
> update and delete triggers on original partition sounds quite odd.
Note that *before* trigger gets fired *before* the update happens. The
actual update may not even happen, depending upon what the trigger
does. And then in our case, the update does not happen; not just that,
it is transformed into delete-insert. So then we should fire
before-delete trigger.

> (b) It seems inconsistent to consider behavior for row and statement
> triggers differently

I am not sure whether we should compare row and statement triggers.
Statement triggers are anyway fired only per-statement, depending upon
whether it is update or insert or delete. It has nothing to do with
how the rows are modified.


>
>>
>> I checked the trigger behaviour in case of UPSERT. Here, when there is
>> conflict found, ExecOnConflictUpdate() is called, and then the
>> function returns immediately, which means AR INSERT trigger will not
>> fire. And ExecOnConflictUpdate() calls ExecUpdate(), which means BR
>> and AR UPDATE triggers will be fired. So in short, when an INSERT
>> becomes an UPDATE, BR INSERT triggers do fire, but then the BR UPDATE
>> and AR UPDATE also get fired. On the same lines, it makes sense in
>> case of UPDATE=>DELETE+INSERT operation to fire a BR UPDATE trigger on
>> the original table, and then the BR and AR DELETE/INSERT triggers on
>> the respective tables.
>>
>
> I am not sure if it is good idea to compare it with "Insert On
> Conflict Do Update", but  even if we want that way, I think Insert On
> Conflict is consistent in statement level triggers which means it will
> fire both Insert and Update statement level triggres (as per below
> note in docs) whereas the documentation in the patch indicates that
> this patch will only fire Update statement level triggers which is
> odd
>
> Note in docs about Insert On Conflict
> "Note that with an INSERT with an ON CONFLICT DO UPDATE clause, both
> INSERT and UPDATE statement level trigger will be fired.

I guess the reason this behaviour is kept for UPSERT, is because the
statement itself suggests : insert/update.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 11 May 2017 at 17:24, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Few comments:
> 1.
> Operating directly on partition doesn't allow update to move row.
> Refer below example:
> create table t1(c1 int) partition by range(c1);
> create table t1_part_1 partition of t1 for values from (1) to (100);
> create table t1_part_2 partition of t1 for values from (100) to (200);
> insert into t1 values(generate_series(1,11));
> insert into t1 values(generate_series(110,120));
>
> postgres=# update t1_part_1 set c1=122 where c1=11;
> ERROR:  new row for relation "t1_part_1" violates partition constraint
> DETAIL:  Failing row contains (122).

Yes, as Robert said, this is expected behaviour. We move the row only
within the partition subtree that has the update table as its root. In
this case, it's the leaf partition.

>
> 3.
> +   longer satisfy the partition constraint of the containing partition. In that
> +   case, if there is some other partition in the partition tree for which this
> +   row satisfies its partition constraint, then the row is moved to that
> +   partition. If there isn't such a partition, an error will occur.
>
> Doesn't this error case indicate that this needs to be integrated with
> Default partition patch of Rahila or that patch needs to take care
> this error case?
> Basically, if there is no matching partition, then move it to default partition.

Will have a look on this. Thanks for pointing this out.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] UPDATE of partition key

From
Amit Kapila
Date:
On Thu, May 11, 2017 at 5:41 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 11 May 2017 at 17:23, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Fri, Mar 17, 2017 at 4:07 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>> On 4 March 2017 at 12:49, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> On Thu, Mar 2, 2017 at 11:53 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>>>> I think it does not make sense running after row triggers in case of
>>>>> row-movement. There is no update happened on that leaf partition. This
>>>>> reasoning can also apply to BR update triggers. But the reasons for
>>>>> having a BR trigger and AR triggers are quite different. Generally, a
>>>>> user needs to do some modifications to the row before getting the
>>>>> final NEW row into the database, and hence [s]he defines a BR trigger
>>>>> for that. And we can't just silently skip this step only because the
>>>>> final row went into some other partition; in fact the row-movement
>>>>> itself might depend on what the BR trigger did with the row. Whereas,
>>>>> AR triggers are typically written for doing some other operation once
>>>>> it is made sure the row is actually updated. In case of row-movement,
>>>>> it is not actually updated.
>>>>
>>>> How about running the BR update triggers for the old partition and the
>>>> AR update triggers for the new partition?  It seems weird to run BR
>>>> update triggers but not AR update triggers.  Another option would be
>>>> to run BR and AR delete triggers and then BR and AR insert triggers,
>>>> emphasizing the choice to treat this update as a delete + insert, but
>>>> (as Amit Kh. pointed out to me when we were in a room together this
>>>> week) that precludes using the BEFORE trigger to modify the row.
>>>>
>>
>> I also find the current behavior with respect to triggers quite odd.
>> The two points that appears odd are (a) Executing both before row
>> update and delete triggers on original partition sounds quite odd.
> Note that *before* trigger gets fired *before* the update happens. The
> actual update may not even happen, depending upon what the trigger
> does. And then in our case, the update does not happen; not just that,
> it is transformed into delete-insert. So then we should fire
> before-delete trigger.
>

Sure, I am aware of that point, but it doesn't seem obvious that both
update and delete BR triggers get fired for original partition.  As a
developer, it might be obvious to you that as you have used delete and
insert interface, it is okay that corresponding BR/AR triggers get
fired, however, it is not so obvious for others, rather it appears
quite odd.  If we try to compare it with the non-partitioned update,
there also it is internally a delete and insert operation, but we
don't fire triggers for those.

>> (b) It seems inconsistent to consider behavior for row and statement
>> triggers differently
>
> I am not sure whether we should compare row and statement triggers.
> Statement triggers are anyway fired only per-statement, depending upon
> whether it is update or insert or delete. It has nothing to do with
> how the rows are modified.
>

Okay.  The broader point I was trying to convey was that the way this
patch defines the behavior of triggers doesn't sound good to me.  It
appears to me that in this thread multiple people have raised points
around trigger behavior and you should try to consider those.   Apart
from the options, Robert has suggested, another option could be that
we allow firing BR-AR update triggers for original partition and BR-AR
insert triggers for the new partition.  In this case, one can argue
that we have not actually updated the row in the original partition,
so there is no need to fire AR update triggers, but I feel that is
what we do for non-partitioned table update and it should be okay here
as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UPDATE of partition key

From
Amit Kapila
Date:
On Thu, May 11, 2017 at 5:45 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 11 May 2017 at 17:24, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Few comments:
>> 1.
>> Operating directly on partition doesn't allow update to move row.
>> Refer below example:
>> create table t1(c1 int) partition by range(c1);
>> create table t1_part_1 partition of t1 for values from (1) to (100);
>> create table t1_part_2 partition of t1 for values from (100) to (200);
>> insert into t1 values(generate_series(1,11));
>> insert into t1 values(generate_series(110,120));
>>
>> postgres=# update t1_part_1 set c1=122 where c1=11;
>> ERROR:  new row for relation "t1_part_1" violates partition constraint
>> DETAIL:  Failing row contains (122).
>
> Yes, as Robert said, this is expected behaviour. We move the row only
> within the partition subtree that has the update table as its root. In
> this case, it's the leaf partition.
>

Okay, but what is the technical reason behind it?  Is it because the
current design doesn't support it or is it because of something very
fundamental to partitions?  Is it because we can't find root partition
from leaf partition?

+ is_partitioned_table =
+ root_rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE;
+
+ if (is_partitioned_table)
+ ExecSetupPartitionTupleRouting(
+ root_rel,
+ /* Build WITH CHECK OPTION constraints for leaf partitions */
+ ExecInitPartitionWithCheckOptions(mtstate, root_rel);
+ /* Build a projection for each leaf partition rel. */
+ ExecInitPartitionReturningProjection(mtstate, root_rel);
..
+ /* It's not a partitioned table after all; error out. */
+ ExecPartitionCheckEmitError(resultRelInfo, slot, estate);

When we are anyway going to give error if table is not a partitioned
table, then isn't it better to give it early when we first identify
that.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UPDATE of partition key

From
Amit Kapila
Date:
On Fri, May 12, 2017 at 9:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, May 11, 2017 at 5:45 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> On 11 May 2017 at 17:24, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Few comments:
>>> 1.
>>> Operating directly on partition doesn't allow update to move row.
>>> Refer below example:
>>> create table t1(c1 int) partition by range(c1);
>>> create table t1_part_1 partition of t1 for values from (1) to (100);
>>> create table t1_part_2 partition of t1 for values from (100) to (200);
>>> insert into t1 values(generate_series(1,11));
>>> insert into t1 values(generate_series(110,120));
>>>
>>> postgres=# update t1_part_1 set c1=122 where c1=11;
>>> ERROR:  new row for relation "t1_part_1" violates partition constraint
>>> DETAIL:  Failing row contains (122).
>>
>> Yes, as Robert said, this is expected behaviour. We move the row only
>> within the partition subtree that has the update table as its root. In
>> this case, it's the leaf partition.
>>
>
> Okay, but what is the technical reason behind it?  Is it because the
> current design doesn't support it or is it because of something very
> fundamental to partitions?
>

One plausible theory is that as Select's on partitions just returns
the rows of that partition, the update should also behave in same way.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 12 May 2017 at 08:30, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, May 11, 2017 at 5:41 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> On 11 May 2017 at 17:23, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> On Fri, Mar 17, 2017 at 4:07 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>>> On 4 March 2017 at 12:49, Robert Haas <robertmhaas@gmail.com> wrote:
>>>>> On Thu, Mar 2, 2017 at 11:53 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>>>>> I think it does not make sense running after row triggers in case of
>>>>>> row-movement. There is no update happened on that leaf partition. This
>>>>>> reasoning can also apply to BR update triggers. But the reasons for
>>>>>> having a BR trigger and AR triggers are quite different. Generally, a
>>>>>> user needs to do some modifications to the row before getting the
>>>>>> final NEW row into the database, and hence [s]he defines a BR trigger
>>>>>> for that. And we can't just silently skip this step only because the
>>>>>> final row went into some other partition; in fact the row-movement
>>>>>> itself might depend on what the BR trigger did with the row. Whereas,
>>>>>> AR triggers are typically written for doing some other operation once
>>>>>> it is made sure the row is actually updated. In case of row-movement,
>>>>>> it is not actually updated.
>>>>>
>>>>> How about running the BR update triggers for the old partition and the
>>>>> AR update triggers for the new partition?  It seems weird to run BR
>>>>> update triggers but not AR update triggers.  Another option would be
>>>>> to run BR and AR delete triggers and then BR and AR insert triggers,
>>>>> emphasizing the choice to treat this update as a delete + insert, but
>>>>> (as Amit Kh. pointed out to me when we were in a room together this
>>>>> week) that precludes using the BEFORE trigger to modify the row.
>>>>>
>>>
>>> I also find the current behavior with respect to triggers quite odd.
>>> The two points that appears odd are (a) Executing both before row
>>> update and delete triggers on original partition sounds quite odd.
>> Note that *before* trigger gets fired *before* the update happens. The
>> actual update may not even happen, depending upon what the trigger
>> does. And then in our case, the update does not happen; not just that,
>> it is transformed into delete-insert. So then we should fire
>> before-delete trigger.
>>
>
> Sure, I am aware of that point, but it doesn't seem obvious that both
> update and delete BR triggers get fired for original partition.  As a
> developer, it might be obvious to you that as you have used delete and
> insert interface, it is okay that corresponding BR/AR triggers get
> fired, however, it is not so obvious for others, rather it appears
> quite odd.

I agree that it seems a bit odd that we are firing both update and
delete triggers on the same partition. But if you look at the
perspective that the update=>delete+insert is a user-aware operation,
it does not seem that odd.

> If we try to compare it with the non-partitioned update,
> there also it is internally a delete and insert operation, but we
> don't fire triggers for those.

For a non-partitioned table, the delete+insert is internal, whereas
for partitioned table, it is completely visible to the user.

>
>>> (b) It seems inconsistent to consider behavior for row and statement
>>> triggers differently
>>
>> I am not sure whether we should compare row and statement triggers.
>> Statement triggers are anyway fired only per-statement, depending upon
>> whether it is update or insert or delete. It has nothing to do with
>> how the rows are modified.
>>
>
> Okay.  The broader point I was trying to convey was that the way this
> patch defines the behavior of triggers doesn't sound good to me.  It
> appears to me that in this thread multiple people have raised points
> around trigger behavior and you should try to consider those.

I understand that there is no single solution which will provide
completely intuitive trigger behaviour. Skipping BR delete trigger
should be fine. But then for consistency, we should skip BR insert
trigger as well, the theory being that the delete+insert are not fired
by the user so we should not fire them. But I feel both should be
fired to avoid any consequences unexpected to the user who has
installed those triggers.

The only specific concern of yours is that of firing *both* update as
well as insert triggers on the same table, right ? My explanation for
this was : we have done this before for UPSERT, and we had documented
the same. We can do it here also.

>  Apart from the options, Robert has suggested, another option could be that
> we allow firing BR-AR update triggers for original partition and BR-AR
> insert triggers for the new partition.  In this case, one can argue
> that we have not actually updated the row in the original partition,
> so there is no need to fire AR update triggers,

Yes that's what I think. If there is no update happened, then AR
update trigger should not be executed. AR triggers are only for
scenarios where it is guaranteed that the DML operation has happened
when the trigger is being executed.

> but I feel that is what we do for non-partitioned table update and it should be okay here
> as well.

I don't think so. For e.g. if a BR trigger returns NULL, the update
does not happen, and then the AR trigger does not fire as well. Do you
see any other scenarios for non-partitioned tables, where AR triggers
do fire when the update does not happen ?


Overall, I am also open to skipping both insert+delete BR trigger, but
I am trying to convince above that this might not be as odd as it
sounds, especially if we document this clearly why we have done.



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 12 May 2017 at 10:01, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, May 12, 2017 at 9:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Thu, May 11, 2017 at 5:45 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>> On 11 May 2017 at 17:24, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>> Few comments:
>>>> 1.
>>>> Operating directly on partition doesn't allow update to move row.
>>>> Refer below example:
>>>> create table t1(c1 int) partition by range(c1);
>>>> create table t1_part_1 partition of t1 for values from (1) to (100);
>>>> create table t1_part_2 partition of t1 for values from (100) to (200);
>>>> insert into t1 values(generate_series(1,11));
>>>> insert into t1 values(generate_series(110,120));
>>>>
>>>> postgres=# update t1_part_1 set c1=122 where c1=11;
>>>> ERROR:  new row for relation "t1_part_1" violates partition constraint
>>>> DETAIL:  Failing row contains (122).
>>>
>>> Yes, as Robert said, this is expected behaviour. We move the row only
>>> within the partition subtree that has the update table as its root. In
>>> this case, it's the leaf partition.
>>>
>>
>> Okay, but what is the technical reason behind it?  Is it because the
>> current design doesn't support it or is it because of something very
>> fundamental to partitions?
No, we can do that if decide to update some table outside the
partition subtree. The reason is more of semantics. I think the user
who is running UPDATE for a partitioned table, should not be
necessarily aware of the structure of the complete partition tree
outside of the current subtree. It is always safe to return error
instead of moving the data outside of the subtree silently.

>>
>
> One plausible theory is that as Select's on partitions just returns
> the rows of that partition, the update should also behave in same way.

Yes , right. Or even inserts fail if we try to insert data that does
not fit into the current subtree.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
>> 3.
>> +   longer satisfy the partition constraint of the containing partition. In that
>> +   case, if there is some other partition in the partition tree for which this
>> +   row satisfies its partition constraint, then the row is moved to that
>> +   partition. If there isn't such a partition, an error will occur.
>>
>> Doesn't this error case indicate that this needs to be integrated with
>> Default partition patch of Rahila or that patch needs to take care
>> this error case?
>> Basically, if there is no matching partition, then move it to default partition.
>
> Will have a look on this. Thanks for pointing this out.

I tried update row movement with both my patch and default-partition
patch applied. And it looks like it works as expected :

1. When an update changes the partitioned key such that the row does
not fit into any of the non-default partitions, the row is moved to
the default partition.
2. If the row does fit into a non-default partition, the row moves
into that partition.
3. If a row from a default partition is updated such that it fits into
any of the non-default partition, it moves into that partition. I
think we can debate on whether the row should stay in the default
partition or move. I think it should be moved, since now the row has a
suitable partition.



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] UPDATE of partition key

From
Amit Kapila
Date:
On Fri, Feb 24, 2017 at 3:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Feb 24, 2017 at 3:24 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
>> It is of course very good that we have something ready for this
>> release and can make a choice of what to do.
>>
>> Thoughts
>>
>> 1. Reuse the tuple state HEAP_MOVED_OFF which IIRC represent exactly
>> almost exactly the same thing. An UPDATE which gets to a
>> HEAP_MOVED_OFF tuple will know to re-find the tuple via the partition
>> metadata, or I might be persuaded that in-this-release it is
>> acceptable to fail when this occurs with an ERROR and a retryable
>> SQLCODE, since the UPDATE will succeed on next execution.
>
> I've got my doubts about whether we can make that bit work that way,
> considering that we still support pg_upgrade (possibly in multiple
> steps) from old releases that had VACUUM FULL.  We really ought to put
> some work into reclaiming those old bits, but there's probably no time
> for that in v10.
>

I agree with you that it might not be straightforward to make it work,
but now that earliest it can go is v11, do we want to try doing
something other than just documenting it.  What I could read from this
e-mail thread is that you are intending towards just documenting it
for the first cut of this feature. However, both Greg and Simon are of
opinion that we should do something about this and even patch Author
(Amit Khandekar) has shown some inclination to do something about this
point (return error to the user in some way), so I think we can't
ignore this point.

I think now that we have some more time, it is better to try something
based on a couple of ideas floating in this thread to address this
point and see if we can come up with something doable without a big
architecture change.

What is your take on this point now?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UPDATE of partition key

From
Amit Kapila
Date:
On Fri, May 12, 2017 at 10:49 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 12 May 2017 at 08:30, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Thu, May 11, 2017 at 5:41 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
>> If we try to compare it with the non-partitioned update,
>> there also it is internally a delete and insert operation, but we
>> don't fire triggers for those.
>
> For a non-partitioned table, the delete+insert is internal, whereas
> for partitioned table, it is completely visible to the user.
>

If the user has executed an update on root table, then it is
transparent.  I think we can consider it user visible only in case if
there is some user visible syntax like "Update ... Move Row If
Constraint Not Satisfied"

>>
>>>> (b) It seems inconsistent to consider behavior for row and statement
>>>> triggers differently
>>>
>>> I am not sure whether we should compare row and statement triggers.
>>> Statement triggers are anyway fired only per-statement, depending upon
>>> whether it is update or insert or delete. It has nothing to do with
>>> how the rows are modified.
>>>
>>
>> Okay.  The broader point I was trying to convey was that the way this
>> patch defines the behavior of triggers doesn't sound good to me.  It
>> appears to me that in this thread multiple people have raised points
>> around trigger behavior and you should try to consider those.
>
> I understand that there is no single solution which will provide
> completely intuitive trigger behaviour. Skipping BR delete trigger
> should be fine. But then for consistency, we should skip BR insert
> trigger as well, the theory being that the delete+insert are not fired
> by the user so we should not fire them. But I feel both should be
> fired to avoid any consequences unexpected to the user who has
> installed those triggers.
>
> The only specific concern of yours is that of firing *both* update as
> well as insert triggers on the same table, right ? My explanation for
> this was : we have done this before for UPSERT, and we had documented
> the same. We can do it here also.
>
>>  Apart from the options, Robert has suggested, another option could be that
>> we allow firing BR-AR update triggers for original partition and BR-AR
>> insert triggers for the new partition.  In this case, one can argue
>> that we have not actually updated the row in the original partition,
>> so there is no need to fire AR update triggers,
>
> Yes that's what I think. If there is no update happened, then AR
> update trigger should not be executed. AR triggers are only for
> scenarios where it is guaranteed that the DML operation has happened
> when the trigger is being executed.
>
>> but I feel that is what we do for non-partitioned table update and it should be okay here
>> as well.
>
> I don't think so. For e.g. if a BR trigger returns NULL, the update
> does not happen, and then the AR trigger does not fire as well. Do you
> see any other scenarios for non-partitioned tables, where AR triggers
> do fire when the update does not happen ?
>

No, but here also it can be considered as an update for original partition.

>
> Overall, I am also open to skipping both insert+delete BR trigger,
>

I think it might be better to summarize all the options discussed
including what the patch has and see what most people consider as
sensible.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 12 May 2017 at 14:56, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think it might be better to summarize all the options discussed
> including what the patch has and see what most people consider as
> sensible.

Yes, makes sense. Here are the options that were discussed so far for
ROW triggers :

Option 1 : (the patch follows this option)
----------
BR Update trigger for source partition.
BR,AR Delete trigger for source partition.
BR,AR Insert trigger for destination partition.
No AR Update trigger.

Rationale :

BR Update trigger should be fired because that trigger can even modify
the rows, and that can even result in partition key update even though
the UPDATE statement is not updating the partition key.

Also, fire the delete/insert triggers on respective partitions since
the rows are about to be deleted/inserted. AR update trigger should
not be fired because that required an actual update to have happened.



Option 2
----------
BR Update trigger for source partition.
AR Update trigger on destination partition.
No insert/delete triggers.

Rationale :

Since it's an UPDATE statement, only update triggers should be fired.
The update ends up moving the row into another partition, so AR Update
trigger should be fired on this partition rather than the original
partition.

Option 3
--------

BR, AR delete triggers on source partition
BR, AR insert triggers on destination partition.

Rationale :
Since the update is converted to delete+insert, just skip the update
triggers completely.



Option 4
--------

BR-AR update triggers for source partition
BR-AR insert triggers for destination partition

Rationale :
Since it is an update statement, both BR and AR UPDATE trigger should
be fired on original partition.
Since update is converted to delete+insert, the corresponding triggers
should be fired, but since we already are firing UPDATE trigger on
original partition, skip delete triggers, otherwise both UPDATE and
DELETE triggers would get fired on the same partition.


----------------

For statement triggers, I think it should be based on the
documentation recently checked in for partitions in general.

+    A statement that targets a parent table in a inheritance or partitioning
+    hierarchy does not cause the statement-level triggers of affected child
+    tables to be fired; only the parent table's statement-level triggers are
+    fired.  However, row-level triggers of any affected child tables will be
+    fired.

Based on that, for row movement as well, the trigger should be fired
only for the table referred in the UPDATE statement, and not for any
child tables, or for any partitions to which the rows were moved. The
doc in this row-movement patch also matches with this behaviour.



Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Fri, May 12, 2017 at 3:07 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I agree with you that it might not be straightforward to make it work,
> but now that earliest it can go is v11, do we want to try doing
> something other than just documenting it.  What I could read from this
> e-mail thread is that you are intending towards just documenting it
> for the first cut of this feature. However, both Greg and Simon are of
> opinion that we should do something about this and even patch Author
> (Amit Khandekar) has shown some inclination to do something about this
> point (return error to the user in some way), so I think we can't
> ignore this point.
>
> I think now that we have some more time, it is better to try something
> based on a couple of ideas floating in this thread to address this
> point and see if we can come up with something doable without a big
> architecture change.
>
> What is your take on this point now?

I still don't think it's worth spending a bit on this, especially not
with WARM probably gobbling up multiple bits.  Reclaiming the bits
seems like a good idea, but spending one on this still seems to me
like it's probably not the best use of our increasingly-limited supply
of infomask bits.  Now, Simon and Greg may still feel otherwise, of
course.

I could get behind providing an option to turn this behavior on and
off at the level of the partitioned table.  That would use a reloption
rather than an infomask bit, so no scarce resource is being consumed.
I suspect that most people don't update the partition keys at all (so
they don't care either way) and the ones who do are probably either
depending on EPQ (in which case they most likely want to just disallow
all UPDATE-row-movement) or not (in which case they again don't care).
If I understand correctly, the only people who will benefit from
consuming an infomask bit are the people who update their partition
keys AND depend on EPQ BUT only for non-key updates AND need the
system to make sure that they don't accidentally rely on it for the
case of an EPQ update.  That seems (to me, anyway) like it's got to be
a really small percentage of actual users, but I just work here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] UPDATE of partition key

From
Dilip Kumar
Date:
On Fri, May 12, 2017 at 4:17 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Option 3
> --------
>
> BR, AR delete triggers on source partition
> BR, AR insert triggers on destination partition.
>
> Rationale :
> Since the update is converted to delete+insert, just skip the update
> triggers completely.

+1 to option3
Generally, BR triggers are used for updating the ROW value and AR
triggers to VALIDATE the row or to modify some other tables.  So it
seems that we can fire the triggers what is actual operation is
happening at the partition level.

For source partition, it's only the delete operation (no update
happened) so we fire delete triggers and for the destination only
insert operations so fire only inserts triggers.  That will keep the
things simple.  And, it will also be in sync with the actual partition
level delete/insert operations.

We may argue that user might have declared only update triggers and as
he has executed the update operation he may expect those triggers to
get fired.  But, I think this behaviour can be documented with the
proper logic that if the user is updating the partition key then he
must be ready with the Delete/Insert triggers also, he can not rely
only upon update level triggers.

Earlier I thought that option1 is better but later I think that this
can complicate the situation as we are firing first BR update then BR
delete and can change the row multiple time and defining such
behaviour can be complicated.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UPDATE of partition key

From
Amit Kapila
Date:
On Wed, May 17, 2017 at 12:06 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Fri, May 12, 2017 at 4:17 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> Option 3
>> --------
>>
>> BR, AR delete triggers on source partition
>> BR, AR insert triggers on destination partition.
>>
>> Rationale :
>> Since the update is converted to delete+insert, just skip the update
>> triggers completely.
>
> +1 to option3
>
..
> Earlier I thought that option1 is better but later I think that this
> can complicate the situation as we are firing first BR update then BR
> delete and can change the row multiple time and defining such
> behaviour can be complicated.
>

If we have to go by this theory, then the option you have preferred
will still execute BR triggers for both delete and insert, so input
row can still be changed twice.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UPDATE of partition key

From
Amit Kapila
Date:
On Mon, May 15, 2017 at 5:28 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, May 12, 2017 at 3:07 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I agree with you that it might not be straightforward to make it work,
>> but now that earliest it can go is v11, do we want to try doing
>> something other than just documenting it.  What I could read from this
>> e-mail thread is that you are intending towards just documenting it
>> for the first cut of this feature. However, both Greg and Simon are of
>> opinion that we should do something about this and even patch Author
>> (Amit Khandekar) has shown some inclination to do something about this
>> point (return error to the user in some way), so I think we can't
>> ignore this point.
>>
>> I think now that we have some more time, it is better to try something
>> based on a couple of ideas floating in this thread to address this
>> point and see if we can come up with something doable without a big
>> architecture change.
>>
>> What is your take on this point now?
>
> I still don't think it's worth spending a bit on this, especially not
> with WARM probably gobbling up multiple bits.  Reclaiming the bits
> seems like a good idea, but spending one on this still seems to me
> like it's probably not the best use of our increasingly-limited supply
> of infomask bits.
>

I think we can do this even without using an additional infomask bit.
As suggested by Greg up thread, we can set InvalidBlockId in ctid to
indicate such an update.

>  Now, Simon and Greg may still feel otherwise, of
> course.
>
> I could get behind providing an option to turn this behavior on and
> off at the level of the partitioned table.
>

Sure that sounds like a viable option and we can set the default value
as false.  However, it might be better if we can detect the same
internally without big changes.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UPDATE of partition key

From
Dilip Kumar
Date:
On Wed, May 17, 2017 at 3:15 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Earlier I thought that option1 is better but later I think that this
>> can complicate the situation as we are firing first BR update then BR
>> delete and can change the row multiple time and defining such
>> behaviour can be complicated.
>>
>
> If we have to go by this theory, then the option you have preferred
> will still execute BR triggers for both delete and insert, so input
> row can still be changed twice.

Yeah, right as per my theory above Option3 have the same problem.

But after putting some more thought I realised that only for "Before
Update" or the "Before Insert" trigger row can be changed. Correct me
if I am assuming something wrong?

So now again option3 will make more sense.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Wed, May 17, 2017 at 6:29 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think we can do this even without using an additional infomask bit.
> As suggested by Greg up thread, we can set InvalidBlockId in ctid to
> indicate such an update.

Hmm.  How would that work?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] UPDATE of partition key

From
Rushabh Lathia
Date:


On Wed, May 17, 2017 at 12:06 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, May 12, 2017 at 4:17 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Option 3
> --------
>
> BR, AR delete triggers on source partition
> BR, AR insert triggers on destination partition.
>
> Rationale :
> Since the update is converted to delete+insert, just skip the update
> triggers completely.

+1 to option3
Generally, BR triggers are used for updating the ROW value and AR
triggers to VALIDATE the row or to modify some other tables.  So it
seems that we can fire the triggers what is actual operation is
happening at the partition level.

For source partition, it's only the delete operation (no update
happened) so we fire delete triggers and for the destination only
insert operations so fire only inserts triggers.  That will keep the
things simple.  And, it will also be in sync with the actual partition
level delete/insert operations.

We may argue that user might have declared only update triggers and as
he has executed the update operation he may expect those triggers to
get fired.  But, I think this behaviour can be documented with the
proper logic that if the user is updating the partition key then he
must be ready with the Delete/Insert triggers also, he can not rely
only upon update level triggers.


Right, that is even my concern. That user might had declared only update
triggers and when user executing UPDATE its expect it to get call - but
with option 3 its not happening.

In term of consistency option 1 looks better. Its doing the same what
its been implemented for the UPSERT - so that user might be already
aware of trigger behaviour. Plus if we document the behaviour then it
sounds correct - 

- Original command was UPDATE so BR update
- Later found that its ROW movement - so BR delete followed by AR delete
- Then Insert in new partition - so BR INSERT followed by AR Insert.

But again I am not quite sure how good it will be to compare the partition
behaviour with the UPSERT.


 
Earlier I thought that option1 is better but later I think that this
can complicate the situation as we are firing first BR update then BR
delete and can change the row multiple time and defining such
behaviour can be complicated.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



--
Rushabh Lathia

Re: [HACKERS] UPDATE of partition key

From
Amit Kapila
Date:
 On Wed, May 17, 2017 at 5:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, May 17, 2017 at 6:29 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I think we can do this even without using an additional infomask bit.
>> As suggested by Greg up thread, we can set InvalidBlockId in ctid to
>> indicate such an update.
>
> Hmm.  How would that work?
>

We can pass a flag say row_moved (or require_row_movement) to
heap_delete which will in turn set InvalidBlockId in ctid instead of
setting it to self. Then the ExecUpdate needs to check for the same
and return an error when heap_update is not successful (result !=
HeapTupleMayBeUpdated).  Can you explain what difficulty are you
envisioning?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 17 May 2017 at 17:29, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
>
>
> On Wed, May 17, 2017 at 12:06 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>
>> On Fri, May 12, 2017 at 4:17 PM, Amit Khandekar <amitdkhan.pg@gmail.com>
>> wrote:
>> > Option 3
>> > --------
>> >
>> > BR, AR delete triggers on source partition
>> > BR, AR insert triggers on destination partition.
>> >
>> > Rationale :
>> > Since the update is converted to delete+insert, just skip the update
>> > triggers completely.
>>
>> +1 to option3
>> Generally, BR triggers are used for updating the ROW value and AR
>> triggers to VALIDATE the row or to modify some other tables.  So it
>> seems that we can fire the triggers what is actual operation is
>> happening at the partition level.
>>
>> For source partition, it's only the delete operation (no update
>> happened) so we fire delete triggers and for the destination only
>> insert operations so fire only inserts triggers.  That will keep the
>> things simple.  And, it will also be in sync with the actual partition
>> level delete/insert operations.
>>
>> We may argue that user might have declared only update triggers and as
>> he has executed the update operation he may expect those triggers to
>> get fired.  But, I think this behaviour can be documented with the
>> proper logic that if the user is updating the partition key then he
>> must be ready with the Delete/Insert triggers also, he can not rely
>> only upon update level triggers.
>>
>
> Right, that is even my concern. That user might had declared only update
> triggers and when user executing UPDATE its expect it to get call - but
> with option 3 its not happening.

Yes that's the issue with option 3. A user wants to make sure update
triggers run, and here we are skipping the BEFORE update triggers. And
user might even modify rows.

Now regarding the AR update triggers .... The user might be more
concerned with the non-partition-key columns, and the UPDATE of
partition key typically would update only the partition key and not
the other column. So for typical case, it makes sense to skip the
UPDATE AR trigger. But if the UPDATE contains both partition key as
well as other column updates, it makes sense to fire AR UPDATE
trigger. One thing we can do is restrict an UPDATE to have both
partition key and non-partition key column updates. So this way we can
always skip the AR update triggers for row-movement updates, unless
may be fire AR UPDATE triggers *only* if they are created using
"BEFORE UPDATE OF <column_name>" and the column is the partition key.

Between skipping delete-insert triggers versus skipping update
triggers, I would go for skipping delete-insert triggers. I think we
cannot skip BR update triggers because that would be a correctness
issue.

From user-perspective, I think the user would like to install a
trigger that would fire if any of the child tables get modified. But
because there is no provision to install a common trigger, the user
has to install the same trigger on every child table. In that sense,
it might not matter whether we fire AR UPDATE trigger on old partition
or new partition.

>
> In term of consistency option 1 looks better. Its doing the same what
> its been implemented for the UPSERT - so that user might be already
> aware of trigger behaviour. Plus if we document the behaviour then it
> sounds correct -
>
> - Original command was UPDATE so BR update
> - Later found that its ROW movement - so BR delete followed by AR delete
> - Then Insert in new partition - so BR INSERT followed by AR Insert.
>
> But again I am not quite sure how good it will be to compare the partition
> behaviour with the UPSERT.
>
>
>
>>
>> Earlier I thought that option1 is better but later I think that this
>> can complicate the situation as we are firing first BR update then BR
>> delete and can change the row multiple time and defining such
>> behaviour can be complicated.
>>
>> --
>> Regards,
>> Dilip Kumar
>> EnterpriseDB: http://www.enterprisedb.com
>>
>>
>> --
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers
>
>
>
>
> --
> Rushabh Lathia



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] UPDATE of partition key

From
Amit Kapila
Date:
On Wed, May 17, 2017 at 4:05 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Wed, May 17, 2017 at 3:15 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Earlier I thought that option1 is better but later I think that this
>>> can complicate the situation as we are firing first BR update then BR
>>> delete and can change the row multiple time and defining such
>>> behaviour can be complicated.
>>>
>>
>> If we have to go by this theory, then the option you have preferred
>> will still execute BR triggers for both delete and insert, so input
>> row can still be changed twice.
>
> Yeah, right as per my theory above Option3 have the same problem.
>
> But after putting some more thought I realised that only for "Before
> Update" or the "Before Insert" trigger row can be changed.
>

Before Row Delete triggers can suppress the delete operation itself
which is kind of unintended in this case.  I think without the user
being aware it doesn't seem advisable to execute multiple BR triggers.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 18 May 2017 at 16:52, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, May 17, 2017 at 4:05 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> On Wed, May 17, 2017 at 3:15 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>> Earlier I thought that option1 is better but later I think that this
>>>> can complicate the situation as we are firing first BR update then BR
>>>> delete and can change the row multiple time and defining such
>>>> behaviour can be complicated.
>>>>
>>>
>>> If we have to go by this theory, then the option you have preferred
>>> will still execute BR triggers for both delete and insert, so input
>>> row can still be changed twice.
>>
>> Yeah, right as per my theory above Option3 have the same problem.
>>
>> But after putting some more thought I realised that only for "Before
>> Update" or the "Before Insert" trigger row can be changed.
>>
>
> Before Row Delete triggers can suppress the delete operation itself
> which is kind of unintended in this case.  I think without the user
> being aware it doesn't seem advisable to execute multiple BR triggers.

By now, majority of the opinions have shown that they do not favour
two triggers getting fired on a single update. Amit, do you consider
option 2 as a valid option ? That is, fire only UPDATE triggers. BR on
source partition, and AR on destination partition. Do you agree that
firing BR update trigger is essential since it can modify the row and
even prevent the update from happening ?

Also, since a user does not have a provision to install a common
UPDATE row trigger, (s)he installs it on each of the leaf partitions.
And then when an update causes row movement, using option 3 would end
up not firing update triggers on any of the partitions. So, I prefer
option 2 over option 3 , i.e. make sure to fire BR and AR update
triggers. Actually option 2 is what Robert had proposed in the
beginning.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 12 May 2017 at 09:27, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> + is_partitioned_table =
> + root_rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE;
> +
> + if (is_partitioned_table)
> + ExecSetupPartitionTupleRouting(
> + root_rel,
> + /* Build WITH CHECK OPTION constraints for leaf partitions */
> + ExecInitPartitionWithCheckOptions(mtstate, root_rel);
> + /* Build a projection for each leaf partition rel. */
> + ExecInitPartitionReturningProjection(mtstate, root_rel);
> ..
> + /* It's not a partitioned table after all; error out. */
> + ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
>
> When we are anyway going to give error if table is not a partitioned
> table, then isn't it better to give it early when we first identify
> that.

Yeah that's right, fixed. Moved the partitioned table check early.
This also showed that there is no need for is_partitioned_table
variable. Accordingly adjusted the code.


> -
> +static void ExecInitPartitionWithCheckOptions(ModifyTableState *mtstate,
> +  Relation root_rel);
> Spurious line delete.

Done.

Also rebased the patch over latest code.

Attached v8 patch.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] UPDATE of partition key

From
Amit Kapila
Date:
On Wed, May 24, 2017 at 2:47 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 18 May 2017 at 16:52, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Wed, May 17, 2017 at 4:05 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>> On Wed, May 17, 2017 at 3:15 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>>> Earlier I thought that option1 is better but later I think that this
>>>>> can complicate the situation as we are firing first BR update then BR
>>>>> delete and can change the row multiple time and defining such
>>>>> behaviour can be complicated.
>>>>>
>>>>
>>>> If we have to go by this theory, then the option you have preferred
>>>> will still execute BR triggers for both delete and insert, so input
>>>> row can still be changed twice.
>>>
>>> Yeah, right as per my theory above Option3 have the same problem.
>>>
>>> But after putting some more thought I realised that only for "Before
>>> Update" or the "Before Insert" trigger row can be changed.
>>>
>>
>> Before Row Delete triggers can suppress the delete operation itself
>> which is kind of unintended in this case.  I think without the user
>> being aware it doesn't seem advisable to execute multiple BR triggers.
>
> By now, majority of the opinions have shown that they do not favour
> two triggers getting fired on a single update. Amit, do you consider
> option 2 as a valid option ?
>

Sounds sensible to me.

> That is, fire only UPDATE triggers. BR on
> source partition, and AR on destination partition. Do you agree that
> firing BR update trigger is essential since it can modify the row and
> even prevent the update from happening ?
>

Agreed.

Apart from above, there is one open issue [1] related to generating an
error for concurrent delete of row for which I have mentioned some way
of getting it done, do you want to try that option and see if you face
any issue in making the progress on that lines?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UPDATE of partition key

From
Amit Kapila
Date:
On Wed, May 24, 2017 at 8:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, May 24, 2017 at 2:47 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>
>> By now, majority of the opinions have shown that they do not favour
>> two triggers getting fired on a single update. Amit, do you consider
>> option 2 as a valid option ?
>>
>
> Sounds sensible to me.
>
>> That is, fire only UPDATE triggers. BR on
>> source partition, and AR on destination partition. Do you agree that
>> firing BR update trigger is essential since it can modify the row and
>> even prevent the update from happening ?
>>
>
> Agreed.
>
> Apart from above, there is one open issue [1]
>

Forget to mention the link, doing it now.

[1] - https://www.postgresql.org/message-id/CAA4eK1KEZQ%2BCyXbBzfn1jFHoEfa_OemDLhLyy7xfD1QUZLo1DQ%40mail.gmail.com


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 24 May 2017 at 20:16, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, May 24, 2017 at 8:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Apart from above, there is one open issue [1]
>>
>
> Forget to mention the link, doing it now.
>
> [1] - https://www.postgresql.org/message-id/CAA4eK1KEZQ%2BCyXbBzfn1jFHoEfa_OemDLhLyy7xfD1QUZLo1DQ%40mail.gmail.com

I am not sure right now whether making the t_ctid of such tuples to
Invalid would be a right option, especially because I think there can
be already some other meaning if t_ctid is not valid. But may be we
can check this more.

If we decide to error out using some way, I would be inclined towards
considering re-using some combinations of infomask bits (like
HEAP_MOVED_OFF as suggested upthread) rather than using invalid t_ctid
value.

But I think, we can also take step-by-step approach even for v11. If
we agree that it is ok to silently do the updates as long as we
document the behaviour, we can go ahead and do this, and then as a
second step, implement error handling as a separate patch. If that
patch does not materialize, we at least have the current behaviour
documented.

Ideally, I think we would have liked if we were somehow able to make
the row-movement UPDATE itself abort if it finds any normal
updates waiting for it to finish, rather than making the normal
updates fail because a row-movement occurred . But I think we will
have to live with it.



Re: [HACKERS] UPDATE of partition key

From
Amit Kapila
Date:
On Mon, May 29, 2017 at 11:20 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 24 May 2017 at 20:16, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Wed, May 24, 2017 at 8:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Apart from above, there is one open issue [1]
>>>
>>
>> Forget to mention the link, doing it now.
>>
>> [1] - https://www.postgresql.org/message-id/CAA4eK1KEZQ%2BCyXbBzfn1jFHoEfa_OemDLhLyy7xfD1QUZLo1DQ%40mail.gmail.com
>
> I am not sure right now whether making the t_ctid of such tuples to
> Invalid would be a right option, especially because I think there can
> be already some other meaning if t_ctid is not valid.
>

AFAIK, this is used to point to current tuple itself or newer version
of a tuple or is used in speculative inserts (refer comments above
HeapTupleHeaderData in htup_details.h).  Can you mention what other
meaning are you referring here for InvalidBlockId in t_ctid?

> But may be we
> can check this more.
>
> If we decide to error out using some way, I would be inclined towards
> considering re-using some combinations of infomask bits (like
> HEAP_MOVED_OFF as suggested upthread) rather than using invalid t_ctid
> value.
>
> But I think, we can also take step-by-step approach even for v11. If
> we agree that it is ok to silently do the updates as long as we
> document the behaviour, we can go ahead and do this, and then as a
> second step, implement error handling as a separate patch. If that
> patch does not materialize, we at least have the current behaviour
> documented.
>

I think that is sensible approach if we find the second step involves
big or complicated changes.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Mon, May 29, 2017 at 5:26 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> But I think, we can also take step-by-step approach even for v11. If
>> we agree that it is ok to silently do the updates as long as we
>> document the behaviour, we can go ahead and do this, and then as a
>> second step, implement error handling as a separate patch. If that
>> patch does not materialize, we at least have the current behaviour
>> documented.
>
> I think that is sensible approach if we find the second step involves
> big or complicated changes.

I think it is definitely a good idea to separate the two patches.
UPDATE tuple routing without any special handling for the EPQ issue is
just a partitioning feature.  The proposed handling for the EPQ issue
is an *on-disk format change*.  That turns a patch which is subject
only to routine bugs into one which can eat your data permanently --
so having the "can eat your data permanently" separated out for both
review and commit seems only prudent.  For me, it's not a matter of
which patch is big or complicated, but rather a matter of one of them
being a whole lot riskier than the other.  Even UPDATE tuple routing
could mess things up pretty seriously if we end up with tuples in the
wrong partition, of course, but the other thing is still worse.

In terms of a development plan, I think we would need to have both
patches before either could be committed.  I believe that everyone
other than me who has expressed an opinion on this issue has said that
it's unacceptable to just ignore the issue, so it doesn't sound like
there will be much appetite for having #1 go into the tree without #2.
I'm still really concerned about that approach because we do not have
very much bit space left and WARM wants to use quite a bit of it.  I
think it's quite possible that we'll be sad in the future if we find
that we can't implement feature XYZ because of the bit-space consumed
by this feature.  However, I don't have the only vote here and I'm not
going to try to shove this into the tree over multiple objections
(unless there are a lot more votes the other way, but so far there's
no sign of that).

Greg/Amit's idea of using the CTID field rather than an infomask bit
seems like a possibly promising approach.  Not everything that needs
bit-space can use the CTID field, so using it is a little less likely
to conflict with something else we want to do in the future than using
a precious infomask bit.  However, I'm worried about this:
   /* Make sure there is no forward chain link in t_ctid */   tp.t_data->t_ctid = tp.t_self;

The comment does not say *why* we need to make sure that there is no
forward chain link, but it implies that some code somewhere in the
system does or at one time did depend on no forward link existing.
Any such code that still exists will need to be updated.  Anybody know
what code that might be, exactly?

The other potential issue I see here is that I know the WARM code also
tries to use the bit-space in the CTID field; in particular, it uses
the CTID field of the last tuple in a HOT chain to point back to the
root of the chain.  That seems like it could conflict with the usage
proposed here, but I'm not totally sure.  Has anyone investigated this
issue?

Regarding the trigger issue, I can't claim to have a terribly strong
opinion on this.  I think that practically anything we do here might
upset somebody, but probably any halfway-reasonable thing we choose to
do will be OK for most people.  However, there seems to be a
discrepancy between the approach that got the most votes and the one
that is implemented by the v8 patch, so that seems like something to
fix.

For what it's worth, in the future, I imagine that we might allow
adding a trigger to a partitioned table and having that cascade down
to all descendant tables.  In that world, firing the BR UPDATE trigger
for the old partition and the AR UPDATE trigger for the new partition
will look a lot like the behavior the user would expect on an
unpartitioned table, which could be viewed as a good thing.  On the
other hand, it's still going to be a DELETE+INSERT under the hood for
the foreseeable future, so firing the delete triggers and then the
insert triggers is also defensible.  Is there any big difference
between these appraoches in terms of how much code is required to make
this work?

In terms of the approach taken by the patch itself, it seems
surprising to me that the patch only calls
ExecSetupPartitionTupleRouting when an update fails the partition
constraint.  Note that in the insert case, we call that function at
the start of execution; calling it in the middle seems to involve
additional hazards; for example, is it really safe to add additional
ResultRelInfos midway through the operation?  Is it safe to take more
locks midway through the operation? It seems like it might be a lot
safer to decide at the beginning of the operation whether this is
needed -- we can skip it if none of the columns involved in the
partition key (or partition key expressions) are mentioned in the
update.  (There's also the issue of triggers, but I'm not sure that
it's sensible to allow a trigger on an individual partition to reroute
an update to another partition; what if we get an infinite loop?)

+            if (concurrently_deleted)
+                return NULL;

I don't understand the motivation for this change, and there are no
comments explaining it that I can see.

Perhaps the concurrency-related (i.e. EPQ) behavior here could be
tested via the isolation tester.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 1 June 2017 at 03:25, Robert Haas <robertmhaas@gmail.com> wrote:
> Greg/Amit's idea of using the CTID field rather than an infomask bit
> seems like a possibly promising approach.  Not everything that needs
> bit-space can use the CTID field, so using it is a little less likely
> to conflict with something else we want to do in the future than using
> a precious infomask bit.  However, I'm worried about this:
>
>     /* Make sure there is no forward chain link in t_ctid */
>     tp.t_data->t_ctid = tp.t_self;
>
> The comment does not say *why* we need to make sure that there is no
> forward chain link, but it implies that some code somewhere in the
> system does or at one time did depend on no forward link existing.
> Any such code that still exists will need to be updated.  Anybody know
> what code that might be, exactly?

I am going to have a look overall at this approach, and about code
somewhere else which might be assuming that t_ctid cannot be Invalid.

> Regarding the trigger issue, I can't claim to have a terribly strong
> opinion on this.  I think that practically anything we do here might
> upset somebody, but probably any halfway-reasonable thing we choose to
> do will be OK for most people.  However, there seems to be a
> discrepancy between the approach that got the most votes and the one
> that is implemented by the v8 patch, so that seems like something to
> fix.

Yes, I have started working on updating the patch to use that approach
(BR and AR update triggers on source and destination partition
respectively, instead of delete+insert) The approach taken by the
patch (BR update + delete+insert triggers) didn't require any changes
in the way ExecDelete() and ExecInsert() were called. Now we would
require to skip the delete/insert triggers, so some flags need to be
passed to these functions, or else have stripped down versions of
ExecDelete() and ExecInsert() which don't do other things like
RETURNING handling and firing triggers.

>
> For what it's worth, in the future, I imagine that we might allow
> adding a trigger to a partitioned table and having that cascade down
> to all descendant tables.  In that world, firing the BR UPDATE trigger
> for the old partition and the AR UPDATE trigger for the new partition
> will look a lot like the behavior the user would expect on an
> unpartitioned table, which could be viewed as a good thing.  On the
> other hand, it's still going to be a DELETE+INSERT under the hood for
> the foreseeable future, so firing the delete triggers and then the
> insert triggers is also defensible.

Ok, I was assuming that there won't be any plans to support triggers
on a partitioned table, but yes, I had imagined how the behaviour
would be in this world. Currently, users who want to have triggers on
a table that happens to be a partitioned table, have to install the
same trigger on each of the leaf partitions, since there is no other
choice. But we would never know whether a trigger on a leaf partition
was actually meant to be specifically on that individual partition or
it was actually meant to be a trigger on a root partitioned table.
Hence there is the difficulty of deciding the right behaviour in case
of triggers with row movement.

If we have an AR UPDATE trigger on root table, then during row
movement, it does not matter whether we fire the trigger on source or
destination, because it is the same single trigger cascaded on both
the partitions. If there is a trigger installed specifically on a leaf
partition, then we know that it should not be fired on other
partitions since it is specifically made for this one. And same
applies for delete and insert triggers: If installed on parent, don't
involve them in row-movement; only fire them if installed on leaf
partitions regardless of whether it was an internally generated
delete+insert due to row-movement). Similarly we can think about BR
triggers.

Of courses, DBAs should be aware of triggers that are already
installed in the table ancestors before installing a new one on a
child table.

Overall, it becomes much clearer what to do if we decide to allow
triggers on partitioned tables.

> Is there any big difference between these appraoches in terms
> of how much code is required to make this work?

You mean if we allow triggers on partitioned tables ? I think we would
have to keep some flag in the trigger data (or somewhere else) that
the trigger actually belongs to upper partitioned table, and so for
delete+insert, don't fire such trigger. Other than that, we don't have
to decide in any unique way which trigger to fire on which table.

>
> In terms of the approach taken by the patch itself, it seems
> surprising to me that the patch only calls
> ExecSetupPartitionTupleRouting when an update fails the partition
> constraint.  Note that in the insert case, we call that function at
> the start of execution;

> calling it in the middle seems to involve additional hazards;
> for example, is it really safe to add additional
> ResultRelInfos midway through the operation?

I thought since the additional ResultRelInfos go into
mtstate->mt_partitions which is independent of
estate->es_result_relations, that should be safe.

> Is it safe to take more locks midway through the operation?

I can imagine some rows already updated, when other tasks like ALTER
TABLE or CREATE INDEX happen on other partitions which are still
unlocked, and then for row movement we try to lock these other
partitions and wait for the DDL tasks to complete. But I didn't see
any particular issues with that. But correct me if you suspect a
possible issue. One issue can be if we were able to modify the table
attributes, but I believe we cannot do that for inherited columns.

> It seems like it might be a lot
> safer to decide at the beginning of the operation whether this is
> needed -- we can skip it if none of the columns involved in the
> partition key (or partition key expressions) are mentioned in the
> update.
> (There's also the issue of triggers,

The reason I thought it cannot be done at the start of the execution,
is because even if we know that update is not modifying the partition
key column, we are not certain that the final NEW row has its
partition key column unchanged, because of triggers. I understand it
might be weird for a user requiring to modify a partition key value,
but if a user does that, it will result in crash because we won't have
the partition routing setup, thinking that there is no partition key
column in the UPDATE.

And we also cannot unconditionally setup the partition routing on all
updates, for performance reasons.

> I'm not sure that it's sensible to allow a trigger on an
> individual partition to reroute an update to another partition
> what if we get an infinite loop?)

You mean, if the other table has another trigger that will again route
to the original partition ? But this infinite loop problem could occur
even for 2 normal tables ?

>
> +            if (concurrently_deleted)
> +                return NULL;
>
> I don't understand the motivation for this change, and there are no
> comments explaining it that I can see.

Yeah comments, I think, are missing. I thought in the ExecDelete()
they are there, but they are not.
If a concurrent delete already deleted the row, we should not bother
about moving the row, hence the above code.


> Perhaps the concurrency-related (i.e. EPQ) behavior here could be
> tested via the isolation tester.
WIll check.



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Thu, Jun 1, 2017 at 7:41 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> Regarding the trigger issue, I can't claim to have a terribly strong
>> opinion on this.  I think that practically anything we do here might
>> upset somebody, but probably any halfway-reasonable thing we choose to
>> do will be OK for most people.  However, there seems to be a
>> discrepancy between the approach that got the most votes and the one
>> that is implemented by the v8 patch, so that seems like something to
>> fix.
>
> Yes, I have started working on updating the patch to use that approach
> (BR and AR update triggers on source and destination partition
> respectively, instead of delete+insert) The approach taken by the
> patch (BR update + delete+insert triggers) didn't require any changes
> in the way ExecDelete() and ExecInsert() were called. Now we would
> require to skip the delete/insert triggers, so some flags need to be
> passed to these functions, or else have stripped down versions of
> ExecDelete() and ExecInsert() which don't do other things like
> RETURNING handling and firing triggers.

See, that strikes me as a pretty good argument for firing the
DELETE+INSERT triggers...

I'm not wedded to that approach, but "what makes the code simplest?"
is not a bad tiebreak, other things being equal.

>> In terms of the approach taken by the patch itself, it seems
>> surprising to me that the patch only calls
>> ExecSetupPartitionTupleRouting when an update fails the partition
>> constraint.  Note that in the insert case, we call that function at
>> the start of execution;
>
>> calling it in the middle seems to involve additional hazards;
>> for example, is it really safe to add additional
>> ResultRelInfos midway through the operation?
>
> I thought since the additional ResultRelInfos go into
> mtstate->mt_partitions which is independent of
> estate->es_result_relations, that should be safe.

I don't know.  That sounds scary to me, but it might be OK.  Probably
needs more study.

>> Is it safe to take more locks midway through the operation?
>
> I can imagine some rows already updated, when other tasks like ALTER
> TABLE or CREATE INDEX happen on other partitions which are still
> unlocked, and then for row movement we try to lock these other
> partitions and wait for the DDL tasks to complete. But I didn't see
> any particular issues with that. But correct me if you suspect a
> possible issue. One issue can be if we were able to modify the table
> attributes, but I believe we cannot do that for inherited columns.

It's just that it's very unlike what we do anywhere else.  I don't
have a real specific idea in mind about what might totally break, but
at a minimum it could certainly cause behavior that can't happen
today.  Today, if you run a query on some tables, it will block
waiting for any locks at the beginning of the query, and the query
won't begin executing until it has all of the locks.  With this
approach, you might block midway through; you might even deadlock
midway through.  Maybe that's not overtly broken, but it's at least
got the possibility of being surprising.

Now, I'd actually kind of like to have behavior like this for other
cases, too.  If we're inserting one row, can't we just lock the one
partition into which it needs to get inserted, rather than all of
them?  But I'm wary of introducing such behavior incidentally in a
patch whose main goal is to allow UPDATE row movement.  Figuring out
what could go wrong and fixing it seems like a substantial project all
of its own.

> The reason I thought it cannot be done at the start of the execution,
> is because even if we know that update is not modifying the partition
> key column, we are not certain that the final NEW row has its
> partition key column unchanged, because of triggers. I understand it
> might be weird for a user requiring to modify a partition key value,
> but if a user does that, it will result in crash because we won't have
> the partition routing setup, thinking that there is no partition key
> column in the UPDATE.

I think we could avoid that issue.  Suppose we select the target
partition based only on the original NEW tuple.  If a trigger on that
partition subsequently modifies the tuple so that it no longer
satisfies the partition constraint for that partition, just let it
ERROR out normally.  Actually, it seems like that's probably the
*easiest* behavior to implement.  Otherwise, you might fire triggers,
discover that you need to re-route the tuple, and then ... fire
triggers again on the new partition, which might reroute it again?

>> I'm not sure that it's sensible to allow a trigger on an
>> individual partition to reroute an update to another partition
>> what if we get an infinite loop?)
>
> You mean, if the other table has another trigger that will again route
> to the original partition ? But this infinite loop problem could occur
> even for 2 normal tables ?

How?  For a normal trigger, nothing it does can change which table is targeted.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 2 June 2017 at 01:17, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jun 1, 2017 at 7:41 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>> Regarding the trigger issue, I can't claim to have a terribly strong
>>> opinion on this.  I think that practically anything we do here might
>>> upset somebody, but probably any halfway-reasonable thing we choose to
>>> do will be OK for most people.  However, there seems to be a
>>> discrepancy between the approach that got the most votes and the one
>>> that is implemented by the v8 patch, so that seems like something to
>>> fix.
>>
>> Yes, I have started working on updating the patch to use that approach
>> (BR and AR update triggers on source and destination partition
>> respectively, instead of delete+insert) The approach taken by the
>> patch (BR update + delete+insert triggers) didn't require any changes
>> in the way ExecDelete() and ExecInsert() were called. Now we would
>> require to skip the delete/insert triggers, so some flags need to be
>> passed to these functions, or else have stripped down versions of
>> ExecDelete() and ExecInsert() which don't do other things like
>> RETURNING handling and firing triggers.
>
> See, that strikes me as a pretty good argument for firing the
> DELETE+INSERT triggers...
>
> I'm not wedded to that approach, but "what makes the code simplest?"
> is not a bad tiebreak, other things being equal.

Yes, that sounds good to me. But I think we want to wait for other's
opinion because it is quite understandable that two triggers firing on
the same partition sounds odd.

>
>>> In terms of the approach taken by the patch itself, it seems
>>> surprising to me that the patch only calls
>>> ExecSetupPartitionTupleRouting when an update fails the partition
>>> constraint.  Note that in the insert case, we call that function at
>>> the start of execution;
>>
>>> calling it in the middle seems to involve additional hazards;
>>> for example, is it really safe to add additional
>>> ResultRelInfos midway through the operation?
>>
>> I thought since the additional ResultRelInfos go into
>> mtstate->mt_partitions which is independent of
>> estate->es_result_relations, that should be safe.
>
> I don't know.  That sounds scary to me, but it might be OK.  Probably
> needs more study.
>
>>> Is it safe to take more locks midway through the operation?
>>
>> I can imagine some rows already updated, when other tasks like ALTER
>> TABLE or CREATE INDEX happen on other partitions which are still
>> unlocked, and then for row movement we try to lock these other
>> partitions and wait for the DDL tasks to complete. But I didn't see
>> any particular issues with that. But correct me if you suspect a
>> possible issue. One issue can be if we were able to modify the table
>> attributes, but I believe we cannot do that for inherited columns.
>
> It's just that it's very unlike what we do anywhere else.  I don't
> have a real specific idea in mind about what might totally break, but
> at a minimum it could certainly cause behavior that can't happen
> today.  Today, if you run a query on some tables, it will block
> waiting for any locks at the beginning of the query, and the query
> won't begin executing until it has all of the locks.  With this
> approach, you might block midway through; you might even deadlock
> midway through.  Maybe that's not overtly broken, but it's at least
> got the possibility of being surprising.
>
> Now, I'd actually kind of like to have behavior like this for other
> cases, too.  If we're inserting one row, can't we just lock the one
> partition into which it needs to get inserted, rather than all of
> them?  But I'm wary of introducing such behavior incidentally in a
> patch whose main goal is to allow UPDATE row movement.  Figuring out
> what could go wrong and fixing it seems like a substantial project all
> of its own.

Yes, I agree it makes sense trying to avoid introducing something we
haven't tried before, in this patch, as far as possible.

>
>> The reason I thought it cannot be done at the start of the execution,
>> is because even if we know that update is not modifying the partition
>> key column, we are not certain that the final NEW row has its
>> partition key column unchanged, because of triggers. I understand it
>> might be weird for a user requiring to modify a partition key value,
>> but if a user does that, it will result in crash because we won't have
>> the partition routing setup, thinking that there is no partition key
>> column in the UPDATE.
>
> I think we could avoid that issue.  Suppose we select the target
> partition based only on the original NEW tuple.  If a trigger on that
> partition subsequently modifies the tuple so that it no longer
> satisfies the partition constraint for that partition, just let it
> ERROR out normally.

Ok, so you are saying, don't allow a partition trigger to initiate the
row movement. I think we should keep this as a documented restriction.
Actually it would be unfortunate that we would have to keep this
restriction only because of implementation issue.

So, according to that, below would be the logic :

Run partition constraint check on the original NEW row.
If it succeeds :
{   Fire BR UPDATE trigger on the original partition.   Run partition constraint check again with the modified NEW row
(may be do this only if the trigger modified the partition key)   If it fails,       abort.   Else       proceed with
theusual local update.
 
}
else
{   Fire BR UPDATE trigger on original partition.   Find the right partition for the modified NEW row.   If it is the
samepartition,       proceed with the usual local update.   else       do the row movement.
 
}


> Actually, it seems like that's probably the
> *easiest* behavior to implement.  Otherwise, you might fire triggers,
> discover that you need to re-route the tuple, and then ... fire
> triggers again on the new partition, which might reroute it again?

Why would update BR trigger fire on the new partition ? On the new
partition, only BR INSERT trigger would fire if at all we decide to
fire delete+insert triggers. And insert trigger would not again cause
the tuple to be re-routed because it's an insert.

>
>>> I'm not sure that it's sensible to allow a trigger on an
>>> individual partition to reroute an update to another partition
>>> what if we get an infinite loop?)
>>
>> You mean, if the other table has another trigger that will again route
>> to the original partition ? But this infinite loop problem could occur
>> even for 2 normal tables ?
>
> How?  For a normal trigger, nothing it does can change which table is targeted.

I thought you were considering the possibility that on the new
partition, the trigger function itself is running another update stmt,
which is also possible for normal tables .

But now I think you are saying, the row that is being inserted into
the new partition might get again modified by the INSERT trigger on
the new partition, which might in turn cause it to fail the new
partition constraint. But in that case, it will not cause another row
movement, because in the new partition, it's an INSERT, not an UPDATE,
so the operation would end there, aborted.

But correct me if I you were thinking of a different scenario that can
cause infinite loop.


-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] UPDATE of partition key

From
Amit Kapila
Date:
On Fri, Jun 2, 2017 at 4:37 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 2 June 2017 at 01:17, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Jun 1, 2017 at 7:41 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>>> Regarding the trigger issue, I can't claim to have a terribly strong
>>>> opinion on this.  I think that practically anything we do here might
>>>> upset somebody, but probably any halfway-reasonable thing we choose to
>>>> do will be OK for most people.  However, there seems to be a
>>>> discrepancy between the approach that got the most votes and the one
>>>> that is implemented by the v8 patch, so that seems like something to
>>>> fix.
>>>
>>> Yes, I have started working on updating the patch to use that approach
>>> (BR and AR update triggers on source and destination partition
>>> respectively, instead of delete+insert) The approach taken by the
>>> patch (BR update + delete+insert triggers) didn't require any changes
>>> in the way ExecDelete() and ExecInsert() were called. Now we would
>>> require to skip the delete/insert triggers, so some flags need to be
>>> passed to these functions,
>>>

I thought you already need to pass an additional flag for special
handling of ctid in Delete case.  For Insert, a new flag needs to be
passed and need to have a check for that in few places.

> or else have stripped down versions of
>>> ExecDelete() and ExecInsert() which don't do other things like
>>> RETURNING handling and firing triggers.
>>
>> See, that strikes me as a pretty good argument for firing the
>> DELETE+INSERT triggers...
>>
>> I'm not wedded to that approach, but "what makes the code simplest?"
>> is not a bad tiebreak, other things being equal.
>
> Yes, that sounds good to me.
>

I am okay if we want to go ahead with firing BR UPDATE + DELETE +
INSERT triggers for an Update statement (when row movement happens) on
the argument of code simplicity, but it sounds slightly odd behavior.

> But I think we want to wait for other's
> opinion because it is quite understandable that two triggers firing on
> the same partition sounds odd.
>

Yeah, but I think we have to rely on docs in this case as behavior is
not intuitive.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UPDATE of partition key

From
Amit Kapila
Date:
On Thu, Jun 1, 2017 at 3:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, May 29, 2017 at 5:26 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> But I think, we can also take step-by-step approach even for v11. If
>>> we agree that it is ok to silently do the updates as long as we
>>> document the behaviour, we can go ahead and do this, and then as a
>>> second step, implement error handling as a separate patch. If that
>>> patch does not materialize, we at least have the current behaviour
>>> documented.
>>
>> I think that is sensible approach if we find the second step involves
>> big or complicated changes.
>
> I think it is definitely a good idea to separate the two patches.
> UPDATE tuple routing without any special handling for the EPQ issue is
> just a partitioning feature.  The proposed handling for the EPQ issue
> is an *on-disk format change*.  That turns a patch which is subject
> only to routine bugs into one which can eat your data permanently --
> so having the "can eat your data permanently" separated out for both
> review and commit seems only prudent.  For me, it's not a matter of
> which patch is big or complicated, but rather a matter of one of them
> being a whole lot riskier than the other.  Even UPDATE tuple routing
> could mess things up pretty seriously if we end up with tuples in the
> wrong partition, of course, but the other thing is still worse.
>
> In terms of a development plan, I think we would need to have both
> patches before either could be committed.  I believe that everyone
> other than me who has expressed an opinion on this issue has said that
> it's unacceptable to just ignore the issue, so it doesn't sound like
> there will be much appetite for having #1 go into the tree without #2.
> I'm still really concerned about that approach because we do not have
> very much bit space left and WARM wants to use quite a bit of it.  I
> think it's quite possible that we'll be sad in the future if we find
> that we can't implement feature XYZ because of the bit-space consumed
> by this feature.  However, I don't have the only vote here and I'm not
> going to try to shove this into the tree over multiple objections
> (unless there are a lot more votes the other way, but so far there's
> no sign of that).
>
> Greg/Amit's idea of using the CTID field rather than an infomask bit
> seems like a possibly promising approach.  Not everything that needs
> bit-space can use the CTID field, so using it is a little less likely
> to conflict with something else we want to do in the future than using
> a precious infomask bit.  However, I'm worried about this:
>
>     /* Make sure there is no forward chain link in t_ctid */
>     tp.t_data->t_ctid = tp.t_self;
>
> The comment does not say *why* we need to make sure that there is no
> forward chain link, but it implies that some code somewhere in the
> system does or at one time did depend on no forward link existing.
>

I think it is to ensure that EvalPlanQual mechanism gets invoked in
the right case.   The visibility routine will return HeapTupleUpdated
both when the tuple is deleted or updated (updated - has a newer
version of the tuple), so we use ctid to decide if we need to follow
the tuple chain for a newer version of the tuple.

> Any such code that still exists will need to be updated.
>

Yeah.

> The other potential issue I see here is that I know the WARM code also
> tries to use the bit-space in the CTID field; in particular, it uses
> the CTID field of the last tuple in a HOT chain to point back to the
> root of the chain.  That seems like it could conflict with the usage
> proposed here, but I'm not totally sure.
>

The proposed change in WARM tuple patch uses ip_posid field of CTID
and we are planning to use ip_blkid field.  Here is the relevant text
and code from WARM tuple patch:

"Store the root line pointer of the WARM chain in the t_ctid.ip_posid
field of the last tuple in the chain and mark the tuple header with
HEAP_TUPLE_LATEST flag to record that fact."

+#define HeapTupleHeaderSetHeapLatest(tup, offnum) \
+do { \
+ AssertMacro(OffsetNumberIsValid(offnum)); \
+ (tup)->t_infomask2 |= HEAP_LATEST_TUPLE; \
+ ItemPointerSetOffsetNumber(&(tup)->t_ctid, (offnum)); \
+} while (0)

For further details, refer patch 0001-Track-root-line-pointer-v23_v26
in the below e-mail:
https://www.postgresql.org/message-id/CABOikdOTstHK2y0rDk%2BY3Wx9HRe%2BbZtj3zuYGU%3DVngneiHo5KQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 5 June 2017 at 11:27, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Jun 2, 2017 at 4:37 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> On 2 June 2017 at 01:17, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Thu, Jun 1, 2017 at 7:41 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>>>> Regarding the trigger issue, I can't claim to have a terribly strong
>>>>> opinion on this.  I think that practically anything we do here might
>>>>> upset somebody, but probably any halfway-reasonable thing we choose to
>>>>> do will be OK for most people.  However, there seems to be a
>>>>> discrepancy between the approach that got the most votes and the one
>>>>> that is implemented by the v8 patch, so that seems like something to
>>>>> fix.
>>>>
>>>> Yes, I have started working on updating the patch to use that approach
>>>> (BR and AR update triggers on source and destination partition
>>>> respectively, instead of delete+insert) The approach taken by the
>>>> patch (BR update + delete+insert triggers) didn't require any changes
>>>> in the way ExecDelete() and ExecInsert() were called. Now we would
>>>> require to skip the delete/insert triggers, so some flags need to be
>>>> passed to these functions,
>>>>
>
> I thought you already need to pass an additional flag for special
> handling of ctid in Delete case.

Yeah that was unavoidable.

> For Insert, a new flag needs to be
> passed and need to have a check for that in few places.

For skipping delete and insert trigger, we need to include still
another flag, and checks in both ExecDelete() and ExecInsert() for
skipping both BR and AR trigger, and then in ExecUpdate(), again a
call to ExecARUpdateTriggers() before quitting.

>
>> or else have stripped down versions of
>>>> ExecDelete() and ExecInsert() which don't do other things like
>>>> RETURNING handling and firing triggers.
>>>
>>> See, that strikes me as a pretty good argument for firing the
>>> DELETE+INSERT triggers...
>>>
>>> I'm not wedded to that approach, but "what makes the code simplest?"
>>> is not a bad tiebreak, other things being equal.
>>
>> Yes, that sounds good to me.
>>
>
> I am okay if we want to go ahead with firing BR UPDATE + DELETE +
> INSERT triggers for an Update statement (when row movement happens) on
> the argument of code simplicity, but it sounds slightly odd behavior.

Ok. Will keep this behaviour that is already present in the patch. I
myself also feel that code simplicity can be used as a tie-breaker if
a single behaviour  cannot be agreed upon that completely satisfies
all aspects.

>
>> But I think we want to wait for other's
>> opinion because it is quite understandable that two triggers firing on
>> the same partition sounds odd.
>>
>
> Yeah, but I think we have to rely on docs in this case as behavior is
> not intuitive.

Agreed. The doc changes in the patch already has explained in detail
this behaviour.

>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Fri, Jun 2, 2017 at 7:07 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> So, according to that, below would be the logic :
>
> Run partition constraint check on the original NEW row.
> If it succeeds :
> {
>     Fire BR UPDATE trigger on the original partition.
>     Run partition constraint check again with the modified NEW row
> (may be do this only if the trigger modified the partition key)
>     If it fails,
>         abort.
>     Else
>         proceed with the usual local update.
> }
> else
> {
>     Fire BR UPDATE trigger on original partition.
>     Find the right partition for the modified NEW row.
>     If it is the same partition,
>         proceed with the usual local update.
>     else
>         do the row movement.
> }

Sure, that sounds about right, although the "Fire BR UPDATE trigger on
the original partition." is the same in both branches, so I'm not
quite sure why you have that in the "if" block.

>> Actually, it seems like that's probably the
>> *easiest* behavior to implement.  Otherwise, you might fire triggers,
>> discover that you need to re-route the tuple, and then ... fire
>> triggers again on the new partition, which might reroute it again?
>
> Why would update BR trigger fire on the new partition ? On the new
> partition, only BR INSERT trigger would fire if at all we decide to
> fire delete+insert triggers. And insert trigger would not again cause
> the tuple to be re-routed because it's an insert.

OK, sure, that makes sense.  I guess it's really the insert case that
I was worried about -- if we have a BEFORE ROW INSERT trigger and it
changes the tuple and we reroute it, I think we'd have to fire the
BEFORE ROW INSERT on the new partition, which might change the tuple
again and cause yet another reroute, and in this worst case this is an
infinite loop.  But it sounds like we're going to fix that problem --
I think correctly -- by only ever allowing the tuple to be routed
once.  If some trigger tries to make a change the tuple after that
such that re-routing is required, they get an error.  And what you are
describing here seems like it will be fine.

> But now I think you are saying, the row that is being inserted into
> the new partition might get again modified by the INSERT trigger on
> the new partition, which might in turn cause it to fail the new
> partition constraint. But in that case, it will not cause another row
> movement, because in the new partition, it's an INSERT, not an UPDATE,
> so the operation would end there, aborted.

Yeah, that's what I was worried about.  I didn't want a row movement
to be able to trigger another row movement and so on ad infinitum.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Mon, Jun 5, 2017 at 2:51 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Greg/Amit's idea of using the CTID field rather than an infomask bit
>> seems like a possibly promising approach.  Not everything that needs
>> bit-space can use the CTID field, so using it is a little less likely
>> to conflict with something else we want to do in the future than using
>> a precious infomask bit.  However, I'm worried about this:
>>
>>     /* Make sure there is no forward chain link in t_ctid */
>>     tp.t_data->t_ctid = tp.t_self;
>>
>> The comment does not say *why* we need to make sure that there is no
>> forward chain link, but it implies that some code somewhere in the
>> system does or at one time did depend on no forward link existing.
>
> I think it is to ensure that EvalPlanQual mechanism gets invoked in
> the right case.   The visibility routine will return HeapTupleUpdated
> both when the tuple is deleted or updated (updated - has a newer
> version of the tuple), so we use ctid to decide if we need to follow
> the tuple chain for a newer version of the tuple.

That would explain why need to make sure that there *is* a forward
chain link in t_ctid for an update, but it doesn't explain why we need
to make sure that there *isn't* a forward link for delete.

> The proposed change in WARM tuple patch uses ip_posid field of CTID
> and we are planning to use ip_blkid field.  Here is the relevant text
> and code from WARM tuple patch:
>
> "Store the root line pointer of the WARM chain in the t_ctid.ip_posid
> field of the last tuple in the chain and mark the tuple header with
> HEAP_TUPLE_LATEST flag to record that fact."
>
> +#define HeapTupleHeaderSetHeapLatest(tup, offnum) \
> +do { \
> + AssertMacro(OffsetNumberIsValid(offnum)); \
> + (tup)->t_infomask2 |= HEAP_LATEST_TUPLE; \
> + ItemPointerSetOffsetNumber(&(tup)->t_ctid, (offnum)); \
> +} while (0)
>
> For further details, refer patch 0001-Track-root-line-pointer-v23_v26
> in the below e-mail:
> https://www.postgresql.org/message-id/CABOikdOTstHK2y0rDk%2BY3Wx9HRe%2BbZtj3zuYGU%3DVngneiHo5KQ%40mail.gmail.com

OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] UPDATE of partition key

From
Amit Kapila
Date:
On Tue, Jun 6, 2017 at 11:54 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jun 5, 2017 at 2:51 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Greg/Amit's idea of using the CTID field rather than an infomask bit
>>> seems like a possibly promising approach.  Not everything that needs
>>> bit-space can use the CTID field, so using it is a little less likely
>>> to conflict with something else we want to do in the future than using
>>> a precious infomask bit.  However, I'm worried about this:
>>>
>>>     /* Make sure there is no forward chain link in t_ctid */
>>>     tp.t_data->t_ctid = tp.t_self;
>>>
>>> The comment does not say *why* we need to make sure that there is no
>>> forward chain link, but it implies that some code somewhere in the
>>> system does or at one time did depend on no forward link existing.
>>
>> I think it is to ensure that EvalPlanQual mechanism gets invoked in
>> the right case.   The visibility routine will return HeapTupleUpdated
>> both when the tuple is deleted or updated (updated - has a newer
>> version of the tuple), so we use ctid to decide if we need to follow
>> the tuple chain for a newer version of the tuple.
>
> That would explain why need to make sure that there *is* a forward
> chain link in t_ctid for an update, but it doesn't explain why we need
> to make sure that there *isn't* a forward link for delete.
>

As far as I understand, it is to ensure that for deleted rows, nothing
more needs to be done.  For example, see the below check in
ExecUpdate/ExecDelete.
if (!ItemPointerEquals(tupleid, &hufd.ctid))
{
..
}
..

Also a similar check in ExecLockRows.  Now for deleted rows, if the
t_ctid wouldn't point to itself, then in the mentioned functions, we
were not in a position to conclude that the row is deleted.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 6 June 2017 at 23:52, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jun 2, 2017 at 7:07 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> So, according to that, below would be the logic :
>>
>> Run partition constraint check on the original NEW row.
>> If it succeeds :
>> {
>>     Fire BR UPDATE trigger on the original partition.
>>     Run partition constraint check again with the modified NEW row
>> (may be do this only if the trigger modified the partition key)
>>     If it fails,
>>         abort.
>>     Else
>>         proceed with the usual local update.
>> }
>> else
>> {
>>     Fire BR UPDATE trigger on original partition.
>>     Find the right partition for the modified NEW row.
>>     If it is the same partition,
>>         proceed with the usual local update.
>>     else
>>         do the row movement.
>> }
>
> Sure, that sounds about right, although the "Fire BR UPDATE trigger on
> the original partition." is the same in both branches, so I'm not
> quite sure why you have that in the "if" block.

Actually after coding this logic, it looks a bit different. See
ExecUpdate() in the attached file  trigger_related_changes.patch

----

Now that we are making sure trigger won't change the partition of the
tuple, next thing we need to do is, make sure the tuple routing setup
is done *only* if the UPDATE is modifying partition keys. Otherwise,
this will degrade normal update performance.

Below is the logic I am implementing for determining whether the
UPDATE is modifying partition keys.

In ExecInitModifyTable() ...
Call GetUpdatedColumns(mtstate->rootResultRelInfo, estate) to get
updated_columns.
For each of the updated_columns :
{
    Check if the column is part of partition key quals of any of
    the relations in mtstate->resultRelInfo[] array.
    /*
     * mtstate->resultRelInfo[] contains exactly those leaf partitions
     * which qualify the update quals.
     */

    If (it is part of partition key quals of at least one of the relations)
    {
       Do ExecSetupPartitionTupleRouting() for the root partition.
       break;
    }
}

Few things need to be considered :

Use Relation->rd_partcheck to get partition check quals of each of the
relations in mtstate->resultRelInfo[].

The Relation->rd_partcheck of the leaf partitions would include the
ancestors' partition quals as well. So we are good: we don't have to
explicitly get the upper partition constraints. Note that an UPDATE
can modify a column which is not used in a partition constraint
expressions of any of the partitions or partitioned tables in the
subtree, but that column may have been used in partition constraint of
a partitioned table belonging to upper subtree.

All of the relations in mtstate->resultRelInfo are already open. So we
don't need to re-open any more relations to get the partition quals.

The column bitmap set returned by GetUpdatedColumns() refer to
attribute numbers w.r.t. to the root partition. And the
mstate->resultRelInfo[] have attnos w.r.t. to the leaf partitions. So
we need to do something similar to map_partition_varattnos() to change
the updated columns attnos to the leaf partitions and walk down the
partition constraint expressions to find if the attnos are present
there.


Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 7 June 2017 at 16:42, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> The column bitmap set returned by GetUpdatedColumns() refer to
> attribute numbers w.r.t. to the root partition. And the
> mstate->resultRelInfo[] have attnos w.r.t. to the leaf partitions. So
> we need to do something similar to map_partition_varattnos() to change
> the updated columns attnos to the leaf partitions

I was wrong about this. Each of the mtstate->resultRelInfo[] has its
own corresponding RangeTblEntry with its own updatedCols having attnos
accordingly adjusted to refer its own table attributes. So we don't
have to do the mapping; we need to get modifedCols separately for each
of the ResultRelInfo, rather than the root relinfo.

> and walk down the
> partition constraint expressions to find if the attnos are present
> there.

But this we will need to do.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Wed, Jun 7, 2017 at 5:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> As far as I understand, it is to ensure that for deleted rows, nothing
> more needs to be done.  For example, see the below check in
> ExecUpdate/ExecDelete.
> if (!ItemPointerEquals(tupleid, &hufd.ctid))
> {
> ..
> }
> ..
>
> Also a similar check in ExecLockRows.  Now for deleted rows, if the
> t_ctid wouldn't point to itself, then in the mentioned functions, we
> were not in a position to conclude that the row is deleted.

Right, so we would have to find all such checks and change them to use
some other method to conclude that the row is deleted.  What method
would we use?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] UPDATE of partition key

From
Amit Kapila
Date:
On Thu, Jun 8, 2017 at 1:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jun 7, 2017 at 5:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> As far as I understand, it is to ensure that for deleted rows, nothing
>> more needs to be done.  For example, see the below check in
>> ExecUpdate/ExecDelete.
>> if (!ItemPointerEquals(tupleid, &hufd.ctid))
>> {
>> ..
>> }
>> ..
>>
>> Also a similar check in ExecLockRows.  Now for deleted rows, if the
>> t_ctid wouldn't point to itself, then in the mentioned functions, we
>> were not in a position to conclude that the row is deleted.
>
> Right, so we would have to find all such checks and change them to use
> some other method to conclude that the row is deleted.  What method
> would we use?
>

I think before doing above check we can simply check if ctid.ip_blkid
contains InvalidBlockNumber, then return an error.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Thu, Jun 8, 2017 at 7:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Jun 8, 2017 at 1:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Jun 7, 2017 at 5:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> As far as I understand, it is to ensure that for deleted rows, nothing
>>> more needs to be done.  For example, see the below check in
>>> ExecUpdate/ExecDelete.
>>> if (!ItemPointerEquals(tupleid, &hufd.ctid))
>>> {
>>> ..
>>> }
>>> ..
>>>
>>> Also a similar check in ExecLockRows.  Now for deleted rows, if the
>>> t_ctid wouldn't point to itself, then in the mentioned functions, we
>>> were not in a position to conclude that the row is deleted.
>>
>> Right, so we would have to find all such checks and change them to use
>> some other method to conclude that the row is deleted.  What method
>> would we use?
>
> I think before doing above check we can simply check if ctid.ip_blkid
> contains InvalidBlockNumber, then return an error.

Hmm, OK.  That case never happens today?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 7 June 2017 at 20:19, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 7 June 2017 at 16:42, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> The column bitmap set returned by GetUpdatedColumns() refer to
>> attribute numbers w.r.t. to the root partition. And the
>> mstate->resultRelInfo[] have attnos w.r.t. to the leaf partitions. So
>> we need to do something similar to map_partition_varattnos() to change
>> the updated columns attnos to the leaf partitions
>
> I was wrong about this. Each of the mtstate->resultRelInfo[] has its
> own corresponding RangeTblEntry with its own updatedCols having attnos
> accordingly adjusted to refer its own table attributes. So we don't
> have to do the mapping; we need to get modifedCols separately for each
> of the ResultRelInfo, rather than the root relinfo.
>
>> and walk down the
>> partition constraint expressions to find if the attnos are present
>> there.
>
> But this we will need to do.

Attached is v9 patch. This covers the two parts discussed upthread :
1. Prevent triggers from causing the row movement.
2. Setup the tuple routing in ExecInitModifyTable(), but only if a
partition key is modified. Check new function IsPartitionKeyUpdate().

Have rebased the patch to consider changes done in commit
15ce775faa428dc9 to prevent triggers from violating partition
constraints. There, for the call to ExecFindPartition() in ExecInsert,
we need to fetch the mtstate->rootResultRelInfo in case the operation
is part of update row movement. This is because the root partition is
not available in the resultRelInfo for UPDATE.


Added many more test scenarios in update.sql that cover the above.

I am yet to test the concurrency part using isolation tester.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] UPDATE of partition key

From
Amit Kapila
Date:
On Thu, Jun 8, 2017 at 10:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jun 8, 2017 at 7:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Thu, Jun 8, 2017 at 1:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Wed, Jun 7, 2017 at 5:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>> As far as I understand, it is to ensure that for deleted rows, nothing
>>>> more needs to be done.  For example, see the below check in
>>>> ExecUpdate/ExecDelete.
>>>> if (!ItemPointerEquals(tupleid, &hufd.ctid))
>>>> {
>>>> ..
>>>> }
>>>> ..
>>>>
>>>> Also a similar check in ExecLockRows.  Now for deleted rows, if the
>>>> t_ctid wouldn't point to itself, then in the mentioned functions, we
>>>> were not in a position to conclude that the row is deleted.
>>>
>>> Right, so we would have to find all such checks and change them to use
>>> some other method to conclude that the row is deleted.  What method
>>> would we use?
>>
>> I think before doing above check we can simply check if ctid.ip_blkid
>> contains InvalidBlockNumber, then return an error.
>
> Hmm, OK.  That case never happens today?
>

As per my understanding that case doesn't exist.  I will verify again
once the patch is available.  I can take a crack at it if Amit
Khandekar is busy with something else or is not comfortable in this
area.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 9 June 2017 at 19:10, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Jun 8, 2017 at 10:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Jun 8, 2017 at 7:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> On Thu, Jun 8, 2017 at 1:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> On Wed, Jun 7, 2017 at 5:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>>> As far as I understand, it is to ensure that for deleted rows, nothing
>>>>> more needs to be done.  For example, see the below check in
>>>>> ExecUpdate/ExecDelete.
>>>>> if (!ItemPointerEquals(tupleid, &hufd.ctid))
>>>>> {
>>>>> ..
>>>>> }
>>>>> ..
>>>>>
>>>>> Also a similar check in ExecLockRows.  Now for deleted rows, if the
>>>>> t_ctid wouldn't point to itself, then in the mentioned functions, we
>>>>> were not in a position to conclude that the row is deleted.
>>>>
>>>> Right, so we would have to find all such checks and change them to use
>>>> some other method to conclude that the row is deleted.  What method
>>>> would we use?
>>>
>>> I think before doing above check we can simply check if ctid.ip_blkid
>>> contains InvalidBlockNumber, then return an error.
>>
>> Hmm, OK.  That case never happens today?
>>
>
> As per my understanding that case doesn't exist.  I will verify again
> once the patch is available.  I can take a crack at it if Amit
> Khandekar is busy with something else or is not comfortable in this
> area.

Amit, I was going to have a look at this, once I finish with the other
part. I was busy on getting that done first. But your comments/help
are always welcome.

>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] UPDATE of partition key

From
Amit Kapila
Date:
On Fri, Jun 9, 2017 at 7:48 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 9 June 2017 at 19:10, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Thu, Jun 8, 2017 at 10:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Thu, Jun 8, 2017 at 7:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>>
>>>> I think before doing above check we can simply check if ctid.ip_blkid
>>>> contains InvalidBlockNumber, then return an error.
>>>
>>> Hmm, OK.  That case never happens today?
>>>
>>
>> As per my understanding that case doesn't exist.  I will verify again
>> once the patch is available.  I can take a crack at it if Amit
>> Khandekar is busy with something else or is not comfortable in this
>> area.
>
> Amit, I was going to have a look at this, once I finish with the other
> part.
>

Sure, will wait for your patch to be available.  I can help by
reviewing the same.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
While rebasing my patch for the below recent commit, I realized that a
similar issue exists for the uptate-tuple-routing patch as well :

commit 78a030a441966d91bc7e932ef84da39c3ea7d970
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Mon Jun 12 23:29:44 2017 -0400
   Fix confusion about number of subplans in partitioned INSERT setup.

The above issue was about incorrectly using 'i' in
mtstate->mt_plans[i] during handling WITH CHECK OPTIONS in
ExecInitModifyTable(), where 'i' was actually meant to refer to the
positions in mtstate->mt_num_partitions. Actually for INSERT, there is
only a single plan element in mtstate->mt_plans[] array.

Similarly, for update-tuple routing, we cannot use
mtstate->mt_plans[i], because 'i' refers to position in
mtstate->mt_partitions[] , whereas mtstate->mt_plans is not at all in
order of mtstate->mt_partitions; in fact mt_plans has only the plans
that are to be scanned on pruned partitions; so it can well be a small
subset of total partitions.

I am working on an updated patch to fix the above.



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 13 June 2017 at 15:40, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> While rebasing my patch for the below recent commit, I realized that a
> similar issue exists for the uptate-tuple-routing patch as well :
>
> commit 78a030a441966d91bc7e932ef84da39c3ea7d970
> Author: Tom Lane <tgl@sss.pgh.pa.us>
> Date:   Mon Jun 12 23:29:44 2017 -0400
>
>     Fix confusion about number of subplans in partitioned INSERT setup.
>
> The above issue was about incorrectly using 'i' in
> mtstate->mt_plans[i] during handling WITH CHECK OPTIONS in
> ExecInitModifyTable(), where 'i' was actually meant to refer to the
> positions in mtstate->mt_num_partitions. Actually for INSERT, there is
> only a single plan element in mtstate->mt_plans[] array.
>
> Similarly, for update-tuple routing, we cannot use
> mtstate->mt_plans[i], because 'i' refers to position in
> mtstate->mt_partitions[] , whereas mtstate->mt_plans is not at all in
> order of mtstate->mt_partitions; in fact mt_plans has only the plans
> that are to be scanned on pruned partitions; so it can well be a small
> subset of total partitions.
>
> I am working on an updated patch to fix the above.

Attached patch v10 fixes the above. In the existing code, where it
builds WCO constraints for each leaf partition; with the patch, that
code now is applicable to row-movement-updates as well. So the
assertions in the code are now updated to allow the same. Secondly,
the mapping for each of the leaf partitions was constructed using the
root partition attributes. Now in the patch, the
mtstate->resultRelInfo[0] (i.e. the first resultRelInfo) is used as
reference. So effectively, map_partition_varattnos() now represents
not just parent-to-partition mapping, but rather, mapping between any
two partitions/partitioned_tables. It's done this way, so that we can
have a common WCO building code for inserts as well as updates. For
e.g. for inserts, the first (and only) WCO belongs to
node->nominalRelation so nominalRelation is used for
map_partition_varattnos(), whereas for updates, first WCO belongs to
the first resultRelInfo which is not same as nominalRelation. So in
the patch, in both cases, we use the first resultRelInfo and the WCO
of the first resultRelInfo for map_partition_varattnos().

Similar thing is done for Returning expressions.

---------

Another change in the patch is : for ExecInitQual() for WCO quals,
mtstate->ps is used as parent, rather than first plan. For updates,
first plan does not belong to the parent partition. In fact, I think
in all cases, we should use mtstate->ps as the parent.
mtstate->mt_plans[0] don't look like they should be considered parent
of these expressions. May be it does not matter to which parent we
link these quals, because there is no ReScan for ExecModifyTable().

Note that for RETURNING projection expressions, we do use mtstate->ps.

--------

There is another issue I discovered. The row-movement works fine if
the destination leaf partition has different attribute ordering than
the root : the existing insert-tuple-routing mapping handles that. But
if the source partition has different ordering w.r.t. the root, it has
a problem : there is no mapping in the opposite direction, i.e. from
the leaf to root. And we require that because the tuple of source leaf
partition needs to be converted to root partition tuple descriptor,
since ExecFindPartition() starts with root.

To fix this, I have introduced another mapping array
mtstate->mt_resultrel_maps[]. This corresponds to the
mtstate->resultRelInfo[]. We don't require per-leaf-partition mapping,
because the update result relations are pruned subset of the total
leaf partitions.

So in ExecInsert, before calling ExecFindPartition(), we need to
convert the leaf partition tuple to root using this reverse mapping.
Since we need to convert the tuple here, and again after
ExecFindPartition() for the found leaf partition, I have replaced the
common code by new function ConvertPartitionTupleSlot().

-------

Used a new flag is_partitionkey_update in ExecInitModifyTable(), which
can be re-used in subsequent sections , rather than again calling
IsPartitionKeyUpdate() function again.

-------

Some more test scenarios added that cover above changes. Basically
partitions that have different tuple descriptors than parents.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
When I tested partition-key-update on a partitioned table having no
child partitions, it crashed. This is because there is an
Assert(mtstate->mt_num_partitions > 0) for creating the
partition-to-root map, which fails if there are no partitions under
the partitioned table. Actually we should skp creating this map if
there are no partitions under the partitioned table on which UPDATE is
run. So the attached patch has this new change to fix it (and
appropriate additional test case added) :

--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -2006,15 +2006,14 @@ ExecInitModifyTable(ModifyTable *node, EState
*estate, int eflags)
         * descriptor of a source partition does not match the root partition
         * descriptor. In such case we need to convert tuples to the
root partition
         * tuple descriptor, because the search for destination partition starts
-        * from the root.
+        * from the root. Skip this setup if it's not a partition key
update or if
+        * there are no partitions below this partitioned table.
         */
-       if (is_partitionkey_update)
+       if (is_partitionkey_update && mtstate->mt_num_partitions > 0)
        {
                TupleConversionMap **tup_conv_maps;
                TupleDesc               outdesc;

-               Assert(mtstate->mt_num_partitions > 0);
-
                mtstate->mt_resultrel_maps =
                (TupleConversionMap **)
palloc0(sizeof(TupleConversionMap*) * nplans);

On 15 June 2017 at 23:06, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 13 June 2017 at 15:40, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> While rebasing my patch for the below recent commit, I realized that a
>> similar issue exists for the uptate-tuple-routing patch as well :
>>
>> commit 78a030a441966d91bc7e932ef84da39c3ea7d970
>> Author: Tom Lane <tgl@sss.pgh.pa.us>
>> Date:   Mon Jun 12 23:29:44 2017 -0400
>>
>>     Fix confusion about number of subplans in partitioned INSERT setup.
>>
>> The above issue was about incorrectly using 'i' in
>> mtstate->mt_plans[i] during handling WITH CHECK OPTIONS in
>> ExecInitModifyTable(), where 'i' was actually meant to refer to the
>> positions in mtstate->mt_num_partitions. Actually for INSERT, there is
>> only a single plan element in mtstate->mt_plans[] array.
>>
>> Similarly, for update-tuple routing, we cannot use
>> mtstate->mt_plans[i], because 'i' refers to position in
>> mtstate->mt_partitions[] , whereas mtstate->mt_plans is not at all in
>> order of mtstate->mt_partitions; in fact mt_plans has only the plans
>> that are to be scanned on pruned partitions; so it can well be a small
>> subset of total partitions.
>>
>> I am working on an updated patch to fix the above.
>
> Attached patch v10 fixes the above. In the existing code, where it
> builds WCO constraints for each leaf partition; with the patch, that
> code now is applicable to row-movement-updates as well. So the
> assertions in the code are now updated to allow the same. Secondly,
> the mapping for each of the leaf partitions was constructed using the
> root partition attributes. Now in the patch, the
> mtstate->resultRelInfo[0] (i.e. the first resultRelInfo) is used as
> reference. So effectively, map_partition_varattnos() now represents
> not just parent-to-partition mapping, but rather, mapping between any
> two partitions/partitioned_tables. It's done this way, so that we can
> have a common WCO building code for inserts as well as updates. For
> e.g. for inserts, the first (and only) WCO belongs to
> node->nominalRelation so nominalRelation is used for
> map_partition_varattnos(), whereas for updates, first WCO belongs to
> the first resultRelInfo which is not same as nominalRelation. So in
> the patch, in both cases, we use the first resultRelInfo and the WCO
> of the first resultRelInfo for map_partition_varattnos().
>
> Similar thing is done for Returning expressions.
>
> ---------
>
> Another change in the patch is : for ExecInitQual() for WCO quals,
> mtstate->ps is used as parent, rather than first plan. For updates,
> first plan does not belong to the parent partition. In fact, I think
> in all cases, we should use mtstate->ps as the parent.
> mtstate->mt_plans[0] don't look like they should be considered parent
> of these expressions. May be it does not matter to which parent we
> link these quals, because there is no ReScan for ExecModifyTable().
>
> Note that for RETURNING projection expressions, we do use mtstate->ps.
>
> --------
>
> There is another issue I discovered. The row-movement works fine if
> the destination leaf partition has different attribute ordering than
> the root : the existing insert-tuple-routing mapping handles that. But
> if the source partition has different ordering w.r.t. the root, it has
> a problem : there is no mapping in the opposite direction, i.e. from
> the leaf to root. And we require that because the tuple of source leaf
> partition needs to be converted to root partition tuple descriptor,
> since ExecFindPartition() starts with root.
>
> To fix this, I have introduced another mapping array
> mtstate->mt_resultrel_maps[]. This corresponds to the
> mtstate->resultRelInfo[]. We don't require per-leaf-partition mapping,
> because the update result relations are pruned subset of the total
> leaf partitions.
>
> So in ExecInsert, before calling ExecFindPartition(), we need to
> convert the leaf partition tuple to root using this reverse mapping.
> Since we need to convert the tuple here, and again after
> ExecFindPartition() for the found leaf partition, I have replaced the
> common code by new function ConvertPartitionTupleSlot().
>
> -------
>
> Used a new flag is_partitionkey_update in ExecInitModifyTable(), which
> can be re-used in subsequent sections , rather than again calling
> IsPartitionKeyUpdate() function again.
>
> -------
>
> Some more test scenarios added that cover above changes. Basically
> partitions that have different tuple descriptors than parents.



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] UPDATE of partition key

From
Thomas Munro
Date:
On Fri, Jun 16, 2017 at 5:36 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> There is another issue I discovered. The row-movement works fine if
> the destination leaf partition has different attribute ordering than
> the root : the existing insert-tuple-routing mapping handles that. But
> if the source partition has different ordering w.r.t. the root, it has
> a problem : there is no mapping in the opposite direction, i.e. from
> the leaf to root. And we require that because the tuple of source leaf
> partition needs to be converted to root partition tuple descriptor,
> since ExecFindPartition() starts with root.
>
> To fix this, I have introduced another mapping array
> mtstate->mt_resultrel_maps[]. This corresponds to the
> mtstate->resultRelInfo[]. We don't require per-leaf-partition mapping,
> because the update result relations are pruned subset of the total
> leaf partitions.

Hi Amit & Amit,

Just a thought: If I understand correctly this new array of tuple
conversion maps is the same as mtstate->mt_transition_tupconv_maps in
my patch transition-tuples-from-child-tables-v11.patch (hopefully soon
to be committed to close a PG10 open item).  In my patch I bounce
transition tuples from child relations up to the named relation's
triggers, and in this patch you bounce child tuples up to the named
relation for rerouting, so the conversion requirement is the same.
Perhaps we could consider refactoring to build a common struct member
on demand for the row movement patch at some point in the future if it
makes the code cleaner.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Thu, Jun 15, 2017 at 1:36 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Attached patch v10 fixes the above. In the existing code, where it
> builds WCO constraints for each leaf partition; with the patch, that
> code now is applicable to row-movement-updates as well.

I guess I don't see why it should work like this.  In the INSERT case,
we must build withCheckOption objects for each partition because those
partitions don't appear in the plan otherwise -- but in the UPDATE
case, they're already there, so why do we need to build anything at
all?  Similarly for RETURNING projections.  How are the things we need
for those cases not already getting built, associated with the
relevant resultRelInfos?  Maybe there's a concern if some children got
pruned - they could turn out later to be the children into which
tuples need to be routed.  But the patch makes no distinction between
possibly-pruned children and any others.

> There is another issue I discovered. The row-movement works fine if
> the destination leaf partition has different attribute ordering than
> the root : the existing insert-tuple-routing mapping handles that. But
> if the source partition has different ordering w.r.t. the root, it has
> a problem : there is no mapping in the opposite direction, i.e. from
> the leaf to root. And we require that because the tuple of source leaf
> partition needs to be converted to root partition tuple descriptor,
> since ExecFindPartition() starts with root.

Seems reasonable, but...

> To fix this, I have introduced another mapping array
> mtstate->mt_resultrel_maps[]. This corresponds to the
> mtstate->resultRelInfo[]. We don't require per-leaf-partition mapping,
> because the update result relations are pruned subset of the total
> leaf partitions.

... I don't understand how you can *not* need a per-leaf-partition
mapping.  I mean, maybe you only need the mapping for the *unpruned*
leaf partitions but you certainly need a separate mapping for each one
of those.

It's possible to imagine driving the tuple routing off of just the
partition key attributes, extracted from wherever they are inside the
tuple at the current level, rather than converting to the root's tuple
format.  However, that's not totally straightforward because there
could be multiple levels of partitioning throughout the tree and
different attributes might be needed at different levels.  Moreover,
in most cases, the mappings are going to end up being no-ops because
the column order will be the same, so it's probably not worth
complicating the code to try to avoid a double conversion that usually
won't happen.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 20 June 2017 at 03:42, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
> Just a thought: If I understand correctly this new array of tuple
> conversion maps is the same as mtstate->mt_transition_tupconv_maps in
> my patch transition-tuples-from-child-tables-v11.patch (hopefully soon
> to be committed to close a PG10 open item).  In my patch I bounce
> transition tuples from child relations up to the named relation's
> triggers, and in this patch you bounce child tuples up to the named
> relation for rerouting, so the conversion requirement is the same.
> Perhaps we could consider refactoring to build a common struct member
> on demand for the row movement patch at some point in the future if it
> makes the code cleaner.

I agree; thanks for bringing this to my attention. The conversion maps
in my patch and yours do sound like they are exactly same. And even in
case where both update-row-movement and transition tables are playing
together, the same map should serve the purpose of both. I will keep a
watch on your patch, and check how I can adjust my patch so that I
don't have to refactor the mapping.

One difference I see is : in your patch, in ExecModifyTable() we jump
the current map position for each successive subplan, whereas in my
patch, in ExecInsert() we deduce the position of the right map to be
fetched using the position of the current resultRelInfo in the
mtstate->resultRelInfo[] array. I think your way is more consistent
with the existing code.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 20 June 2017 at 03:46, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jun 15, 2017 at 1:36 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> Attached patch v10 fixes the above. In the existing code, where it
>> builds WCO constraints for each leaf partition; with the patch, that
>> code now is applicable to row-movement-updates as well.
>
> I guess I don't see why it should work like this.  In the INSERT case,
> we must build withCheckOption objects for each partition because those
> partitions don't appear in the plan otherwise -- but in the UPDATE
> case, they're already there, so why do we need to build anything at
> all?  Similarly for RETURNING projections.  How are the things we need
> for those cases not already getting built, associated with the
> relevant resultRelInfos?  Maybe there's a concern if some children got
> pruned - they could turn out later to be the children into which
> tuples need to be routed. But the patch makes no distinction
> between possibly-pruned children and any others.

Yes, only a subset of the partitions appear in the UPDATE subplans. I
think typically for updates, a very small subset of the total leaf
partitions will be there in the plans, others would get pruned. IMHO,
it would not be worth having an optimization where it opens only those
leaf partitions which are not already there in the subplans. Without
the optimization, we are able to re-use the INSERT infrastructure
without additional changes.


>
>> There is another issue I discovered. The row-movement works fine if
>> the destination leaf partition has different attribute ordering than
>> the root : the existing insert-tuple-routing mapping handles that. But
>> if the source partition has different ordering w.r.t. the root, it has
>> a problem : there is no mapping in the opposite direction, i.e. from
>> the leaf to root. And we require that because the tuple of source leaf
>> partition needs to be converted to root partition tuple descriptor,
>> since ExecFindPartition() starts with root.
>
> Seems reasonable, but...
>
>> To fix this, I have introduced another mapping array
>> mtstate->mt_resultrel_maps[]. This corresponds to the
>> mtstate->resultRelInfo[]. We don't require per-leaf-partition mapping,
>> because the update result relations are pruned subset of the total
>> leaf partitions.
>
> ... I don't understand how you can *not* need a per-leaf-partition
> mapping.  I mean, maybe you only need the mapping for the *unpruned*
> leaf partitions

Yes, we need the mapping only for the unpruned leaf partitions, and
those partitions are available in the per-subplan resultRelInfo's.

> but you certainly need a separate mapping for each one of those.

You mean *each* of the leaf partitions ? I didn't get why we would
need it for each one. The tuple targeted for update belongs to one of
the per-subplan resultInfos. And this tuple is to be routed to another
leaf partition. So the reverse mapping is for conversion from the
source resultRelinfo to the root partition. I am unable to figure out
a scenario where we would require this reverse mapping for partitions
on which UPDATE is *not* going to be executed.

>
> It's possible to imagine driving the tuple routing off of just the
> partition key attributes, extracted from wherever they are inside the
> tuple at the current level, rather than converting to the root's tuple
> format.  However, that's not totally straightforward because there
> could be multiple levels of partitioning throughout the tree and
> different attributes might be needed at different levels.

Yes, the conversion anyway occurs at each of these levels even for
insert, specifically because there can be different partition
attributes each time. For update, its only one additional conversion.
But yes, this new mapping would be required for this one single
conversion.

> Moreover,
> in most cases, the mappings are going to end up being no-ops because
> the column order will be the same, so it's probably not worth
> complicating the code to try to avoid a double conversion that usually
> won't happen.

I agree.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Tue, Jun 20, 2017 at 2:54 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> I guess I don't see why it should work like this.  In the INSERT case,
>> we must build withCheckOption objects for each partition because those
>> partitions don't appear in the plan otherwise -- but in the UPDATE
>> case, they're already there, so why do we need to build anything at
>> all?  Similarly for RETURNING projections.  How are the things we need
>> for those cases not already getting built, associated with the
>> relevant resultRelInfos?  Maybe there's a concern if some children got
>> pruned - they could turn out later to be the children into which
>> tuples need to be routed. But the patch makes no distinction
>> between possibly-pruned children and any others.
>
> Yes, only a subset of the partitions appear in the UPDATE subplans. I
> think typically for updates, a very small subset of the total leaf
> partitions will be there in the plans, others would get pruned. IMHO,
> it would not be worth having an optimization where it opens only those
> leaf partitions which are not already there in the subplans. Without
> the optimization, we are able to re-use the INSERT infrastructure
> without additional changes.

Well, that is possible, but certainly not guaranteed.  I mean,
somebody could do a whole-table UPDATE, or an UPDATE that hits a
smattering of rows in every partition; e.g. the table is partitioned
on order number, and you do UPDATE lineitem SET product_code = 'K372B'
WHERE product_code = 'K372'.

Leaving that aside, the point here is that you're rebuilding
withCheckOptions and returningLists that have already been built in
the planner.  That's bad for two reasons.  First, it's inefficient,
especially if there are many partitions.  Second, it will amount to a
functional bug if you get a different answer than the planner did.
Note this comment in the existing code:
   /*    * Build WITH CHECK OPTION constraints for each leaf partition rel. Note    * that we didn't build the
withCheckOptionListfor each partition within    * the planner, but simple translation of the varattnos for each
partition   * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE    * cases are handled above.    */
 

The comment "UPDATE/DELETE cases are handled above" is referring to
the code that initializes the WCOs generated by the planner.  You've
modified the comment in your patch, but the associated code: your
updated comment says that only "DELETEs and local UPDATES are handled
above", but in reality, *all* updates are still handled above.  And
then they are handled again here.  Similarly for returning lists.
It's certainly not OK for the comment to be inaccurate, but I think
it's also bad to redo the work which the planner has already done,
even if it makes the patch smaller.

Also, I feel like it's probably not correct to use the first result
relation as the nominal relation for building WCOs and returning lists
anyway.  I mean, if the first result relation has a different column
order than the parent relation, isn't this just broken?  If it works
for some reason, the comments don't explain what that reason is.

>> ... I don't understand how you can *not* need a per-leaf-partition
>> mapping.  I mean, maybe you only need the mapping for the *unpruned*
>> leaf partitions
>
> Yes, we need the mapping only for the unpruned leaf partitions, and
> those partitions are available in the per-subplan resultRelInfo's.

OK.

>> but you certainly need a separate mapping for each one of those.
>
> You mean *each* of the leaf partitions ? I didn't get why we would
> need it for each one. The tuple targeted for update belongs to one of
> the per-subplan resultInfos. And this tuple is to be routed to another
> leaf partition. So the reverse mapping is for conversion from the
> source resultRelinfo to the root partition. I am unable to figure out
> a scenario where we would require this reverse mapping for partitions
> on which UPDATE is *not* going to be executed.

I agree - the reverse mapping is only needed for the partitions in
which UPDATE will be executed.

Some other things:

+             * The row was already deleted by a concurrent DELETE. So we don't
+             * have anything to update.

I find this explanation, and the surrounding comments, inadequate.  It
doesn't really explain why we're doing this.  I think it should say
something like this: For a normal UPDATE, the case where the tuple has
been the subject of a concurrent UPDATE or DELETE would be handled by
the EvalPlanQual machinery, but for an UPDATE that we've translated
into a DELETE from this partition and an INSERT into some other
partition, that's not available, because CTID chains can't span
relation boundaries.  We mimic the semantics to a limited extent by
skipping the INSERT if the DELETE fails to find a tuple.  This ensures
that two concurrent attempts to UPDATE the same tuple at the same time
can't turn one tuple into two, and that an UPDATE of a just-deleted
tuple can't resurrect it.

+            bool        partition_check_passed_with_trig_tuple;
+
+            partition_check_passed =
+                (resultRelInfo->ri_PartitionCheck &&
+                 ExecPartitionCheck(resultRelInfo, slot, estate));
+
+            partition_check_passed_with_trig_tuple =
+                (resultRelInfo->ri_PartitionCheck &&
+                 ExecPartitionCheck(resultRelInfo, trig_slot, estate));
+            if (partition_check_passed)
+            {
+                /*
+                 * If it's the trigger that is causing partition constraint
+                 * violation, abort. We don't want a trigger to cause tuple
+                 * routing.
+                 */
+                if (!partition_check_passed_with_trig_tuple)
+                    ExecPartitionCheckEmitError(resultRelInfo,
+                                                trig_slot, estate);
+            }
+            else
+            {
+                /*
+                 * Partition constraint failed with original NEW tuple. But the
+                 * trigger might even have modifed the tuple such that it fits
+                 * back into the partition. So partition constraint check
+                 * should be based on *final* NEW tuple.
+                 */
+                partition_check_passed =
partition_check_passed_with_trig_tuple;
+            }

Maybe I inadvertently gave the contrary impression in some prior
review, but this logic doesn't seem right to me.  I don't think
there's any problem with a BR UPDATE trigger causing tuple routing.
What I want to avoid is repeatedly rerouting the same tuple, but I
don't think that could happen even without this guard. We've now fixed
insert tuple routing so that a BR INSERT trigger can't cause the
partition constraint to be violated (cf. commit
15ce775faa428dc91027e4e2d6b7a167a27118b5) and there's no way for
update tuple routing to trigger additional BR UPDATE triggers.  So I
don't see the point of checking the constraints twice here.  I think
what you want to do is get rid of all the changes here and instead
adjust the logic just before ExecConstraints() to invoke
ExecPartitionCheck() on the post-trigger version of the tuple.

Parenthetically, if we decided to keep this logic as you have it, the
code that sets partition_check_passed and
partition_check_passed_with_trig_tuple doesn't need to check
resultRelInfo->ri_PartitionCheck because the surrounding "if" block
already did.

+    for (i = 0; i < num_rels; i++)
+    {
+        ResultRelInfo *resultRelInfo = &result_rels[i];
+        Relation        rel = resultRelInfo->ri_RelationDesc;
+        Bitmapset     *expr_attrs = NULL;
+
+        pull_varattnos((Node *) rel->rd_partcheck, 1, &expr_attrs);
+
+        /* Both bitmaps are offset by FirstLowInvalidHeapAttributeNumber. */
+        if (bms_overlap(expr_attrs, GetUpdatedColumns(resultRelInfo, estate)))
+            return true;
+    }

This seems like an awfully expensive way of performing this test.
Under what circumstances could this be true for some result relations
and false for others; or in other words, why do we have to loop over
all of the result relations?  It seems to me that the user has typed
something like:

UPDATE whatever SET thingy = ..., whatsit = ... WHERE whatever = ...
AND thunk = ...

If either thingy or whatsit is a partitioning column, UPDATE tuple
routing might be needed - and it should be able to test that by a
*single* comparison between the set of columns being updated and the
partitioning columns, without needing to repeat for every partitions.
Perhaps that test needs to be done at plan time and saved in the plan,
rather than performed here -- or maybe it's easy enough to do it here.

One problem is that, if BR UPDATE triggers are in fact allowed to
cause tuple routing as I proposed above, the presence of a BR UPDATE
trigger for any partition could necessitate UPDATE tuple routing for
queries that wouldn't otherwise need it.  But even if you end up
inserting a test for that case, it can surely be a lot cheaper than
this, since it only involves checking a boolean flag, not a bitmapset.
It could be argue that we ought to prohibit BR UPDATE triggers from
causing tuple routing so that we don't have to do this test at all,
but I'm not sure that's a good trade-off.  It seems to necessitate
checking the partition constraint twice per tuple instead of once per
tuple, which like a very heavy price.

+#define GetUpdatedColumns(relinfo, estate) \
+    (rt_fetch((relinfo)->ri_RangeTableIndex,
(estate)->es_range_table)->updatedCols)

I think this should be moved to a header file (and maybe turned into a
static inline function) rather than copy-pasting the definition into a
new file.

-            List       *mapped_wcoList;
+            List       *mappedWco;            List       *wcoExprs = NIL;            ListCell   *ll;

-            /* varno = node->nominalRelation */
-            mapped_wcoList = map_partition_varattnos(wcoList,
-                                                     node->nominalRelation,
-                                                     partrel, rel);
-            foreach(ll, mapped_wcoList)
+            mappedWco = map_partition_varattnos(firstWco, firstVarno,
+                                                partrel, firstResultRel);
+            foreach(ll, mappedWco)            {                WithCheckOption *wco = castNode(WithCheckOption,
lfirst(ll));               ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
 
-                                                   plan);
+                                                   &mtstate->ps);
                wcoExprs = lappend(wcoExprs, wcoExpr);            }

-            resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
+            resultRelInfo->ri_WithCheckOptions = mappedWco;

Renaming the variable looks fairly pointless, unless I'm missing something?

Regarding the tests, it seems like you've got a test case where you
update a sub-partition and it fails because the tuple would need to be
moved out of a sub-tree, which is good.  But I think it would also be
good to have a case where you update a sub-partition and it succeeds
in moving the tuple within the subtree.  I don't see one like that
presently; it seems all the others update the topmost root or the
leaf.  I also think it would be a good idea to make sub_parted's
column order different from both list_parted and its own children, and
maybe use a diversity of data types (e.g. int4, int8, text instead of
making everything int).

+select tableoid::regclass , * from list_parted where a = 2 order by 1;
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass , * from list_parted where a = 2 order by 1;

The extra space before the comma looks strange.

Also, please make a habit of checking patches for whitespace errors
using git diff --check.

[rhaas pgsql]$ git diff --check
src/backend/executor/nodeModifyTable.c:384: indent with spaces.
+                        tuple, &slot);
src/backend/executor/nodeModifyTable.c:1966: space before tab in indent.
+                IsPartitionKeyUpdate(estate, mtstate->resultRelInfo, nplans));

You will notice these kinds of things if you read the diff you are
submitting before you press send, because git highlights them in
bright red.  That's a good practice for many other reasons, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] UPDATE of partition key

From
Amit Langote
Date:
On 2017/06/21 3:53, Robert Haas wrote:
> On Tue, Jun 20, 2017 at 2:54 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>> I guess I don't see why it should work like this.  In the INSERT case,
>>> we must build withCheckOption objects for each partition because those
>>> partitions don't appear in the plan otherwise -- but in the UPDATE
>>> case, they're already there, so why do we need to build anything at
>>> all?  Similarly for RETURNING projections.  How are the things we need
>>> for those cases not already getting built, associated with the
>>> relevant resultRelInfos?  Maybe there's a concern if some children got
>>> pruned - they could turn out later to be the children into which
>>> tuples need to be routed. But the patch makes no distinction
>>> between possibly-pruned children and any others.
>>
>> Yes, only a subset of the partitions appear in the UPDATE subplans. I
>> think typically for updates, a very small subset of the total leaf
>> partitions will be there in the plans, others would get pruned. IMHO,
>> it would not be worth having an optimization where it opens only those
>> leaf partitions which are not already there in the subplans. Without
>> the optimization, we are able to re-use the INSERT infrastructure
>> without additional changes.
> 
> Well, that is possible, but certainly not guaranteed.  I mean,
> somebody could do a whole-table UPDATE, or an UPDATE that hits a
> smattering of rows in every partition; e.g. the table is partitioned
> on order number, and you do UPDATE lineitem SET product_code = 'K372B'
> WHERE product_code = 'K372'.
> 
> Leaving that aside, the point here is that you're rebuilding
> withCheckOptions and returningLists that have already been built in
> the planner.  That's bad for two reasons.  First, it's inefficient,
> especially if there are many partitions.  Second, it will amount to a
> functional bug if you get a different answer than the planner did.
> Note this comment in the existing code:
> 
>     /*
>      * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
>      * that we didn't build the withCheckOptionList for each partition within
>      * the planner, but simple translation of the varattnos for each partition
>      * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
>      * cases are handled above.
>      */
> 
> The comment "UPDATE/DELETE cases are handled above" is referring to
> the code that initializes the WCOs generated by the planner.  You've
> modified the comment in your patch, but the associated code: your
> updated comment says that only "DELETEs and local UPDATES are handled
> above", but in reality, *all* updates are still handled above.  And
> then they are handled again here.  Similarly for returning lists.
> It's certainly not OK for the comment to be inaccurate, but I think
> it's also bad to redo the work which the planner has already done,
> even if it makes the patch smaller.

I guess this has to do with the UPDATE turning into DELETE+INSERT.  So, it
seems like WCOs are being initialized for the leaf partitions
(ResultRelInfos in the mt_partitions array) that are in turn are
initialized for the aforementioned INSERT.  That's why the term "...local
UPDATEs" in the new comment text.

If that's true, I wonder if it makes sense to apply what would be
WCO_RLS_UPDATE_CHECK to a leaf partition that the tuple will be moved into
by calling ExecInsert()?

> Also, I feel like it's probably not correct to use the first result
> relation as the nominal relation for building WCOs and returning lists
> anyway.  I mean, if the first result relation has a different column
> order than the parent relation, isn't this just broken?  If it works
> for some reason, the comments don't explain what that reason is.

Yep, it's more appropriate to use
ModifyTableState->rootResultRelationInfo->ri_RelationDesc somehow.  That
is, if answer to the question I raised above is positive.

Thanks,
Amit




Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Wed, Jun 21, 2017 at 5:28 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:>> The comment "UPDATE/DELETE
cases are handled above" is referring to
>> the code that initializes the WCOs generated by the planner.  You've
>> modified the comment in your patch, but the associated code: your
>> updated comment says that only "DELETEs and local UPDATES are handled
>> above", but in reality, *all* updates are still handled above.  And
>> then they are handled again here.  Similarly for returning lists.
>> It's certainly not OK for the comment to be inaccurate, but I think
>> it's also bad to redo the work which the planner has already done,
>> even if it makes the patch smaller.
>
> I guess this has to do with the UPDATE turning into DELETE+INSERT.  So, it
> seems like WCOs are being initialized for the leaf partitions
> (ResultRelInfos in the mt_partitions array) that are in turn are
> initialized for the aforementioned INSERT.  That's why the term "...local
> UPDATEs" in the new comment text.
>
> If that's true, I wonder if it makes sense to apply what would be
> WCO_RLS_UPDATE_CHECK to a leaf partition that the tuple will be moved into
> by calling ExecInsert()?

I think we probably should apply the insert policy, just as we're
executing the insert trigger.

>> Also, I feel like it's probably not correct to use the first result
>> relation as the nominal relation for building WCOs and returning lists
>> anyway.  I mean, if the first result relation has a different column
>> order than the parent relation, isn't this just broken?  If it works
>> for some reason, the comments don't explain what that reason is.
>
> Yep, it's more appropriate to use
> ModifyTableState->rootResultRelationInfo->ri_RelationDesc somehow.  That
> is, if answer to the question I raised above is positive.

The questions appear to me to be independent.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 21 June 2017 at 00:23, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jun 20, 2017 at 2:54 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>> I guess I don't see why it should work like this.  In the INSERT case,
>>> we must build withCheckOption objects for each partition because those
>>> partitions don't appear in the plan otherwise -- but in the UPDATE
>>> case, they're already there, so why do we need to build anything at
>>> all?  Similarly for RETURNING projections.  How are the things we need
>>> for those cases not already getting built, associated with the
>>> relevant resultRelInfos?  Maybe there's a concern if some children got
>>> pruned - they could turn out later to be the children into which
>>> tuples need to be routed. But the patch makes no distinction
>>> between possibly-pruned children and any others.
>>
>> Yes, only a subset of the partitions appear in the UPDATE subplans. I
>> think typically for updates, a very small subset of the total leaf
>> partitions will be there in the plans, others would get pruned. IMHO,
>> it would not be worth having an optimization where it opens only those
>> leaf partitions which are not already there in the subplans. Without
>> the optimization, we are able to re-use the INSERT infrastructure
>> without additional changes.
>
> Well, that is possible, but certainly not guaranteed.  I mean,
> somebody could do a whole-table UPDATE, or an UPDATE that hits a
> smattering of rows in every partition;

I am not saying that it's guaranteed to be a small subset. I am saying
that it would be typically a small subset for
update-of-partitioned-key case. Seems weird if a user causes an
update-row-movement for multiple partitions at the same time.
Generally it would be an administrative task where some/all of the
rows of a partition need to have their partition key updated that
cause them to change their partition, and so there would be probably a
where clause that would narrow down the update to that particular
partition, because without the where clause the update is anyway
slower and it's redundant to scan all other partitions.

But, point taken, that there can always be certain cases involving
multiple table partition-key updates.

> e.g. the table is partitioned on order number, and you do UPDATE
> lineitem SET product_code = 'K372B' WHERE product_code = 'K372'.

This query does not update order number, so here there is no
partition-key-update. Are you thinking that the patch is generating
the per-leaf-partition WCO expressions even for a update not involving
a partition key ?

>
> Leaving that aside, the point here is that you're rebuilding
> withCheckOptions and returningLists that have already been built in
> the planner.  That's bad for two reasons.  First, it's inefficient,
> especially if there are many partitions.

Yeah, I agree that this becomes more and more redundant if the update
involves more partitions.

> Second, it will amount to a functional bug if you get a
> different answer than the planner did.

Actually, the per-leaf WCOs are meant to be executed on the
destination partitions where the tuple is moved, while the WCOs
belonging to the per-subplan resultRelInfo are meant for the
resultRelinfo used for the UPDATE plans. So actually it should not
matter whether they look same or different, because they are fired at
different objects. Now these objects can happen to be the same
relations though.

But in any case, it's not clear to me how the mapped WCO and the
planner's WCO would yield a different answer if they are both the same
relation. I am possibly missing something. The planner has already
generated the withCheckOptions for each of the resultRelInfo. And then
we are using one of those to re-generate the WCO for a leaf partition
by only adjusting the attnos. If there is already a WCO generated in
the planner for that leaf partition (because that partition was
present in mtstate->resultRelInfo), then the re-built WCO should be
exactly look same as the earlier one, because they are the same
relations, and so the attnos generated in them would be same since the
Relation TupleDesc is the same.

> Note this comment in the existing code:
>
>     /*
>      * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
>      * that we didn't build the withCheckOptionList for each partition within
>      * the planner, but simple translation of the varattnos for each partition
>      * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
>      * cases are handled above.
>      */
>
> The comment "UPDATE/DELETE cases are handled above" is referring to
> the code that initializes the WCOs generated by the planner.  You've
> modified the comment in your patch, but the associated code: your
> updated comment says that only "DELETEs and local UPDATES are handled
> above", but in reality, *all* updates are still handled above.  And

Actually I meant, "above works for only local updates. For
row-movement-updates, we need per-leaf partition WCOs, because when
the row is inserted into target partition, that partition may be not
be included in the above planner resultRelInfo, so we need WCOs for
all partitions". I think this said comment should be sufficient if I
add this in the code ?

> then they are handled again here.
> Similarly for returning lists.
> It's certainly not OK for the comment to be inaccurate, but I think
> it's also bad to redo the work which the planner has already done,
> even if it makes the patch smaller.
>
> Also, I feel like it's probably not correct to use the first result
> relation as the nominal relation for building WCOs and returning lists
> anyway.  I mean, if the first result relation has a different column
> order than the parent relation, isn't this just broken?  If it works
> for some reason, the comments don't explain what that reason is.

Not sure why parent relation should come into picture. As long as the
first result relation belongs to one of the partitions in the whole
partition tree, we should be able to use that to build WCOs of any
other partitions, because they have a common set of attributes having
the same name. So we are bound to find each of the attributes of first
resultRelInfo in the other leaf partitions during attno mapping.

> Some other things:
>
> +             * The row was already deleted by a concurrent DELETE. So we don't
> +             * have anything to update.
>
> I find this explanation, and the surrounding comments, inadequate.  It
> doesn't really explain why we're doing this.  I think it should say
> something like this: For a normal UPDATE, the case where the tuple has
> been the subject of a concurrent UPDATE or DELETE would be handled by
> the EvalPlanQual machinery, but for an UPDATE that we've translated
> into a DELETE from this partition and an INSERT into some other
> partition, that's not available, because CTID chains can't span
> relation boundaries.  We mimic the semantics to a limited extent by
> skipping the INSERT if the DELETE fails to find a tuple.  This ensures
> that two concurrent attempts to UPDATE the same tuple at the same time
> can't turn one tuple into two, and that an UPDATE of a just-deleted
> tuple can't resurrect it.

Thanks, will put that comment in the next patch.

>
> +            bool        partition_check_passed_with_trig_tuple;
> +
> +            partition_check_passed =
> +                (resultRelInfo->ri_PartitionCheck &&
> +                 ExecPartitionCheck(resultRelInfo, slot, estate));
> +
> +            partition_check_passed_with_trig_tuple =
> +                (resultRelInfo->ri_PartitionCheck &&
> +                 ExecPartitionCheck(resultRelInfo, trig_slot, estate));
> +            if (partition_check_passed)
> +            {
> +                /*
> +                 * If it's the trigger that is causing partition constraint
> +                 * violation, abort. We don't want a trigger to cause tuple
> +                 * routing.
> +                 */
> +                if (!partition_check_passed_with_trig_tuple)
> +                    ExecPartitionCheckEmitError(resultRelInfo,
> +                                                trig_slot, estate);
> +            }
> +            else
> +            {
> +                /*
> +                 * Partition constraint failed with original NEW tuple. But the
> +                 * trigger might even have modifed the tuple such that it fits
> +                 * back into the partition. So partition constraint check
> +                 * should be based on *final* NEW tuple.
> +                 */
> +                partition_check_passed =
> partition_check_passed_with_trig_tuple;
> +            }
>
> Maybe I inadvertently gave the contrary impression in some prior
> review, but this logic doesn't seem right to me.  I don't think
> there's any problem with a BR UPDATE trigger causing tuple routing.
> What I want to avoid is repeatedly rerouting the same tuple, but I
> don't think that could happen even without this guard. We've now fixed
> insert tuple routing so that a BR INSERT trigger can't cause the
> partition constraint to be violated (cf. commit
> 15ce775faa428dc91027e4e2d6b7a167a27118b5) and there's no way for
> update tuple routing to trigger additional BR UPDATE triggers.  So I
> don't see the point of checking the constraints twice here.  I think
> what you want to do is get rid of all the changes here and instead
> adjust the logic just before ExecConstraints() to invoke
> ExecPartitionCheck() on the post-trigger version of the tuple.

When I came up with this code, the intention was to make sure BR
UPDATE trigger does not cause tuple routing. But yeah, I can't recall
what made me think that the above changes would be needed to prevent
BR UPDATE trigger from causing tuple routing. With the latest code, it
indeed looks like we can get rid of these changes, and still prevent
that.

BTW, that code was not to avoid repeated re-routing.

Above, you seem to say that there's no problem with BR UPDATE trigger
causing the tuple routing. But, when none of the partition-key columns
are used in UPDATE, we don't set up for update-tuple-routing, so with
no partition-key update, tuple routing will not occur even if BR
UPDATE trigger would have caused UPDATE tuple routing. This is one
restriction we have to live with because we beforehand decide whether
to do the tuple-routing setup based on the columns modified in the
UPDATE query.

>
> Parenthetically, if we decided to keep this logic as you have it, the
> code that sets partition_check_passed and
> partition_check_passed_with_trig_tuple doesn't need to check
> resultRelInfo->ri_PartitionCheck because the surrounding "if" block
> already did.

Yes.

>
> +    for (i = 0; i < num_rels; i++)
> +    {
> +        ResultRelInfo *resultRelInfo = &result_rels[i];
> +        Relation        rel = resultRelInfo->ri_RelationDesc;
> +        Bitmapset     *expr_attrs = NULL;
> +
> +        pull_varattnos((Node *) rel->rd_partcheck, 1, &expr_attrs);
> +
> +        /* Both bitmaps are offset by FirstLowInvalidHeapAttributeNumber. */
> +        if (bms_overlap(expr_attrs, GetUpdatedColumns(resultRelInfo, estate)))
> +            return true;
> +    }
>
> This seems like an awfully expensive way of performing this test.
> Under what circumstances could this be true for some result relations
> and false for others;

One resultRelinfo can have no partition key column used in its quals,
but the next resultRelinfo can have quite different quals, and these
quals can have partition key referred. This is possible if the two of
them have different parents that have different partition-key columns.

> or in other words, why do we have to loop over all of the result
> relations?  It seems to me that the user has typed something like:
>
> UPDATE whatever SET thingy = ..., whatsit = ... WHERE whatever = ...
> AND thunk = ...
>
> If either thingy or whatsit is a partitioning column, UPDATE tuple
> routing might be needed

So, in the above code, bms_overlap() would return true if either
thingy or whatsit is a partitioning column.

> - and it should be able to test that by a
> *single* comparison between the set of columns being updated and the
> partitioning columns, without needing to repeat for every partitions.

If bms_overlap() returns true for the very first resultRelinfo, it
will return immediately. But yes, if there are no relations using
partition key, we will have to scan all of these relations. But again,
note that these are pruned leaf partitions, they typically will not
contain all the leaf partitions.

> Perhaps that test needs to be done at plan time and saved in the plan,
> rather than performed here -- or maybe it's easy enough to do it here.

Hmm, it looks convenient here because mtstate->resultRelInfo gets set only here.

>
> One problem is that, if BR UPDATE triggers are in fact allowed to
> cause tuple routing as I proposed above, the presence of a BR UPDATE
> trigger for any partition could necessitate UPDATE tuple routing for
> queries that wouldn't otherwise need it.

You mean always setup update tuple routing if there's a BR UPDATE
trigger ? Actually I was going for disallowing BR update trigger to
initiate tuple routing, as I described above.

> But even if you end up
> inserting a test for that case, it can surely be a lot cheaper than
> this,

I didn't exactly get why the bitmap_overlap() test needs to be
compared with the presence-of-trigger test.

> since it only involves checking a boolean flag, not a bitmapset.
> It could be argue that we ought to prohibit BR UPDATE triggers from
> causing tuple routing so that we don't have to do this test at all,
> but I'm not sure that's a good trade-off.
> It seems to necessitate checking the partition constraint twice per
> tuple instead of once per tuple, which like a very heavy price.

I think I didn't quite understand this paragraph as a whole. Can you
state the trade-off here again ?

>
> +#define GetUpdatedColumns(relinfo, estate) \
> +    (rt_fetch((relinfo)->ri_RangeTableIndex,
> (estate)->es_range_table)->updatedCols)
>
> I think this should be moved to a header file (and maybe turned into a
> static inline function) rather than copy-pasting the definition into a
> new file.

Will do that.

>
> -            List       *mapped_wcoList;
> +            List       *mappedWco;
>              List       *wcoExprs = NIL;
>              ListCell   *ll;
>
> -            /* varno = node->nominalRelation */
> -            mapped_wcoList = map_partition_varattnos(wcoList,
> -                                                     node->nominalRelation,
> -                                                     partrel, rel);
> -            foreach(ll, mapped_wcoList)
> +            mappedWco = map_partition_varattnos(firstWco, firstVarno,
> +                                                partrel, firstResultRel);
> +            foreach(ll, mappedWco)
>              {
>                  WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
>                  ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
> -                                                   plan);
> +                                                   &mtstate->ps);
>
>                  wcoExprs = lappend(wcoExprs, wcoExpr);
>              }
>
> -            resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
> +            resultRelInfo->ri_WithCheckOptions = mappedWco;
>
> Renaming the variable looks fairly pointless, unless I'm missing something?

We are converting from firstWco to mappedWco. So firstWco => mappedWco
looks more natural pairing than firstWco => mapped_wcoList.

And I renamed wcoList to firstWco because I wanted to emphasize that
is the first WCO out of the node->withCheckOptionLists. In the
existing code, it was only for INSERT; withCheckOptionLists was a
single element list, so firstWco name didn't sound suitable, but with
multiple elements, it is essential to have it named firstWco so as to
emphasize that we take the first one irrespective of whether it is
UPDATE or INSERT.

>
> Regarding the tests, it seems like you've got a test case where you
> update a sub-partition and it fails because the tuple would need to be
> moved out of a sub-tree, which is good.  But I think it would also be
> good to have a case where you update a sub-partition and it succeeds
> in moving the tuple within the subtree.  I don't see one like that
> presently; it seems all the others update the topmost root or the
> leaf.  I also think it would be a good idea to make sub_parted's
> column order different from both list_parted and its own children, and
> maybe use a diversity of data types (e.g. int4, int8, text instead of
> making everything int).
>
> +select tableoid::regclass , * from list_parted where a = 2 order by 1;
> +update list_parted set b = c + a where a = 2;
> +select tableoid::regclass , * from list_parted where a = 2 order by 1;
>
> The extra space before the comma looks strange.

Will do the above changes, thanks.

>
> Also, please make a habit of checking patches for whitespace errors
> using git diff --check.
>
> [rhaas pgsql]$ git diff --check
> src/backend/executor/nodeModifyTable.c:384: indent with spaces.
> +                        tuple, &slot);
> src/backend/executor/nodeModifyTable.c:1966: space before tab in indent.
> +                IsPartitionKeyUpdate(estate, mtstate->resultRelInfo, nplans));
>
> You will notice these kinds of things if you read the diff you are
> submitting before you press send, because git highlights them in
> bright red.  That's a good practice for many other reasons, too.

Yeah, somehow I think I missed these because I must have checked only
the incremental diffs w.r.t. the earlier one where I must have
introduced them. Your point is very much true that we should make it a
habit to check complete patch with --check option, or apply it myself.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 21 June 2017 at 20:14, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jun 21, 2017 at 5:28 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:>> The comment "UPDATE/DELETE
> cases are handled above" is referring to
>>> the code that initializes the WCOs generated by the planner.  You've
>>> modified the comment in your patch, but the associated code: your
>>> updated comment says that only "DELETEs and local UPDATES are handled
>>> above", but in reality, *all* updates are still handled above.  And
>>> then they are handled again here.  Similarly for returning lists.
>>> It's certainly not OK for the comment to be inaccurate, but I think
>>> it's also bad to redo the work which the planner has already done,
>>> even if it makes the patch smaller.
>>
>> I guess this has to do with the UPDATE turning into DELETE+INSERT.  So, it
>> seems like WCOs are being initialized for the leaf partitions
>> (ResultRelInfos in the mt_partitions array) that are in turn are
>> initialized for the aforementioned INSERT.  That's why the term "...local
>> UPDATEs" in the new comment text.
>>
>> If that's true, I wonder if it makes sense to apply what would be
>> WCO_RLS_UPDATE_CHECK to a leaf partition that the tuple will be moved into
>> by calling ExecInsert()?
>
> I think we probably should apply the insert policy, just as we're
> executing the insert trigger.

Yes, the RLS quals should execute during tuple routing according to
whether it is a update or whether it has been converted to insert. I
think the tests don't quite test the insert part. Will check.

>
>>> Also, I feel like it's probably not correct to use the first result
>>> relation as the nominal relation for building WCOs and returning lists
>>> anyway.  I mean, if the first result relation has a different column
>>> order than the parent relation, isn't this just broken?  If it works
>>> for some reason, the comments don't explain what that reason is.
>>
>> Yep, it's more appropriate to use
>> ModifyTableState->rootResultRelationInfo->ri_RelationDesc somehow.  That
>> is, if answer to the question I raised above is positive.

From what I had checked earlier when coding that part,
rootResultRelInfo is NULL in case of inserts, unless something has
changed in later commits. That's the reason I decided to use the first
resultRelInfo.


Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Wed, Jun 21, 2017 at 1:37 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> e.g. the table is partitioned on order number, and you do UPDATE
>> lineitem SET product_code = 'K372B' WHERE product_code = 'K372'.
>
> This query does not update order number, so here there is no
> partition-key-update. Are you thinking that the patch is generating
> the per-leaf-partition WCO expressions even for a update not involving
> a partition key ?

No, it just wasn't a great example.  Sorry.

>> Second, it will amount to a functional bug if you get a
>> different answer than the planner did.
>
> Actually, the per-leaf WCOs are meant to be executed on the
> destination partitions where the tuple is moved, while the WCOs
> belonging to the per-subplan resultRelInfo are meant for the
> resultRelinfo used for the UPDATE plans. So actually it should not
> matter whether they look same or different, because they are fired at
> different objects. Now these objects can happen to be the same
> relations though.
>
> But in any case, it's not clear to me how the mapped WCO and the
> planner's WCO would yield a different answer if they are both the same
> relation. I am possibly missing something. The planner has already
> generated the withCheckOptions for each of the resultRelInfo. And then
> we are using one of those to re-generate the WCO for a leaf partition
> by only adjusting the attnos. If there is already a WCO generated in
> the planner for that leaf partition (because that partition was
> present in mtstate->resultRelInfo), then the re-built WCO should be
> exactly look same as the earlier one, because they are the same
> relations, and so the attnos generated in them would be same since the
> Relation TupleDesc is the same.

If the planner's WCOs and mapped WCOs are always the same, then I
think we should try to avoid generating both.  If they can be
different, but that's intentional and correct, then there's no
substantive problem with the patch but the comments need to make it
clear why we are generating both.

> Actually I meant, "above works for only local updates. For
> row-movement-updates, we need per-leaf partition WCOs, because when
> the row is inserted into target partition, that partition may be not
> be included in the above planner resultRelInfo, so we need WCOs for
> all partitions". I think this said comment should be sufficient if I
> add this in the code ?

Let's not get too focused on updating the comment until we are in
agreement about what the code ought to be doing.  I'm not clear
whether you accept the point that the patch needs to be changed to
avoid generating the same WCOs and returning lists in both the planner
and the executor.

>> Also, I feel like it's probably not correct to use the first result
>> relation as the nominal relation for building WCOs and returning lists
>> anyway.  I mean, if the first result relation has a different column
>> order than the parent relation, isn't this just broken?  If it works
>> for some reason, the comments don't explain what that reason is.
>
> Not sure why parent relation should come into picture. As long as the
> first result relation belongs to one of the partitions in the whole
> partition tree, we should be able to use that to build WCOs of any
> other partitions, because they have a common set of attributes having
> the same name. So we are bound to find each of the attributes of first
> resultRelInfo in the other leaf partitions during attno mapping.

Well, at least for returning lists, we've got to generate the
returning lists so that they all match the column order of the parent,
not the parent's first child.  Otherwise, for example, UPDATE
parent_table ... RETURNING * will not work correctly.  The tuples
returned by the returning clause have to have the attribute order of
parent_table, not the attribute order of parent_table's first child.
I'm not sure whether WCOs have the same issue, but it's not clear to
me why they wouldn't: they contain a qual which is an expression tree,
and presumably there are Var nodes in there someplace, and if so, then
they have varattnos that have to be right for the purpose for which
they're going to be used.

>> +    for (i = 0; i < num_rels; i++)
>> +    {
>> +        ResultRelInfo *resultRelInfo = &result_rels[i];
>> +        Relation        rel = resultRelInfo->ri_RelationDesc;
>> +        Bitmapset     *expr_attrs = NULL;
>> +
>> +        pull_varattnos((Node *) rel->rd_partcheck, 1, &expr_attrs);
>> +
>> +        /* Both bitmaps are offset by FirstLowInvalidHeapAttributeNumber. */
>> +        if (bms_overlap(expr_attrs, GetUpdatedColumns(resultRelInfo, estate)))
>> +            return true;
>> +    }
>>
>> This seems like an awfully expensive way of performing this test.
>> Under what circumstances could this be true for some result relations
>> and false for others;
>
> One resultRelinfo can have no partition key column used in its quals,
> but the next resultRelinfo can have quite different quals, and these
> quals can have partition key referred. This is possible if the two of
> them have different parents that have different partition-key columns.

Hmm, true.  So if we have a table foo that is partitioned by list (a),
and one of its children is a table bar that is partitioned by list
(b), then we need to consider doing tuple-routing if either column a
is modified, or if column b is modified for a partition which is a
descendant of bar.  But visiting that only requires looking at the
partitioned table and those children that are also partitioned, not
all of the leaf partitions as the patch does.

>> - and it should be able to test that by a
>> *single* comparison between the set of columns being updated and the
>> partitioning columns, without needing to repeat for every partitions.
>
> If bms_overlap() returns true for the very first resultRelinfo, it
> will return immediately. But yes, if there are no relations using
> partition key, we will have to scan all of these relations. But again,
> note that these are pruned leaf partitions, they typically will not
> contain all the leaf partitions.

But they might, and then this will be inefficient.  Just because the
patch doesn't waste many cycles in the case where most partitions are
pruned doesn't mean that it's OK for it to waste cycles when few
partitions are pruned.

>> One problem is that, if BR UPDATE triggers are in fact allowed to
>> cause tuple routing as I proposed above, the presence of a BR UPDATE
>> trigger for any partition could necessitate UPDATE tuple routing for
>> queries that wouldn't otherwise need it.
>
> You mean always setup update tuple routing if there's a BR UPDATE
> trigger ?

Yes.

> Actually I was going for disallowing BR update trigger to
> initiate tuple routing, as I described above.

I know that!  But as I said before, they requires evaluating every
partition key constraint twice per tuple, which seems very expensive.
I'm very doubtful that's a good approach.

>> But even if you end up
>> inserting a test for that case, it can surely be a lot cheaper than
>> this,
>
> I didn't exactly get why the bitmap_overlap() test needs to be
> compared with the presence-of-trigger test.

My point was: If you always set up tuple routing when a BR UPDATE
trigger is present, then you don't need to check the partition
constraint twice per tuple.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Wed, Jun 21, 2017 at 1:38 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>> Yep, it's more appropriate to use
>>> ModifyTableState->rootResultRelationInfo->ri_RelationDesc somehow.  That
>>> is, if answer to the question I raised above is positive.
>
> From what I had checked earlier when coding that part,
> rootResultRelInfo is NULL in case of inserts, unless something has
> changed in later commits. That's the reason I decided to use the first
> resultRelInfo.

We're just going around in circles here.  Saying that you decided to
use the first child's resultRelInfo because you didn't have a
resultRelInfo for the parent is an explanation of why you wrote the
code the way you did, but that doesn't make it correct.  I want to
know why you think it's correct.

I think it's probably wrong, because it seems to me that if the INSERT
code needs to use the parent's ResultRelInfo rather than the first
child's ResultRelInfo, the UPDATE code probably needs to do the same.
Commit d3cc37f1d801a6b5cad9bf179274a8d767f1ee50 got rid of
resultRelInfos for non-leaf partitions, and commit
e180c8aa8caf5c55a273d4a8e6092e77ff3cff10 added the resultRelInfo back
for the topmost parent, because otherwise it didn't work correctly.
If every partition in the hierarchy has a different attribute
ordering, then it seems to me that it must surely matter which of
those attribute orderings we pick.  It's hard to imagine that we can
pick *either* the parent's attribute ordering *or* that of the first
child and nothing will be different - the attribute numbers inside the
returning lists and WCOs we create have got to get used somehow, so
surely it matters which attribute numbers we use, doesn't it?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 22 June 2017 at 01:41, Robert Haas <robertmhaas@gmail.com> wrote:
>>> Second, it will amount to a functional bug if you get a
>>> different answer than the planner did.
>>
>> Actually, the per-leaf WCOs are meant to be executed on the
>> destination partitions where the tuple is moved, while the WCOs
>> belonging to the per-subplan resultRelInfo are meant for the
>> resultRelinfo used for the UPDATE plans. So actually it should not
>> matter whether they look same or different, because they are fired at
>> different objects. Now these objects can happen to be the same
>> relations though.
>>
>> But in any case, it's not clear to me how the mapped WCO and the
>> planner's WCO would yield a different answer if they are both the same
>> relation. I am possibly missing something. The planner has already
>> generated the withCheckOptions for each of the resultRelInfo. And then
>> we are using one of those to re-generate the WCO for a leaf partition
>> by only adjusting the attnos. If there is already a WCO generated in
>> the planner for that leaf partition (because that partition was
>> present in mtstate->resultRelInfo), then the re-built WCO should be
>> exactly look same as the earlier one, because they are the same
>> relations, and so the attnos generated in them would be same since the
>> Relation TupleDesc is the same.
>
> If the planner's WCOs and mapped WCOs are always the same, then I
> think we should try to avoid generating both.  If they can be
> different, but that's intentional and correct, then there's no
> substantive problem with the patch but the comments need to make it
> clear why we are generating both.
>
>> Actually I meant, "above works for only local updates. For
>> row-movement-updates, we need per-leaf partition WCOs, because when
>> the row is inserted into target partition, that partition may be not
>> be included in the above planner resultRelInfo, so we need WCOs for
>> all partitions". I think this said comment should be sufficient if I
>> add this in the code ?
>
> Let's not get too focused on updating the comment until we are in
> agreement about what the code ought to be doing.  I'm not clear
> whether you accept the point that the patch needs to be changed to
> avoid generating the same WCOs and returning lists in both the planner
> and the executor.

Yes, we can re-use the WCOs generated in the planner, as an
optimization, since those we re-generate for the same relations will
look exactly the same. The WCOs generated by planner (in
inheritance_planner) are generated when (in adjust_appendrel_attrs())
we change attnos used in the query to refer to the child RTEs and this
adjusts the attnos of the WCOs of the child RTEs. So the WCOs of
subplan resultRelInfo are actually the parent table WCOs, but only the
attnos changed. And in ExecInitModifyTable() we do the same thing for
leaf partitions, although using different function
map_variable_attnos().

>
>>> Also, I feel like it's probably not correct to use the first result
>>> relation as the nominal relation for building WCOs and returning lists
>>> anyway.  I mean, if the first result relation has a different column
>>> order than the parent relation, isn't this just broken?  If it works
>>> for some reason, the comments don't explain what that reason is.

One thing I didn't mention earlier about the WCOs, is that for child
rels, we don't use the WCOs defined for the child rels. We only
inherit the WCO expressions defined for the root rel. That's the
reason they are the same expressions, only the attnos changed to match
the respective relation tupledesc. If the WCOs of each of the subplan
resultRelInfo() were different, then definitely it was not possible to
use the first resultRelinfo to generate other leaf partition WCOs,
because the WCO defined for relation A is independent of that defined
for relation B.

So, since the WCOs of all the relations are actually those of the
parent, we only need to adjust the attnos of any of these
resultRelInfos.

For e.g., if the root rel WCO is defined as "col > 5" where col is the
4th column, the expression will look like "var_1.attno_4 > 5". And the
WCO that is generated for a subplan resultRelInfo will look something
like "var_n.attno_2 > 5" if col is the 2nd column in this table.

All of the above logic assumes that we never use the WCO defined for
the child relation. At least that's how it looks by looking at the way
we generate WCOs in ExecInitModifyTable() for INSERTs as well looking
at the code in inheritance_planner() for UPDATEs. At both these
places, we never use the WCOs defined for child tables.

So suppose we define the tables and their WCOs like this :

CREATE TABLE range_parted ( a text, b int, c int) partition by range (a, b);

ALTER TABLE range_parted ENABLE ROW LEVEL SECURITY;
GRANT ALL ON range_parted TO PUBLIC ;
create policy seeall ON range_parted as PERMISSIVE for SELECT using ( true);

create table part_b_10_b_20 partition of range_parted for values from
('b', 10) to ('b', 20) partition by range (c);

create table part_c_1_100 (b int, c int, a text);
alter table part_b_10_b_20 attach partition part_c_1_100 for values
from (1) to (100);
create table part_c_100_200 (c int, a text, b int);
alter table part_b_10_b_20 attach partition part_c_100_200 for values
from (100) to (200);

GRANT ALL ON part_c_100_200 TO PUBLIC ;
ALTER TABLE part_c_100_200 ENABLE ROW LEVEL SECURITY;
create policy seeall ON part_c_100_200 as PERMISSIVE for SELECT using ( true);

insert into part_c_1_100 (a, b, c) values ('b', 12, 96);
insert into part_c_1_100 (a, b, c) values ('b', 13, 97);
insert into part_c_100_200 (a, b, c) values ('b', 15, 105);
insert into part_c_100_200 (a, b, c) values ('b', 17, 105);

-- For root table, allow updates only if NEW.c is even number.
create policy pu on range_parted for UPDATE USING (true) WITH CHECK (c % 2 = 0);
-- For this table, allow updates only if NEW.c is divisible by 4.
create policy pu on part_c_100_200 for UPDATE USING (true) WITH CHECK
(c % 4 = 0);


Now, if we try to update the child table using UPDATE on root table,
it will allow setting c to a number which would otherwise violate WCO
constraint of the child table if the query would have run on the child
table directly :

postgres=# set role user1;
SET
postgres=> select tableoid::regclass, * from range_parted where b = 17;   tableoid    | a | b  |  c
----------------+---+----+-----part_c_100_200 | b | 17 | 105
-- root table does not allow updating c to odd numbers
postgres=> update range_parted set c = 107 where a = 'b' and b = 17 ;
ERROR:  new row violates row-level security policy for table "range_parted"
-- child table does not allow to modify it to 106 because it is not
divisble by 4.
postgres=> update part_c_100_200 set c = 106 where a = 'b' and b = 17 ;
ERROR:  new row violates row-level security policy for table "part_c_100_200"
-- But we can update it to 106 by running update on the root table,
because here child table WCOs won't get used.
postgres=> update range_parted set c = 106 where a = 'b' and b = 17 ;
UPDATE 1
postgres=> select tableoid::regclass, * from range_parted where b = 17;   tableoid    | a | b  |  c
----------------+---+----+-----part_c_100_200 | b | 17 | 106

Same applies for INSERTs. I hope this is expected behaviour. Initially
I had found this weird, but then saw that is consistent for both
inserts as well as updates.

>>
>> Not sure why parent relation should come into picture. As long as the
>> first result relation belongs to one of the partitions in the whole
>> partition tree, we should be able to use that to build WCOs of any
>> other partitions, because they have a common set of attributes having
>> the same name. So we are bound to find each of the attributes of first
>> resultRelInfo in the other leaf partitions during attno mapping.
>
> Well, at least for returning lists, we've got to generate the
> returning lists so that they all match the column order of the parent,
> not the parent's first child.
> Otherwise, for example, UPDATE
> parent_table ... RETURNING * will not work correctly.  The tuples
> returned by the returning clause have to have the attribute order of
> parent_table, not the attribute order of parent_table's first child.
> I'm not sure whether WCOs have the same issue, but it's not clear to
> me why they wouldn't: they contain a qual which is an expression tree,
> and presumably there are Var nodes in there someplace, and if so, then
> they have varattnos that have to be right for the purpose for which
> they're going to be used.

So once we put the attnos right according to the child relation
tupdesc, the rest part of generating the final RETURNING expressions
as per the root able column order is taken care of by the returning
projection, no ?

This scenario is included in the update.sql regression test here :
-- ok (row movement, with subset of rows moved into different partition)
update range_parted set b = b - 6 where c > 116 returning a, b + c;


>
>>> +    for (i = 0; i < num_rels; i++)
>>> +    {
>>> +        ResultRelInfo *resultRelInfo = &result_rels[i];
>>> +        Relation        rel = resultRelInfo->ri_RelationDesc;
>>> +        Bitmapset     *expr_attrs = NULL;
>>> +
>>> +        pull_varattnos((Node *) rel->rd_partcheck, 1, &expr_attrs);
>>> +
>>> +        /* Both bitmaps are offset by FirstLowInvalidHeapAttributeNumber. */
>>> +        if (bms_overlap(expr_attrs, GetUpdatedColumns(resultRelInfo, estate)))
>>> +            return true;
>>> +    }
>>>
>>> This seems like an awfully expensive way of performing this test.
>>> Under what circumstances could this be true for some result relations
>>> and false for others;
>>
>> One resultRelinfo can have no partition key column used in its quals,
>> but the next resultRelinfo can have quite different quals, and these
>> quals can have partition key referred. This is possible if the two of
>> them have different parents that have different partition-key columns.
>
> Hmm, true.  So if we have a table foo that is partitioned by list (a),
> and one of its children is a table bar that is partitioned by list
> (b), then we need to consider doing tuple-routing if either column a
> is modified, or if column b is modified for a partition which is a
> descendant of bar.  But visiting that only requires looking at the
> partitioned table and those children that are also partitioned, not
> all of the leaf partitions as the patch does.
>

Will give a thought on this and get back on this, and remaining points.



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 26 June 2017 at 08:37, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 22 June 2017 at 01:41, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> Second, it will amount to a functional bug if you get a
>>>> different answer than the planner did.
>>>
>>> Actually, the per-leaf WCOs are meant to be executed on the
>>> destination partitions where the tuple is moved, while the WCOs
>>> belonging to the per-subplan resultRelInfo are meant for the
>>> resultRelinfo used for the UPDATE plans. So actually it should not
>>> matter whether they look same or different, because they are fired at
>>> different objects. Now these objects can happen to be the same
>>> relations though.
>>>
>>> But in any case, it's not clear to me how the mapped WCO and the
>>> planner's WCO would yield a different answer if they are both the same
>>> relation. I am possibly missing something. The planner has already
>>> generated the withCheckOptions for each of the resultRelInfo. And then
>>> we are using one of those to re-generate the WCO for a leaf partition
>>> by only adjusting the attnos. If there is already a WCO generated in
>>> the planner for that leaf partition (because that partition was
>>> present in mtstate->resultRelInfo), then the re-built WCO should be
>>> exactly look same as the earlier one, because they are the same
>>> relations, and so the attnos generated in them would be same since the
>>> Relation TupleDesc is the same.
>>
>> If the planner's WCOs and mapped WCOs are always the same, then I
>> think we should try to avoid generating both.  If they can be
>> different, but that's intentional and correct, then there's no
>> substantive problem with the patch but the comments need to make it
>> clear why we are generating both.
>>
>>> Actually I meant, "above works for only local updates. For
>>> row-movement-updates, we need per-leaf partition WCOs, because when
>>> the row is inserted into target partition, that partition may be not
>>> be included in the above planner resultRelInfo, so we need WCOs for
>>> all partitions". I think this said comment should be sufficient if I
>>> add this in the code ?
>>
>> Let's not get too focused on updating the comment until we are in
>> agreement about what the code ought to be doing.  I'm not clear
>> whether you accept the point that the patch needs to be changed to
>> avoid generating the same WCOs and returning lists in both the planner
>> and the executor.
>
> Yes, we can re-use the WCOs generated in the planner, as an
> optimization, since those we re-generate for the same relations will
> look exactly the same. The WCOs generated by planner (in
> inheritance_planner) are generated when (in adjust_appendrel_attrs())
> we change attnos used in the query to refer to the child RTEs and this
> adjusts the attnos of the WCOs of the child RTEs. So the WCOs of
> subplan resultRelInfo are actually the parent table WCOs, but only the
> attnos changed. And in ExecInitModifyTable() we do the same thing for
> leaf partitions, although using different function
> map_variable_attnos().

In attached patch v12,  during UPDATE tuple routing setup, for each
leaf partition, we now check if it is present already in one of the
UPDATE per-subplan resultrels. If present, we re-use them rather than
creating a new one and opening the table again.

So the mtstate->mt_partitions is now an array of ResultRelInfo
pointers. That pointer points to either the UPDATE per-subplan result
rel, or a newly allocated ResultRelInfo.

For each of the leaf partitions, we have to search through the
per-subplan resultRelInfo oids to check if there is a match. To do
this, I have created a temporary hash table which stores oids and the
ResultRelInfo pointers of mtstate->resultRelInfo array, and which can
be used to search the oid for each of the leaf partitions.

This patch version has handled only the above discussion point. I will
follow up with the other points separately.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 22 June 2017 at 01:57, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jun 21, 2017 at 1:38 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>>> Yep, it's more appropriate to use
>>>> ModifyTableState->rootResultRelationInfo->ri_RelationDesc somehow.  That
>>>> is, if answer to the question I raised above is positive.
>>
>> From what I had checked earlier when coding that part,
>> rootResultRelInfo is NULL in case of inserts, unless something has
>> changed in later commits. That's the reason I decided to use the first
>> resultRelInfo.
>
> We're just going around in circles here.  Saying that you decided to
> use the first child's resultRelInfo because you didn't have a
> resultRelInfo for the parent is an explanation of why you wrote the
> code the way you did, but that doesn't make it correct.  I want to
> know why you think it's correct.

Yeah, that was just an FYI on how I decided to use the first
resultRelInfo; it was not for explaining why using first resultRelInfo
is correct. So upthread, I have tried to explain.

>
> I think it's probably wrong, because it seems to me that if the INSERT
> code needs to use the parent's ResultRelInfo rather than the first
> child's ResultRelInfo, the UPDATE code probably needs to do the same.
> Commit d3cc37f1d801a6b5cad9bf179274a8d767f1ee50 got rid of
> resultRelInfos for non-leaf partitions, and commit
> e180c8aa8caf5c55a273d4a8e6092e77ff3cff10 added the resultRelInfo back
> for the topmost parent, because otherwise it didn't work correctly.



Regarding rootResultRelInfo , it would have been good if
rootResultRelInfo was set for both insert and update, but it isn't set
for inserts.....

For inserts :
In ExecInitModifyTable(), ModifyTableState->rootResultRelInfo  remains
NULL because ModifyTable->rootResultRelIndex is = -1 :
/* If modifying a partitioned table, initialize the root table info */
if (node->rootResultRelIndex >= 0)  mtstate->rootResultRelInfo = estate->es_root_result_relations +
node->rootResultRelIndex;


ModifyTable->rootResultRelIndex = -1 because it does not get set since
ModifyTable->partitioned_rels is NULL :

/*
* If the main target relation is a partitioned table, the
* following list contains the RT indexes of partitioned child
* relations including the root, which are not included in the
* above list.  We also keep RT indexes of the roots
* separately to be identitied as such during the executor
* initialization.
*/
if (splan->partitioned_rels != NIL)
{
root->glob->nonleafResultRelations =
list_concat(root->glob->nonleafResultRelations,
list_copy(splan->partitioned_rels));
/* Remember where this root will be in the global list. */
splan->rootResultRelIndex = list_length(root->glob->rootResultRelations);
root->glob->rootResultRelations =
lappend_int(root->glob->rootResultRelations,
linitial_int(splan->partitioned_rels));
}

ModifyTable->partitioned_rels is NULL because inheritance_planner()
does not get called for INSERTs; instead, grouping_planner() gets
called :

subquery_planner()
{
/*
* Do the main planning.  If we have an inherited target relation, that
* needs special processing, else go straight to grouping_planner.
*/
if (parse->resultRelation && rt_fetch(parse->resultRelation,
parse->rtable)->inh)  inheritance_planner(root);
else  grouping_planner(root, false, tuple_fraction);

}

Above, inh is false in case of inserts.



Re: [HACKERS] UPDATE of partition key

From
Amit Langote
Date:
Hi Amit,

On 2017/06/28 20:43, Amit Khandekar wrote:
> In attached patch v12

The patch no longer applies and fails to compile after the following
commit was made yesterday:

commit 501ed02cf6f4f60c3357775eb07578aebc912d3a
Author: Andrew Gierth <rhodiumtoad@postgresql.org>
Date:   Wed Jun 28 18:55:03 2017 +0100
   Fix transition tables for partition/inheritance.

Thanks,
Amit




Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 29 June 2017 at 07:42, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Hi Amit,
>
> On 2017/06/28 20:43, Amit Khandekar wrote:
>> In attached patch v12
>
> The patch no longer applies and fails to compile after the following
> commit was made yesterday:
>
> commit 501ed02cf6f4f60c3357775eb07578aebc912d3a
> Author: Andrew Gierth <rhodiumtoad@postgresql.org>
> Date:   Wed Jun 28 18:55:03 2017 +0100
>
>     Fix transition tables for partition/inheritance.

Thanks for informing Amit.

As Thomas mentioned upthread, the above commit already uses a tuple
conversion mapping from leaf partition to root partitioned table
(mt_transition_tupconv_maps), which serves the same purpose as that of
the mapping used in the update-partition-key patch during update tuple
routing (mt_resultrel_maps).

We need to try to merge these two into a general-purpose mapping array
such as mt_leaf_root_maps. I haven't done that in the rebased patch
(attached), so currently it has both these mapping fields.

For transition tables, this map is per-leaf-partition in case of
inserts, whereas it is per-subplan result rel for updates. For
update-tuple routing, the mapping is required to be per-subplan. Now,
for update-row-movement in presence of transition tables, we would
require both per-subplan mapping as well as per-leaf-partition
mapping, which can't be done if we have a single mapping field, unless
we have some way to identify which of the per-leaf partition mapping
elements belong to per-subplan rels.

So, it's not immediately possible to merge them.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 22 June 2017 at 01:41, Robert Haas <robertmhaas@gmail.com> wrote:
>>> +    for (i = 0; i < num_rels; i++)
>>> +    {
>>> +        ResultRelInfo *resultRelInfo = &result_rels[i];
>>> +        Relation        rel = resultRelInfo->ri_RelationDesc;
>>> +        Bitmapset     *expr_attrs = NULL;
>>> +
>>> +        pull_varattnos((Node *) rel->rd_partcheck, 1, &expr_attrs);
>>> +
>>> +        /* Both bitmaps are offset by FirstLowInvalidHeapAttributeNumber. */
>>> +        if (bms_overlap(expr_attrs, GetUpdatedColumns(resultRelInfo, estate)))
>>> +            return true;
>>> +    }
>>>
>>> This seems like an awfully expensive way of performing this test.
>>> Under what circumstances could this be true for some result relations
>>> and false for others;
>>
>> One resultRelinfo can have no partition key column used in its quals,
>> but the next resultRelinfo can have quite different quals, and these
>> quals can have partition key referred. This is possible if the two of
>> them have different parents that have different partition-key columns.
>
> Hmm, true.  So if we have a table foo that is partitioned by list (a),
> and one of its children is a table bar that is partitioned by list
> (b), then we need to consider doing tuple-routing if either column a
> is modified, or if column b is modified for a partition which is a
> descendant of bar.  But visiting that only requires looking at the
> partitioned table and those children that are also partitioned, not
> all of the leaf partitions as the patch does.

The main concern is that the non-leaf partitions are not open (except
root), so we would need to open them in order to get the partition key
of the parents of update resultrels (or get only the partition key
atts and exprs from pg_partitioned_table).

There can be multiple approaches to finding partition key columns.

Approach 1 : When there are a few update result rels and a large
partition tree, we traverse from each of the result rels to their
ancestors , and open their ancestors (get_partition_parent()) to get
the partition key columns. For result rels having common parents, do
this only once.

Approach 2 : If there are only a few partitioned tables, and large
number of update result rels, it would be easier to just open all the
partitioned tables and form the partition key column bitmap out of all
their partition keys. If the bitmap does not have updated columns,
that's not a partition-key-update. So for typical non-partition-key
updates, just opening the partitioned tables will suffice, and so that
would not affect performance of normal updates.

But if the bitmap has updated columns, we can't conclude that it's a
partition-key-update, otherwise it would be false positive. We again
need to further check whether the update result rels belong to
ancestors that have updated partition keys.

Approach 3 : In RelationData, in a new bitmap field (rd_partcheckattrs
?), store partition key attrs that are used in rd_partcheck . Populate
this field during generate_partition_qual().

So to conclude, I think, we can do this :

Scenario 1 :
Only one partitioned table : the root; rest all are leaf partitions.
In this case, it is definitely efficient to just check the root
partition key, which will be sufficient.

Scenario 2 :
There are few non-leaf partitioned tables (3-4) :
Open those tables, and follow 2nd approach above: If we don't find any
updated partition-keys in any of them, well and good. If we do find,
failover to approach 3 : For each of the update resultrels, use the
new rd_partcheckattrs bitmap to know if it uses any of the updated
columns. This would be faster than pulling up attrs from the quals
like how it was done in the patch.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Thu, Jun 29, 2017 at 3:52 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> So to conclude, I think, we can do this :
>
> Scenario 1 :
> Only one partitioned table : the root; rest all are leaf partitions.
> In this case, it is definitely efficient to just check the root
> partition key, which will be sufficient.
>
> Scenario 2 :
> There are few non-leaf partitioned tables (3-4) :
> Open those tables, and follow 2nd approach above: If we don't find any
> updated partition-keys in any of them, well and good. If we do find,
> failover to approach 3 : For each of the update resultrels, use the
> new rd_partcheckattrs bitmap to know if it uses any of the updated
> columns. This would be faster than pulling up attrs from the quals
> like how it was done in the patch.

I think we should just have the planner figure out a list of which
columns are partitioning columns either for the named relation or some
descendent, and set a flag if that set of columns overlaps the set of
columns updated.  At execution time, update tuple routing is needed if
either that flag is set or if some partition included in the plan has
a BR UPDATE trigger.  Attached is a draft patch implementing that
approach.

This could be made more more accurate.  Suppose table foo is
partitioned by a and some but not all of the partitions partitioned by
b.  If it so happens that, in a query which only updates b, constraint
exclusion eliminates all of the partitions that are subpartitioned by
b, it would be unnecessary to enable update tuple routing (unless BR
UPDATE triggers are present) but this patch will not figure that out.
I don't think that optimization is critical for the first version of
this feature; there will be a limited number of users with
asymmetrical subpartitioning setups, and if one of them has an idea
how to improve this without hurting anything else, they are free to
contribute a patch.  Other optimizations are possible too, but I don't
really see any of them as critical either.

I don't think the approach of building a hash table to figure out
which result rels have already been created is a good one.  That too
feels like something that the planner should be figuring out and the
executor should just be implementing what the planner decided.  I
haven't figured out exactly how that should work yet, but it seems
like it ought to be doable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] UPDATE of partition key

From
Thomas Munro
Date:
On Fri, Jun 30, 2017 at 12:01 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 29 June 2017 at 07:42, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Hi Amit,
>>
>> On 2017/06/28 20:43, Amit Khandekar wrote:
>>> In attached patch v12
>>
>> The patch no longer applies and fails to compile after the following
>> commit was made yesterday:
>>
>> commit 501ed02cf6f4f60c3357775eb07578aebc912d3a
>> Author: Andrew Gierth <rhodiumtoad@postgresql.org>
>> Date:   Wed Jun 28 18:55:03 2017 +0100
>>
>>     Fix transition tables for partition/inheritance.
>
> Thanks for informing Amit.
>
> As Thomas mentioned upthread, the above commit already uses a tuple
> conversion mapping from leaf partition to root partitioned table
> (mt_transition_tupconv_maps), which serves the same purpose as that of
> the mapping used in the update-partition-key patch during update tuple
> routing (mt_resultrel_maps).
>
> We need to try to merge these two into a general-purpose mapping array
> such as mt_leaf_root_maps. I haven't done that in the rebased patch
> (attached), so currently it has both these mapping fields.
>
> For transition tables, this map is per-leaf-partition in case of
> inserts, whereas it is per-subplan result rel for updates. For
> update-tuple routing, the mapping is required to be per-subplan. Now,
> for update-row-movement in presence of transition tables, we would
> require both per-subplan mapping as well as per-leaf-partition
> mapping, which can't be done if we have a single mapping field, unless
> we have some way to identify which of the per-leaf partition mapping
> elements belong to per-subplan rels.
>
> So, it's not immediately possible to merge them.

Would make sense to have a set of functions with names like
GetConvertor{From,To}{Subplan,Leaf}(mtstate, index) which build arrays
m_convertors_{from,to}_by_{subplan,leaf} the first time they need
them?

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Fri, Jun 30, 2017 at 4:20 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I don't think the approach of building a hash table to figure out
> which result rels have already been created is a good one.  That too
> feels like something that the planner should be figuring out and the
> executor should just be implementing what the planner decided.  I
> haven't figured out exactly how that should work yet, but it seems
> like it ought to be doable.

I was imagining when I wrote the above that the planner should somehow
compute a list of relations that it has excluded so that the executor
can skip building ResultRelInfos for exactly those relations, but on
further study, that's not particularly easy to achieve and wouldn't
really save anything anyway, because the list of OIDs is coming
straight out of the partition descriptor, so it's pretty much free.
However, I still think it would be a nifty idea if we could avoid
needing the hash table to deduplicate.  The reason we need that is, I
think, that expand_inherited_rtentry() is going to expand the
inheritance hierarchy in whatever order the scan(s) of pg_inherits
return the descendant tables, whereas the partition descriptor is
going to put them in a canonical order.

But that seems like it wouldn't be too hard to fix: let's have
expand_inherited_rtentry() expand the partitioned table in the same
order that will be used by ExecSetupPartitionTupleRouting().  That
seems pretty easy to do - just have expand_inherited_rtentry() notice
that it's got a partitioned table and call
RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to
produce the list of OIDs.  Then - I think -
ExecSetupPartitionTupleRouting() doesn't need the hash table; it can
just scan through the return value of ExecSetupPartitionTupleRouting()
and the list of already-created ResultRelInfo structures in parallel -
the order must be the same, but the latter can be missing some
elements, so it can just create the missing ones.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] UPDATE of partition key

From
Amit Langote
Date:
On 2017/07/02 20:10, Robert Haas wrote:
> On Fri, Jun 30, 2017 at 4:20 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I don't think the approach of building a hash table to figure out
>> which result rels have already been created is a good one.  That too
>> feels like something that the planner should be figuring out and the
>> executor should just be implementing what the planner decided.  I
>> haven't figured out exactly how that should work yet, but it seems
>> like it ought to be doable.
> 
> I was imagining when I wrote the above that the planner should somehow
> compute a list of relations that it has excluded so that the executor
> can skip building ResultRelInfos for exactly those relations, but on
> further study, that's not particularly easy to achieve and wouldn't
> really save anything anyway, because the list of OIDs is coming
> straight out of the partition descriptor, so it's pretty much free.
> However, I still think it would be a nifty idea if we could avoid
> needing the hash table to deduplicate.  The reason we need that is, I
> think, that expand_inherited_rtentry() is going to expand the
> inheritance hierarchy in whatever order the scan(s) of pg_inherits
> return the descendant tables, whereas the partition descriptor is
> going to put them in a canonical order.
> 
> But that seems like it wouldn't be too hard to fix: let's have
> expand_inherited_rtentry() expand the partitioned table in the same
> order that will be used by ExecSetupPartitionTupleRouting().  That
> seems pretty easy to do - just have expand_inherited_rtentry() notice
> that it's got a partitioned table and call
> RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to
> produce the list of OIDs.  Then - I think -
> ExecSetupPartitionTupleRouting() doesn't need the hash table; it can
> just scan through the return value of ExecSetupPartitionTupleRouting()
> and the list of already-created ResultRelInfo structures in parallel -
> the order must be the same, but the latter can be missing some
> elements, so it can just create the missing ones.

Interesting idea.

If we are going to do this, I think we may need to modify
RelationGetPartitionDispatchInfo() a bit or invent an alternative that
does not do as much work.  Currently, it assumes that it's only ever
called by ExecSetupPartitionTupleRouting() and hence also generates
PartitionDispatchInfo objects for partitioned child tables.  We don't need
that if called from within the planner.

Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled
with its usage within the executor, because there is this comment:
       /*        * We keep the partitioned ones open until we're done using the        * information being collected
here(for example, see        * ExecEndModifyTable).        */
 

Thanks,
Amit




Re: [HACKERS] UPDATE of partition key

From
Etsuro Fujita
Date:
On 2017/07/03 18:54, Amit Langote wrote:
> On 2017/07/02 20:10, Robert Haas wrote:

>> But that seems like it wouldn't be too hard to fix: let's have
>> expand_inherited_rtentry() expand the partitioned table in the same
>> order that will be used by ExecSetupPartitionTupleRouting().

That's really what I wanted when updating the patch for tuple-routing to 
foreign partitions.  (I don't understand the issue discussed here, though.)

>> That
>> seems pretty easy to do - just have expand_inherited_rtentry() notice
>> that it's got a partitioned table and call
>> RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to
>> produce the list of OIDs.
Seems like a good idea.

> Interesting idea.
> 
> If we are going to do this, I think we may need to modify
> RelationGetPartitionDispatchInfo() a bit or invent an alternative that
> does not do as much work.  Currently, it assumes that it's only ever
> called by ExecSetupPartitionTupleRouting() and hence also generates
> PartitionDispatchInfo objects for partitioned child tables.  We don't need
> that if called from within the planner.
> 
> Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled
> with its usage within the executor, because there is this comment:
> 
>          /*
>           * We keep the partitioned ones open until we're done using the
>           * information being collected here (for example, see
>           * ExecEndModifyTable).
>           */

Yeah, we need some refactoring work.  Is anyone working on that?

Best regards,
Etsuro Fujita




Re: [HACKERS] UPDATE of partition key

From
Amit Langote
Date:
On 2017/07/04 17:25, Etsuro Fujita wrote:
> On 2017/07/03 18:54, Amit Langote wrote:
>> On 2017/07/02 20:10, Robert Haas wrote:
>>> That
>>> seems pretty easy to do - just have expand_inherited_rtentry() notice
>>> that it's got a partitioned table and call
>>> RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to
>>> produce the list of OIDs.
> Seems like a good idea.
> 
>> Interesting idea.
>>
>> If we are going to do this, I think we may need to modify
>> RelationGetPartitionDispatchInfo() a bit or invent an alternative that
>> does not do as much work.  Currently, it assumes that it's only ever
>> called by ExecSetupPartitionTupleRouting() and hence also generates
>> PartitionDispatchInfo objects for partitioned child tables.  We don't need
>> that if called from within the planner.
>>
>> Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled
>> with its usage within the executor, because there is this comment:
>>
>>          /*
>>           * We keep the partitioned ones open until we're done using the
>>           * information being collected here (for example, see
>>           * ExecEndModifyTable).
>>           */
> 
> Yeah, we need some refactoring work.  Is anyone working on that?

I would like to take a shot at that if someone else hasn't already cooked
up a patch.  Working on making RelationGetPartitionDispatchInfo() a
routine callable from both within the planner and the executor should be a
worthwhile effort.

Thanks,
Amit




Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 4 July 2017 at 14:38, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/07/04 17:25, Etsuro Fujita wrote:
>> On 2017/07/03 18:54, Amit Langote wrote:
>>> On 2017/07/02 20:10, Robert Haas wrote:
>>>> That
>>>> seems pretty easy to do - just have expand_inherited_rtentry() notice
>>>> that it's got a partitioned table and call
>>>> RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to
>>>> produce the list of OIDs.
>> Seems like a good idea.
>>
>>> Interesting idea.
>>>
>>> If we are going to do this, I think we may need to modify
>>> RelationGetPartitionDispatchInfo() a bit or invent an alternative that
>>> does not do as much work.  Currently, it assumes that it's only ever
>>> called by ExecSetupPartitionTupleRouting() and hence also generates
>>> PartitionDispatchInfo objects for partitioned child tables.  We don't need
>>> that if called from within the planner.
>>>
>>> Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled
>>> with its usage within the executor, because there is this comment:
>>>
>>>          /*
>>>           * We keep the partitioned ones open until we're done using the
>>>           * information being collected here (for example, see
>>>           * ExecEndModifyTable).
>>>           */
>>
>> Yeah, we need some refactoring work.  Is anyone working on that?
>
> I would like to take a shot at that if someone else hasn't already cooked
> up a patch.  Working on making RelationGetPartitionDispatchInfo() a
> routine callable from both within the planner and the executor should be a
> worthwhile effort.

What I am currently working on is to see if we can call
find_all_inheritors() or find_inheritance_children() instead of
generating the leaf_part_oids using APPEND_REL_PARTITION_OIDS().
Possibly we don't have to refactor it completely.
find_inheritance_children() needs to return the oids in canonical
order. So in find_inheritance_children () need to re-use part of
RelationBuildPartitionDesc() where it generates those oids in that
order. I am checking this part, and am going to come up with an
approach based on findings.

Also, need to investigate whether *always* sorting the oids in
canonical order is going to be much expensive than the current sorting
using oids. But I guess it won't be that expensive.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 4 July 2017 at 14:48, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 4 July 2017 at 14:38, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2017/07/04 17:25, Etsuro Fujita wrote:
>>> On 2017/07/03 18:54, Amit Langote wrote:
>>>> On 2017/07/02 20:10, Robert Haas wrote:
>>>>> That
>>>>> seems pretty easy to do - just have expand_inherited_rtentry() notice
>>>>> that it's got a partitioned table and call
>>>>> RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to
>>>>> produce the list of OIDs.
>>> Seems like a good idea.
>>>
>>>> Interesting idea.
>>>>
>>>> If we are going to do this, I think we may need to modify
>>>> RelationGetPartitionDispatchInfo() a bit or invent an alternative that
>>>> does not do as much work.  Currently, it assumes that it's only ever
>>>> called by ExecSetupPartitionTupleRouting() and hence also generates
>>>> PartitionDispatchInfo objects for partitioned child tables.  We don't need
>>>> that if called from within the planner.
>>>>
>>>> Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled
>>>> with its usage within the executor, because there is this comment:
>>>>
>>>>          /*
>>>>           * We keep the partitioned ones open until we're done using the
>>>>           * information being collected here (for example, see
>>>>           * ExecEndModifyTable).
>>>>           */
>>>
>>> Yeah, we need some refactoring work.  Is anyone working on that?
>>
>> I would like to take a shot at that if someone else hasn't already cooked
>> up a patch.  Working on making RelationGetPartitionDispatchInfo() a
>> routine callable from both within the planner and the executor should be a
>> worthwhile effort.
>
> What I am currently working on is to see if we can call
> find_all_inheritors() or find_inheritance_children() instead of
> generating the leaf_part_oids using APPEND_REL_PARTITION_OIDS().
> Possibly we don't have to refactor it completely.
> find_inheritance_children() needs to return the oids in canonical
> order. So in find_inheritance_children () need to re-use part of
> RelationBuildPartitionDesc() where it generates those oids in that
> order. I am checking this part, and am going to come up with an
> approach based on findings.

The other approach is to make canonical ordering only in
find_all_inheritors() by replacing call to find_inheritance_children()
with the refactored/modified RelationGetPartitionDispatchInfo(). But
that would mean that the callers of find_inheritance_children() would
have one ordering, while the callers of find_all_inheritors() would
have a different ordering; that brings up chances of deadlocks. That's
why I think, we need to think about modifying the common function
find_inheritance_children(), so that we would be consistent with the
ordering. And then use find_inheritance_children() or
find_all_inheritors() in RelationGetPartitionDispatchInfo(). So yes,
there would be some modifications to
RelationGetPartitionDispatchInfo().

>
> Also, need to investigate whether *always* sorting the oids in
> canonical order is going to be much expensive than the current sorting
> using oids. But I guess it won't be that expensive.
>
>
> --
> Thanks,
> -Amit Khandekar
> EnterpriseDB Corporation
> The Postgres Database Company



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 4 July 2017 at 15:23, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 4 July 2017 at 14:48, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> On 4 July 2017 at 14:38, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> On 2017/07/04 17:25, Etsuro Fujita wrote:
>>>> On 2017/07/03 18:54, Amit Langote wrote:
>>>>> On 2017/07/02 20:10, Robert Haas wrote:
>>>>>> That
>>>>>> seems pretty easy to do - just have expand_inherited_rtentry() notice
>>>>>> that it's got a partitioned table and call
>>>>>> RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to
>>>>>> produce the list of OIDs.
>>>> Seems like a good idea.
>>>>
>>>>> Interesting idea.
>>>>>
>>>>> If we are going to do this, I think we may need to modify
>>>>> RelationGetPartitionDispatchInfo() a bit or invent an alternative that
>>>>> does not do as much work.  Currently, it assumes that it's only ever
>>>>> called by ExecSetupPartitionTupleRouting() and hence also generates
>>>>> PartitionDispatchInfo objects for partitioned child tables.  We don't need
>>>>> that if called from within the planner.
>>>>>
>>>>> Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled
>>>>> with its usage within the executor, because there is this comment:
>>>>>
>>>>>          /*
>>>>>           * We keep the partitioned ones open until we're done using the
>>>>>           * information being collected here (for example, see
>>>>>           * ExecEndModifyTable).
>>>>>           */
>>>>
>>>> Yeah, we need some refactoring work.  Is anyone working on that?
>>>
>>> I would like to take a shot at that if someone else hasn't already cooked
>>> up a patch.  Working on making RelationGetPartitionDispatchInfo() a
>>> routine callable from both within the planner and the executor should be a
>>> worthwhile effort.
>>
>> What I am currently working on is to see if we can call
>> find_all_inheritors() or find_inheritance_children() instead of
>> generating the leaf_part_oids using APPEND_REL_PARTITION_OIDS().
>> Possibly we don't have to refactor it completely.
>> find_inheritance_children() needs to return the oids in canonical
>> order. So in find_inheritance_children () need to re-use part of
>> RelationBuildPartitionDesc() where it generates those oids in that
>> order. I am checking this part, and am going to come up with an
>> approach based on findings.
>
> The other approach is to make canonical ordering only in
> find_all_inheritors() by replacing call to find_inheritance_children()
> with the refactored/modified RelationGetPartitionDispatchInfo(). But
> that would mean that the callers of find_inheritance_children() would
> have one ordering, while the callers of find_all_inheritors() would
> have a different ordering; that brings up chances of deadlocks. That's
> why I think, we need to think about modifying the common function
> find_inheritance_children(), so that we would be consistent with the
> ordering. And then use find_inheritance_children() or
> find_all_inheritors() in RelationGetPartitionDispatchInfo(). So yes,
> there would be some modifications to
> RelationGetPartitionDispatchInfo().
>
>>
>> Also, need to investigate whether *always* sorting the oids in
>> canonical order is going to be much expensive than the current sorting
>> using oids. But I guess it won't be that expensive.


Like I mentioned upthread... in expand_inherited_rtentry(), if we
replace find_all_inheritors() with something else that returns oids in
canonical order, that will change the order in which children tables
get locked, which increases the chance of deadlock. Because, then the
callers of find_all_inheritors() will lock them in one order, while
callers of expand_inherited_rtentry() will lock them in a different
order. Even in the current code, I think there is a chance of
deadlocks because RelationGetPartitionDispatchInfo() and
find_all_inheritors() have different lock ordering.

Now, to get the oids of a partitioned table children sorted by
canonical ordering, (i.e. using the partition bound values) we need to
either use the partition bounds to sort the oids like the way it is
done in RelationBuildPartitionDesc() or, open the parent table and get
it's Relation->rd_partdesc->oids[] which are already sorted in
canonical order. So if we generate oids using this way in
find_all_inheritors() and find_inheritance_children(), that will
generate consistent ordering everywhere. But this method is quite
expensive as compared to the way oids are generated and sorted using
oid values in find_inheritance_children().

In both expand_inherited_rtentry() and
RelationGetPartitionDispatchInfo(), each of the child tables are
opened.

So, in both of these functions, what we can do is : call a new
function partition_tree_walker() which does following :
1. Lock the children using the existing order (i.e. sorted by oid
values) using the same function find_all_inheritors(). Rename
find_all_inheritors() to lock_all_inheritors(... , bool return_oids)
which returns the oid list only if requested.
2. And then scan through each of the partitions in canonical order, by
opening the parent table, then opening the partition descriptor oids,
and then doing whatever needs to be done with that partition rel.

partition_tree_walker() will look something like this :

void partition_tree_walker(Oid parentOid, LOCKMODE lockmode,                      void (*walker_func) (), void
*context)
{   Relation parentrel;   List *rels_list;   ListCell *cell;
   (void) lock_all_inheritors(parentOid, lockmode,                          false /* don't generate oids */);
   parentrel = heap_open(parentOid, NoLock);   rels_list = append_rel_partition_oids(NIL, parentrel);
   /* Scan through all partitioned rels, and at the    * same time append their children. */   foreach(cell, rels_list)
 {       /* Open partrel without locking; lock_all_inheritors() has locked it */       Relation    partrel =
heap_open(lfirst_oid(cell),NoLock);
 
       /* Append the children of a partitioned rel to the same list        * that we are iterating on */       if
(RelationGetPartitionDesc(partrel))          rels_list = append_rel_partition_oids(rels_list, partrel);
 
       /*        * Do whatever processing needs to be done on this partel.        * The walker function is free to
eitherclose the partel        * or keep it opened, but it needs to make sure the opened        * ones are closed later
     */       walker_func(partrel, context);   }
 
}

List *append_rel_partition_oids(List *rel_list, Relation rel)
{   int i;   for (i = 0; i < rel->rd_partdesc->nparts; i++)       rel_list = lappend_oid(rel_list,
rel->rd_partdesc->oids[i]);
   return rel_list;
}


So, in expand_inherited_rtentry() the foreach(l, inhOIDs) loop will be
replaced by partition_tree_walker(parentOid, expand_rte_walker_func)
where expand_rte_walker_func() will do all the work done in the for
loop for each of the partition rels.

Similarly, in RelationGetPartitionDispatchInfo() the initial part
where it uses APPEND_REL_PARTITION_OIDS() can be replaced by
partition_tree_walker(rel, dispatch_info_walkerfunc) where
dispatch_info_walkerfunc() will generate the oids, or may be populate
the complete PartitionDispatchData structure. 'pd' variable can be
passed as context to the partition_tree_walker(..., context)

Generating the resultrels in canonical order by opening the tables
using the above way wouldn't be more expensive than the existing code,
because even currently we anyways have to open all the tables in both
of these functions.

-Amit Khandekar



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 5 July 2017 at 15:12, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Like I mentioned upthread... in expand_inherited_rtentry(), if we
> replace find_all_inheritors() with something else that returns oids in
> canonical order, that will change the order in which children tables
> get locked, which increases the chance of deadlock. Because, then the
> callers of find_all_inheritors() will lock them in one order, while
> callers of expand_inherited_rtentry() will lock them in a different
> order. Even in the current code, I think there is a chance of
> deadlocks because RelationGetPartitionDispatchInfo() and
> find_all_inheritors() have different lock ordering.
>
> Now, to get the oids of a partitioned table children sorted by
> canonical ordering, (i.e. using the partition bound values) we need to
> either use the partition bounds to sort the oids like the way it is
> done in RelationBuildPartitionDesc() or, open the parent table and get
> it's Relation->rd_partdesc->oids[] which are already sorted in
> canonical order. So if we generate oids using this way in
> find_all_inheritors() and find_inheritance_children(), that will
> generate consistent ordering everywhere. But this method is quite
> expensive as compared to the way oids are generated and sorted using
> oid values in find_inheritance_children().
>
> In both expand_inherited_rtentry() and
> RelationGetPartitionDispatchInfo(), each of the child tables are
> opened.
>
> So, in both of these functions, what we can do is : call a new
> function partition_tree_walker() which does following :
> 1. Lock the children using the existing order (i.e. sorted by oid
> values) using the same function find_all_inheritors(). Rename
> find_all_inheritors() to lock_all_inheritors(... , bool return_oids)
> which returns the oid list only if requested.
> 2. And then scan through each of the partitions in canonical order, by
> opening the parent table, then opening the partition descriptor oids,
> and then doing whatever needs to be done with that partition rel.
>
> partition_tree_walker() will look something like this :
>
> void partition_tree_walker(Oid parentOid, LOCKMODE lockmode,
>                        void (*walker_func) (), void *context)
> {
>     Relation parentrel;
>     List *rels_list;
>     ListCell *cell;
>
>     (void) lock_all_inheritors(parentOid, lockmode,
>                            false /* don't generate oids */);
>
>     parentrel = heap_open(parentOid, NoLock);
>     rels_list = append_rel_partition_oids(NIL, parentrel);
>
>     /* Scan through all partitioned rels, and at the
>      * same time append their children. */
>     foreach(cell, rels_list)
>     {
>         /* Open partrel without locking; lock_all_inheritors() has locked it */
>         Relation    partrel = heap_open(lfirst_oid(cell), NoLock);
>
>         /* Append the children of a partitioned rel to the same list
>          * that we are iterating on */
>         if (RelationGetPartitionDesc(partrel))
>             rels_list = append_rel_partition_oids(rels_list, partrel);
>
>         /*
>          * Do whatever processing needs to be done on this partel.
>          * The walker function is free to either close the partel
>          * or keep it opened, but it needs to make sure the opened
>          * ones are closed later
>          */
>         walker_func(partrel, context);
>     }
> }
>
> List *append_rel_partition_oids(List *rel_list, Relation rel)
> {
>     int i;
>     for (i = 0; i < rel->rd_partdesc->nparts; i++)
>         rel_list = lappend_oid(rel_list, rel->rd_partdesc->oids[i]);
>
>     return rel_list;
> }
>
>
> So, in expand_inherited_rtentry() the foreach(l, inhOIDs) loop will be
> replaced by partition_tree_walker(parentOid, expand_rte_walker_func)
> where expand_rte_walker_func() will do all the work done in the for
> loop for each of the partition rels.
>
> Similarly, in RelationGetPartitionDispatchInfo() the initial part
> where it uses APPEND_REL_PARTITION_OIDS() can be replaced by
> partition_tree_walker(rel, dispatch_info_walkerfunc) where
> dispatch_info_walkerfunc() will generate the oids, or may be populate
> the complete PartitionDispatchData structure. 'pd' variable can be
> passed as context to the partition_tree_walker(..., context)
>
> Generating the resultrels in canonical order by opening the tables
> using the above way wouldn't be more expensive than the existing code,
> because even currently we anyways have to open all the tables in both
> of these functions.
>

Attached is a WIP patch (make_resultrels_ordered.patch) that generates
the result rels in canonical order. This patch is kept separate from
the update-partition-key patch, and can be applied on master branch.

In this patch, rather than partition_tree_walker() called with a
context, I have provided a function partition_walker_next() using
which we iterate over all the partitions in canonical order.
partition_walker_next() will take care of appending oids from
partition descriptors.

Now, to generate consistent oid ordering in
RelationGetPartitionDispatchInfo() and expand_inherited_rtentry(), we
could have very well skipped using the partition_walker API in
expand_inherited_rtentry() and just had it iterate over the partition
descriptors the way it is done in  RelationGetPartitionDispatchInfo().
But I think it's better to have some common function to traverse the
partition tree in consistent order, hence the usage of
partition_walker_next() in both expand_inherited_rtentry() and
RelationGetPartitionDispatchInfo(). In
RelationGetPartitionDispatchInfo(), still, it only uses this function
to generate partitioned table list. But even to generate partitioned
tables in correct order, it is better to use partition_walker_next(),
so that we make sure to finally generate consistent order of leaf
oids.

I considered the option where RelationGetPartitionDispatchInfo() would
directly build the pd[] array over each iteration of
partition_walker_next(). But that was turning out to be clumsy,
because then we need to keep track of which pd[] element each of the
oids would go into by having a current position of pd[]. Rather than
this, it is best to keep building of pd array separate, as done in the
existing code.

Didn't do any renaming for find_all_inheritors(). Just called it in
both the functions, and ignored the list returned. Like mentioned
upthread, it is important to lock in this order  so as to be
consistent with the lock ordering in other places where
find_inheritance_children() is called. Hence, called
find_all_inheritors() in RelationGetPartitionDispatchInfo() as well.

Note that this patch does not attempt to make
RelationGetPartitionDispatchInfo() work in planner. That I think
should be done once we finalise how to generate common oid ordering,
and is not in the scope of this project.

Once I merge this in the update-partition-key patch, in
ExecSetupPartitionTupleRouting(), I will be able to search for the
leaf partitions in this ordered resultrel list, without having to
build a hash table of result rels the way it is currently done in the
update-partition-key patch.


Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 13 July 2017 at 22:39, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Attached is a WIP patch (make_resultrels_ordered.patch) that generates
> the result rels in canonical order. This patch is kept separate from
> the update-partition-key patch, and can be applied on master branch.

Attached update-partition-key_v13.patch now contains this
make_resultrels_ordered.patch changes.

So now that the per-subplan result rels and the leaf partition oids
that are generated for tuple routing are both known to have the same
order (cannonical), in ExecSetupPartitionTupleRouting(), we look for
the per-subplan results without the need for a hash table. Instead of
the hash table, we iterate over the leaf partition oids and at the
same time keep shifting a position over the per-subplan resultrels
whenever the resultrel at the position is found to be present in the
leaf partitions list. Since the two lists are in the same order, we
never have to again scan the portition of the lists that is already
scanned.

I considered whether the issue behind this recent commit might be
relevant for update tuple-routing as well :
commit f81a91db4d1c2032632aa5df9fc14be24f5fe5ec
Author: Robert Haas <rhaas@postgresql.org>
Date:   Mon Jul 17 21:29:45 2017 -0400
    Use a real RT index when setting up partition tuple routing.

Since we know that using a dummy 1 value for tuple routing result rels
is not correct, I am checking about another possibility : Now in the
latest patch, the tuple routing partitions would have a mix of a)
existing update result-rels, and b) new partition resultrels. 'b'
resultrels would have the RT index of nominalRelation, but the
existing 'a' resultrels would have their own different RT indexes. I
suspect, this might surface a similar issue that was fixed by the
above commit, for e.g. with the WITH query having UPDATE subqueries
doing tuple routing. Will check that.

This patch also has Robert's changes in the planner to decide whether
to do update tuple routing.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] UPDATE of partition key

From
Rajkumar Raghuwanshi
Date:
On Mon, Jul 24, 2017 at 11:23 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Attached update-partition-key_v13.patch now contains this
make_resultrels_ordered.patch changes.


I have applied attach patch and got below observation.

Observation :  if join producing multiple output rows for a given row to be modified. I am seeing here it is updating a row and also inserting rows in target table. hence after update total count of table got incremented.

below are steps:
postgres=# create table part_upd (a int, b int) partition by range(a);
CREATE TABLE
postgres=# create table part_upd1 partition of part_upd for values from (minvalue) to (-10);
CREATE TABLE
postgres=# create table part_upd2 partition of part_upd for values from (-10) to (0);
CREATE TABLE
postgres=# create table part_upd3 partition of part_upd for values from (0) to (10);
CREATE TABLE
postgres=# create table part_upd4 partition of part_upd for values from (10) to (maxvalue);
CREATE TABLE
postgres=# insert into part_upd select i,i from generate_series(-30,30,3)i;
INSERT 0 21
postgres=# select count(*) from part_upd;
 count
-------
    21
(1 row)

postgres=#
postgres=# create table non_part_upd (a int);
CREATE TABLE
postgres=# insert into non_part_upd select i%2 from generate_series(-30,30,5)i;
INSERT 0 13
postgres=# update part_upd t1 set a = (t2.a+10) from non_part_upd t2 where t2.a = t1.b;
UPDATE 7
postgres=# select count(*) from part_upd;
 count
-------
    27
(1 row)

postgres=# select tableoid::regclass,* from part_upd;
 tableoid  |  a  |  b 
-----------+-----+-----
 part_upd1 | -30 | -30
 part_upd1 | -27 | -27
 part_upd1 | -24 | -24
 part_upd1 | -21 | -21
 part_upd1 | -18 | -18
 part_upd1 | -15 | -15
 part_upd1 | -12 | -12
 part_upd2 |  -9 |  -9
 part_upd2 |  -6 |  -6
 part_upd2 |  -3 |  -3
 part_upd3 |   3 |   3
 part_upd3 |   6 |   6
 part_upd3 |   9 |   9
 part_upd4 |  12 |  12
 part_upd4 |  15 |  15
 part_upd4 |  18 |  18
 part_upd4 |  21 |  21
 part_upd4 |  24 |  24
 part_upd4 |  27 |  27
 part_upd4 |  30 |  30
 part_upd4 |  10 |   0
 part_upd4 |  10 |   0
 part_upd4 |  10 |   0
 part_upd4 |  10 |   0
 part_upd4 |  10 |   0
 part_upd4 |  10 |   0
 part_upd4 |  10 |   0
(27 rows)

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 25 July 2017 at 15:02, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:
> On Mon, Jul 24, 2017 at 11:23 AM, Amit Khandekar <amitdkhan.pg@gmail.com>
> wrote:
>>
>>
>> Attached update-partition-key_v13.patch now contains this
>> make_resultrels_ordered.patch changes.
>>
>
> I have applied attach patch and got below observation.
>
> Observation :  if join producing multiple output rows for a given row to be
> modified. I am seeing here it is updating a row and also inserting rows in
> target table. hence after update total count of table got incremented.

Thanks for catching this Rajkumar.

So after the row to be updated is already moved to another partition,
when the next join output row corresponds to the same row which is
moved, that row is now deleted, so ExecDelete()=>heap_delete() gets
HeapTupleSelfUpdated, and this is not handled. So even when
ExecDelete() finds that the row is already deleted, we still call
ExecInsert(), so a new row is inserted.  In ExecDelete(), we should
indicate that the row is already deleted. In the existing patch, there
is a parameter concurrenty_deleted for ExecDelete() which indicates
that the row is concurrently deleted. I think we can make this
parameter for both of these purposes so as to avoid ExecInsert() for
both these scenarios. Will work on a patch.



Re: [HACKERS] UPDATE of partition key

From
Rajkumar Raghuwanshi
Date:
On Tue, Jul 25, 2017 at 3:54 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
On 25 July 2017 at 15:02, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:
> On Mon, Jul 24, 2017 at 11:23 AM, Amit Khandekar <amitdkhan.pg@gmail.com>
> wrote:
>>
>>
>> Attached update-partition-key_v13.patch now contains this
>> make_resultrels_ordered.patch changes.
>>
>
> I have applied attach patch and got below observation.
>
> Observation :  if join producing multiple output rows for a given row to be
> modified. I am seeing here it is updating a row and also inserting rows in
> target table. hence after update total count of table got incremented.

Thanks for catching this Rajkumar.

So after the row to be updated is already moved to another partition,
when the next join output row corresponds to the same row which is
moved, that row is now deleted, so ExecDelete()=>heap_delete() gets
HeapTupleSelfUpdated, and this is not handled. So even when
ExecDelete() finds that the row is already deleted, we still call
ExecInsert(), so a new row is inserted.  In ExecDelete(), we should
indicate that the row is already deleted. In the existing patch, there
is a parameter concurrenty_deleted for ExecDelete() which indicates
that the row is concurrently deleted. I think we can make this
parameter for both of these purposes so as to avoid ExecInsert() for
both these scenarios. Will work on a patch.

Thanks Amit.

Got one more observation :  update... returning is not working with whole row reference. please take a look.

postgres=# create table part (a int, b int) partition by range(a);
CREATE TABLE
postgres=# create table part_p1 partition of part for values from (minvalue) to (0);
CREATE TABLE
postgres=# create table part_p2 partition of part for values from (0) to (maxvalue);
CREATE TABLE
postgres=# insert into part values (10,1);
INSERT 0 1
postgres=# insert into part values (20,2);
INSERT 0 1
postgres=# update part t1 set a = b returning t1;
ERROR:  unexpected whole-row reference found in partition key

Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Thu, Jul 13, 2017 at 1:09 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Attached is a WIP patch (make_resultrels_ordered.patch) that generates
> the result rels in canonical order. This patch is kept separate from
> the update-partition-key patch, and can be applied on master branch.

Hmm, I like the approach you've taken here in general, but I think it
needs cleanup.

+typedef struct ParentChild

This is a pretty generic name.  Pick something more specific and informative.

+static List *append_rel_partition_oids(List *rel_list, Relation rel);

One could be forgiven for thinking that this function was just going
to append OIDs, but it actually appends ParentChild structures, so I
think the name needs work.

+List *append_rel_partition_oids(List *rel_list, Relation rel)

Style.  Please pgindent your patches.

+#ifdef DEBUG_PRINT_OIDS
+    print_oids(*leaf_part_oids);
+#endif

I'd just rip out this debug stuff once you've got this working, but if
we keep it, it certainly can't have a name as generic as print_oids()
when it's actually doing something with a list of ParentChild
structures.  Also, it prints names, not OIDs.  And DEBUG_PRINT_OIDS is
no good for the same reasons.

+    if (RelationGetPartitionDesc(rel))
+        walker->rels_list = append_rel_partition_oids(walker->rels_list, rel);

Every place that calls append_rel_partition_oids guards that call with
if (RelationGetPartitionDesc(...)).  It seems to me that it would be
simpler to remove those tests and instead just replace the
Assert(partdesc) inside that function with if (!partdesc) return;

Is there any real benefit in this "walker" interface?  It looks to me
like it might be simpler to just change things around so that it
returns a list of OIDs, like find_all_inheritors, but generated
differently.  Then if you want bound-ordering rather than
OID-ordering, you just do this:

list_free(inhOids);
inhOids = get_partition_oids_in_bound_order(rel);

That'd remove the need for some if/then logic as you've currently got
in get_next_child().

+    is_partitioned_resultrel =
+        (oldrelation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE
+         && rti == parse->resultRelation);

I suspect this isn't correct for a table that contains wCTEs, because
there would in that case be multiple result relations.

I think we should always expand in bound order rather than only when
it's a result relation. I think for partition-wise join, we're going
to want to do it this way for all relations in the query, or at least
for all relations in the query that might possibly be able to
participate in a partition-wise join.  If there are multiple cases
that are going to need this ordering, it's hard for me to accept the
idea that it's worth the complexity of trying to keep track of when we
expanded things in one order vs. another.  There are other
applications of having things in bound order too, like MergeAppend ->
Append strength-reduction (which might not be legal anyway if there
are list partitions with multiple, non-contiguous list bounds or if
any NULL partition doesn't end up in the right place in the order, but
there will be lots of cases where it can work).

On another note, did you do anything about the suggestion Thomas made
in http://postgr.es/m/CAEepm=3sc_j1zwqDYrbU4DTfX5rHcaMNNuaXRKWZFgt9m23OcA@mail.gmail.com
?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] UPDATE of partition key

From
Amit Langote
Date:
On 2017/07/26 6:07, Robert Haas wrote:
> On Thu, Jul 13, 2017 at 1:09 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> Attached is a WIP patch (make_resultrels_ordered.patch) that generates
>> the result rels in canonical order. This patch is kept separate from
>> the update-partition-key patch, and can be applied on master branch.
>
> I suspect this isn't correct for a table that contains wCTEs, because
> there would in that case be multiple result relations.
> 
> I think we should always expand in bound order rather than only when
> it's a result relation. I think for partition-wise join, we're going
> to want to do it this way for all relations in the query, or at least
> for all relations in the query that might possibly be able to
> participate in a partition-wise join.  If there are multiple cases
> that are going to need this ordering, it's hard for me to accept the
> idea that it's worth the complexity of trying to keep track of when we
> expanded things in one order vs. another.  There are other
> applications of having things in bound order too, like MergeAppend ->
> Append strength-reduction (which might not be legal anyway if there
> are list partitions with multiple, non-contiguous list bounds or if
> any NULL partition doesn't end up in the right place in the order, but
> there will be lots of cases where it can work).

Sorry to be responding this late to the Amit's make_resultrel_ordered
patch itself, but I agree that we should teach the planner to *always*
expand partitioned tables in the partition bound order.

When working on something else, I ended up writing a prerequisite patch
that refactors RelationGetPartitionDispatchInfo() to not be too tied to
its current usage for tuple-routing, so that it can now be used in the
planner (for example, in expand_inherited_rtentry(), instead of
find_all_inheritors()).  If we could adopt that patch, we can focus on the
update partition row movement issues more closely on this thread, rather
than the concerns about the order that planner puts partitions into.

I checked that we get the same result relation order with both the
patches, but I would like to highlight a notable difference here between
the approaches taken by our patches.  In my patch, I have now taught
RelationGetPartitionDispatchInfo() to lock *only* the partitioned tables
in the tree, because we need to look at its partition descriptor to
collect partition OIDs and bounds.  We can defer locking (and opening the
relation descriptor of) leaf partitions to a point where planner has
determined that the partition will be accessed after all (not pruned),
which will be done in a separate patch of course.

Sorry again that I didn't share this patch sooner.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] UPDATE of partition key

From
Etsuro Fujita
Date:
On 2017/07/26 6:07, Robert Haas wrote:
> On Thu, Jul 13, 2017 at 1:09 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> Attached is a WIP patch (make_resultrels_ordered.patch) that generates
>> the result rels in canonical order. This patch is kept separate from
>> the update-partition-key patch, and can be applied on master branch.

Thank you for working on this, Amit!

> Hmm, I like the approach you've taken here in general,

+1 for the approach.

> Is there any real benefit in this "walker" interface?  It looks to me
> like it might be simpler to just change things around so that it
> returns a list of OIDs, like find_all_inheritors, but generated
> differently.  Then if you want bound-ordering rather than
> OID-ordering, you just do this:
> 
> list_free(inhOids);
> inhOids = get_partition_oids_in_bound_order(rel);
> 
> That'd remove the need for some if/then logic as you've currently got
> in get_next_child().

Yeah, that would make the code much simple, so +1 for Robert's idea.

> I think we should always expand in bound order rather than only when
> it's a result relation. I think for partition-wise join, we're going
> to want to do it this way for all relations in the query, or at least
> for all relations in the query that might possibly be able to
> participate in a partition-wise join.  If there are multiple cases
> that are going to need this ordering, it's hard for me to accept the
> idea that it's worth the complexity of trying to keep track of when we
> expanded things in one order vs. another.  There are other
> applications of having things in bound order too, like MergeAppend ->
> Append strength-reduction (which might not be legal anyway if there
> are list partitions with multiple, non-contiguous list bounds or if
> any NULL partition doesn't end up in the right place in the order, but
> there will be lots of cases where it can work).

+1 for that as well.  Another benefit from that would be EXPLAIN; we 
could display partitions for a partitioned table in the same order for 
Append and ModifyTable (ie, SELECT/UPDATE/DELETE), which I think would 
make the EXPLAIN result much readable.

Best regards,
Etsuro Fujita




Re: [HACKERS] UPDATE of partition key

From
Amit Langote
Date:
On 2017/07/25 21:55, Rajkumar Raghuwanshi wrote:
> Got one more observation :  update... returning is not working with whole
> row reference. please take a look.
> 
> postgres=# create table part (a int, b int) partition by range(a);
> CREATE TABLE
> postgres=# create table part_p1 partition of part for values from
> (minvalue) to (0);
> CREATE TABLE
> postgres=# create table part_p2 partition of part for values from (0) to
> (maxvalue);
> CREATE TABLE
> postgres=# insert into part values (10,1);
> INSERT 0 1
> postgres=# insert into part values (20,2);
> INSERT 0 1
> postgres=# update part t1 set a = b returning t1;
> ERROR:  unexpected whole-row reference found in partition key

That looks like a bug which exists in HEAD too.  I posted a patch in a
dedicated thread to address the same [1].

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/9a39df80-871e-6212-0684-f93c83be4097%40lab.ntt.co.jp




Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 26 July 2017 at 02:37, Robert Haas <robertmhaas@gmail.com> wrote:
> Is there any real benefit in this "walker" interface?  It looks to me
> like it might be simpler to just change things around so that it
> returns a list of OIDs, like find_all_inheritors, but generated
> differently.  Then if you want bound-ordering rather than
> OID-ordering, you just do this:
>
> list_free(inhOids);
> inhOids = get_partition_oids_in_bound_order(rel);
>
> That'd remove the need for some if/then logic as you've currently got
> in get_next_child().

Yes, I had considered that ; i.e., first generating just a list of
bound-ordered oids. But that consequently needs all the child tables
to be opened and closed twice; once during the list generation, and
then while expanding the partitioned table. Agreed, that the second
time, heap_open() would not be that expensive because tables would be
cached, but still it would require to get the cached relation handle
from hash table. Since we anyway want to open the tables, better have
a *next() function to go-get the next partition in a fixed order.

Actually, there isn't much that the walker next() function does. Any
code that wants to traverse bound-wise can do that by its own. The
walker function is just a convenient way to make sure everyone
traverses in the same order by using this function.

Yet to go over other things including your review comments, and Amit
Langote's patch on refactoring RelationGetPartitionDispatchInfo().

> On another note, did you do anything about the suggestion Thomas made
> in http://postgr.es/m/CAEepm=3sc_j1zwqDYrbU4DTfX5rHcaMNNuaXRKWZFgt9m23OcA@mail.gmail.com
> ?

This is still pending on me; plus I think there are some more points.
I need to go over those and consolidate a list of todos.




-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Wed, Jul 26, 2017 at 2:13 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Sorry to be responding this late to the Amit's make_resultrel_ordered
> patch itself, but I agree that we should teach the planner to *always*
> expand partitioned tables in the partition bound order.

Sounds like we have unanimous agreement on that point.  Yesterday, I
was discussing with Beena Emerson, who is working on run-time
partition pruning, that it would also be useful for that purpose, if
you're trying to prune based on a range query.

> I checked that we get the same result relation order with both the
> patches, but I would like to highlight a notable difference here between
> the approaches taken by our patches.  In my patch, I have now taught
> RelationGetPartitionDispatchInfo() to lock *only* the partitioned tables
> in the tree, because we need to look at its partition descriptor to
> collect partition OIDs and bounds.  We can defer locking (and opening the
> relation descriptor of) leaf partitions to a point where planner has
> determined that the partition will be accessed after all (not pruned),
> which will be done in a separate patch of course.

That's very desirable, but I believe it introduces a deadlock risk
which Amit's patch avoids.  A transaction using the code you've
written here is eventually going to lock all partitions, BUT it's
going to move the partitioned ones to the front of the locking order
vs. what find_all_inheritors would do.  So, when multi-level
partitioning is in use, I think it could happen that some other
transaction is accessing the table using a different code path that
uses the find_all_inheritors order without modification.  If those
locks conflict (e.g. query vs. DROP) then there's a deadlock risk.

Unfortunately I don't see any easy way around that problem, but maybe
somebody else has an idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 28 July 2017 at 20:10, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jul 26, 2017 at 2:13 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Sorry to be responding this late to the Amit's make_resultrel_ordered
>> patch itself, but I agree that we should teach the planner to *always*
>> expand partitioned tables in the partition bound order.
>
> Sounds like we have unanimous agreement on that point.

I too agree.

>
>> I checked that we get the same result relation order with both the
>> patches, but I would like to highlight a notable difference here between
>> the approaches taken by our patches.  In my patch, I have now taught
>> RelationGetPartitionDispatchInfo() to lock *only* the partitioned tables
>> in the tree, because we need to look at its partition descriptor to
>> collect partition OIDs and bounds.  We can defer locking (and opening the
>> relation descriptor of) leaf partitions to a point where planner has
>> determined that the partition will be accessed after all (not pruned),
>> which will be done in a separate patch of course.

With Amit Langote's patch, we can very well do the locking beforehand
by find_all_inheritors(), and then run
RelationGetPartitionDispatchInfo() with noLock, so as to remove the
deadlock problem. But I think we should keep these two tasks separate,
i.e. expanding the partition tree in bound order, and making
RelationGetPartitionDispatchInfo() work for the planner.

Regarding building the PartitionDispatchInfo in the planner, we should
do that only after it is known that partition columns are updated, so
it can't be done in expand_inherited_rtentry() because it would be too
soon. For planner setup, RelationGetPartitionDispatchInfo() should
just build the tupmap for each partitioned table, and then initialize
the rest of the fields like tuplslot, reldesc , etc later during
execution.

So for now, I feel we should just do the changes for making sure the
order is same, and then over that, separately modify
RelationGetPartitionDispatchInfo() for planner.

>
> That's very desirable, but I believe it introduces a deadlock risk
> which Amit's patch avoids.  A transaction using the code you've
> written here is eventually going to lock all partitions, BUT it's
> going to move the partitioned ones to the front of the locking order
> vs. what find_all_inheritors would do.  So, when multi-level
> partitioning is in use, I think it could happen that some other
> transaction is accessing the table using a different code path that
> uses the find_all_inheritors order without modification.  If those
> locks conflict (e.g. query vs. DROP) then there's a deadlock risk.

Yes, I agree. Even with single-level partitioning, the leaf partitions
ordered by find_all_inheritors() is by oid values, so that's also
going to be differently ordered.

>
> Unfortunately I don't see any easy way around that problem, but maybe
> somebody else has an idea.

One approach I had considered was to have find_inheritance_children()
itself lock the children in bound order, so that everyone will have
bound-ordered oids, but that would be too expensive since it requires
opening all partitioned tables to initialize partition descriptors. In
find_inheritance_children(), we get all oids without opening any
tables. But now that I think more of it, it's only the partitioned
tables that we have to open, not the leaf partitions; and furthermore,
I didn't see calls to find_inheritance_children() and
find_all_inheritors() in performance-critical code, except in
expand_inherited_rtentry(). All of them are in DDL commands; but yes,
that can change in the future.

Regarding dynamically locking specific partitions as and when needed,
I think this method inherently has the issue of deadlock because the
order would be random. So it feels like there is no way around other
than to lock all partitions beforehand.

----------------

Regarding using first resultrel for mapping RETURNING and WCO, I think
we can use (a renamed) getASTriggerResultRelInfo() to get the root
result relation, and use WCO and RETURNING expressions of this
relation to do the mapping for child rels. This way, there won't be
insert/update specific code, and we don't need to use first result
relation.

While checking the whole-row bug on the other thread [1] , I noticed
that the RETURNING/WCO expressions for the per-subplan result rels are
formed by considering not just simple vars, but also whole row vars
and other nodes. So for update-tuple-routing, there would be some
result-rels WCOs formed using adjust_appendrel_attrs(), while for
others, they would be built using map_partition_varattnos() which only
considers simple vars. So the bug in [1] would be there for
update-partition-key as well, when the tuple is routed into a newly
built resultrel. May be, while fixing the bug in [1] , this might be
automatically solved.

----------------

Below are the TODOS at this point :

Fix for bug reported by Rajkumar about update with join.
Do something about two separate mapping tables for Transition tables
and update tuple-routing.
GetUpdatedColumns() to be moved to header file.
More test scenarios in regression tests.
Need to check/test whether we are correctly applying insert policies
(ant not update) while inserting a routed tuple.
Use getASTriggerResultRelInfo() for attrno mapping, rather than first
resultrel, for generating child WCO/RETURNING expression.
Address Robert's review comments on make_resultrel_ordered.patch.
pgindent.

[1] https://www.postgresql.org/message-id/d86d27ea-cc9d-5dbe-b131-e7dec4017983%40lab.ntt.co.jp

Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] UPDATE of partition key

From
Amit Langote
Date:
On 2017/07/29 2:45, Amit Khandekar wrote:
> On 28 July 2017 at 20:10, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Jul 26, 2017 at 2:13 AM, Amit Langote wrote:
>>> I checked that we get the same result relation order with both the
>>> patches, but I would like to highlight a notable difference here between
>>> the approaches taken by our patches.  In my patch, I have now taught
>>> RelationGetPartitionDispatchInfo() to lock *only* the partitioned tables
>>> in the tree, because we need to look at its partition descriptor to
>>> collect partition OIDs and bounds.  We can defer locking (and opening the
>>> relation descriptor of) leaf partitions to a point where planner has
>>> determined that the partition will be accessed after all (not pruned),
>>> which will be done in a separate patch of course.
>>
>> That's very desirable, but I believe it introduces a deadlock risk
>> which Amit's patch avoids.  A transaction using the code you've
>> written here is eventually going to lock all partitions, BUT it's
>> going to move the partitioned ones to the front of the locking order
>> vs. what find_all_inheritors would do.  So, when multi-level
>> partitioning is in use, I think it could happen that some other
>> transaction is accessing the table using a different code path that
>> uses the find_all_inheritors order without modification.  If those
>> locks conflict (e.g. query vs. DROP) then there's a deadlock risk.
> 
> Yes, I agree. Even with single-level partitioning, the leaf partitions
> ordered by find_all_inheritors() is by oid values, so that's also
> going to be differently ordered.

We do require to lock the parent first in any case.  Doesn't that prevent
deadlocks by imparting an implicit order on locking by operations whose
locks conflict.

Having said that, I think it would be desirable for all code paths to
manipulate partitions in the same order.  For partitioned tables, I think
we can make it the partition bound order by replacing all calls to
find_all_inheritors and find_inheritance_children on partitioned table
parents with something else that reads partition OIDs from the relcache
(PartitionDesc) and traverses the partition tree breadth-first manner.

>> Unfortunately I don't see any easy way around that problem, but maybe
>> somebody else has an idea.
> 
> One approach I had considered was to have find_inheritance_children()
> itself lock the children in bound order, so that everyone will have
> bound-ordered oids, but that would be too expensive since it requires
> opening all partitioned tables to initialize partition descriptors. In
> find_inheritance_children(), we get all oids without opening any
> tables. But now that I think more of it, it's only the partitioned
> tables that we have to open, not the leaf partitions; and furthermore,
> I didn't see calls to find_inheritance_children() and
> find_all_inheritors() in performance-critical code, except in
> expand_inherited_rtentry(). All of them are in DDL commands; but yes,
> that can change in the future.

This approach more or less amounts to calling the new
RelationGetPartitionDispatchInfo() (per my proposed patch, a version of
which I posted upthread.)  Maybe we can add a wrapper on top, say,
get_all_partition_oids() which throws away other things that
RelationGetPartitionDispatchInfo() returned.  In addition it locks all the
partitions that are returned, unlike only the partitioned ones, which is
what RelationGetPartitionDispatchInfo() has been taught to do.

> Regarding dynamically locking specific partitions as and when needed,
> I think this method inherently has the issue of deadlock because the
> order would be random. So it feels like there is no way around other
> than to lock all partitions beforehand.

I'm not sure why the order has to be random.  If and when we decide to
open and lock a subset of partitions for a given query, it will be done in
some canonical order as far as I can imagine.  Do you have some specific
example in mind?

Thanks,
Amit




Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 2 August 2017 at 14:38, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/07/29 2:45, Amit Khandekar wrote:
>> On 28 July 2017 at 20:10, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Wed, Jul 26, 2017 at 2:13 AM, Amit Langote wrote:
>>>> I checked that we get the same result relation order with both the
>>>> patches, but I would like to highlight a notable difference here between
>>>> the approaches taken by our patches.  In my patch, I have now taught
>>>> RelationGetPartitionDispatchInfo() to lock *only* the partitioned tables
>>>> in the tree, because we need to look at its partition descriptor to
>>>> collect partition OIDs and bounds.  We can defer locking (and opening the
>>>> relation descriptor of) leaf partitions to a point where planner has
>>>> determined that the partition will be accessed after all (not pruned),
>>>> which will be done in a separate patch of course.
>>>
>>> That's very desirable, but I believe it introduces a deadlock risk
>>> which Amit's patch avoids.  A transaction using the code you've
>>> written here is eventually going to lock all partitions, BUT it's
>>> going to move the partitioned ones to the front of the locking order
>>> vs. what find_all_inheritors would do.  So, when multi-level
>>> partitioning is in use, I think it could happen that some other
>>> transaction is accessing the table using a different code path that
>>> uses the find_all_inheritors order without modification.  If those
>>> locks conflict (e.g. query vs. DROP) then there's a deadlock risk.
>>
>> Yes, I agree. Even with single-level partitioning, the leaf partitions
>> ordered by find_all_inheritors() is by oid values, so that's also
>> going to be differently ordered.
>
> We do require to lock the parent first in any case.  Doesn't that prevent
> deadlocks by imparting an implicit order on locking by operations whose
> locks conflict.

Yes may be, but I am not too sure at this point. find_all_inheritors()
locks only the children, and the parent lock is already locked
separately. find_all_inheritors() does not necessitate to lock the
children with the same lockmode as the parent.

> Having said that, I think it would be desirable for all code paths to
> manipulate partitions in the same order.  For partitioned tables, I think
> we can make it the partition bound order by replacing all calls to
> find_all_inheritors and find_inheritance_children on partitioned table
> parents with something else that reads partition OIDs from the relcache
> (PartitionDesc) and traverses the partition tree breadth-first manner.
>
>>> Unfortunately I don't see any easy way around that problem, but maybe
>>> somebody else has an idea.
>>
>> One approach I had considered was to have find_inheritance_children()
>> itself lock the children in bound order, so that everyone will have
>> bound-ordered oids, but that would be too expensive since it requires
>> opening all partitioned tables to initialize partition descriptors. In
>> find_inheritance_children(), we get all oids without opening any
>> tables. But now that I think more of it, it's only the partitioned
>> tables that we have to open, not the leaf partitions; and furthermore,
>> I didn't see calls to find_inheritance_children() and
>> find_all_inheritors() in performance-critical code, except in
>> expand_inherited_rtentry(). All of them are in DDL commands; but yes,
>> that can change in the future.
>
> This approach more or less amounts to calling the new
> RelationGetPartitionDispatchInfo() (per my proposed patch, a version of
> which I posted upthread.)  Maybe we can add a wrapper on top, say,
> get_all_partition_oids() which throws away other things that
> RelationGetPartitionDispatchInfo() returned.  In addition it locks all the
> partitions that are returned, unlike only the partitioned ones, which is
> what RelationGetPartitionDispatchInfo() has been taught to do.

So there are three different task items here :
1. Arrange the oids in consistent order everywhere.
2. Prepare the Partition Dispatch Info data structure in the planner
as against during execution.
3. For update tuple routing, assume that the result rels are ordered
consistently to make the searching efficient.

#3 depends on #1. So for that, I have come up with a minimum set of
changes to have expand_inherited_rtentry() generate the rels in bound
order. When we do #2 , it may be possible that we may need to re-do my
changes in expand_inherited_rtentry(), but those are minimum. We may
even end up having the walker function being used at multiple places,
but right now it is not certain.

So, I think we can continue the discussion about #1 and #2 in a separate thread.

>
>> Regarding dynamically locking specific partitions as and when needed,
>> I think this method inherently has the issue of deadlock because the
>> order would be random. So it feels like there is no way around other
>> than to lock all partitions beforehand.
>
> I'm not sure why the order has to be random.  If and when we decide to
> open and lock a subset of partitions for a given query, it will be done in
> some canonical order as far as I can imagine.  Do you have some specific
> example in mind?

Partitioned table t1 has partitions t1p1 and t1p2
Partitioned table t2 at the same level has partitions t2p1 and t2p2
Tuple routing causes the first row to insert into t2p2, so t2p2 is locked.
Next insert locks t1p1 because it inserts into t1p1.
But at the same time, somebody does DDL on some parent common to t1
and t2, so it locks the leaf partitions in a fixed specific order,
which would be different than the insert lock order because that order
depended upon the order of tables that the insert rows were routed to.


>
> Thanks,
> Amit
>



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] UPDATE of partition key

From
Amit Langote
Date:
On 2017/08/02 19:49, Amit Khandekar wrote:
> On 2 August 2017 at 14:38, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> One approach I had considered was to have find_inheritance_children()
>>> itself lock the children in bound order, so that everyone will have
>>> bound-ordered oids, but that would be too expensive since it requires
>>> opening all partitioned tables to initialize partition descriptors. In
>>> find_inheritance_children(), we get all oids without opening any
>>> tables. But now that I think more of it, it's only the partitioned
>>> tables that we have to open, not the leaf partitions; and furthermore,
>>> I didn't see calls to find_inheritance_children() and
>>> find_all_inheritors() in performance-critical code, except in
>>> expand_inherited_rtentry(). All of them are in DDL commands; but yes,
>>> that can change in the future.
>>
>> This approach more or less amounts to calling the new
>> RelationGetPartitionDispatchInfo() (per my proposed patch, a version of
>> which I posted upthread.)  Maybe we can add a wrapper on top, say,
>> get_all_partition_oids() which throws away other things that
>> RelationGetPartitionDispatchInfo() returned.  In addition it locks all the
>> partitions that are returned, unlike only the partitioned ones, which is
>> what RelationGetPartitionDispatchInfo() has been taught to do.
> 
> So there are three different task items here :
> 1. Arrange the oids in consistent order everywhere.
> 2. Prepare the Partition Dispatch Info data structure in the planner
> as against during execution.
> 3. For update tuple routing, assume that the result rels are ordered
> consistently to make the searching efficient.

That's a good breakdown.

> #3 depends on #1. So for that, I have come up with a minimum set of
> changes to have expand_inherited_rtentry() generate the rels in bound
> order. When we do #2 , it may be possible that we may need to re-do my
> changes in expand_inherited_rtentry(), but those are minimum. We may
> even end up having the walker function being used at multiple places,
> but right now it is not certain.

So AFAICS:

For performance reasons, we want the order in which leaf partition
sub-plans appear in the ModifyTable node (and subsequently leaf partition
ResultRelInfos ModifyTableState) to be some known canonical order.  That's
because we want to map partitions in the insert tuple-routing data
structure (which appear in a known canonical order as determined by
RelationGetPartititionDispatchInfo) to those appearing in the
ModifyTableState.  That's so that we can reuse the planner-generated WCO
and RETURNING lists in the insert code path when update tuple-routing
invokes that path.

To implement that, planner should retrieve the list of leaf partition OIDs
in the same order as ExecSetupPartitionTupleRouting() retrieves them.
Because the latter calls RelationGetPartitionDispatchInfo on the root
partitioned table, maybe the planner should do that too, instead of its
current method getting OIDs using find_all_inheritors().  But it's
currently not possible due to the way RelationGetPartitionDispatchInfo()
and involved data structures are designed.

One way forward I see is to invent new interface functions:
 List *get_all_partition_oids(Oid, LOCKMODE) List *get_partition_oids(Oid, LOCKMODE)

that resemble find_all_inheritors() and find_inheritance_children(),
respectively, but expects that users make sure that they are called only
for partitioned tables.  Needless to mention, OIDs are returned with
canonical order determined by that of the partition bounds and partition
tree structure.  We replace all the calls of the old interface functions
with the respective new ones.  That means expand_inherited_rtentry (among
others) now calls get_all_partition_oids() if the RTE is for a partitioned
table and find_all_inheritors() otherwise.

> So, I think we can continue the discussion about #1 and #2 in a separate thread.

I have started a new thread named "expanding inheritance in partition
bound order" and posted a couple of patches [1].

After applying those patches, you can write code for #3 without having to
worry about the concerns of partition order, which I guess you've already
done.

>>> Regarding dynamically locking specific partitions as and when needed,
>>> I think this method inherently has the issue of deadlock because the
>>> order would be random. So it feels like there is no way around other
>>> than to lock all partitions beforehand.
>>
>> I'm not sure why the order has to be random.  If and when we decide to
>> open and lock a subset of partitions for a given query, it will be done in
>> some canonical order as far as I can imagine.  Do you have some specific
>> example in mind?
> 
> Partitioned table t1 has partitions t1p1 and t1p2
> Partitioned table t2 at the same level has partitions t2p1 and t2p2
> Tuple routing causes the first row to insert into t2p2, so t2p2 is locked.
> Next insert locks t1p1 because it inserts into t1p1.
> But at the same time, somebody does DDL on some parent common to t1
> and t2, so it locks the leaf partitions in a fixed specific order,
> which would be different than the insert lock order because that order
> depended upon the order of tables that the insert rows were routed to.

Note that we don't currently do this.  That is, lock partitions in an
order determined by incoming rows.  ExecSetupPartitionTupleRouting() locks
(RowExclusiveLock) all the partitions beforehand in the partition bound
order.  Any future patch that wants to delay locking and opening the
relation descriptor of a leaf partition to when a tuple is actually routed
to it will have to think hard about the deadlock problem you illustrate above.

Aside from the insert case, let's consider locking order when planning a
select on a partitioned table.  We currently lock all the partitions in
advance in expand_inherited_rtentry().  When replacing the current method
by some new way, we will first determine all the partitions that satisfy a
given query, collect them in an ordered list (some fixed canonical order),
and lock them in that order.

But maybe, I misunderstood what you said?

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/0118a1f2-84bb-19a7-b906-dec040a206f2%40lab.ntt.co.jp




Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
>
> Below are the TODOS at this point :
>
> Fix for bug reported by Rajkumar about update with join.

I had explained the root issue of this bug here : [1]

Attached patch includes the fix, which is explained below.
Currently in the patch, there is a check if the tuple is concurrently
deleted by other session, i.e. when heap_update() returns
HeapTupleUpdated. In such case we set concurrently_deleted output
param to true. We should also do the same for HeapTupleSelfUpdated
return value.

In fact, there are other places in ExecDelete() where it can return
without doing anything. For e.g. if a BR DELETE trigger prevents the
delete from happening, ExecBRDeleteTriggers() returns false, in which
case ExecDelete() returns.

So what the fix does is : rename concurrently_deleted parameter to
delete_skipped so as to indicate a more general status : whether
delete has actually happened or was it skipped. And set this param to
true only after the delete happens. This allows us to avoid adding a
new rows for the trigger case also.

Added test scenario for UPDATE with JOIN case, and also TRIGGER case.

> Do something about two separate mapping tables for Transition tables
> and update tuple-routing.
On 1 July 2017 at 03:15, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
> Would make sense to have a set of functions with names like
> GetConvertor{From,To}{Subplan,Leaf}(mtstate, index) which build arrays
> m_convertors_{from,to}_by_{subplan,leaf} the first time they need
> them?

This was discussed here : [2]. I think even if we have them built when
needed, still in presence of both tuple routing and transition tables,
we do need separate arrays. So I think rather than dynamic arrays, we
can have static arrays but their elements will point to  a shared
TupleConversionMap structure whenever possible.
As already in the patch, in case of insert/update tuple routing, there
is a per-leaf partition mt_transition_tupconv_maps array for
transition tables, and a separate per-subplan arry mt_resultrel_maps
for update tuple routing. *But*, what I am proposing is: for the
mt_transition_tupconv_maps[] element for which the leaf partition also
exists as a per-subplan result, that array element and the
mt_resultrel_maps[] element will point to the same TupleConversionMap
structure.

This is quite similar to how we are re-using the per-subplan
resultrels for the per-leaf result rels. We will re-use the
per-subplan TupleConversionMap for the per-leaf
mt_transition_tupconv_maps[] elements.

Not yet implemented this.

> GetUpdatedColumns() to be moved to header file.

Done. I have moved it in execnodes.h

> More test scenarios in regression tests.
> Need to check/test whether we are correctly applying insert policies
> (ant not update) while inserting a routed tuple.

Yet to do above two.

> Use getASTriggerResultRelInfo() for attrno mapping, rather than first
> resultrel, for generating child WCO/RETURNING expression.
>

Regarding generating child WithCheckOption and Returning expressions
using those of the root result relation, ModifyTablePath and
ModifyTable should have new fields rootReturningList (and
rootWithCheckOptions) which would be derived from
root->parse->returningList in inheritance_planner(). But then, similar
to per-subplan returningList, rootReturningList would have to pass
through set_plan_refs()=>set_returning_clause_references() which
requires the subplan targetlist to be passed. Because of this, for
rootReturningList, we require a subplan for root partition, which is
not there currently; we have subpans only for child rels. That means
we would have to create such plan only for the sake of generating
rootReturningList.

The other option is to do the way the patch is currently doing in the
executor by using the returningList of the first per-subplan result
rel to generate the other child returningList (and WithCheckOption).
This is working by applying map_partition_varattnos() to the first
returningList. But now that we realized that we have to specially
handle whole-row vars, map_partition_varattnos() would need some
changes to convert whole row vars differently for
child-rel-to-child-rel mapping. For childrel-to-childrel conversion,
the whole-row var is already wrapped by ConvertRowtypeExpr, but we
need to change its Var->vartype to the new child vartype.

I think the second option looks easier, but I am open to suggestions,
and I am myself still checking the first one.

> Address Robert's review comments on make_resultrel_ordered.patch.
>
> +typedef struct ParentChild
>
> This is a pretty generic name.  Pick something more specific and informative.

I have used ChildPartitionInfo. But suggestions welcome.

>
> +static List *append_rel_partition_oids(List *rel_list, Relation rel);
>
> One could be forgiven for thinking that this function was just going
> to append OIDs, but it actually appends ParentChild structures, so I
> think the name needs work.

Renamed it to append_child_partitions().

>
> +List *append_rel_partition_oids(List *rel_list, Relation rel)
>
> Style.  Please pgindent your patches.

I have pgindent'ed changes in nodeModifyTable.c and partition.c, yet
to do that for others.

>
> +#ifdef DEBUG_PRINT_OIDS
> +    print_oids(*leaf_part_oids);
> +#endif
>
> I'd just rip out this debug stuff once you've got this working, but if
> we keep it, it certainly can't have a name as generic as print_oids()
> when it's actually doing something with a list of ParentChild
> structures.  Also, it prints names, not OIDs.  And DEBUG_PRINT_OIDS is
> no good for the same reasons.

Now that I have tested it , I have removed this. Also, the ordered
subplans printed in explain output serve the same purpose.

>
> +    if (RelationGetPartitionDesc(rel))
> +        walker->rels_list = append_rel_partition_oids(walker->rels_list, rel);
>
> Every place that calls append_rel_partition_oids guards that call with
> if (RelationGetPartitionDesc(...)).  It seems to me that it would be
> simpler to remove those tests and instead just replace the
> Assert(partdesc) inside that function with if (!partdesc) return;

Done.

>
> Is there any real benefit in this "walker" interface?  It looks to me
> like it might be simpler to just change things around so that it
> returns a list of OIDs, like find_all_inheritors, but generated
> differently.  Then if you want bound-ordering rather than
> OID-ordering, you just do this:
>
> list_free(inhOids);
> inhOids = get_partition_oids_in_bound_order(rel);
>
> That'd remove the need for some if/then logic as you've currently got
> in get_next_child().

Have explained this here :
https://www.postgresql.org/message-id/CAJ3gD9dQ2FKes8pP6aM-4Tx3ngqWvD8oyOJiDRxLVoQiY76t0A%40mail.gmail.com
I am aware that this might get changed once we checkin a separate
patch just floated to expand inheritence in bound order.

>
> +    is_partitioned_resultrel =
> +        (oldrelation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE
> +         && rti == parse->resultRelation);
>
> I suspect this isn't correct for a table that contains wCTEs, because
> there would in that case be multiple result relations.
>
> I think we should always expand in bound order rather than only when
> it's a result relation.
Have changed it to always expand in bound order for partitioned table.


[1]. https://www.postgresql.org/message-id/CAKcux6%3Dz38gH4K6YAFi%2BYvo5tHTwBL4tam4VM33CAPZ5dDMk1Q%40mail.gmail.com

[2] https://www.postgresql.org/message-id/CAEepm=3sc_j1zwqDYrbU4DTfX5rHcaMNNuaXRKWZFgt9m23OcA@mail.gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] UPDATE of partition key

From
Rajkumar Raghuwanshi
Date:

On Fri, Aug 4, 2017 at 10:28 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> Below are the TODOS at this point :
>
> Fix for bug reported by Rajkumar about update with join.

I had explained the root issue of this bug here : [1]

Attached patch includes the fix, which is explained below.

Hi Amit,

I have applied v14 patch and tested from my side, everything looks good to me. attaching some of test case and out file for reference.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation
Attachment

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 4 August 2017 at 22:28, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>
>> Below are the TODOS at this point :
>>
>> Do something about two separate mapping tables for Transition tables
>> and update tuple-routing.
> On 1 July 2017 at 03:15, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>> Would make sense to have a set of functions with names like
>> GetConvertor{From,To}{Subplan,Leaf}(mtstate, index) which build arrays
>> m_convertors_{from,to}_by_{subplan,leaf} the first time they need
>> them?
>
> This was discussed here : [2]. I think even if we have them built when
> needed, still in presence of both tuple routing and transition tables,
> we do need separate arrays. So I think rather than dynamic arrays, we
> can have static arrays but their elements will point to  a shared
> TupleConversionMap structure whenever possible.
> As already in the patch, in case of insert/update tuple routing, there
> is a per-leaf partition mt_transition_tupconv_maps array for
> transition tables, and a separate per-subplan arry mt_resultrel_maps
> for update tuple routing. *But*, what I am proposing is: for the
> mt_transition_tupconv_maps[] element for which the leaf partition also
> exists as a per-subplan result, that array element and the
> mt_resultrel_maps[] element will point to the same TupleConversionMap
> structure.
>
> This is quite similar to how we are re-using the per-subplan
> resultrels for the per-leaf result rels. We will re-use the
> per-subplan TupleConversionMap for the per-leaf
> mt_transition_tupconv_maps[] elements.
>
> Not yet implemented this.

The attached patch has the above needed changes. Now we have following
map arrays in ModifyTableState. The earlier naming was confusing so I
renamed them.
mt_perleaf_parentchild_maps : To be used for converting insert/update
routed tuples from root to the destination leaf partition.
mt_perleaf_childparent_maps : To be used for transition tables for
converting back the tuples from leaf partition to root.
mt_persubplan_childparent_maps : To be used by both transition tables
and update-row movement for their own different purpose for UPDATEs.

I also had to add another partition slot mt_rootpartition_tuple_slot
alongside mt_partition_tuple_slot. For update-row-movement, in
ExecInsert(), we used to have a common slot for root partition's tuple
as well as leaf partition tuple. So the former tuple was a transient
tuple. But mtstate->mt_transition_capture->tcs_original_insert_tuple
requires the tuple to be valid, so we could not pass a transient
tuple. Hence another partition slot.

-------

But in the first place, while testing transition tables behaviour with
update row movement, I found out that transition tables OLD TABLE AND
NEW TABLE don't get populated with the rows that are moved to another
partition. This is because the operation is ExecDelete() and
ExecInsert(), which don't run the transition-related triggers for
updates. Even though transition-table-triggers are statement-level,
the AR ROW trigger-related functions like ExecARUpdateTriggers() do
get run for each row, so that the tables get populated; and they skip
the usual row-level trigger stuff. For update-row-movement, we need to
teach ExecARUpdateTriggers() to run the transition-related processing
for the DELETE+INESRT operation as well. But since delete and insert
happen on different tables, we cannot call ExecARUpdateTriggers() at a
single place. We need to call it once after ExecDelete() for loading
the OLD row, and then after ExecInsert() for loading the NEW row.
Also, currently ExecARUpdateTriggers() does not allow NULL old tuple
or new tuple, but we need to allow it for the above transition table
processing.

The attached patch has the above needed changes.

>
>> Use getASTriggerResultRelInfo() for attrno mapping, rather than first
>> resultrel, for generating child WCO/RETURNING expression.
>>
>
> Regarding generating child WithCheckOption and Returning expressions
> using those of the root result relation, ModifyTablePath and
> ModifyTable should have new fields rootReturningList (and
> rootWithCheckOptions) which would be derived from
> root->parse->returningList in inheritance_planner(). But then, similar
> to per-subplan returningList, rootReturningList would have to pass
> through set_plan_refs()=>set_returning_clause_references() which
> requires the subplan targetlist to be passed. Because of this, for
> rootReturningList, we require a subplan for root partition, which is
> not there currently; we have subpans only for child rels. That means
> we would have to create such plan only for the sake of generating
> rootReturningList.
>
> The other option is to do the way the patch is currently doing in the
> executor by using the returningList of the first per-subplan result
> rel to generate the other child returningList (and WithCheckOption).
> This is working by applying map_partition_varattnos() to the first
> returningList. But now that we realized that we have to specially
> handle whole-row vars, map_partition_varattnos() would need some
> changes to convert whole row vars differently for
> child-rel-to-child-rel mapping. For childrel-to-childrel conversion,
> the whole-row var is already wrapped by ConvertRowtypeExpr, but we
> need to change its Var->vartype to the new child vartype.
>
> I think the second option looks easier, but I am open to suggestions,
> and I am myself still checking the first one.

I have done the changes using the second option above. In the attached
patch, the same map_partition_varattnos() is called for child-to-child
mapping. But in such case, the source child partition already has
ConvertRowtypeExpr node, so another ConvertRowtypeExpr node is not
added; just the containing var node is updated with the new composite
type. In the regression test, I have included different types like
numeric, int, text for the partition key columns, so as to test the
same.

>> More test scenarios in regression tests.
>> Need to check/test whether we are correctly applying insert policies
>> (ant not update) while inserting a routed tuple.
>
> Yet to do above two.

This is still to do.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] UPDATE of partition key

From
Dilip Kumar
Date:
On Fri, Aug 11, 2017 at 10:44 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 4 August 2017 at 22:28, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>>

I am planning to review and test this patch, Seems like this patch
needs to be rebased.

[dilip@localhost postgresql]$ patch -p1 <
../patches/update-partition-key_v15.patch
patching file doc/src/sgml/ddl.sgml
patching file doc/src/sgml/ref/update.sgml
patching file doc/src/sgml/trigger.sgml
patching file src/backend/catalog/partition.c
Hunk #3 succeeded at 910 (offset -1 lines).
Hunk #4 succeeded at 924 (offset -1 lines).
Hunk #5 succeeded at 934 (offset -1 lines).
Hunk #6 succeeded at 994 (offset -1 lines).
Hunk #7 succeeded at 1009 with fuzz 1 (offset 3 lines).
Hunk #8 FAILED at 1023.
Hunk #9 succeeded at 1059 with fuzz 2 (offset 10 lines).
Hunk #10 succeeded at 2069 (offset 2 lines).
Hunk #11 succeeded at 2406 (offset 2 lines).
1 out of 11 hunks FAILED -- saving rejects to file
src/backend/catalog/partition.c.rej
patching file src/backend/commands/copy.c
Hunk #2 FAILED at 1426.
Hunk #3 FAILED at 1462.
Hunk #4 succeeded at 2616 (offset 7 lines).
Hunk #5 succeeded at 2726 (offset 8 lines).
Hunk #6 succeeded at 2846 (offset 8 lines).
2 out of 6 hunks FAILED -- saving rejects to file
src/backend/commands/copy.c.rej
patching file src/backend/commands/trigger.c
Hunk #4 succeeded at 5261 with fuzz 2.
patching file src/backend/executor/execMain.c
Hunk #1 succeeded at 65 (offset 1 line).
Hunk #2 succeeded at 103 (offset 1 line).
Hunk #3 succeeded at 1829 (offset 20 lines).
Hunk #4 succeeded at 1860 (offset 20 lines).
Hunk #5 succeeded at 1927 (offset 20 lines).
Hunk #6 succeeded at 2044 (offset 21 lines).
Hunk #7 FAILED at 3210.
Hunk #8 FAILED at 3244.
Hunk #9 succeeded at 3289 (offset 26 lines).
Hunk #10 FAILED at 3340.
Hunk #11 succeeded at 3387 (offset 29 lines).
Hunk #12 succeeded at 3424 (offset 29 lines).
3 out of 12 hunks FAILED -- saving rejects to file
src/backend/executor/execMain.c.rej
patching file src/backend/executor/execReplication.c
patching file src/backend/executor/nodeModifyTable.c

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
Thanks Dilip. I am working on rebasing the patch. Particularly, the
partition walker in my patch depended on the fact that all the tables
get opened (and then closed) while creating the tuple routing info.
But in HEAD, now only the partitioned tables get opened. So need some
changes in my patch.

The partition walker related changes are going to be inapplicable once
the other thread [1] commits the changes for expansion of inheritence
in bound-order, but till then I would have to rebase the partition
walker changes over HEAD.

[1] https://www.postgresql.org/message-id/0118a1f2-84bb-19a7-b906-dec040a206f2%40lab.ntt.co.jp


On 31 August 2017 at 12:09, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Fri, Aug 11, 2017 at 10:44 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> On 4 August 2017 at 22:28, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>>>
>
> I am planning to review and test this patch, Seems like this patch
> needs to be rebased.
>
> [dilip@localhost postgresql]$ patch -p1 <
> ../patches/update-partition-key_v15.patch
> patching file doc/src/sgml/ddl.sgml
> patching file doc/src/sgml/ref/update.sgml
> patching file doc/src/sgml/trigger.sgml
> patching file src/backend/catalog/partition.c
> Hunk #3 succeeded at 910 (offset -1 lines).
> Hunk #4 succeeded at 924 (offset -1 lines).
> Hunk #5 succeeded at 934 (offset -1 lines).
> Hunk #6 succeeded at 994 (offset -1 lines).
> Hunk #7 succeeded at 1009 with fuzz 1 (offset 3 lines).
> Hunk #8 FAILED at 1023.
> Hunk #9 succeeded at 1059 with fuzz 2 (offset 10 lines).
> Hunk #10 succeeded at 2069 (offset 2 lines).
> Hunk #11 succeeded at 2406 (offset 2 lines).
> 1 out of 11 hunks FAILED -- saving rejects to file
> src/backend/catalog/partition.c.rej
> patching file src/backend/commands/copy.c
> Hunk #2 FAILED at 1426.
> Hunk #3 FAILED at 1462.
> Hunk #4 succeeded at 2616 (offset 7 lines).
> Hunk #5 succeeded at 2726 (offset 8 lines).
> Hunk #6 succeeded at 2846 (offset 8 lines).
> 2 out of 6 hunks FAILED -- saving rejects to file
> src/backend/commands/copy.c.rej
> patching file src/backend/commands/trigger.c
> Hunk #4 succeeded at 5261 with fuzz 2.
> patching file src/backend/executor/execMain.c
> Hunk #1 succeeded at 65 (offset 1 line).
> Hunk #2 succeeded at 103 (offset 1 line).
> Hunk #3 succeeded at 1829 (offset 20 lines).
> Hunk #4 succeeded at 1860 (offset 20 lines).
> Hunk #5 succeeded at 1927 (offset 20 lines).
> Hunk #6 succeeded at 2044 (offset 21 lines).
> Hunk #7 FAILED at 3210.
> Hunk #8 FAILED at 3244.
> Hunk #9 succeeded at 3289 (offset 26 lines).
> Hunk #10 FAILED at 3340.
> Hunk #11 succeeded at 3387 (offset 29 lines).
> Hunk #12 succeeded at 3424 (offset 29 lines).
> 3 out of 12 hunks FAILED -- saving rejects to file
> src/backend/executor/execMain.c.rej
> patching file src/backend/executor/execReplication.c
> patching file src/backend/executor/nodeModifyTable.c
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 31 August 2017 at 14:15, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Thanks Dilip. I am working on rebasing the patch. Particularly, the
> partition walker in my patch depended on the fact that all the tables
> get opened (and then closed) while creating the tuple routing info.
> But in HEAD, now only the partitioned tables get opened. So need some
> changes in my patch.
>
> The partition walker related changes are going to be inapplicable once
> the other thread [1] commits the changes for expansion of inheritence
> in bound-order, but till then I would have to rebase the partition
> walker changes over HEAD.
>
> [1] https://www.postgresql.org/message-id/0118a1f2-84bb-19a7-b906-dec040a206f2%40lab.ntt.co.jp
>

After recent commit 30833ba154, now the partitions are expanded in
depth-first order. It didn't seem worthwhile rebasing my partition
walker changes onto the latest code. So in the attached patch, I have
removed all the partition walker changes. But
RelationGetPartitionDispatchInfo() traverses in breadth-first order,
which is different than the update result rels order (because
inheritance expansion order is depth-first). So, in order to make the
tuple-routing-related leaf partitions in the same order as that of the
update result rels, we would have to make changes in
RelationGetPartitionDispatchInfo(), which I am not sure whether it is
going to be done as part of the thread "expanding inheritance in
partition bound order" [1]. For now, in the attached patch, I have
reverted back to the hash table method to find the leaf partitions in
the update result rels.

[1] https://www.postgresql.org/message-id/CAJ3gD9eyudCNU6V-veMme%2BeyzfX_ey%2BgEzULMzOw26c3f9rzdg%40mail.gmail.com

Thanks
-Amit Khandekar



Re: [HACKERS] UPDATE of partition key

From
Amit Kapila
Date:
On Sun, Sep 3, 2017 at 5:10 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 31 August 2017 at 14:15, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> Thanks Dilip. I am working on rebasing the patch. Particularly, the
>> partition walker in my patch depended on the fact that all the tables
>> get opened (and then closed) while creating the tuple routing info.
>> But in HEAD, now only the partitioned tables get opened. So need some
>> changes in my patch.
>>
>> The partition walker related changes are going to be inapplicable once
>> the other thread [1] commits the changes for expansion of inheritence
>> in bound-order, but till then I would have to rebase the partition
>> walker changes over HEAD.
>>
>> [1] https://www.postgresql.org/message-id/0118a1f2-84bb-19a7-b906-dec040a206f2%40lab.ntt.co.jp
>>
>
> After recent commit 30833ba154, now the partitions are expanded in
> depth-first order. It didn't seem worthwhile rebasing my partition
> walker changes onto the latest code. So in the attached patch, I have
> removed all the partition walker changes.
>

It seems you have forgotten to attach the patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 4 September 2017 at 07:43, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sun, Sep 3, 2017 at 5:10 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> On 31 August 2017 at 14:15, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>> Thanks Dilip. I am working on rebasing the patch. Particularly, the
>>> partition walker in my patch depended on the fact that all the tables
>>> get opened (and then closed) while creating the tuple routing info.
>>> But in HEAD, now only the partitioned tables get opened. So need some
>>> changes in my patch.
>>>
>>> The partition walker related changes are going to be inapplicable once
>>> the other thread [1] commits the changes for expansion of inheritence
>>> in bound-order, but till then I would have to rebase the partition
>>> walker changes over HEAD.
>>>
>>> [1] https://www.postgresql.org/message-id/0118a1f2-84bb-19a7-b906-dec040a206f2%40lab.ntt.co.jp
>>>
>>
>> After recent commit 30833ba154, now the partitions are expanded in
>> depth-first order. It didn't seem worthwhile rebasing my partition
>> walker changes onto the latest code. So in the attached patch, I have
>> removed all the partition walker changes.
>>
>
> It seems you have forgotten to attach the patch.

Oops sorry. Now attached.



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] UPDATE of partition key

From
Dilip Kumar
Date:
On Mon, Sep 4, 2017 at 10:52 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 4 September 2017 at 07:43, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Oops sorry. Now attached.

I have done some basic testing and initial review of the patch.  I
have some comments/doubts.  I will continue the review.

+ if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture)
+ ExecARUpdateTriggers(estate, resultRelInfo, InvalidOid,

For passing invalid ItemPointer we are using InvalidOid, this seems
bit odd to me
are we using simmilar convention some other place? I think it would be better to
just pass 0?

------

- if ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
- (event == TRIGGER_EVENT_UPDATE && update_old_table))
+ if (oldtup != NULL &&
+ ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
+ (event == TRIGGER_EVENT_UPDATE && update_old_table))) { Tuplestorestate *old_tuplestore;

- Assert(oldtup != NULL);

Only if TRIGGER_EVENT_UPDATE it is possible that oldtup can be NULL,
so we have added an extra
check for oldtup and removed the Assert, but if  TRIGGER_EVENT_DELETE
we never expect it to be NULL.

Is it better to put Assert outside the condition check (Assert(oldtup
!= NULL || event == TRIGGER_EVENT_UPDATE)) ?
same for the newtup.

I think we should also explain in comments about why oldtup or newtup
can be NULL in case of if
TRIGGER_EVENT_UPDATE

-------

+    triggers affect the row being moved. As far as <literal>AFTER ROW</>
+    triggers are concerned, <literal>AFTER</> <command>DELETE</command> and
+    <literal>AFTER</> <command>INSERT</command> triggers are applied; but
+    <literal>AFTER</> <command>UPDATE</command> triggers are not applied
+    because the <command>UPDATE</command> has been converted to a
+    <command>DELETE</command> and <command>INSERT</command>.

Above comments says that ARUpdate trigger is not fired but below code call
ARUpdateTrigger

+ if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture)
+ ExecARUpdateTriggers(estate, resultRelInfo, InvalidOid,
+ NULL,
+ tuple,
+ NULL,
+ mtstate->mt_transition_capture);


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 6 September 2017 at 21:47, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Mon, Sep 4, 2017 at 10:52 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> On 4 September 2017 at 07:43, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Oops sorry. Now attached.
>
> I have done some basic testing and initial review of the patch.  I

Thanks for taking this up for review. Attached is the updated patch
v17, that covers the below points.

> + if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture)
> + ExecARUpdateTriggers(estate, resultRelInfo, InvalidOid,
>
> For passing invalid ItemPointer we are using InvalidOid, this seems
> bit odd to me
> are we using simmilar convention some other place? I think it would be better to
> just pass 0?

Yes that's right. Replaced InvalidOid by NULL since ItemPointer is a pointer.

>
> ------
>
> - if ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
> - (event == TRIGGER_EVENT_UPDATE && update_old_table))
> + if (oldtup != NULL &&
> + ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
> + (event == TRIGGER_EVENT_UPDATE && update_old_table)))
>   {
>   Tuplestorestate *old_tuplestore;
>
> - Assert(oldtup != NULL);
>
> Only if TRIGGER_EVENT_UPDATE it is possible that oldtup can be NULL,
> so we have added an extra
> check for oldtup and removed the Assert, but if  TRIGGER_EVENT_DELETE
> we never expect it to be NULL.
>
> Is it better to put Assert outside the condition check (Assert(oldtup
> != NULL || event == TRIGGER_EVENT_UPDATE)) ?
> same for the newtup.
>
> I think we should also explain in comments about why oldtup or newtup
> can be NULL in case of if
> TRIGGER_EVENT_UPDATE

Done all the above. Added two separate asserts, one for DELETE and the
other for INSERT.

>
> -------
>
> +    triggers affect the row being moved. As far as <literal>AFTER ROW</>
> +    triggers are concerned, <literal>AFTER</> <command>DELETE</command> and
> +    <literal>AFTER</> <command>INSERT</command> triggers are applied; but
> +    <literal>AFTER</> <command>UPDATE</command> triggers are not applied
> +    because the <command>UPDATE</command> has been converted to a
> +    <command>DELETE</command> and <command>INSERT</command>.
>
> Above comments says that ARUpdate trigger is not fired but below code call
> ARUpdateTrigger
>
> + if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture)
> + ExecARUpdateTriggers(estate, resultRelInfo, InvalidOid,
> + NULL,
> + tuple,
> + NULL,
> + mtstate->mt_transition_capture);

Actually, since transition tables came in, the functions like
ExecARUpdateTriggers() or ExecARInsertTriggers() have this additional
purpose of capturing transition table rows, so that the images of the
tables are visible when statement triggers are fired that refer to
these transition tables. So in the above code, these functions only
capture rows, they do not add any event for firing any ROW triggers.
AfterTriggerSaveEvent() returns without adding any event if it's
called only for transition capture. So even if UPDATE row triggers are
defined, they won't get fired in case of row movement, although the
updated rows would be captured if transition tables are referenced in
these triggers or in the statement triggers.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 3 September 2017 at 17:10, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

> After recent commit 30833ba154, now the partitions are expanded in
> depth-first order. It didn't seem worthwhile rebasing my partition
> walker changes onto the latest code. So in the attached patch, I have
> removed all the partition walker changes. But
> RelationGetPartitionDispatchInfo() traverses in breadth-first order,
> which is different than the update result rels order (because
> inheritance expansion order is depth-first). So, in order to make the
> tuple-routing-related leaf partitions in the same order as that of the
> update result rels, we would have to make changes in
> RelationGetPartitionDispatchInfo(), which I am not sure whether it is
> going to be done as part of the thread "expanding inheritance in
> partition bound order" [1]. For now, in the attached patch, I have
> reverted back to the hash table method to find the leaf partitions in
> the update result rels.
>
> [1] https://www.postgresql.org/message-id/CAJ3gD9eyudCNU6V-veMme%2BeyzfX_ey%2BgEzULMzOw26c3f9rzdg%40mail.gmail.com

As mentioned by Amit Langote in the above mail thread, he is going to
do changes for making RelationGetPartitionDispatchInfo() return the
leaf partitions in depth-first order. Once that is done, I will then
remove the hash table method for finding leaf partitions in update
result rels, and instead use the earlier efficient method that takes
advantage of the fact that update result rels and leaf partitions are
in the same order.

>
> Thanks
> -Amit Khandekar



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
Attached is the patch rebased on latest HEAD.

Thanks
-Amit Khandekar

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Thu, Sep 7, 2017 at 6:17 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 3 September 2017 at 17:10, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> After recent commit 30833ba154, now the partitions are expanded in
>> depth-first order. It didn't seem worthwhile rebasing my partition
>> walker changes onto the latest code. So in the attached patch, I have
>> removed all the partition walker changes. But
>> RelationGetPartitionDispatchInfo() traverses in breadth-first order,
>> which is different than the update result rels order (because
>> inheritance expansion order is depth-first). So, in order to make the
>> tuple-routing-related leaf partitions in the same order as that of the
>> update result rels, we would have to make changes in
>> RelationGetPartitionDispatchInfo(), which I am not sure whether it is
>> going to be done as part of the thread "expanding inheritance in
>> partition bound order" [1]. For now, in the attached patch, I have
>> reverted back to the hash table method to find the leaf partitions in
>> the update result rels.
>>
>> [1] https://www.postgresql.org/message-id/CAJ3gD9eyudCNU6V-veMme%2BeyzfX_ey%2BgEzULMzOw26c3f9rzdg%40mail.gmail.com
>
> As mentioned by Amit Langote in the above mail thread, he is going to
> do changes for making RelationGetPartitionDispatchInfo() return the
> leaf partitions in depth-first order. Once that is done, I will then
> remove the hash table method for finding leaf partitions in update
> result rels, and instead use the earlier efficient method that takes
> advantage of the fact that update result rels and leaf partitions are
> in the same order.

Has he posted that patch yet?  I don't think I saw it, but maybe I
missed something.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

From
Amit Langote
Date:
On 2017/09/08 18:57, Robert Haas wrote:
>> As mentioned by Amit Langote in the above mail thread, he is going to
>> do changes for making RelationGetPartitionDispatchInfo() return the
>> leaf partitions in depth-first order. Once that is done, I will then
>> remove the hash table method for finding leaf partitions in update
>> result rels, and instead use the earlier efficient method that takes
>> advantage of the fact that update result rels and leaf partitions are
>> in the same order.
> 
> Has he posted that patch yet?  I don't think I saw it, but maybe I
> missed something.

I will post on that thread in a moment.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

From
amul sul
Date:
On Thu, May 18, 2017 at 9:13 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
 On Wed, May 17, 2017 at 5:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, May 17, 2017 at 6:29 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I think we can do this even without using an additional infomask bit.
>> As suggested by Greg up thread, we can set InvalidBlockId in ctid to
>> indicate such an update.
>
> Hmm.  How would that work?
>

We can pass a flag say row_moved (or require_row_movement) to
heap_delete which will in turn set InvalidBlockId in ctid instead of
setting it to self. Then the ExecUpdate needs to check for the same
and return an error when heap_update is not successful (result !=
HeapTupleMayBeUpdated).  Can you explain what difficulty are you
envisioning?


Attaching WIP patch incorporates the above logic, although I am yet to check
all the code for places which might be using ip_blkid.  I have got a small query here,
do we need an error on HeapTupleSelfUpdated case as well?

Note that patch should be applied to the top of Amit Khandekar's latest patch(v17_rebased).

Regards,
Amul

Attachment

Re: [HACKERS] UPDATE of partition key

From
Amit Kapila
Date:
On Fri, Sep 8, 2017 at 4:51 PM, amul sul <sulamul@gmail.com> wrote:
> On Thu, May 18, 2017 at 9:13 AM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>
>>  On Wed, May 17, 2017 at 5:17 PM, Robert Haas <robertmhaas@gmail.com>
>> wrote:
>> > On Wed, May 17, 2017 at 6:29 AM, Amit Kapila <amit.kapila16@gmail.com>
>> > wrote:
>> >> I think we can do this even without using an additional infomask bit.
>> >> As suggested by Greg up thread, we can set InvalidBlockId in ctid to
>> >> indicate such an update.
>> >
>> > Hmm.  How would that work?
>> >
>>
>> We can pass a flag say row_moved (or require_row_movement) to
>> heap_delete which will in turn set InvalidBlockId in ctid instead of
>> setting it to self. Then the ExecUpdate needs to check for the same
>> and return an error when heap_update is not successful (result !=
>> HeapTupleMayBeUpdated).  Can you explain what difficulty are you
>> envisioning?
>>
>
> Attaching WIP patch incorporates the above logic, although I am yet to check
> all the code for places which might be using ip_blkid.  I have got a small
> query here,
> do we need an error on HeapTupleSelfUpdated case as well?
>

No, because that case is anyway a no-op (or error depending on whether
is updated/deleted by same command or later command).  Basically, even
if the row wouldn't have been moved to another partition, we would not
have allowed the command to proceed with the update.  This handling is
to make commands fail rather than a no-op where otherwise (when the
tuple is not moved to another partition) the command would have
succeeded.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

From
Dilip Kumar
Date:
On Thu, Sep 7, 2017 at 11:41 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 6 September 2017 at 21:47, Dilip Kumar <dilipbalaut@gmail.com> wrote:

> Actually, since transition tables came in, the functions like
> ExecARUpdateTriggers() or ExecARInsertTriggers() have this additional
> purpose of capturing transition table rows, so that the images of the
> tables are visible when statement triggers are fired that refer to
> these transition tables. So in the above code, these functions only
> capture rows, they do not add any event for firing any ROW triggers.
> AfterTriggerSaveEvent() returns without adding any event if it's
> called only for transition capture. So even if UPDATE row triggers are
> defined, they won't get fired in case of row movement, although the
> updated rows would be captured if transition tables are referenced in
> these triggers or in the statement triggers.
>

Ok then I have one more question,

With transition table, we can only support statement level trigger and
for update
statement, we are only going to execute UPDATE statement level
trigger? so is there
any point of making transition table entry for DELETE/INSERT trigger
as those transition
table will never be used.  Or I am missing something?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 11 September 2017 at 21:12, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Thu, Sep 7, 2017 at 11:41 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> On 6 September 2017 at 21:47, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>> Actually, since transition tables came in, the functions like
>> ExecARUpdateTriggers() or ExecARInsertTriggers() have this additional
>> purpose of capturing transition table rows, so that the images of the
>> tables are visible when statement triggers are fired that refer to
>> these transition tables. So in the above code, these functions only
>> capture rows, they do not add any event for firing any ROW triggers.
>> AfterTriggerSaveEvent() returns without adding any event if it's
>> called only for transition capture. So even if UPDATE row triggers are
>> defined, they won't get fired in case of row movement, although the
>> updated rows would be captured if transition tables are referenced in
>> these triggers or in the statement triggers.
>>
>
> Ok then I have one more question,
>
> With transition table, we can only support statement level trigger

Yes, we don't support row triggers with transition tables if the table
is a partition.

> and for update
> statement, we are only going to execute UPDATE statement level
> trigger? so is there
> any point of making transition table entry for DELETE/INSERT trigger
> as those transition
> table will never be used.

But the statement level trigger function can refer to OLD TABLE and
NEW TABLE, which will contain all the OLD rows and NEW rows
respectively. So the updated rows of the partitions (including the
moved ones) need to be captured. So for OLD TABLE, we need to capture
the deleted row, and for NEW TABLE, we need to capture the inserted
row.

In the regression test update.sql, check how the statement trigger
trans_updatetrig prints all the updated rows, including the moved
ones.


>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

From
Dilip Kumar
Date:
On Tue, Sep 12, 2017 at 11:15 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

> But the statement level trigger function can refer to OLD TABLE and
> NEW TABLE, which will contain all the OLD rows and NEW rows
> respectively. So the updated rows of the partitions (including the
> moved ones) need to be captured. So for OLD TABLE, we need to capture
> the deleted row, and for NEW TABLE, we need to capture the inserted
> row.

Yes, I agree.  So in ExecDelete for OLD TABLE we only need to call
ExecARUpdateTriggers which will make the entry in OLD TABLE only if
transition table is there otherwise nothing and I guess this part
already exists in your patch.  And, we are also calling
ExecARDeleteTriggers and I guess that is to fire the ROW-LEVEL delete
trigger and that is also fine.  What I don't understand is that if
there is no "ROW- LEVEL delete trigger" and there is only a "statement
level delete trigger" with transition table still we are making the
entry in transition table of the delete trigger and that will never be
used.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 12 September 2017 at 11:57, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Tue, Sep 12, 2017 at 11:15 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
>> But the statement level trigger function can refer to OLD TABLE and
>> NEW TABLE, which will contain all the OLD rows and NEW rows
>> respectively. So the updated rows of the partitions (including the
>> moved ones) need to be captured. So for OLD TABLE, we need to capture
>> the deleted row, and for NEW TABLE, we need to capture the inserted
>> row.
>
> Yes, I agree.  So in ExecDelete for OLD TABLE we only need to call
> ExecARUpdateTriggers which will make the entry in OLD TABLE only if
> transition table is there otherwise nothing and I guess this part
> already exists in your patch.  And, we are also calling
> ExecARDeleteTriggers and I guess that is to fire the ROW-LEVEL delete
> trigger and that is also fine.  What I don't understand is that if
> there is no "ROW- LEVEL delete trigger" and there is only a "statement
> level delete trigger" with transition table still we are making the
> entry in transition table of the delete trigger and that will never be
> used.

Hmm, ok, that might be happening, since we are calling
ExecARDeleteTriggers() with mtstate->mt_transition_capture non-NULL,
and so the deleted tuple gets captured even when there is no UPDATE
statement trigger defined, which looks redundant. Will check this.
Thanks.

>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 8 September 2017 at 15:21, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Attached is the patch rebased on latest HEAD.

The patch got bit rotten again. Rebased version v17_rebased_2.patch
has also some scenarios added in update.sql , that cover UPDATE row
movement from non-default to default partition and vice versa.

>
> Thanks
> -Amit Khandekar



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] UPDATE of partition key

From
amul sul
Date:


On Sun, Sep 10, 2017 at 8:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Sep 8, 2017 at 4:51 PM, amul sul <sulamul@gmail.com> wrote:
> On Thu, May 18, 2017 at 9:13 AM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>
>>  On Wed, May 17, 2017 at 5:17 PM, Robert Haas <robertmhaas@gmail.com>
>> wrote:
>> > On Wed, May 17, 2017 at 6:29 AM, Amit Kapila <amit.kapila16@gmail.com>
>> > wrote:
>> >> I think we can do this even without using an additional infomask bit.
>> >> As suggested by Greg up thread, we can set InvalidBlockId in ctid to
>> >> indicate such an update.
>> >
>> > Hmm.  How would that work?
>> >
>>
>> We can pass a flag say row_moved (or require_row_movement) to
>> heap_delete which will in turn set InvalidBlockId in ctid instead of
>> setting it to self. Then the ExecUpdate needs to check for the same
>> and return an error when heap_update is not successful (result !=
>> HeapTupleMayBeUpdated).  Can you explain what difficulty are you
>> envisioning?
>>
>
> Attaching WIP patch incorporates the above logic, although I am yet to check
> all the code for places which might be using ip_blkid.  I have got a small
> query here,
> do we need an error on HeapTupleSelfUpdated case as well?
>

No, because that case is anyway a no-op (or error depending on whether
is updated/deleted by same command or later command).  Basically, even
if the row wouldn't have been moved to another partition, we would not
have allowed the command to proceed with the update.  This handling is
to make commands fail rather than a no-op where otherwise (when the
tuple is not moved to another partition) the command would have
succeeded.

Thank you.

I've rebased patch against  Amit Khandekar's latest
​ ​
patch
​ ​
(v17_rebased​_2​)​
​.
​Also ​
added ip_blkid validation
​ ​
check in heap_get_latest_tid(), rewrite_heap_tuple​()​​​
& rewrite_heap_tuple​​() function​, because only
​ ​
ItemPointerEquals() check is no
longer sufficient
​after
 this patch.

Regards,
Amul 

Attachment

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 12 September 2017 at 12:39, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 12 September 2017 at 11:57, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> On Tue, Sep 12, 2017 at 11:15 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>
>>> But the statement level trigger function can refer to OLD TABLE and
>>> NEW TABLE, which will contain all the OLD rows and NEW rows
>>> respectively. So the updated rows of the partitions (including the
>>> moved ones) need to be captured. So for OLD TABLE, we need to capture
>>> the deleted row, and for NEW TABLE, we need to capture the inserted
>>> row.
>>
>> Yes, I agree.  So in ExecDelete for OLD TABLE we only need to call
>> ExecARUpdateTriggers which will make the entry in OLD TABLE only if
>> transition table is there otherwise nothing and I guess this part
>> already exists in your patch.  And, we are also calling
>> ExecARDeleteTriggers and I guess that is to fire the ROW-LEVEL delete
>> trigger and that is also fine.  What I don't understand is that if
>> there is no "ROW- LEVEL delete trigger" and there is only a "statement
>> level delete trigger" with transition table still we are making the
>> entry in transition table of the delete trigger and that will never be
>> used.
>
> Hmm, ok, that might be happening, since we are calling
> ExecARDeleteTriggers() with mtstate->mt_transition_capture non-NULL,
> and so the deleted tuple gets captured even when there is no UPDATE
> statement trigger defined, which looks redundant. Will check this.
> Thanks.

I found out that, in case when there is a DELETE statement trigger
using transition tables, it's not only an issue of redundancy; it's a
correctness issue. Since for transition tables both DELETE and UPDATE
use the same old row tuplestore for capturing OLD table, that table
gets duplicate rows: one from ExecARDeleteTriggers() and another from
ExecARUpdateTriggers(). In presence of INSERT statement trigger using
transition tables, both INSERT and UPDATE events have separate
tuplestore, so duplicate rows don't show up in the UPDATE NEW table.
But, nevertheless, we need to prevent NEW rows to be collected in the
INSERT event tuplestore, and capture the NEW rows only in the UPDATE
event tuplestore.

In the attached patch, we first call ExecARUpdateTriggers(), and while
doing that, we first save the info that a NEW row is already captured
(mtstate->mt_transition_capture->tcs_update_old_table == true). If it
captured, we pass NULL transition_capture pointer to
ExecARDeleteTriggers() (and ExecARInsertTriggers) so that it does not
again capture an extra row.

Modified a testcase in update.sql by including DELETE statement
trigger that uses transition tables.

-------

After commit 77b6b5e9c, the order of leaf partitions returned by
RelationGetPartitionDispatchInfo() and the order of the UPDATE result
rels are in the same order. Earlier, because of different orders, I
had to use a hash table to search for the leaf partitions in the
update result rels, so that we could re-use the per-subplan UPDATE
ResultRelInfo's. Now since the order is same, in the attached patch, I
have removed the hash table method, and instead, iterate over the leaf
partition oids and at the same time keep shifting a position over the
per-subplan resultrels whenever the resultrel at the position is found
to be present in the leaf partitions list.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] UPDATE of partition key

From
Dilip Kumar
Date:
On Fri, Sep 15, 2017 at 4:55 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 12 September 2017 at 12:39, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> On 12 September 2017 at 11:57, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>> On Tue, Sep 12, 2017 at 11:15 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>>
> I found out that, in case when there is a DELETE statement trigger
> using transition tables, it's not only an issue of redundancy; it's a
> correctness issue. Since for transition tables both DELETE and UPDATE
> use the same old row tuplestore for capturing OLD table, that table
> gets duplicate rows: one from ExecARDeleteTriggers() and another from
> ExecARUpdateTriggers(). In presence of INSERT statement trigger using
> transition tables, both INSERT and UPDATE events have separate
> tuplestore, so duplicate rows don't show up in the UPDATE NEW table.
> But, nevertheless, we need to prevent NEW rows to be collected in the
> INSERT event tuplestore, and capture the NEW rows only in the UPDATE
> event tuplestore.
>
> In the attached patch, we first call ExecARUpdateTriggers(), and while
> doing that, we first save the info that a NEW row is already captured
> (mtstate->mt_transition_capture->tcs_update_old_table == true). If it
> captured, we pass NULL transition_capture pointer to
> ExecARDeleteTriggers() (and ExecARInsertTriggers) so that it does not
> again capture an extra row.
>
> Modified a testcase in update.sql by including DELETE statement
> trigger that uses transition tables.

Ok, this fix looks correct to me, I will review the latest patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

From
Dilip Kumar
Date:
On Mon, Sep 18, 2017 at 11:29 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Fri, Sep 15, 2017 at 4:55 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> On 12 September 2017 at 12:39, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>> On 12 September 2017 at 11:57, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>>> On Tue, Sep 12, 2017 at 11:15 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>>>

>> In the attached patch, we first call ExecARUpdateTriggers(), and while
>> doing that, we first save the info that a NEW row is already captured
>> (mtstate->mt_transition_capture->tcs_update_old_table == true). If it
>> captured, we pass NULL transition_capture pointer to
>> ExecARDeleteTriggers() (and ExecARInsertTriggers) so that it does not
>> again capture an extra row.
>>
>> Modified a testcase in update.sql by including DELETE statement
>> trigger that uses transition tables.
>
> Ok, this fix looks correct to me, I will review the latest patch.

Please find few more comments.

+ * in which they appear in the PartitionDesc. Also, extract the
+ * partition key columns of the root partitioned table. Those of the
+ * child partitions would be collected during recursive expansion.
*/
+ pull_child_partition_columns(&all_part_cols, oldrelation, oldrelation);
expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc, lockmode, &root->append_rel_list,
+   &all_part_cols,

pcinfo->all_part_cols is only used in case of update, I think we can
call pull_child_partition_columns
only if rte has updateCols?

@@ -2029,6 +2034,7 @@ typedef struct PartitionedChildRelInfo

Index parent_relid;
List   *child_rels;
+ Bitmapset  *all_part_cols;
} PartitionedChildRelInfo;

I might be missing something, but do we really need to store
all_part_cols inside the
PartitionedChildRelInfo,  can't we call pull_child_partition_columns
directly inside
inheritance_planner whenever we realize that RTE has some updateCols
and we want to
check the overlap?

+extern void partition_walker_init(PartitionWalker *walker, Relation rel);
+extern Relation partition_walker_next(PartitionWalker *walker,
+  Relation *parent);
+

I don't see these functions are used anywhere?

+typedef struct PartitionWalker
+{
+ List   *rels_list;
+ ListCell   *cur_cell;
+} PartitionWalker;
+

Same as above



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 18 September 2017 at 20:45, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Please find few more comments.
>
> + * in which they appear in the PartitionDesc. Also, extract the
> + * partition key columns of the root partitioned table. Those of the
> + * child partitions would be collected during recursive expansion.
> */
> + pull_child_partition_columns(&all_part_cols, oldrelation, oldrelation);
> expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
>   lockmode, &root->append_rel_list,
> +   &all_part_cols,
>
> pcinfo->all_part_cols is only used in case of update, I think we can
> call pull_child_partition_columns
> only if rte has updateCols?
>
> @@ -2029,6 +2034,7 @@ typedef struct PartitionedChildRelInfo
>
> Index parent_relid;
> List   *child_rels;
> + Bitmapset  *all_part_cols;
> } PartitionedChildRelInfo;
>
> I might be missing something, but do we really need to store
> all_part_cols inside the
> PartitionedChildRelInfo,  can't we call pull_child_partition_columns
> directly inside
> inheritance_planner whenever we realize that RTE has some updateCols
> and we want to
> check the overlap?

One thing  we will have to do extra is : Open and close the
partitioned rels again. The idea was that we collect the bitmap
*while* we are already expanding through the tree and the rel is open.
Will check if this is feasible.

>
> +extern void partition_walker_init(PartitionWalker *walker, Relation rel);
> +extern Relation partition_walker_next(PartitionWalker *walker,
> +  Relation *parent);
> +
>
> I don't see these functions are used anywhere?
>
> +typedef struct PartitionWalker
> +{
> + List   *rels_list;
> + ListCell   *cur_cell;
> +} PartitionWalker;
> +
>
> Same as above

Yes, this was left out from the earlier implementation. Will have this
removed in the next updated patch.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

From
Dilip Kumar
Date:
On Tue, Sep 19, 2017 at 1:15 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 18 September 2017 at 20:45, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> Please find few more comments.
>>
>> + * in which they appear in the PartitionDesc. Also, extract the
>> + * partition key columns of the root partitioned table. Those of the
>> + * child partitions would be collected during recursive expansion.
>> */
>> + pull_child_partition_columns(&all_part_cols, oldrelation, oldrelation);
>> expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
>>   lockmode, &root->append_rel_list,
>> +   &all_part_cols,
>>
>> pcinfo->all_part_cols is only used in case of update, I think we can
>> call pull_child_partition_columns
>> only if rte has updateCols?
>>
>> @@ -2029,6 +2034,7 @@ typedef struct PartitionedChildRelInfo
>>
>> Index parent_relid;
>> List   *child_rels;
>> + Bitmapset  *all_part_cols;
>> } PartitionedChildRelInfo;
>>
>> I might be missing something, but do we really need to store
>> all_part_cols inside the
>> PartitionedChildRelInfo,  can't we call pull_child_partition_columns
>> directly inside
>> inheritance_planner whenever we realize that RTE has some updateCols
>> and we want to
>> check the overlap?
>
> One thing  we will have to do extra is : Open and close the
> partitioned rels again. The idea was that we collect the bitmap
> *while* we are already expanding through the tree and the rel is open.
> Will check if this is feasible.

Oh, I see.
>
>>
>> +extern void partition_walker_init(PartitionWalker *walker, Relation rel);
>> +extern Relation partition_walker_next(PartitionWalker *walker,
>> +  Relation *parent);
>> +
>>
>> I don't see these functions are used anywhere?
>>
>> +typedef struct PartitionWalker
>> +{
>> + List   *rels_list;
>> + ListCell   *cur_cell;
>> +} PartitionWalker;
>> +
>>
>> Same as above
>
> Yes, this was left out from the earlier implementation. Will have this
> removed in the next updated patch.
Ok. I will continue my review thanks.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Fri, Sep 15, 2017 at 7:25 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> [ new patch ]

This already fails to apply again.  In general, I think it would be a
good idea to break this up into a patch series rather than have it as
a single patch.  That would allow some bits to be applied earlier.
The main patch will probably still be pretty big, but at least we can
make things a little easier by getting some of the cleanup out of the
way first.  Specific suggestions on what to break out below.

If the changes to rewriteManip.c are a marginal efficiency hack and
nothing more, then let's commit this part separately before the main
patch.  If they're necessary for correctness, then please add a
comment explaining why they are necessary.

There appears to be no reason why the definitions of
GetInsertedColumns() and GetUpdatedColumns() need to be moved to a
header file as a result of this patch.  GetUpdatedColumns() was
previously defined in trigger.c and execMain.c and, post-patch, is
still called from only those files.  GetInsertedColumns() was, and
remains, called only from execMain.c.  If this were needed I'd suggest
doing it as a preparatory patch before the main patch, but it seems we
don't need it at all.

If I understand correctly, the reason for changing mt_partitions from
ResultRelInfo * to ResultRelInfo ** is that, currently, all of the
ResultRelInfos for a partitioning hierarchy are allocated as a single
chunk, but we can't do that and also reuse the ResultRelInfos created
during InitPlan.  I suggest that we do this as a preparatory patch.
Someone could argue that this is going the wrong way and that we ought
to instead make InitPlan() create all of the necessarily
ResultRelInfos, but it seems to me that eventually we probably want to
allow setting up ResultRelInfos on the fly for only those partitions
for which we end up needing them.  The code already has some provision
for creating ResultRelInfos on the fly - see ExecGetTriggerResultRel.
I don't think it's this patch's job to try to apply that kind of thing
to tuple routing, but it seems like in the long run if we're inserting
1 tuple into a table with 1000 partitions, or performing 1 update that
touches the partition key, it would be best not to create
ResultRelInfos for all 1000 partitions just for fun.  But this sort of
thing seems much easier of mt_partitions is ResultRelInfo ** rather
than ResultRelInfo *, so I think what you have is going in the right
direction.

+         * mtstate->resultRelInfo[]. Note: We assume that if the resultRelInfo
+         * does not belong to subplans, then it already matches the root tuple
+         * descriptor; although there is no such known scenario where this
+         * could happen.
+         */
+        if (rootResultRelInfo != resultRelInfo &&
+            mtstate->mt_persubplan_childparent_maps != NULL &&
+            resultRelInfo >= mtstate->resultRelInfo &&
+            resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans - 1)
+        {
+            int         map_index = resultRelInfo - mtstate->resultRelInfo;

I think you should Assert() that it doesn't happen instead of assuming
that it doesn't happen.   IOW, remove the last two branches of the
if-condition, and then add an Assert() that map_index is sane.

It is not clear to me why we need both mt_perleaf_childparent_maps and
mt_persubplan_childparent_maps.

+         * Note: if the UPDATE is converted into a DELETE+INSERT as part of
+         * update-partition-key operation, then this function is also called
+         * separately for DELETE and INSERT to capture transition table rows.
+         * In such case, either old tuple or new tuple can be NULL.

That seems pretty strange.  I don't quite see how that's going to work
correctly.  I'm skeptical about the idea that the old tuple capture
and new tuple capture can safely happen at different times.

I wonder if we should have a reloption controlling whether
update-tuple routing is enabled.  I wonder how much more expensive it
is to execute UPDATE root SET a = a + 1 WHERE a = 1 on a table with
1000 subpartitions with this patch than without, assuming the update
succeeds in both cases.

I also wonder how efficient this implementation is in general.  For
example, suppose you make a table with 1000 partitions each containing
10,000 tuples and update them all, and consider three scenarios: (1)
partition key not updated but all tuples subject to non-HOT updates
because the updated column is indexed, (2) partition key updated but
no tuple movement required as a result, (3) partition key updated and
all tuples move to a different partition.  It would be useful to
compare the times, and also to look at perf profiles and see if there
are any obvious sources of inefficiency that can be squeezed out.  It
wouldn't surprise me if tuple movement is a bit slower than the other
scenarios, but it would be nice to know how much slower and whether
the bottlenecks are anything that we can easily fix.  I don't feel
that the performance constraints for this patch should be too tight,
because we're talking about being able to do something vs. not being
able to do it at all, but we should try to have it not stink.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 20 September 2017 at 00:06, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Sep 15, 2017 at 7:25 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> [ new patch ]
>
> This already fails to apply again.  In general, I think it would be a
> good idea to break this up into a patch series rather than have it as
> a single patch.  That would allow some bits to be applied earlier.
> The main patch will probably still be pretty big, but at least we can
> make things a little easier by getting some of the cleanup out of the
> way first.  Specific suggestions on what to break out below.
>
> If the changes to rewriteManip.c are a marginal efficiency hack and
> nothing more, then let's commit this part separately before the main
> patch.  If they're necessary for correctness, then please add a
> comment explaining why they are necessary.

Ok. Yes, just wanted to avoid two ConvertRowtypeExpr nodes one over
the other. But that was not causing any correctness issue. Will
extract these changes into separate patch.

>
> There appears to be no reason why the definitions of
> GetInsertedColumns() and GetUpdatedColumns() need to be moved to a
> header file as a result of this patch.  GetUpdatedColumns() was
> previously defined in trigger.c and execMain.c and, post-patch, is
> still called from only those files.  GetInsertedColumns() was, and
> remains, called only from execMain.c.  If this were needed I'd suggest
> doing it as a preparatory patch before the main patch, but it seems we
> don't need it at all.

In earlier versions of the patch, these functions were used in
nodeModifyTable.c as well. Now that those calls are not there in this
file, I will revert back the changes done for moving the definitions
into header file.

>
> If I understand correctly, the reason for changing mt_partitions from
> ResultRelInfo * to ResultRelInfo ** is that, currently, all of the
> ResultRelInfos for a partitioning hierarchy are allocated as a single
> chunk, but we can't do that and also reuse the ResultRelInfos created
> during InitPlan.  I suggest that we do this as a preparatory patch.

Ok, will prepare a separate patch. Do you mean to include in that
patch the changes I did in ExecSetupPartitionTupleRouting() that
re-use the ResultRelInfo structures of per-subplan update result rels
?

> Someone could argue that this is going the wrong way and that we ought
> to instead make InitPlan() create all of the necessarily
> ResultRelInfos, but it seems to me that eventually we probably want to
> allow setting up ResultRelInfos on the fly for only those partitions
> for which we end up needing them.  The code already has some provision
> for creating ResultRelInfos on the fly - see ExecGetTriggerResultRel.
> I don't think it's this patch's job to try to apply that kind of thing
> to tuple routing, but it seems like in the long run if we're inserting
> 1 tuple into a table with 1000 partitions, or performing 1 update that
> touches the partition key, it would be best not to create
> ResultRelInfos for all 1000 partitions just for fun.

Yes makes sense.

>  But this sort of
> thing seems much easier of mt_partitions is ResultRelInfo ** rather
> than ResultRelInfo *, so I think what you have is going in the right
> direction.

Ok.

>
> +         * mtstate->resultRelInfo[]. Note: We assume that if the resultRelInfo
> +         * does not belong to subplans, then it already matches the root tuple
> +         * descriptor; although there is no such known scenario where this
> +         * could happen.
> +         */
> +        if (rootResultRelInfo != resultRelInfo &&
> +            mtstate->mt_persubplan_childparent_maps != NULL &&
> +            resultRelInfo >= mtstate->resultRelInfo &&
> +            resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans - 1)
> +        {
> +            int         map_index = resultRelInfo - mtstate->resultRelInfo;
>
> I think you should Assert() that it doesn't happen instead of assuming
> that it doesn't happen.   IOW, remove the last two branches of the
> if-condition, and then add an Assert() that map_index is sane.

Ok.

>
> It is not clear to me why we need both mt_perleaf_childparent_maps and
> mt_persubplan_childparent_maps.

mt_perleaf_childparent_maps :
This is used for converting transition-captured
inserted/modified/deleted tuples from leaf to root partition, because
we need to have all the ROWS in the root partition attribute order.
This map is used only for tuples that are routed from root to leaf
partition during INSERT, or when tuples are routed from one leaf
partition to another leaf partition during update row movement. For
both of these operations, we need per-leaf maps, because during tuple
conversion, the source relation is among the mtstate->mt_partitions.

mt_persubplan_childparent_maps :
This is used at two places :

1. After an ExecUpdate() updates a row of a per-subplan update result
rel, we need to capture the tuple, so again we need to convert to the
root partition. Here, the source table is a per-subplan update result
rel; so we need to have per-subplan conversion map array. So after
UPDATE finishes with one update result rel,
node->mt_transition_capture->tcs_map shifts to the next element in the
mt_persubplan_childparent_maps array. :
ExecModifyTable()
{  ....  node->mt_transition_capture->tcs_map =     node->mt_persubplan_childparent_maps[node->mt_whichplan];  ....
}

2. In ExecInsert(), if it is part of update tuple routing, we need to
convert the tuple from the update result rel to the root partition. So
it re-uses this same conversion map.

Now, instead of these two maps having separate allocations, I have
arranged for the per-leaf map array to re-use the mapping allocations
made by per-subplan array elements, similar to how we are doing for
re-using the ResultRelInfos. But still the arrays themselves need to
be separate.


>
> +         * Note: if the UPDATE is converted into a DELETE+INSERT as part of
> +         * update-partition-key operation, then this function is also called
> +         * separately for DELETE and INSERT to capture transition table rows.
> +         * In such case, either old tuple or new tuple can be NULL.
>
> That seems pretty strange.  I don't quite see how that's going to work
> correctly.  I'm skeptical about the idea that the old tuple capture
> and new tuple capture can safely happen at different times.

Actually the tuple capture involves just adding the tuple into the
correct tuplestore for a particular event. There is no trigger event
added for tuple capture. Calling ExecARUpdateTriggers() with either
newtuple NULL or tupleid Invalid makes sure that it does not do
anything other than transition capture :

@@ -5306,7 +5322,8 @@ AfterTriggerSaveEvent(EState *estate,
ResultRelInfo *relinfo,               /* If transition tables are the only reason we're
here, return. */               if (trigdesc == NULL ||                       (event == TRIGGER_EVENT_DELETE &&
!trigdesc->trig_delete_after_row) ||                       (event == TRIGGER_EVENT_INSERT &&
!trigdesc->trig_insert_after_row) ||
-                       (event == TRIGGER_EVENT_UPDATE &&
!trigdesc->trig_update_after_row))
+                       (event == TRIGGER_EVENT_UPDATE &&
!trigdesc->trig_update_after_row) ||
+                       (event == TRIGGER_EVENT_UPDATE && (oldtup ==
NULL || newtup == NULL)))                       return;

Even if we imagine a single place or a single function that we could
call to do the OLD and NEW row capture, still the end result is going
to be the same : OLD row would go into
mtstate->mt_transition_capture->tcs_old_tuplestore, and NEW row would
end up in mtstate->mt_transition_capture->tcs_update_tuplestore. Note
that these are common tuple stores for all the partitions of the
partition tree.

(Actually I am still rebasing my patch over the recent changes where
tcs_update_tuplestore no more exists; instead we need to use
transition_capture->tcs_private->new_tuplestore).

When we access the OLD and NEW tables for UPDATE trigger, there is no
longer a co-relation as to which row of OLD TABLE correspond to which
row of the NEW TABLE for a given updated row. So, at exactly which
point OLD row and NEW row gets captured into their respective
tuplestores, and in which order, is not important.

Whereas, for the usual per ROW triggers, it is critical that the
trigger event has both the OLD and NEW row together in the same
trigger event, since they need to be both accessible in the same
trigger function.

Doing the OLD and NEW tables row capture separately is essential
because the DELETE and INSERT happen on different tables, so we are
not even sure if the insert is going to happen (thanks to triggers on
partitions, if any). If the insert is skipped, we should not capture
that tuple.


>
> I wonder if we should have a reloption controlling whether
> update-tuple routing is enabled.  I wonder how much more expensive it
> is to execute UPDATE root SET a = a + 1 WHERE a = 1 on a table with
> 1000 subpartitions with this patch than without, assuming the update
> succeeds in both cases.

You mean to check how much the patch slows down things for the
existing updates involving no row movement ? And so have the reloption
to have an option to disable the logic that slows down things ?

>
> I also wonder how efficient this implementation is in general.  For
> example, suppose you make a table with 1000 partitions each containing
> 10,000 tuples and update them all, and consider three scenarios: (1)
> partition key not updated but all tuples subject to non-HOT updates
> because the updated column is indexed, (2) partition key updated but
> no tuple movement required as a result, (3) partition key updated and
> all tuples move to a different partition.  It would be useful to
> compare the times, and also to look at perf profiles and see if there
> are any obvious sources of inefficiency that can be squeezed out.  It
> wouldn't surprise me if tuple movement is a bit slower than the other
> scenarios, but it would be nice to know how much slower and whether
> the bottlenecks are anything that we can easily fix.

Ok yeah that would be helpful to remove any unnecessary slowness that
may have been caused due to the patch; will do.




-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

From
amul sul
Date:
On Wed, Sep 20, 2017 at 9:27 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
On 20 September 2017 at 00:06, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Sep 15, 2017 at 7:25 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> [ new patch ]

  86 -           (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
  87 +           (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
  88 +           (event == TRIGGER_EVENT_UPDATE && (oldtup == NULL || newtup == NULL)))
  89             return;
  90     }


Either of oldtup or newtup will be valid at a time & vice versa.  Can we improve
this check accordingly?

For e.g.: 
(event == TRIGGER_EVENT_UPDATE && )(HeapTupleIsValid(oldtup) ^ ItemPointerIsValid(newtup)))))


 247 
 248 +   /*
 249 +    * EDB: In case this is part of update tuple routing, put this row into the
 250 +    * transition NEW TABLE if we are capturing transition tables. We need to
 251 +    * do this separately for DELETE and INSERT because they happen on
 252 +    * different tables.
 253 +    */
 254 +   if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture)
 255 +       ExecARUpdateTriggers(estate, resultRelInfo, NULL,
 256 +                    NULL,
 257 +                    tuple,
 258 +                    NULL,
 259 +                    mtstate->mt_transition_capture);
 260 +
 261     list_free(recheckIndexes);

 267 
 268 +   /*
 269 +    * EDB: In case this is part of update tuple routing, put this row into the
 270 +    * transition OLD TABLE if we are capturing transition tables. We need to
 271 +    * do this separately for DELETE and INSERT because they happen on
 272 +    * different tables.
 273 +    */
 274 +   if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture)
 275 +       ExecARUpdateTriggers(estate, resultRelInfo, tupleid,
 276 +                    oldtuple,
 277 +                    NULL,
 278 +                    NULL,
 279 +                    mtstate->mt_transition_capture);
 280 +

Initially, I wondered that why can't we have above code right after
​ExecInsert()​ & ​ExecIDelete()​ in ​ExecUpdate​ respectively?

We can do that for ExecIDelete() but not easily in the ExecInsert() case,
because ExecInsert() internally searches the correct partition's resultRelInfo
for an insert and before returning to ExecUpdate resultRelInfo is restored
to the old one.  That's why current logic seems to be reasonable for now.
Is there anything that we can do?

Regards,
Amul


Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
I have extracted a couple of changes into preparatory patches, as
explained below :

On 20 September 2017 at 21:27, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 20 September 2017 at 00:06, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, Sep 15, 2017 at 7:25 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>> [ new patch ]
>>
>> This already fails to apply again.  In general, I think it would be a
>> good idea to break this up into a patch series rather than have it as
>> a single patch.  That would allow some bits to be applied earlier.
>> The main patch will probably still be pretty big, but at least we can
>> make things a little easier by getting some of the cleanup out of the
>> way first.  Specific suggestions on what to break out below.
>>
>> If the changes to rewriteManip.c are a marginal efficiency hack and
>> nothing more, then let's commit this part separately before the main
>> patch.  If they're necessary for correctness, then please add a
>> comment explaining why they are necessary.
>
> Ok. Yes, just wanted to avoid two ConvertRowtypeExpr nodes one over
> the other. But that was not causing any correctness issue. Will
> extract these changes into separate patch.

The patch for the above change is :
0002-Prevent-a-redundant-ConvertRowtypeExpr-node.patch

>
>>
>> There appears to be no reason why the definitions of
>> GetInsertedColumns() and GetUpdatedColumns() need to be moved to a
>> header file as a result of this patch.  GetUpdatedColumns() was
>> previously defined in trigger.c and execMain.c and, post-patch, is
>> still called from only those files.  GetInsertedColumns() was, and
>> remains, called only from execMain.c.  If this were needed I'd suggest
>> doing it as a preparatory patch before the main patch, but it seems we
>> don't need it at all.
>
> In earlier versions of the patch, these functions were used in
> nodeModifyTable.c as well. Now that those calls are not there in this
> file, I will revert back the changes done for moving the definitions
> into header file.

Did the above , and included in the attached revised patch
update-partition-key_v19.patch.


>
>>
>> If I understand correctly, the reason for changing mt_partitions from
>> ResultRelInfo * to ResultRelInfo ** is that, currently, all of the
>> ResultRelInfos for a partitioning hierarchy are allocated as a single
>> chunk, but we can't do that and also reuse the ResultRelInfos created
>> during InitPlan.  I suggest that we do this as a preparatory patch.
>
> Ok, will prepare a separate patch. Do you mean to include in that
> patch the changes I did in ExecSetupPartitionTupleRouting() that
> re-use the ResultRelInfo structures of per-subplan update result rels
> ?

Above changes are in attached
0001-Re-use-UPDATE-result-rels-created-in-InitPlan.patch.


Patches are to be applied in this order :

0001-Re-use-UPDATE-result-rels-created-in-InitPlan.patch
0002-Prevent-a-redundant-ConvertRowtypeExpr-node.patch
update-partition-key_v19.patch

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 21 September 2017 at 19:52, amul sul <sulamul@gmail.com> wrote:
> On Wed, Sep 20, 2017 at 9:27 PM, Amit Khandekar <amitdkhan.pg@gmail.com>
> wrote:
>>
>> On 20 September 2017 at 00:06, Robert Haas <robertmhaas@gmail.com> wrote:
>> > On Fri, Sep 15, 2017 at 7:25 AM, Amit Khandekar <amitdkhan.pg@gmail.com>
>> > wrote:
>> >> [ new patch ]
>
>
>   86 -           (event == TRIGGER_EVENT_UPDATE &&
> !trigdesc->trig_update_after_row))
>   87 +           (event == TRIGGER_EVENT_UPDATE &&
> !trigdesc->trig_update_after_row) ||
>   88 +           (event == TRIGGER_EVENT_UPDATE && (oldtup == NULL || newtup
> == NULL)))
>   89             return;
>   90     }
>
>
> Either of oldtup or newtup will be valid at a time & vice versa.  Can we
> improve
> this check accordingly?
>
> For e.g.:
> (event == TRIGGER_EVENT_UPDATE && )(HeapTupleIsValid(oldtup) ^
> ItemPointerIsValid(newtup)))))

Ok, I will be doing this as below :
-  (event == TRIGGER_EVENT_UPDATE && (oldtup == NULL || newtup == NULL)))
+ (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))

At other places in the function, oldtup and newtup are checked for
NULL, so to be consistent, I haven't used HeapTupleIsValid.

Actually, it won't happen that both oldtup and newtup are NULL ... in
either of delete, insert, or update, but I haven't added an Assert for
this, because that has been true even on HEAD.

Will include the above minor change in the next patch when more changes come in.

>
>
>  247
>  248 +   /*
>  249 +    * EDB: In case this is part of update tuple routing, put this row
> into the
>  250 +    * transition NEW TABLE if we are capturing transition tables. We
> need to
>  251 +    * do this separately for DELETE and INSERT because they happen on
>  252 +    * different tables.
>  253 +    */
>  254 +   if (mtstate->operation == CMD_UPDATE &&
> mtstate->mt_transition_capture)
>  255 +       ExecARUpdateTriggers(estate, resultRelInfo, NULL,
>  256 +                    NULL,
>  257 +                    tuple,
>  258 +                    NULL,
>  259 +                    mtstate->mt_transition_capture);
>  260 +
>  261     list_free(recheckIndexes);
>
>  267
>  268 +   /*
>  269 +    * EDB: In case this is part of update tuple routing, put this row
> into the
>  270 +    * transition OLD TABLE if we are capturing transition tables. We
> need to
>  271 +    * do this separately for DELETE and INSERT because they happen on
>  272 +    * different tables.
>  273 +    */
>  274 +   if (mtstate->operation == CMD_UPDATE &&
> mtstate->mt_transition_capture)
>  275 +       ExecARUpdateTriggers(estate, resultRelInfo, tupleid,
>  276 +                    oldtuple,
>  277 +                    NULL,
>  278 +                    NULL,
>  279 +                    mtstate->mt_transition_capture);
>  280 +
>
> Initially, I wondered that why can't we have above code right after
> ExecInsert() & ExecIDelete() in ExecUpdate respectively?
>
> We can do that for ExecIDelete() but not easily in the ExecInsert() case,
> because ExecInsert() internally searches the correct partition's
> resultRelInfo
> for an insert and before returning to ExecUpdate resultRelInfo is restored
> to the old one.  That's why current logic seems to be reasonable for now.
> Is there anything that we can do?

Yes, resultRelInfo is different when we return from ExecInsert().
Also, I think the trigger and transition capture be done immediately
after the rows are inserted. This is true for existing code also.
Furthermore, there is a dependency of ExecARUpdateTriggers() on
ExecARInsertTriggers(). transition_capture is passed NULL if we
already captured the tuple in ExecARUpdateTriggers(). It looks simpler
to do all this at a single place.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
Below are some performance figures. Overall, there does not appear to
be a noticeable difference in the figures in partition key updates
with and without row movement (which is surprising), and
non-partition-key updates with and without the patch.

All the values are in milliseconds.

Configuration :

shared_buffers = 8GB
maintenance_work_mem = 4GB
synchronous_commit = off
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
log_line_prefix = '%t [%p] '
max_wal_size = 5GB
max_connections = 200

The attached files were used to create a partition tree made up of 16
partitioned tables, each containing 125 partitions. First half of the
2000 partitions are filled with 10 million rows. Update row movement
moves the data to the other half of the partitions.

gen.sql : Creates the partitions.
insert.data : This data file is uploaded here [1]. Used "COPY ptab
from '$PWD/insert.data' "
index.sql : Optionally, Create index on column d.

The schema looks like this :

CREATE TABLE ptab (a date, b int, c int, d int) PARTITION BY RANGE (a, b);

CREATE TABLE ptab_1_1 PARTITION OF ptab
for values from ('1900-01-01', 1) to ('1900-01-01', 7501)
PARTITION BY range (c);
    CREATE TABLE ptab_1_1_1 PARTITION OF ptab_1_1
    for values from (1) to (81);
    CREATE TABLE ptab_1_1_2 PARTITION OF ptab_1_1
    for values from (81) to (161);
..........
..........
CREATE TABLE ptab_1_2 PARTITION OF ptab
for values from ('1900-01-01', 7501) to ('1900-01-01', 15001)
PARTITION BY range (c);
..........
..........

On 20 September 2017 at 00:06, Robert Haas <robertmhaas@gmail.com> wrote:
> I wonder how much more expensive it
> is to execute UPDATE root SET a = a + 1 WHERE a = 1 on a table with
> 1000 subpartitions with this patch than without, assuming the update
> succeeds in both cases.

UPDATE query used : UPDATE ptab set d = d + 1 where d = 1; -- where d
is not a partition key of any of the partitions.
This query updates 8 rows out of 10 million rows.
With HEAD  : 2953.691 , 2862.298 , 2855.286 , 2835.879 (avg : 2876)
With Patch : 2933.719 , 2832.463 , 2749.979 , 2820.416 (avg : 2834)
(All the values are in milliseconds.)

> suppose you make a table with 1000 partitions each containing
> 10,000 tuples and update them all, and consider three scenarios: (1)
> partition key not updated but all tuples subject to non-HOT updates
> because the updated column is indexed, (2) partition key updated but
> no tuple movement required as a result, (3) partition key updated and
> all tuples move to a different partition.

Note that the following figures do not represent a consistent set of
figures. They keep on varying. For e.g. , even though the
partition-key-update without row movement appears to have taken a bit
more time with patch than with HEAD, a new set of tests run might even
end up the other way round.

NPK  : 42089 (patch)
NPKI : 81593 (patch)
PK   : 45250 (patch) , 44944 (HEAD)
PKR  : 46701 (patch)

The above figures are in milliseconds. The explanations of the above
short-forms :

NPK :
Update of column that is not a partition-key.
UPDATE query used : UPDATE ptab set d = d + 1 ; This update *all* rows.

NPKI :
Update of column that is not a partition-key. And this column is
indexed (Used attached file index.sql).
UPDATE query used : UPDATE ptab set d = d + 1 ; This update *all* rows.

PK :
Update of partition key, but row movement does not occur. There are no
indexed columns.
UPDATE query used : UPDATE ptab set a = a + '1 hour'::interval ;

PKR :
Update of partition key, with all rows moved to other partitions.
There are no indexed columns.
UPDATE query used : UPDATE ptab set a = a + '2 years'::interval ;


[1] https://drive.google.com/open?id=0B_YJCqIAxKjeN3hMXzdDejlNYmlpWVJpaU9mWUhFRVhXTG5Z

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] UPDATE of partition key

From
amul sul
Date:
On Wed, Sep 13, 2017 at 4:24 PM, amul sul <sulamul@gmail.com> wrote:
>
>
> On Sun, Sep 10, 2017 at 8:47 AM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>
>> On Fri, Sep 8, 2017 at 4:51 PM, amul sul <sulamul@gmail.com> wrote:
>> > On Thu, May 18, 2017 at 9:13 AM, Amit Kapila <amit.kapila16@gmail.com>
>> > wrote:
>> >>
>> >>  On Wed, May 17, 2017 at 5:17 PM, Robert Haas <robertmhaas@gmail.com>
>> >> wrote:
>> >> > On Wed, May 17, 2017 at 6:29 AM, Amit Kapila
>> >> > <amit.kapila16@gmail.com>
>> >> > wrote:
>> >> >> I think we can do this even without using an additional infomask
>> >> >> bit.
>> >> >> As suggested by Greg up thread, we can set InvalidBlockId in ctid to
>> >> >> indicate such an update.
>> >> >
>> >> > Hmm.  How would that work?
>> >> >
>> >>
>> >> We can pass a flag say row_moved (or require_row_movement) to
>> >> heap_delete which will in turn set InvalidBlockId in ctid instead of
>> >> setting it to self. Then the ExecUpdate needs to check for the same
>> >> and return an error when heap_update is not successful (result !=
>> >> HeapTupleMayBeUpdated).  Can you explain what difficulty are you
>> >> envisioning?
>> >>
>> >
>> > Attaching WIP patch incorporates the above logic, although I am yet to
>> > check
>> > all the code for places which might be using ip_blkid.  I have got a
>> > small
>> > query here,
>> > do we need an error on HeapTupleSelfUpdated case as well?
>> >
>>
>> No, because that case is anyway a no-op (or error depending on whether
>> is updated/deleted by same command or later command).  Basically, even
>> if the row wouldn't have been moved to another partition, we would not
>> have allowed the command to proceed with the update.  This handling is
>> to make commands fail rather than a no-op where otherwise (when the
>> tuple is not moved to another partition) the command would have
>> succeeded.
>>
> Thank you.
>
> I've rebased patch against  Amit Khandekar's latest patch (v17_rebased_2).
> Also, added ip_blkid validation check in heap_get_latest_tid(), rewrite_heap_tuple()
> & rewrite_heap_tuple() function, because only ItemPointerEquals() check is no
> longer sufficient after this patch.

FYI, I have posted this patch in a separate thread :
https://postgr.es/m/CAAJ_b95PkwojoYfz0bzXU8OokcTVGzN6vYGCNVUukeUDrnF3dw@mail.gmail.com

Regards,
Amul


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Fri, Sep 22, 2017 at 1:57 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> The patch for the above change is :
> 0002-Prevent-a-redundant-ConvertRowtypeExpr-node.patch

Thinking about this a little more, I'm wondering about how this case
arises.  I think that for this patch to avoid multiple conversions,
we'd have to be calling map_variable_attnos on an expression and then
calling map_variable_attnos on that expression again.

>>> If I understand correctly, the reason for changing mt_partitions from
>>> ResultRelInfo * to ResultRelInfo ** is that, currently, all of the
>>> ResultRelInfos for a partitioning hierarchy are allocated as a single
>>> chunk, but we can't do that and also reuse the ResultRelInfos created
>>> during InitPlan.  I suggest that we do this as a preparatory patch.
>>
>> Ok, will prepare a separate patch. Do you mean to include in that
>> patch the changes I did in ExecSetupPartitionTupleRouting() that
>> re-use the ResultRelInfo structures of per-subplan update result rels
>> ?
>
> Above changes are in attached
> 0001-Re-use-UPDATE-result-rels-created-in-InitPlan.patch.

No, not all of those changes.  Just the adjustments to make
ModifyTableState's mt_partitions be of type ResultRelInfo ** rather
than ResultRelInfo *, and anything closely related to that.  Not, for
example, the num_update_rri stuff.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 30 September 2017 at 01:26, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Sep 29, 2017 at 3:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, Sep 22, 2017 at 1:57 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>> The patch for the above change is :
>>> 0002-Prevent-a-redundant-ConvertRowtypeExpr-node.patch
>>
>> Thinking about this a little more, I'm wondering about how this case
>> arises.  I think that for this patch to avoid multiple conversions,
>> we'd have to be calling map_variable_attnos on an expression and then
>> calling map_variable_attnos on that expression again.

We are not calling map_variable_attnos() twice. The first time it
calls, there is already the ConvertRowtypeExpr node if the expression
is a whole row var. This node is already added from
adjust_appendrel_attrs(). So the conversion is done by two different
functions.

For ConvertRowtypeExpr, map_variable_attnos_mutator() recursively
calls map_variable_attnos_mutator() for ConvertRowtypeExpr->arg with
coerced_var=true.

>
> I guess I didn't quite finish this thought, sorry.  Maybe it's
> obvious, but the point I was going for is: why would we do that, vs.
> just converting once?

The first time ConvertRowtypeExpr node gets added in the expression is
when adjust_appendrel_attrs() is called for each of the child tables.
Here, for each of the child table, when the parent parse tree is
converted into the child parse tree, the whole row var (in RETURNING
or WITH CHECK OPTIONS expr) is wrapped with ConvertRowtypeExpr(), so
child parse tree (or the child WCO expr) has this ConvertRowtypeExpr
node.

The second time this node is added is during update-tuple-routing in
ExecInitModifyTable(), when map_partition_varattnos() is called for
each of the partitions to convert from the first per-subplan
RETURNING/WCO expression to the RETURNING/WCO expression belonging to
the leaf partition. This second conversion happens for the leaf
partitions which are not already present in per-subplan UPDATE result
rels.

So the first conversion is from parent to child while building
per-subplan plans, and the second is from first per-subplan child to
another child for building expressions of the leaf partitions.

So suppose the root partitioned table RETURNING expression is a whole
row var wr(r) where r is its composite type representing the root
table type.
Then, one of its UPDATE child tables will have its RETURNING
expression converted like this :
wr(r)  ===>  CRE(r) -> wr(c1)
where CRE(r) represents ConvertRowtypeExpr of result type r, which has
its arg pointing to wr(c1) which is a whole row var of composite type
c1 for the child table c1. So this node converts from composite type
of child table to composite type of root table.

Now, when the second conversion occurs for the leaf partition (i.e.
during update-tuple-routing), the conversion looks like this :
CRE(r) -> wr(c1)  ===>  CRE(r) -> wr(c2)
But W/o the 0002*ConvertRowtypeExpr*.patch the conversion would have
looked like this :
CRE(r) -> wr(c1)  ===>  CRE(r) -> CRE(c1) -> wr(c2)
In short, we omit the intermediate CRE(c1) node.


While writing this down, I observed that after multi-level partition
tree expansion was introduced, the child table expressions are not
converted directly from the root. Instead, they are converted from
their immediate parent. So there is a chain of conversions : to leaf
from its parent, to that parent from its parent, and so on from the
root. Effectively, during the first conversion, there are that many
ConvertRowtypeExpr nodes one above the other already present in the
UPDATE result rel expressions. But my patch handles the optimization
only for the leaf partition conversions.

If already has CRE : CRE(rr) -> wr(r)
Parent-to-child conversion ::: CRE(p) -> wr(r)  ===>   CRE(rr) ->
CRE(r) -> wr(c1)
W patch : CRE(rr) -> CRE(r) -> wr(c1) ===> CRE(rr) -> CRE(r) -> wr(c2)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Tue, Oct 3, 2017 at 8:16 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> While writing this down, I observed that after multi-level partition
> tree expansion was introduced, the child table expressions are not
> converted directly from the root. Instead, they are converted from
> their immediate parent. So there is a chain of conversions : to leaf
> from its parent, to that parent from its parent, and so on from the
> root. Effectively, during the first conversion, there are that many
> ConvertRowtypeExpr nodes one above the other already present in the
> UPDATE result rel expressions. But my patch handles the optimization
> only for the leaf partition conversions.
>
> If already has CRE : CRE(rr) -> wr(r)
> Parent-to-child conversion ::: CRE(p) -> wr(r)  ===>   CRE(rr) ->
> CRE(r) -> wr(c1)
> W patch : CRE(rr) -> CRE(r) -> wr(c1) ===> CRE(rr) -> CRE(r) -> wr(c2)

Maybe adjust_appendrel_attrs() should have a similar provision for
avoiding extra ConvertRowTypeExpr nodes?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 30 September 2017 at 01:23, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> If I understand correctly, the reason for changing mt_partitions from
>>>> ResultRelInfo * to ResultRelInfo ** is that, currently, all of the
>>>> ResultRelInfos for a partitioning hierarchy are allocated as a single
>>>> chunk, but we can't do that and also reuse the ResultRelInfos created
>>>> during InitPlan.  I suggest that we do this as a preparatory patch.
>>>
>>> Ok, will prepare a separate patch. Do you mean to include in that
>>> patch the changes I did in ExecSetupPartitionTupleRouting() that
>>> re-use the ResultRelInfo structures of per-subplan update result rels
>>> ?
>>
>> Above changes are in attached
>> 0001-Re-use-UPDATE-result-rels-created-in-InitPlan.patch.
>
> No, not all of those changes.  Just the adjustments to make
> ModifyTableState's mt_partitions be of type ResultRelInfo ** rather
> than ResultRelInfo *, and anything closely related to that.  Not, for
> example, the num_update_rri stuff.

Ok. Attached is the patch modified to have changes only to handle
array of ResultRelInfo * instead of array of ResultRelInfo.

-------

On 4 October 2017 at 01:08, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Oct 3, 2017 at 8:16 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> While writing this down, I observed that after multi-level partition
>> tree expansion was introduced, the child table expressions are not
>> converted directly from the root. Instead, they are converted from
>> their immediate parent. So there is a chain of conversions : to leaf
>> from its parent, to that parent from its parent, and so on from the
>> root. Effectively, during the first conversion, there are that many
>> ConvertRowtypeExpr nodes one above the other already present in the
>> UPDATE result rel expressions. But my patch handles the optimization
>> only for the leaf partition conversions.
>
> Maybe adjust_appendrel_attrs() should have a similar provision for
> avoiding extra ConvertRowTypeExpr nodes?

Yeah, I think we should be able to do that. Will check.

------

On 19 September 2017 at 13:15, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 18 September 2017 at 20:45, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> Please find few more comments.
>>
>> + * in which they appear in the PartitionDesc. Also, extract the
>> + * partition key columns of the root partitioned table. Those of the
>> + * child partitions would be collected during recursive expansion.
>> */
>> + pull_child_partition_columns(&all_part_cols, oldrelation, oldrelation);
>> expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
>>   lockmode, &root->append_rel_list,
>> +   &all_part_cols,
>>
>> pcinfo->all_part_cols is only used in case of update, I think we can
>> call pull_child_partition_columns
>> only if rte has updateCols?
>>
>> @@ -2029,6 +2034,7 @@ typedef struct PartitionedChildRelInfo
>>
>> Index parent_relid;
>> List   *child_rels;
>> + Bitmapset  *all_part_cols;
>> } PartitionedChildRelInfo;
>>
>> I might be missing something, but do we really need to store
>> all_part_cols inside the
>> PartitionedChildRelInfo,  can't we call pull_child_partition_columns
>> directly inside
>> inheritance_planner whenever we realize that RTE has some updateCols
>> and we want to
>> check the overlap?
>
> One thing  we will have to do extra is : Open and close the
> partitioned rels again. The idea was that we collect the bitmap
> *while* we are already expanding through the tree and the rel is open.
> Will check if this is feasible.

While giving more thought on this suggestion of Dilip's, I found out
that pull_child_partition_columns() is getting called with child_rel
and its immediate parent. That means, it maps the child rel attributes
to its immediate parent. If that immediate parent is not the root
partrel, then the conversion is not sufficient. We need to map child
rel attnos to root partrel attnos. So for a partition tree with 3 or
more levels, with the bottom partitioned rel having different att
ordering than the root, this will not work.

Before the commit that enabled recursive multi-level partition tree
expansion, pull_child_partition_columns() was always getting called
with child_rel and root rel. So this issue crept up when I rebased
over this commit, overlooking the fact that parent rel is the
immediate parent, not the root parent.

Anyways, I think Dilip's suggestion makes sense : we can do the
finding-all-part-cols work separately in inheritance_planner() using
the partitioned_rels handle. Re-opening the partitioned tables should
be cheap, because they have already been opened earlier, so they are
available in relcache. So did this as he suggested using new function
get_all_partition_cols(). While doing that, I have ensured that we use
the root rel to map all the child rel attnos. So the above issue is
fixed now.

Also added test scenarios that test the above issue. Namely, made the
partition tree 3 level, and added some specific scenarios where it
used to wrongly error out without trying to move the tuple, because it
determined partition-key is not updated.


---------

Though we re-use the update result rels, the WCO and Returning
expressions were not getting re-used from those update result rels.
This check was missing :
@@ -2059,7 +2380,7 @@ ExecInitModifyTable(ModifyTable *node, EState
*estate, int eflags)
for (i = 0; i < mtstate->mt_num_partitions; i++)
{
   Relation        partrel;
   List       *rlist;

   resultRelInfo = mtstate->mt_partitions[i];
+
+ /*
+ * If we are referring to a resultRelInfo from one of the update
+ * result rels, that result rel would already have a returningList
+ * built.
+ */
+ if (resultRelInfo->ri_projectReturning)
+    continue;
+
  partrel = resultRelInfo->ri_RelationDesc;

Added this check in the patch.

----------

On 22 September 2017 at 16:13, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 21 September 2017 at 19:52, amul sul <sulamul@gmail.com> wrote:
>>
>>   86 -           (event == TRIGGER_EVENT_UPDATE &&
>> !trigdesc->trig_update_after_row))
>>   87 +           (event == TRIGGER_EVENT_UPDATE &&
>> !trigdesc->trig_update_after_row) ||
>>   88 +           (event == TRIGGER_EVENT_UPDATE && (oldtup == NULL || newtup
>> == NULL)))
>>   89             return;
>>   90     }
>>
>>
>> Either of oldtup or newtup will be valid at a time & vice versa.  Can we
>> improve
>> this check accordingly?
>>
>> For e.g.:
>> (event == TRIGGER_EVENT_UPDATE && )(HeapTupleIsValid(oldtup) ^
>> ItemPointerIsValid(newtup)))))
>
>Ok, I will be doing this as below :
>-  (event == TRIGGER_EVENT_UPDATE && (oldtup == NULL || newtup == NULL)))
>+ (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))

Have done this in the attached patch.

--------

Attached are these patches :

Preparatory patches :
0001-Prepare-for-re-using-UPDATE-result-rels-during-tuple.patch
0002-Prevent-a-redundant-ConvertRowtypeExpr-node.patch
Main patch :
update-partition-key_v20.patch

Thanks
-Amit Khandekar

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Wed, Oct 4, 2017 at 9:51 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Preparatory patches :
> 0001-Prepare-for-re-using-UPDATE-result-rels-during-tuple.patch
> 0002-Prevent-a-redundant-ConvertRowtypeExpr-node.patch
> Main patch :
> update-partition-key_v20.patch

Committed 0001 with a few tweaks and 0002 unchanged.  Please check
whether everything looks OK.

Is anybody still reviewing the main patch here?  (It would be good if
the answer is "yes".)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

From
Amit Langote
Date:
On 2017/10/13 6:18, Robert Haas wrote:
> Is anybody still reviewing the main patch here?  (It would be good if
> the answer is "yes".)

I am going to try to look at the latest version over the weekend and early
next week.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

From
Amit Langote
Date:
Hi Amit.

On 2017/10/04 22:51, Amit Khandekar wrote:
> Main patch :
> update-partition-key_v20.patch

Guess you're already working on it but the patch needs a rebase.  A couple
of hunks in the patch to execMain.c and nodeModifyTable.c fail.

Meanwhile a few comments:

+void
+pull_child_partition_columns(Bitmapset **bitmapset,
+                            Relation rel,
+                            Relation parent)

Nitpick: don't we normally list the output argument(s) at the end?  Also,
"bitmapset" could be renamed to something that conveys what it contains?

+       if (partattno != 0)
+           child_keycols =
+               bms_add_member(child_keycols,
+                              partattno -
FirstLowInvalidHeapAttributeNumber);
+   }
+   foreach(lc, partexprs)
+   {

Elsewhere (in quite a few places), we don't iterate over partexprs
separately like this, although I'm not saying it is bad, just different
from other places.

+ * the transition tuplestores can be built. Furthermore, if the transition
+ *  capture is happening for UPDATEd rows being moved to another
partition due
+ *  partition-key change, then this function is called once when the row is
+ *  deleted (to capture OLD row), and once when the row is inserted to
another
+ *  partition (to capture NEW row). This is done separately because
DELETE and
+ *  INSERT happen on different tables.

Extra space at the beginning from the 2nd line onwards.

+           (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup
== NULL))))

Is there some reason why a bitwise operator is used here?

+ * 'update_rri' has the UPDATE per-subplan result rels.

Could you explain why they are being received as input here?

+ * 'perleaf_parentchild_maps' receives an array of TupleConversionMap objects
+ *     with on entry for every leaf partition (required to convert input
tuple
+ *     based on the root table's rowtype to a leaf partition's rowtype after
+ *     tuple routing is done)

Could this be named leaf_tupconv_maps, maybe?  It perhaps makes clear that
they are maps needed for "tuple conversion".  And the other field holding
the reverse map as leaf_rev_tupconv_maps.  Either that or use underscores
to separate words, but then it gets too long I guess.


+       tuple = ConvertPartitionTupleSlot(mtstate,
+
mtstate->mt_perleaf_parentchild_maps[leaf_part_index],
+

The 2nd line here seems to have gone over 80 characters.

ISTM, ConvertPartitionTupleSlot has a very specific job and a bit complex
interface.  I guess it could simply have the following interface:

static HeapTuple ConvertPartitionTuple(ModifyTabelState *mtstate,                                      HeapTuple tuple,
boolis_update);
 

And figure out, based on the value of is_update, which map to use and
which slot to set *p_new_slot to (what is now "new_slot" argument).
You're getting mtstate here anyway, which contains all the information you
need here.  It seems better to make that (selecting which map and which
slot) part of the function's implementation if we're having this function
at all, imho.  Maybe I'm missing some details there, but my point still
remains that we should try to put more logic in that function instead of
it just do the mechanical tuple conversion.

+         * We have already checked partition constraints above, so skip them
+         * below.

How about: ", so skip checking here."?


ISTM, the newly introduced logic in ExecSetupTransitionCaptureState() to
try to reuse the per-subplan child-to-parent map as per-leaf
child-to-parent map could be simplified a bit.  I mean the following code:

+    /*
+     * But for Updates, we can share the per-subplan maps with the per-leaf
+     * maps.
+     */
+    update_rri_index = 0;
+    update_rri = mtstate->resultRelInfo;
+    if (mtstate->mt_nplans > 0)
+        cur_reloid = RelationGetRelid(update_rri[0].ri_RelationDesc);

-        /* Choose the right set of partitions */
-        if (mtstate->mt_partition_dispatch_info != NULL)
+    for (i = 0; i < numResultRelInfos; ++i)
+    {
<snip>

How about (pseudo-code):
j = 0;for (i = 0; i < n_leaf_parts; i++){    if (j < n_subplans && leaf_rri[i]->oid == subplan_rri[j]->oid)    {
leaf_childparent_map[i]= subplan_childparent_map[j];        j++;    }    else    {        leaf_childparent_map[i] = new
map   }}
 

I think the above would also be useful in ExecSetupPartitionTupleRouting()
where you've added similar code to try to reuse per-subplan ResultRelInfos.


In ExecInitModifyTable(), can we try to minimize the number of places
where update_tuple_routing_needed is being set.  Currently, it's being set
in 3 places:

+    bool        update_tuple_routing_needed = node->part_cols_updated;

&

+        /*
+         * If this is an UPDATE and a BEFORE UPDATE trigger is present,
we may
+         * need to do update tuple routing.
+         */
+        if (resultRelInfo->ri_TrigDesc &&
+            resultRelInfo->ri_TrigDesc->trig_update_before_row &&
+            operation == CMD_UPDATE)
+            update_tuple_routing_needed = true;

&

+    /* Decide whether we need to perform update tuple routing. */
+    if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+        update_tuple_routing_needed = false;


In the following:
        ExecSetupPartitionTupleRouting(rel,
+                                       (operation == CMD_UPDATE ?
+                                        mtstate->resultRelInfo : NULL),
+                                       (operation == CMD_UPDATE ? nplans
: 0),

Can the second parameter be made to not span two lines?  It was a bit hard
for me to see that there two new parameters.

+     * Construct mapping from each of the resultRelInfo attnos to the root

Maybe it's odd to say "resultRelInfo attno", because it's really the
underlying partition whose attnos we're talking about as being possibly
different from the root table's attnos.

+     * descriptor. In such case we need to convert tuples to the root

s/In such case/In such a case,/

By the way, I've seen in a number of places that the patch calls "root
table" a partition.  Not just in comments, but also a variable appears to
be given a name which contains rootpartition.  I can see only one instance
where root is called a partition in the existing source code, but it seems
to have been introduced only recently:

allpaths.c:1333:         * A root partition will already have a

+         * qual for each partition. Note that, if there are SubPlans in
there,
+         * they all end up attached to the one parent Plan node.

The sentence starting with "Note that, " is a bit unclear.

+        Assert(update_tuple_routing_needed ||
+               (operation == CMD_INSERT &&
+                list_length(node->withCheckOptionLists) == 1 &&
+                mtstate->mt_nplans == 1));

The comment I complained about above is perhaps about this Assert.

-            List       *mapped_wcoList;
+            List       *mappedWco;

Not sure why this rename.  After this rename, it's now inconsistent with
the code above which handles non-partitioned case, which still calls it
wcoList.  Maybe, because you introduced firstWco and then this line:

+        firstWco = linitial(node->withCheckOptionLists);

but note that each member of node->withCheckOptionLists is also a list, so
the original naming.   Also, further below, you're assigning mappedWco to
a List * field.

+            resultRelInfo->ri_WithCheckOptions = mappedWco;


Comments on the optimizer changes:

+get_all_partition_cols(List *rtables,

Did you mean rtable?

get_all_partition_cols() seems to go over the rtable as many times as
there are partitioned tables in the tree.  Is there a way to do this work
somewhere else?  Maybe when the partitioned_rels list is built in the
first place.  But that would require us to make changes to extract
partition columns in some place (prepunion.c) where it's hard to justify
why it's being done there at all.

+        get_all_partition_cols(root->parse->rtable, top_parentRTindex,
+                             partitioned_rels, &all_part_cols);

Two more spaces needed on the 2nd line.

+ * If all_part_cols_p is non-NULL, *all_part_cols_p is set to a bitmapset
+ * of all partitioning columns used by the partitioned table or any
+ * descendent.
+ *

Dead comment?  Aha, so here's where all_part_cols was being set before...

+    TupleTableSlot *mt_rootpartition_tuple_slot;

I guess I was complaining about this field where you call root a
partition.  Maybe, mt_root_tuple_slot would suffice.


Thanks again for working on this.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 16 October 2017 at 08:28, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Hi Amit.
>
> On 2017/10/04 22:51, Amit Khandekar wrote:
>> Main patch :
>> update-partition-key_v20.patch
>
> Guess you're already working on it but the patch needs a rebase.  A couple
> of hunks in the patch to execMain.c and nodeModifyTable.c fail.

Thanks for taking up this review Amit. Attached is the rebased
version. Will get back on your review comments and updated patch soon.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 16 October 2017 at 08:28, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

>
> + * the transition tuplestores can be built. Furthermore, if the transition
> + *  capture is happening for UPDATEd rows being moved to another
> partition due
> + *  partition-key change, then this function is called once when the row is
> + *  deleted (to capture OLD row), and once when the row is inserted to
> another
> + *  partition (to capture NEW row). This is done separately because
> DELETE and
> + *  INSERT happen on different tables.
>
> Extra space at the beginning from the 2nd line onwards.

Just observed that the existing comment lines use tab instead of
spaces. I have now used tab for the new comments, instead of the
multiple spaces.

>
> +           (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup
> == NULL))))
>
> Is there some reason why a bitwise operator is used here?

That exact condition means that the function is called for transition
capture for updated rows being moved to another partition. For this
scenario, either the oldtup or the newtup is NULL. I wanted to exactly
capture that condition there. I think the bitwise operator is more
user-friendly in emphasizing the point that it is indeed an "either a
or b, not both" condition.

>
> + * 'update_rri' has the UPDATE per-subplan result rels.
>
> Could you explain why they are being received as input here?

Added the explanation in the comments.

>
> + * 'perleaf_parentchild_maps' receives an array of TupleConversionMap objects
> + *     with on entry for every leaf partition (required to convert input
> tuple
> + *     based on the root table's rowtype to a leaf partition's rowtype after
> + *     tuple routing is done)
>
> Could this be named leaf_tupconv_maps, maybe?  It perhaps makes clear that
> they are maps needed for "tuple conversion".  And the other field holding
> the reverse map as leaf_rev_tupconv_maps.  Either that or use underscores
> to separate words, but then it gets too long I guess.

In master branch, now this param is already there with the name
"tup_conv_maps". In the rebased version in the earlier mail, I haven't
again changed it. I think "tup_conv_maps" looks clear enough.

>
>
> +       tuple = ConvertPartitionTupleSlot(mtstate,
> +
> mtstate->mt_perleaf_parentchild_maps[leaf_part_index],
> +
>
> The 2nd line here seems to have gone over 80 characters.
>
> ISTM, ConvertPartitionTupleSlot has a very specific job and a bit complex
> interface.  I guess it could simply have the following interface:
>
> static HeapTuple ConvertPartitionTuple(ModifyTabelState *mtstate,
>                                        HeapTuple tuple, bool is_update);
>
> And figure out, based on the value of is_update, which map to use and
> which slot to set *p_new_slot to (what is now "new_slot" argument).
> You're getting mtstate here anyway, which contains all the information you
> need here.  It seems better to make that (selecting which map and which
> slot) part of the function's implementation if we're having this function
> at all, imho.  Maybe I'm missing some details there, but my point still
> remains that we should try to put more logic in that function instead of
> it just do the mechanical tuple conversion.

I tried to see how the interface would look if we do that way. Here is
how the code looks :

static TupleTableSlot *
ConvertPartitionTupleSlot(ModifyTableState *mtstate,
                    bool for_update_tuple_routing,
                    int map_index,
                    HeapTuple *tuple,
                    TupleTableSlot *slot)
{
   TupleConversionMap   *map;
   TupleTableSlot *new_slot;

   if (for_update_tuple_routing)
   {
      map = mtstate->mt_persubplan_childparent_maps[map_index];
      new_slot = mtstate->mt_rootpartition_tuple_slot;
   }
   else
   {
      map = mtstate->mt_perleaf_parentchild_maps[map_index];
      new_slot = mtstate->mt_partition_tuple_slot;
   }

   if (!map)
      return slot;

   *tuple = do_convert_tuple(*tuple, map);

   /*
    * Change the partition tuple slot descriptor, as per converted tuple.
    */
   ExecSetSlotDescriptor(new_slot, map->outdesc);
   ExecStoreTuple(*tuple, new_slot, InvalidBuffer, true);

   return new_slot;
}

It looks like the interface does not much simplify, and above that, we
have more number of lines in that function. Also, the caller anyway
has to be aware whether the map_index is the index into the leaf
partitions or the update subplans. So it is not like the caller does
not have to be aware about whether the mapping should be
mt_persubplan_childparent_maps or mt_perleaf_parentchild_maps.

>
> +         * We have already checked partition constraints above, so skip them
> +         * below.
>
> How about: ", so skip checking here."?

Ok I have made it this way :
* We have already checked partition constraints above, so skip
* checking them here.


>
>
> ISTM, the newly introduced logic in ExecSetupTransitionCaptureState() to
> try to reuse the per-subplan child-to-parent map as per-leaf
> child-to-parent map could be simplified a bit.  I mean the following code:
>
> +    /*
> +     * But for Updates, we can share the per-subplan maps with the per-leaf
> +     * maps.
> +     */
> +    update_rri_index = 0;
> +    update_rri = mtstate->resultRelInfo;
> +    if (mtstate->mt_nplans > 0)
> +        cur_reloid = RelationGetRelid(update_rri[0].ri_RelationDesc);
>
> -        /* Choose the right set of partitions */
> -        if (mtstate->mt_partition_dispatch_info != NULL)
> +    for (i = 0; i < numResultRelInfos; ++i)
> +    {
> <snip>
>
> How about (pseudo-code):
>
>  j = 0;
>  for (i = 0; i < n_leaf_parts; i++)
>  {
>      if (j < n_subplans && leaf_rri[i]->oid == subplan_rri[j]->oid)
>      {
>          leaf_childparent_map[i] = subplan_childparent_map[j];
>          j++;
>      }
>      else
>      {
>          leaf_childparent_map[i] = new map
>      }
>  }
>
> I think the above would also be useful in ExecSetupPartitionTupleRouting()
> where you've added similar code to try to reuse per-subplan ResultRelInfos.

Did something like that in the attached patch. Please have a look.
After we conclude on that, will do the same for
ExecSetupPartitionTupleRouting() as well.

>
>
> In ExecInitModifyTable(), can we try to minimize the number of places
> where update_tuple_routing_needed is being set.  Currently, it's being set
> in 3 places:

Will see if we can skip some checks (TODO).


> In the following:
>
>          ExecSetupPartitionTupleRouting(rel,
> +                                       (operation == CMD_UPDATE ?
> +                                        mtstate->resultRelInfo : NULL),
> +                                       (operation == CMD_UPDATE ? nplans
> : 0),
>
> Can the second parameter be made to not span two lines?  It was a bit hard
> for me to see that there two new parameters.

I think it is safe to just pass mtstate->resultRelInfo. Inside
ExecSetupPartitionTupleRouting() we should anyways check only the
nplans param (and not update_rri) to decide whether it is for insert
or update. So did the same.

>
> +     * Construct mapping from each of the resultRelInfo attnos to the root
>
> Maybe it's odd to say "resultRelInfo attno", because it's really the
> underlying partition whose attnos we're talking about as being possibly
> different from the root table's attnos.

Changed : resultRelInfo => partition

>
> +     * descriptor. In such case we need to convert tuples to the root
>
> s/In such case/In such a case,/

Done.

>
> By the way, I've seen in a number of places that the patch calls "root
> table" a partition.  Not just in comments, but also a variable appears to
> be given a name which contains rootpartition.  I can see only one instance
> where root is called a partition in the existing source code, but it seems
> to have been introduced only recently:
>
> allpaths.c:1333:                 * A root partition will already have a

Changed to either this :
root partition => root partitioned table
or this if we have to refer to it too often :
root partition => root

>
> +         * qual for each partition. Note that, if there are SubPlans in
> there,
> +         * they all end up attached to the one parent Plan node.
>
> The sentence starting with "Note that, " is a bit unclear.
>
> +        Assert(update_tuple_routing_needed ||
> +               (operation == CMD_INSERT &&
> +                list_length(node->withCheckOptionLists) == 1 &&
> +                mtstate->mt_nplans == 1));
>
> The comment I complained about above is perhaps about this Assert.
>
> -            List       *mapped_wcoList;
> +            List       *mappedWco;
>
> Not sure why this rename.  After this rename, it's now inconsistent with
> the code above which handles non-partitioned case, which still calls it
> wcoList.  Maybe, because you introduced firstWco and then this line:
>
> +        firstWco = linitial(node->withCheckOptionLists);
>
> but note that each member of node->withCheckOptionLists is also a list, so
> the original naming.   Also, further below, you're assigning mappedWco to
> a List * field.
>
> +            resultRelInfo->ri_WithCheckOptions = mappedWco;
>
>
> Comments on the optimizer changes:
>
> +get_all_partition_cols(List *rtables,
>
> Did you mean rtable?
>
>
> +        get_all_partition_cols(root->parse->rtable, top_parentRTindex,
> +                             partitioned_rels, &all_part_cols);
>
> Two more spaces needed on the 2nd line.
>
>
>
> +void
> +pull_child_partition_columns(Bitmapset **bitmapset,
> +                            Relation rel,
> +                            Relation parent)
>
> Nitpick: don't we normally list the output argument(s) at the end?  Also,
> "bitmapset" could be renamed to something that conveys what it contains?
>
> +       if (partattno != 0)
> +           child_keycols =
> +               bms_add_member(child_keycols,
> +                              partattno -
> FirstLowInvalidHeapAttributeNumber);
> +   }
> +   foreach(lc, partexprs)
> +   {
>
> Elsewhere (in quite a few places), we don't iterate over partexprs
> separately like this, although I'm not saying it is bad, just different
> from other places.
>
> get_all_partition_cols() seems to go over the rtable as many times as
> there are partitioned tables in the tree.  Is there a way to do this work
> somewhere else?  Maybe when the partitioned_rels list is built in the
> first place.  But that would require us to make changes to extract
> partition columns in some place (prepunion.c) where it's hard to justify
> why it's being done there at all.
>
>
> + * If all_part_cols_p is non-NULL, *all_part_cols_p is set to a bitmapset
> + * of all partitioning columns used by the partitioned table or any
> + * descendent.
> + *
>
> Dead comment?  Aha, so here's where all_part_cols was being set before...
>
> +    TupleTableSlot *mt_rootpartition_tuple_slot;
>
> I guess I was complaining about this field where you call root a
> partition.  Maybe, mt_root_tuple_slot would suffice.

Will get back with the above comments (TODO)


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
Below I have addressed the remaining review comments :

On 16 October 2017 at 08:28, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
>
> In ExecInitModifyTable(), can we try to minimize the number of places
> where update_tuple_routing_needed is being set.  Currently, it's being set
> in 3 places:

I think the way it's done seems ok. For each resultRelInfo,
update_tuple_routing_needed is updated in case that resultRel has
partition cols changed. And at that point, we don't have rel opened,
so we can't check if that rel is partitioned. So another check is
required outside of the loop.

>
> +         * qual for each partition. Note that, if there are SubPlans in
> there,
> +         * they all end up attached to the one parent Plan node.
>
> The sentence starting with "Note that, " is a bit unclear.
>
> +        Assert(update_tuple_routing_needed ||
> +               (operation == CMD_INSERT &&
> +                list_length(node->withCheckOptionLists) == 1 &&
> +                mtstate->mt_nplans == 1));
>
> The comment I complained about above is perhaps about this Assert.

That is an existing comment. On HEAD, the "parent Plan" refers to
mtstate->mt_plans[0]. Now in the patch, for the parent node in
ExecInitQual(), mtstate->ps is passed rather than mt_plans[0]. So the
parent plan refers to this mtstate node.

BTW, the reason I had changed the parent node to mtstate->ps is :
Other places in that code use mtstate->ps while initializing
expressions :

/*
* Build a projection for each result rel.
*/
   resultRelInfo->ri_projectReturning =
      ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
                              resultRelInfo->ri_RelationDesc->rd_att);

...........

/* build DO UPDATE WHERE clause expression */
if (node->onConflictWhere)
{
   ExprState  *qualexpr;

   qualexpr = ExecInitQual((List *) node->onConflictWhere,
    &mtstate->ps);
....
}

I think wherever we initialize expressions belonging to a plan, we
should use that plan as the parent. WithCheckOptions are fields of
ModifyTableState.

>
> -            List       *mapped_wcoList;
> +            List       *mappedWco;
>
> Not sure why this rename.  After this rename, it's now inconsistent with
> the code above which handles non-partitioned case, which still calls it
> wcoList.  Maybe, because you introduced firstWco and then this line:
>
> +        firstWco = linitial(node->withCheckOptionLists);
>
> but note that each member of node->withCheckOptionLists is also a list, so
> the original naming.   Also, further below, you're assigning mappedWco to
> a List * field.
>
> +            resultRelInfo->ri_WithCheckOptions = mappedWco;

Done. Reverted mappedWco to mapped_wcoList. And firstWco to first_wcoList.

>
>
> Comments on the optimizer changes:
>
> +get_all_partition_cols(List *rtables,
>
> Did you mean rtable?

I did mean rtables. It's a list of rtables.

>
>
> +        get_all_partition_cols(root->parse->rtable, top_parentRTindex,
> +                             partitioned_rels, &all_part_cols);
>
> Two more spaces needed on the 2nd line.

Done.

>
>
>
> +void
> +pull_child_partition_columns(Bitmapset **bitmapset,
> +                            Relation rel,
> +                            Relation parent)
>
> Nitpick: don't we normally list the output argument(s) at the end?

Agreed. Done.

> Also, "bitmapset" could be renamed to something that conveys what it contains?

Renamed it to partcols

>
> +       if (partattno != 0)
> +           child_keycols =
> +               bms_add_member(child_keycols,
> +                              partattno -
> FirstLowInvalidHeapAttributeNumber);
> +   }
> +   foreach(lc, partexprs)
> +   {
>
> Elsewhere (in quite a few places), we don't iterate over partexprs
> separately like this, although I'm not saying it is bad, just different
> from other places.

I think you are suggesting we do it like how it's done in
is_partition_attr(). Can you please let me know other places we do
this same way ? I couldn't find.

>
> get_all_partition_cols() seems to go over the rtable as many times as
> there are partitioned tables in the tree.  Is there a way to do this work
> somewhere else?  Maybe when the partitioned_rels list is built in the
> first place.  But that would require us to make changes to extract
> partition columns in some place (prepunion.c) where it's hard to justify
> why it's being done there at all.

See below ...

>
>
> + * If all_part_cols_p is non-NULL, *all_part_cols_p is set to a bitmapset
> + * of all partitioning columns used by the partitioned table or any
> + * descendent.
> + *
>
> Dead comment?

Removed.

> Aha, so here's where all_part_cols was being set before...

Yes, and we used to have PartitionedChildRelInfo.all_part_cols field
for that. We used to populate that while traversing through the
partition tree in expand_inherited_rtentry(). I agreed with Dilip's
opinion that this would unnecessarily add up some processing even when
the query is not a DML. And also, we don't have to have
PartitionedChildRelInfo.all_part_cols. For the earlier implementation,
check v18 patch or earlier versions.

>
> +    TupleTableSlot *mt_rootpartition_tuple_slot;
>
> I guess I was complaining about this field where you call root a
> partition.  Maybe, mt_root_tuple_slot would suffice.

Done.

Attached v22 patch.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] UPDATE of partition key

From
Amit Langote
Date:
Hi Amit.

Thanks a lot for updated patches and sorry that I couldn't get to looking
at your emails sooner.  Note that I'm replying here to both of your
emails, but looking at only the latest v22 patch.

On 2017/10/24 0:15, Amit Khandekar wrote:
> On 16 October 2017 at 08:28, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>
>> +           (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup
>> == NULL))))
>>
>> Is there some reason why a bitwise operator is used here?
> 
> That exact condition means that the function is called for transition
> capture for updated rows being moved to another partition. For this
> scenario, either the oldtup or the newtup is NULL. I wanted to exactly
> capture that condition there. I think the bitwise operator is more
> user-friendly in emphasizing the point that it is indeed an "either a
> or b, not both" condition.

I see.  In that case, since this patch adds the new condition, a note
about it in the comment just above would be good, because the situation
you describe here seems to arise only during update-tuple-routing, IIUC.

>> + * 'perleaf_parentchild_maps' receives an array of TupleConversionMap objects
>> + *     with on entry for every leaf partition (required to convert input
>> tuple
>> + *     based on the root table's rowtype to a leaf partition's rowtype after
>> + *     tuple routing is done)
>>
>> Could this be named leaf_tupconv_maps, maybe?  It perhaps makes clear that
>> they are maps needed for "tuple conversion".  And the other field holding
>> the reverse map as leaf_rev_tupconv_maps.  Either that or use underscores
>> to separate words, but then it gets too long I guess.
> 
> In master branch, now this param is already there with the name
> "tup_conv_maps". In the rebased version in the earlier mail, I haven't
> again changed it. I think "tup_conv_maps" looks clear enough.

OK.

In the latest patch:

+ * 'update_rri' has the UPDATE per-subplan result rels. These are re-used
+ *      instead of allocating new ones while generating the array of all leaf
+ *      partition result rels.

Instead of:

"These are re-used instead of allocating new ones while generating the
array of all leaf partition result rels."

how about:

"There is no need to allocate a new ResultRellInfo entry for leaf
partitions for which one already exists in this array"

>> ISTM, ConvertPartitionTupleSlot has a very specific job and a bit complex
>> interface.  I guess it could simply have the following interface:
>>
>> static HeapTuple ConvertPartitionTuple(ModifyTabelState *mtstate,
>>                                        HeapTuple tuple, bool is_update);
>>
>> And figure out, based on the value of is_update, which map to use and
>> which slot to set *p_new_slot to (what is now "new_slot" argument).
>> You're getting mtstate here anyway, which contains all the information you
>> need here.  It seems better to make that (selecting which map and which
>> slot) part of the function's implementation if we're having this function
>> at all, imho.  Maybe I'm missing some details there, but my point still
>> remains that we should try to put more logic in that function instead of
>> it just do the mechanical tuple conversion.
> 
> I tried to see how the interface would look if we do that way. Here is
> how the code looks :
> 
> static TupleTableSlot *
> ConvertPartitionTupleSlot(ModifyTableState *mtstate,
>                     bool for_update_tuple_routing,
>                     int map_index,
>                     HeapTuple *tuple,
>                     TupleTableSlot *slot)
> {
>    TupleConversionMap   *map;
>    TupleTableSlot *new_slot;
> 
>    if (for_update_tuple_routing)
>    {
>       map = mtstate->mt_persubplan_childparent_maps[map_index];
>       new_slot = mtstate->mt_rootpartition_tuple_slot;
>    }
>    else
>    {
>       map = mtstate->mt_perleaf_parentchild_maps[map_index];
>       new_slot = mtstate->mt_partition_tuple_slot;
>    }
> 
>    if (!map)
>       return slot;
> 
>    *tuple = do_convert_tuple(*tuple, map);
> 
>    /*
>     * Change the partition tuple slot descriptor, as per converted tuple.
>     */
>    ExecSetSlotDescriptor(new_slot, map->outdesc);
>    ExecStoreTuple(*tuple, new_slot, InvalidBuffer, true);
> 
>    return new_slot;
> }
> 
> It looks like the interface does not much simplify, and above that, we
> have more number of lines in that function. Also, the caller anyway
> has to be aware whether the map_index is the index into the leaf
> partitions or the update subplans. So it is not like the caller does
> not have to be aware about whether the mapping should be
> mt_persubplan_childparent_maps or mt_perleaf_parentchild_maps.

Hmm, I think we should try to make it so that the caller doesn't have to
be aware of that.  And by caller I guess you mean ExecInsert(), which
should not be a place, IMHO, where to try to introduce a lot of new logic
specific to update tuple routing.  ISTM, ModifyTableState now has one too
many TupleConversionMap pointer arrays after the patch, creating the need
to choose from in the first place.  AIUI -

* mt_perleaf_parentchild_maps:
 - each entry is a map to convert root parent's tuples to a given leaf   partition's format
 - used to be called mt_partition_tupconv_maps and is needed when tuple-   routing is in use; for both INSERT and
UPDATEwith tuple-routing
 
 - as many entries in the array as there are leaf partitions and stored   in the partition bound order

* mt_perleaf_childparent_maps:
 - each entry is a map to convert a leaf partition's tuples to the root   parent's format
 - newly added by this patch and seems to be needed for UPDATE with   tuple-routing for two needs: 1. tuple-routing
shouldstart with a   tuple in root parent format whereas the tuple received is in leaf   partition format when
ExecInsert()called for update-tuple-routing (by   ExecUpdate), 2. after tuple-routing, we must capture the tuple
insertedinto the partition in the transition tuplestore which accepts   tuples in root parent's format
 
 - as many entries in the array as there are leaf partitions and stored   in the partition bound order

* mt_persubplan_childparent_maps:
 - each entry is a map to convert a child table's tuples to the root   parent's format
 - used to be called mt_transition_tupconv_maps and needed for converting   child tuples to the root parent's format
whenstoring them in the   transition tuplestore which accepts tuples in root parent's format
 
 - as many entries in the array as there are sub-plans in mt_plans and   stored in either the partition bound order or
unknownorder (the   latter in the regular inheritance case)
 

I think we could combine the last two into one.  The only apparent reason
for them to be separate seems to be that the subplan array might contain
less entries than perleaf array and ExecInsert() has only enough
information to calculate the offset of a map in the persubplan array.
That is, resultRelInfo of leaf partition that ExecInsert starts with in
the update-tuple-routing case comes from mtstate->resultRelInfo array
which contains only mt_nplans entries.  So, if we only have the array with
entries for *all* partitions, it's hard to get the offset of the map to
use in that array.

I suggest we don't add a new map array and a significant amount of new
code to initialize the same and to implement the logic to choose the
correct array to get the map from.  Instead, we could simply add an array
of integers with mt_nplans entries.  Each entry is an offset of a given
sub-plan in the array containing entries of something for *all*
partitions.  Since, we are teaching ExecSetupPartitionTupleRouting() to
reuse ResultRelInfos from mtstate->resultRelInfos, there is a suitable
place to construct such array.  Let's say the array is called
mt_subplan_partition_offsets[].  Let ExecSetupPartitionTupleRouting() also
initialize the parent-to-partition maps for *all* partitions, in the
update-tuple-routing case.  Then add a quick-return check in
ExecSetupTransitionCaptureState() to see if the map has already been set
by ExecSetupPartitionTupleRouting().  Since we're using the same map for
two purposes, we could rename mt_transition_tupconv_maps to something that
doesn't bind it to its use only for transition tuple capture.

With that, now there are no persubplan and perleaf arrays for ExecInsert()
to pick from to select a map to pass to ConvertPartitionTupleSlot(), or
maybe even no need for the separate function.  The tuple-routing code
block in ExecInsert would look like below (writing resultRelInfo as just Rel):
 rootRel = (mtstate->rootRel != NULL) ? mtstate->rootRel : Rel
 if (rootRel != Rel)    /* update tuple-routing active */ {     int  subplan_off = Rel - mtstate->Rel[0];     int
leaf_off= mtstate->mt_subplan_partition_offsets[subplan_off];
 
     if (mt_transition_tupconv_maps[leaf_off])     {        /*         * Convert to root format using         *
mt_transition_tupconv_maps[leaf_off]        */
 
         slot = mt_root_tuple_slot;  /* for tuple-routing */
         /* Store the converted tuple into slot */     } }
 /* Existing tuple-routing flow follows */ new_leaf = ExecFindPartition(rootRel, slot, ...)
 if (mtstate->transition_capture) {    transition_capture_map = mt_transition_tupconv_maps[new_leaf] }
 if (mt_partition_tupconv_maps[new_leaf]) {    /*     * Convert to leaf format using
mt_partition_tupconv_maps[new_leaf]    */
 
    slot = mt_partition_tuple_slot;
    /* Store the converted tuple into slot */ }

>> ISTM, the newly introduced logic in ExecSetupTransitionCaptureState() to
>> try to reuse the per-subplan child-to-parent map as per-leaf
>> child-to-parent map could be simplified a bit.  I mean the following code:
>>
>> +    /*
>> +     * But for Updates, we can share the per-subplan maps with the per-leaf
>> +     * maps.
>> +     */
>> +    update_rri_index = 0;
>> +    update_rri = mtstate->resultRelInfo;
>> +    if (mtstate->mt_nplans > 0)
>> +        cur_reloid = RelationGetRelid(update_rri[0].ri_RelationDesc);
>>
>> -        /* Choose the right set of partitions */
>> -        if (mtstate->mt_partition_dispatch_info != NULL)
>> +    for (i = 0; i < numResultRelInfos; ++i)
>> +    {
>> <snip>
>>
>> How about (pseudo-code):
>>
>>  j = 0;
>>  for (i = 0; i < n_leaf_parts; i++)
>>  {
>>      if (j < n_subplans && leaf_rri[i]->oid == subplan_rri[j]->oid)
>>      {
>>          leaf_childparent_map[i] = subplan_childparent_map[j];
>>          j++;
>>      }
>>      else
>>      {
>>          leaf_childparent_map[i] = new map
>>      }
>>  }
>>
>> I think the above would also be useful in ExecSetupPartitionTupleRouting()
>> where you've added similar code to try to reuse per-subplan ResultRelInfos.
> 
> Did something like that in the attached patch. Please have a look.
> After we conclude on that, will do the same for
> ExecSetupPartitionTupleRouting() as well.

Yeah, ExecSetupTransitionCaptureState() looks better in v22, but as I
explained above, we may not need to change the function so much.  The
approach, OTOH, should be adopted for ExecSetupPartitionTupleRouting().

>> In the following:
>>
>>          ExecSetupPartitionTupleRouting(rel,
>> +                                       (operation == CMD_UPDATE ?
>> +                                        mtstate->resultRelInfo : NULL),
>> +                                       (operation == CMD_UPDATE ? nplans
>> : 0),
>>
>> Can the second parameter be made to not span two lines?  It was a bit hard
>> for me to see that there two new parameters.
> 
> I think it is safe to just pass mtstate->resultRelInfo. Inside
> ExecSetupPartitionTupleRouting() we should anyways check only the
> nplans param (and not update_rri) to decide whether it is for insert
> or update. So did the same.

OK.

>> By the way, I've seen in a number of places that the patch calls "root
>> table" a partition.  Not just in comments, but also a variable appears to
>> be given a name which contains rootpartition.  I can see only one instance
>> where root is called a partition in the existing source code, but it seems
>> to have been introduced only recently:
>>
>> allpaths.c:1333:                 * A root partition will already have a
> 
> Changed to either this :
> root partition => root partitioned table
> or this if we have to refer to it too often :
> root partition => root

That seems fine, thanks.

On 2017/10/25 15:10, Amit Khandekar wrote:
> On 16 October 2017 at 08:28, Amit Langote wrote:
>> In ExecInitModifyTable(), can we try to minimize the number of places
>> where update_tuple_routing_needed is being set.  Currently, it's being set
>> in 3 places:
>
> I think the way it's done seems ok. For each resultRelInfo,
> update_tuple_routing_needed is updated in case that resultRel has
> partition cols changed. And at that point, we don't have rel opened,
> so we can't check if that rel is partitioned. So another check is
> required outside of the loop.

I understood why now.

>> +         * qual for each partition. Note that, if there are SubPlans in
>> there,
>> +         * they all end up attached to the one parent Plan node.
>>
>> The sentence starting with "Note that, " is a bit unclear.
>>
>> +        Assert(update_tuple_routing_needed ||
>> +               (operation == CMD_INSERT &&
>> +                list_length(node->withCheckOptionLists) == 1 &&
>> +                mtstate->mt_nplans == 1));
>>
>> The comment I complained about above is perhaps about this Assert.
>
> That is an existing comment.

Sorry, my bad.

> On HEAD, the "parent Plan" refers to
> mtstate->mt_plans[0]. Now in the patch, for the parent node in
> ExecInitQual(), mtstate->ps is passed rather than mt_plans[0]. So the
> parent plan refers to this mtstate node.

Hmm, I'm not really sure if doing that (passing mtstate->ps) would be
accurate.  In the update tuple routing case, it seems that it's better to
pass the correct parent PlanState pointer to ExecInitQual(), that is, one
corresponding to the partition's sub-plan.  At least I get that feeling by
looking at how parent is used downstream to that ExecInitQual() call, but
there *may* not be anything to worry about there after all.  I'm unsure.

> BTW, the reason I had changed the parent node to mtstate->ps is :
> Other places in that code use mtstate->ps while initializing
> expressions :
>
> /*
> * Build a projection for each result rel.
> */
>    resultRelInfo->ri_projectReturning =
>       ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
>                               resultRelInfo->ri_RelationDesc->rd_att);
>
> ...........
>
> /* build DO UPDATE WHERE clause expression */
> if (node->onConflictWhere)
> {
>    ExprState  *qualexpr;
>
>    qualexpr = ExecInitQual((List *) node->onConflictWhere,
>     &mtstate->ps);
> ....
> }
>
> I think wherever we initialize expressions belonging to a plan, we
> should use that plan as the parent. WithCheckOptions are fields of
> ModifyTableState.

You may be right, but I see for WithCheckOptions initialization
specifically that the non-tuple-routing code passes the actual sub-plan
when initializing the WCO for a given result rel.

>> Comments on the optimizer changes:
>>
>> +get_all_partition_cols(List *rtables,
>>
>> Did you mean rtable?
>
> I did mean rtables. It's a list of rtables.

It's not, AFAIK.  rtable (range table) is a list of range table entries,
which is also what seems to get passed to get_all_partition_cols for that
argument (root->parse->rtable, which is not a list of lists).

Moreover, there are no existing instances of this naming within the
planner other than those that this patch introduces:

$ grep rtables src/backend/optimizer/
planner.c:114: static void get_all_partition_cols(List *rtables,
planner.c:1063: get_all_partition_cols(List *rtables,
planner.c:1069:    Oid    root_relid = getrelid(root_rti, rtables);
planner.c:1078:    Oid            relid = getrelid(rti, rtables);

OTOH, dependency.c does have rtables, but it's actually a list of range
tables.  For example:

dependency.c:1360:    context.rtables = list_make1(rtable);

>> +       if (partattno != 0)
>> +           child_keycols =
>> +               bms_add_member(child_keycols,
>> +                              partattno -
>> FirstLowInvalidHeapAttributeNumber);
>> +   }
>> +   foreach(lc, partexprs)
>> +   {
>>
>> Elsewhere (in quite a few places), we don't iterate over partexprs
>> separately like this, although I'm not saying it is bad, just different
>> from other places.
>
> I think you are suggesting we do it like how it's done in
> is_partition_attr(). Can you please let me know other places we do
> this same way ? I couldn't find.

OK, not as many as I thought there would be, but there are following
beside is_partition_attrs():

partition.c: get_range_nulltest()
partition.c: get_qual_for_range()
relcache.c: RelationBuildPartitionKey()

>> Aha, so here's where all_part_cols was being set before...
>
> Yes, and we used to have PartitionedChildRelInfo.all_part_cols field
> for that. We used to populate that while traversing through the
> partition tree in expand_inherited_rtentry(). I agreed with Dilip's
> opinion that this would unnecessarily add up some processing even when
> the query is not a DML. And also, we don't have to have
> PartitionedChildRelInfo.all_part_cols. For the earlier implementation,
> check v18 patch or earlier versions.

Hmm, I think I have to agree with both you and Dilip that that would add
some redundant processing to other paths.

> Attached v22 patch.

Thanks again.

Regards,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Wed, Oct 25, 2017 at 11:40 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Below I have addressed the remaining review comments :

The changes to trigger.c still make me super-nervous.  Hey THOMAS
MUNRO, any chance you could review that part?

+       /* The caller must have already locked all the partitioned tables. */
+       root_rel = heap_open(root_relid, NoLock);
+       *all_part_cols = NULL;
+       foreach(lc, partitioned_rels)
+       {
+               Index           rti = lfirst_int(lc);
+               Oid                     relid = getrelid(rti, rtables);
+               Relation        part_rel = heap_open(relid, NoLock);
+
+               pull_child_partition_columns(part_rel, root_rel, all_part_cols);
+               heap_close(part_rel, NoLock);

I don't like the fact that we're opening and closing the relation here
just to get information on the partitioning columns.  I think it would
be better to do this someplace that already has the relation open and
store the details in the RelOptInfo.  set_relation_partition_info()
looks like the right spot.

+void
+pull_child_partition_columns(Relation rel,
+                                                        Relation parent,
+                                                        Bitmapset **partcols)

This code has a lot in common with is_partition_attr().  I'm not sure
it's worth trying to unify them, but it could be done.

+ * 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT,

Instead of " : ", you could just write "is the".

+                * For Updates, if the leaf partition is already present in the
+                * per-subplan result rels, we re-use that rather than
initialize a
+                * new result rel. The per-subplan resultrels and the
resultrels of
+                * the leaf partitions are both in the same canonical
order. So while

It would be good to explain the reason.  Also, Updates shouldn't be
capitalized here.

+                               Assert(cur_update_rri <= update_rri +
num_update_rri - 1);

Maybe just cur_update_rri < update_rri + num_update_rri, or even
current_update_rri - update_rri < num_update_rri.

Also, +1 for Amit Langote's idea of trying to merge
mt_perleaf_childparent_maps with mt_persubplan_childparent_maps.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 7 November 2017 at 00:33, Robert Haas <robertmhaas@gmail.com> wrote:

> Also, +1 for Amit Langote's idea of trying to merge
> mt_perleaf_childparent_maps with mt_persubplan_childparent_maps.

Currently I am trying to see if it simplifies things if we do that. We
will be merging these arrays into one, but we are adding a new int[]
array that maps subplans to leaf partitions. Will get back with how it
looks finally.

Robert, Amit , I will get back with your other review comments.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

From
Amit Langote
Date:
On 2017/11/07 14:40, Amit Khandekar wrote:
> On 7 November 2017 at 00:33, Robert Haas <robertmhaas@gmail.com> wrote:
> 
>> Also, +1 for Amit Langote's idea of trying to merge
>> mt_perleaf_childparent_maps with mt_persubplan_childparent_maps.
> 
> Currently I am trying to see if it simplifies things if we do that. We
> will be merging these arrays into one, but we are adding a new int[]
> array that maps subplans to leaf partitions. Will get back with how it
> looks finally.

One thing to note is that the int[] array I mentioned will be much faster
to compute than going to convert_tuples_by_name() to build the additional
maps array.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

From
Thomas Munro
Date:
On Tue, Nov 7, 2017 at 8:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> The changes to trigger.c still make me super-nervous.  Hey THOMAS
> MUNRO, any chance you could review that part?

Looking, but here's one silly thing that jumped out at me while
getting started with this patch.  I cannot seem to convince my macOS
system to agree with the expected sort order from :show_data, where
underscores precede numbers:
 part_a_10_a_20 | a | 10 | 200 |  1 | part_a_1_a_10  | a |  1 |   1 |  1 |
- part_d_1_15    | b | 15 | 146 |  1 |
- part_d_1_15    | b | 16 | 147 |  2 | part_d_15_20   | b | 17 | 155 | 16 | part_d_15_20   | b | 19 | 155 | 19 |
+ part_d_1_15    | b | 15 | 146 |  1 |
+ part_d_1_15    | b | 16 | 147 |  2 |

It seems that macOS (like older BSDs) just doesn't know how to sort
Unicode and falls back to sorting the bits.  I expect that means that
the test will also fail on any other OS with "make check
LC_COLLATE=C".  I believe our regression tests are supposed to pass
with a wide range of collations including C, so I wonder if this means
we should stick a leading zero on those single digit numbers, or
something, to stabilise the output.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 8 November 2017 at 07:55, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
> On Tue, Nov 7, 2017 at 8:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> The changes to trigger.c still make me super-nervous.  Hey THOMAS
>> MUNRO, any chance you could review that part?
>
> Looking, but here's one silly thing that jumped out at me while
> getting started with this patch.  I cannot seem to convince my macOS
> system to agree with the expected sort order from :show_data, where
> underscores precede numbers:
>
>   part_a_10_a_20 | a | 10 | 200 |  1 |
>   part_a_1_a_10  | a |  1 |   1 |  1 |
> - part_d_1_15    | b | 15 | 146 |  1 |
> - part_d_1_15    | b | 16 | 147 |  2 |
>   part_d_15_20   | b | 17 | 155 | 16 |
>   part_d_15_20   | b | 19 | 155 | 19 |
> + part_d_1_15    | b | 15 | 146 |  1 |
> + part_d_1_15    | b | 16 | 147 |  2 |
>
> It seems that macOS (like older BSDs) just doesn't know how to sort
> Unicode and falls back to sorting the bits.  I expect that means that
> the test will also fail on any other OS with "make check
> LC_COLLATE=C".  I believe our regression tests are supposed to pass
> with a wide range of collations including C, so I wonder if this means
> we should stick a leading zero on those single digit numbers, or
> something, to stabilise the output.

I preferably need to retain the partition names. I have now added a
LOCALE "C" for partname like this :

-\set show_data 'select tableoid::regclass::text partname, * from
range_parted order by 1, 2, 3, 4, 5, 6'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname,
* from range_parted order by 1, 2, 3, 4, 5, 6'

Thomas, can you please try the attached incremental patch
regress_locale_changes.patch and check if the test passes ? The patch
is to be applied on the main v22 patch. If the test passes, I will
include these changes (also for list_parted) in the upcoming v23
patch.

Thanks
-Amit Khandekar

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] UPDATE of partition key

From
Thomas Munro
Date:
On Wed, Nov 8, 2017 at 5:57 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Thomas, can you please try the attached incremental patch
> regress_locale_changes.patch and check if the test passes ? The patch
> is to be applied on the main v22 patch. If the test passes, I will
> include these changes (also for list_parted) in the upcoming v23
> patch.

That looks good.  Thanks.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

From
Thomas Munro
Date:
On Wed, Nov 8, 2017 at 5:57 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 8 November 2017 at 07:55, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>> On Tue, Nov 7, 2017 at 8:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> The changes to trigger.c still make me super-nervous.  Hey THOMAS
>>> MUNRO, any chance you could review that part?

At first, it seemed quite strange to me that row triggers and
statement triggers fire different events for the same modification.
Row triggers see DELETE +  INSERT (necessarily because different
tables are involved), but this fact is hidden from the target table's
statement triggers.

The alternative would be for all triggers to see consistent events and
transitions.  Instead of having your special case code in ExecInsert
and ExecDelete that creates the two halves of a 'synthetic' UPDATE for
the transition tables, you'd just let the existing ExecInsert and
ExecDelete code do its thing, and you'd need a flag to record that you
should also fire INSERT/DELETE after statement triggers if any rows
moved.

After sleeping on this question, I am coming around to the view that
the way you have it is right.  The distinction isn't really between
row triggers and statement triggers, it's between triggers at
different levels in the hierarchy.  It just so happens that we
currently only fire target table statement triggers and leaf table row
triggers.  Future development ideas that seem consistent with your
choice:

1.  If we ever allow row triggers with transition tables on child
tables, then I think *their* transition tables should certainly see
the deletes and inserts, otherwise OLD TABLE and NEW TABLE would be
inconsistent with the OLD and NEW variables in a single trigger
invocation.  (These were prohibited mainly due to lack of time and
(AFAIK) limited usefulness; I think they would need probably need
their own separate tuplestores, or possibly some kind of filtering.)

2.  If we ever allow row triggers on partitioned tables (ie that fire
when its children are modified), then I think their UPDATE trigger
should probably fire when a row moves between any two (grand-)*child
tables, just as you have it for target table statement triggers.  It
doesn't matter that the view from parent tables' triggers is
inconsistent with the view from leaf table triggers: it's a feature
that we 'hide' partitioning from the user to the extent we can so that
you can treat the partitioned table just like a table.

Any other views?

As for the code, I haven't figured out how to break it yet, and I'm
wondering if there is some way to refactor so that ExecInsert and
ExecDelete don't have to record pseudo-UPDATE trigger events.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 2 November 2017 at 12:40, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> ISTM, ModifyTableState now has one too
> many TupleConversionMap pointer arrays after the patch, creating the need
> to choose from in the first place.  AIUI -
>
> * mt_perleaf_parentchild_maps:
>
>   - each entry is a map to convert root parent's tuples to a given leaf
>     partition's format
>
>   - used to be called mt_partition_tupconv_maps and is needed when tuple-
>     routing is in use; for both INSERT and UPDATE with tuple-routing
>
>   - as many entries in the array as there are leaf partitions and stored
>     in the partition bound order
>
> * mt_perleaf_childparent_maps:
>
>   - each entry is a map to convert a leaf partition's tuples to the root
>     parent's format
>
>   - newly added by this patch and seems to be needed for UPDATE with
>     tuple-routing for two needs: 1. tuple-routing should start with a
>     tuple in root parent format whereas the tuple received is in leaf
>     partition format when ExecInsert() called for update-tuple-routing (by
>     ExecUpdate), 2. after tuple-routing, we must capture the tuple
>     inserted into the partition in the transition tuplestore which accepts
>     tuples in root parent's format
>
>   - as many entries in the array as there are leaf partitions and stored
>     in the partition bound order
>
> * mt_persubplan_childparent_maps:
>
>   - each entry is a map to convert a child table's tuples to the root
>     parent's format
>
>   - used to be called mt_transition_tupconv_maps and needed for converting
>     child tuples to the root parent's format when storing them in the
>     transition tuplestore which accepts tuples in root parent's format
>
>   - as many entries in the array as there are sub-plans in mt_plans and
>     stored in either the partition bound order or unknown order (the
>     latter in the regular inheritance case)

thanks for the detailed description. Yet that's correct.

>
> I think we could combine the last two into one.  The only apparent reason
> for them to be separate seems to be that the subplan array might contain
> less entries than perleaf array and ExecInsert() has only enough
> information to calculate the offset of a map in the persubplan array.
> That is, resultRelInfo of leaf partition that ExecInsert starts with in
> the update-tuple-routing case comes from mtstate->resultRelInfo array
> which contains only mt_nplans entries.  So, if we only have the array with
> entries for *all* partitions, it's hard to get the offset of the map to
> use in that array.
>
> I suggest we don't add a new map array and a significant amount of new
> code to initialize the same and to implement the logic to choose the
> correct array to get the map from.  Instead, we could simply add an array
> of integers with mt_nplans entries.  Each entry is an offset of a given
> sub-plan in the array containing entries of something for *all*
> partitions.  Since, we are teaching ExecSetupPartitionTupleRouting() to
> reuse ResultRelInfos from mtstate->resultRelInfos, there is a suitable
> place to construct such array.  Let's say the array is called
> mt_subplan_partition_offsets[].  Let ExecSetupPartitionTupleRouting() also
> initialize the parent-to-partition maps for *all* partitions, in the
> update-tuple-routing case.  Then add a quick-return check in
> ExecSetupTransitionCaptureState() to see if the map has already been set
> by ExecSetupPartitionTupleRouting().  Since we're using the same map for
> two purposes, we could rename mt_transition_tupconv_maps to something that
> doesn't bind it to its use only for transition tuple capture.

I was trying hard to verify whether this is really going to simplify
the code. We are removing one array and adding one. In my approach,
the map structures are anyway shared, they are not duplicated. Because
I have separate arrays to access the tuple conversion map
partition-based or subplan-based, there is no need for extra logic to
get into the per-partition array. But on the other hand, we need not
do that many changes in ExecSetupTransitionCaptureState() that I have
done, although my patch hasn't resulted in more number of line in that
function; it has just changed the logic.

Also, each time we access the map, we need to know whether it is
per-plan or per-partition, according to a set of factors like whether
transition tables are there and whether tuple routing is there.

But I realized that one plus point of your approach is that it is
going to be extensible if we later need to have some more per-subplan
information that is already there in a partition-wise array. In that
case, we just need to re-use the int[] map; we don't have to create
two new separate arrays; just create one per-leaf array, and use the
map to get into one of its elements, given a per-subplan index.

So I went ahead and did the changes :

New mtstate maps :

TupleConversionMap **mt_parentchild_tupconv_maps;
/* Per partition map for tuple conversion from root to leaf */
TupleConversionMap **mt_childparent_tupconv_maps;
/* Per plan/partition map for tuple conversion from child to root */
int *mt_subplan_partition_offsets;
/* Stores position of update result rels in leaf partitions */

We need to know whether mt_childparent_tupconv_maps is per-plan or
per-partition. Each time this map is accessed, it is tedious to go
through conditions that determine whether that map is per-partition or
not. Here are the conditions :

For transition tables
   per-leaf map needed : in presence of tuple routing (insert or
update, whichever).
   per-plan map needed : in presence of simple update (i.e. routing
not involved)
For update tuple routing.
   per-plan map needed : always

So instead, added a new bool mtstate->mt_is_tupconv_perpart field that
is set to true only while setting up transition tables and that too
only when tuple routing is to be done.

Since both transition tables and update tuple routing need a
child-parent map, extracted the code to build the  map into a common
function ExecSetupChildParentMap(). (I think I could have done this
earlier also)

Each time we need to access this map, we not only have to use the
int[] maps, we also need to first check if it's a per-leaf map. So put
this logic in tupconv_map_for_subplan() and used this everywhere we
need the map.

Attached is v23 patch that has just the above changes (and also
rebased on hash-partitioning changes, like update.sql). I am still
doing some sanity testing on this, although regression passes.

I am yet to respond to the other review comments; will do that with a v24 patch.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 9 November 2017 at 09:27, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
> On Wed, Nov 8, 2017 at 5:57 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> On 8 November 2017 at 07:55, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>>> On Tue, Nov 7, 2017 at 8:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> The changes to trigger.c still make me super-nervous.  Hey THOMAS
>>>> MUNRO, any chance you could review that part?
>
> At first, it seemed quite strange to me that row triggers and
> statement triggers fire different events for the same modification.
> Row triggers see DELETE +  INSERT (necessarily because different
> tables are involved), but this fact is hidden from the target table's
> statement triggers.
>
> The alternative would be for all triggers to see consistent events and
> transitions.  Instead of having your special case code in ExecInsert
> and ExecDelete that creates the two halves of a 'synthetic' UPDATE for
> the transition tables, you'd just let the existing ExecInsert and
> ExecDelete code do its thing, and you'd need a flag to record that you
> should also fire INSERT/DELETE after statement triggers if any rows
> moved.

Yeah I also had thought about that. But thought that change was too
invasive. For e.g. letting ExecARInsertTriggers() do the transition
capture even when transition_capture->tcs_update_new_table is set.

I was also thinking of having a separate function to *only* add the
transition table rows. So in ExecInsert, call this one instead of
ExecARUpdateTriggers(). But realized that the existing
ExecARUpdateTriggers() looks like a better, robust interface with all
its checks. Just that calling ExecARUpdateTriggers() sounds like we
are also firing trigger; we are not firing any trigger or saving any
event, we are just adding the transition row.

>
> After sleeping on this question, I am coming around to the view that
> the way you have it is right.  The distinction isn't really between
> row triggers and statement triggers, it's between triggers at
> different levels in the hierarchy.  It just so happens that we
> currently only fire target table statement triggers and leaf table row
> triggers.

Yes. And rows are there only in leaf partitions. So we have to
simulate as though the target table has these rows. Like you
mentioned, the user has to get the impression of a normal table. So we
have to do something extra to capture the rows.

> Future development ideas that seem consistent with your choice:
>
> 1.  If we ever allow row triggers with transition tables on child
> tables, then I think *their* transition tables should certainly see
> the deletes and inserts, otherwise OLD TABLE and NEW TABLE would be
> inconsistent with the OLD and NEW variables in a single trigger
> invocation.  (These were prohibited mainly due to lack of time and
> (AFAIK) limited usefulness; I think they would need probably need
> their own separate tuplestores, or possibly some kind of filtering.)

As we know, for row triggers on leaf partitions, we treat them as
normal tables, so a trigger written on a leaf partition sees only the
local changes. The trigger is unaware whether the insert is part of an
UPDATE row movement. Similarly, the transition table referenced by
that row trigger function should see only the NEW table, not the old
table.

>
> 2.  If we ever allow row triggers on partitioned tables (ie that fire
> when its children are modified), then I think their UPDATE trigger
> should probably fire when a row moves between any two (grand-)*child
> tables, just as you have it for target table statement triggers.

Yes I agree.

> It doesn't matter that the view from parent tables' triggers is
> inconsistent with the view from leaf table triggers: it's a feature
> that we 'hide' partitioning from the user to the extent we can so that
> you can treat the partitioned table just like a table.
>
> Any other views?

I think because because there is no provision for a row trigger on
partitioned table, users who want to have a common trigger on a
partition subtree, has no choice but to create the same trigger
individually on the leaf partitions. And that's the reason we cannot
handle an update row movement with triggers without anomalies.

Thanks
-Amit Khandekar


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

From
David Rowley
Date:
On 10 November 2017 at 16:42, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
[ update-partition-key_v23.patch ]

Hi Amit,

Thanks for working on this. I'm looking forward to seeing this go in.

So... I've signed myself up to review the patch, and I've just had a
look at it, (after first reading this entire email thread!).

Overall the patch looks like it's in quite a good shape. I think I do
agree with Robert about the UPDATE anomaly that's been discussed. I
don't think we're painting ourselves into any corner by not having
this working correctly right away. Anyone who's using some trigger
workarounds for the current lack of support for updating the partition
key is already going to have the same issues, so at least this will
save them some troubles implementing triggers and give them much
better performance. I see you've documented this fact too, which is
good.

I'm writing this email now as I've just run out of review time for today.

Here's what I noted down during my first pass:

1. Closing command tags in docs should not be abbreviated
   triggers are concerned, <literal>AFTER</> <command>DELETE</command> and

This changed in c29c5789. I think Peter will be happy if you don't
abbreviate the closing tags.

2. "about to do" would read better as "about to perform"
concurrent session, and it is about to do an <command>UPDATE</command>

I think this paragraph could be more clear if we identified the
sessions with a number.

Perhaps:      Suppose, session 1 is performing an <command>UPDATE</command> on a      partition key, meanwhile, session
2tries to perform an <command>UPDATE      </command> or <command>DELETE</command> operation on the same row.
Session2 can silently miss the row due to session 1's activity.  In      such a case, session 2's
<command>UPDATE</command>/<command>DELETE     </command>, being unaware of the row's movement, interprets this that the
    row has just been deleted, so there is nothing to be done for this row.      Whereas, in the usual case where the
tableis not partitioned, or where      there is no row movement, the second session would have identified the
newlyupdated row and carried <command>UPDATE</command>/<command>DELETE      </command> on this new row version.
 


3. Integer width. get_partition_natts returns int but we assign to int16.

int16 partnatts = get_partition_natts(key);

Confusingly get_partition_col_attnum() returns int16 instead of AttrNumber
but that's existingly not correct.

4. The following code could just pull_varattnos(partexprs, 1, &child_keycols);

foreach(lc, partexprs)
{
Node    *expr = (Node *) lfirst(lc);

pull_varattnos(expr, 1, &child_keycols);
}

5. Triggers. Do we need a new "TG_" tag to allow trigger functions to
do something
special when the DELETE/INSERT is a partition move? I have audit
tables in mind here
it may appear as though a user performed a DELETE when they actually
performed an UPDATE
giving visibility of this to the trigger function will allow the
application to work around this.

6. change "row" to "a row" and "old" to "the old"

* depending on whether the event is for row being deleted from old

But to be honest, I'm having trouble parsing the comment. I think it
would be better to
say explicitly when the row will be NULL rather than "depending on
whether the event"

7. I'm confused with how this change came about. If the old comment
was correct here then the comment you're referring to here should
remain in ExecPartitionCheck(), but you're saying it's in
ExecConstraints().

/* See the comments in ExecConstraints. */

If the comment really is in ExecConstraints(), then you might want to
give an overview of what you mean, then reference ExecConstraints() if
more details are required.

8. I'm having trouble parsing this comment:
* 'update_rri' has the UPDATE per-subplan result rels.

I think "has" should be "contains" ?

9. Also, this should likely be reworded:
* 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT,*      this is 0.
'num_update_rri' number of elements in 'update_rri' array or zero for INSERT.

10. There should be no space before the '?'

/* Is this leaf partition present in the update resultrel ? */

11. I'm struggling to understand this comment:

* This is required when converting tuple as per root
* partition tuple descriptor.

"tuple" should probably be "the tuple", but not quite sure what you
mean by "as per root".

I may have misunderstood, but maybe it should read:

* This is required when we convert the partition's tuple to
* be compatible with the partitioned table's tuple descriptor.

12. I think "as well" would be better written as "either".

* If we didn't open the partition rel, it means we haven't
* initialized the result rel as well.

13. I'm unsure what is meant by the following comment:

* Verify result relation is a valid target for insert operation. Even
* for updates, we are doing this for tuple-routing, so again, we need
* to check the validity for insert operation.

I'm not quite sure where UPDATE comes in here as we're only checking for INSERT?

14. Use of underscores instead of camelCase.

COPY_SCALAR_FIELD(part_cols_updated);

I know you're not the first one to break this as "partitioned_rels"
does not follow it either, but that's probably not a good enough
reason to break away from camelCase any further.

I'd suggest "partColsUpdated". But after a re-think, maybe cols is
incorrect. All columns are partitioned, it's the key columns that we
care about, so how about "partKeyUpdate"

15. Are you sure that you mean "root" here?
* All the child partition attribute numbers are converted to the root* partitioned table.

Surely this is just the target relation. "parent" maybe? A
sub-partitioned table might be the target of an UPDATE too.

15. I see get_all_partition_cols() is just used once to check if
parent_rte->updatedCols contains and partition keys.

Would it not be better to reform that function and pass
parent_rte->updatedCols in and abort as soon as you see a single
match?

Maybe the function could return bool and be named
partitioned_key_overlaps(), that way your assignment in
inheritance_planner() would just become:

part_cols_updated = partitioned_key_overlaps(root->parse->rtable,
top_parentRTindex, partitioned_rels, parent_rte->updatedCols);

or something like that anyway.

16. Typo in comment
* 'part_cols_updated' if any partitioning columns are being updated, either* from the named relation or a descendent
partitionetable.
 

"partitione" should be "partitioned". Also, normally for bool
parameters, we might word things like "True if ..." rather than just
"if"

You probably should follow camelCase I mentioned in 14 here too.

17. Comment needs a few changes:
* ConvertPartitionTupleSlot -- convenience function for converting tuple and* storing it into a tuple slot provided
through'new_slot', which typically* should be one of the dedicated partition tuple slot. Passes the partition* tuple
slotback into output param p_old_slot. If no mapping present, keeps* p_old_slot unchanged.** Returns the converted
tuple.

There are a few typos here. For example, "tuple" should be "a tuple",
but maybe the comment should just be worded like:
* ConvertPartitionTupleSlot -- convenience function for tuple conversion* using 'map'. The tuple, if converted, is
storedin 'new_slot' and* 'p_old_slot' is set to the original partition tuple slot. If map is NULL,* then the original
tupleis returned unmodified, otherwise the converted* tuple is returned.
 

18. Line goes over 80 chars.

TransitionCaptureState *transition_capture = mtstate->mt_transition_capture;

Better just to split the declaration and assignment.

19. Confusing comment:

/*
* If the original operation is UPDATE, the root partitioned table
* needs to be fetched from mtstate->rootResultRelInfo.
*/

It's not that clear here how you determine this is an UPDATE of a
partitioned key.

20. This code looks convoluted:

rootResultRelInfo = (mtstate->rootResultRelInfo ?
mtstate->rootResultRelInfo : resultRelInfo);

/*
* If the resultRelInfo is not the root partitioned table (which
* happens for UPDATE), we should convert the tuple into root's tuple
* descriptor, since ExecFindPartition() starts the search from root.
* The tuple conversion map list is in the order of
* mtstate->resultRelInfo[], so to retrieve the one for this resultRel,
* we need to know the position of the resultRel in
* mtstate->resultRelInfo[].
*/
if (rootResultRelInfo != resultRelInfo)
{

rootResultRelInfo is assigned via a ternary expression which makes the
subsequent if test seem a little strange.

Would it not be better to test:

if (mtstate->rootResultRelInfo)
{
rootResultRelInfo = mtstate->rootResultRelInfo
... other stuff ...
}
else
rootResultRelInfo = resultRelInfo;

Then above the if test you can explain that rootResultRelInfo is only
set during UPDATE of partition keys, as per #19.

21. How come you renamed mt_partition_tupconv_maps[] to
mt_parentchild_tupconv_maps[]?

22. Comment in ExecInsert() could be worded better.

/*
* In case this is part of update tuple routing, put this row into the
* transition NEW TABLE if we are capturing transition tables. We need to
* do this separately for DELETE and INSERT because they happen on
* different tables.
*/

/*
* This INSERT may be the result of a partition-key-UPDATE. If so,
* and we're required to capture transition tables then we'd better
* record this as a statement level UPDATE on the target relation.
* We're not interested in the statement level DELETE or INSERT as
* these occur on the individual partitions, none of which are the
* target of this the UPDATE statement.
*/

A similar comment could use a similar improvement in ExecDelete()

23. Line is longer than 80 chars.

TransitionCaptureState *transition_capture = mtstate->mt_transition_capture;

24. I know from reading the thread this name has changed before, but I
think delete_skipped seems like the wrong name for this variable in:

if (delete_skipped)
*delete_skipped = true;

Skipped is the wrong word here as that indicates like we had some sort
of choice and that we decided not to. However, that's not the case
when the tuple was concurrently deleted. Would it not be better to
call it "tuple_deleted" or even "success" and reverse the logic? It's
just a bit confusing that you're setting this to skipped before
anything happens. It would be nicer if there was a better way to do
this whole thing as it's a bit of a wart in the code. I understand why
the code exists though.

Also, I wonder if it's better to always pass a boolean here to save
having to test for NULL before setting it, that way you might consider
putting the success = false just before the return NULL, then do
success = true after the tuple is gone.
Failing that, putting: something like: success = false; /* not yet! */
where you're doing the if (deleted_skipped) test is might also be
better.

25. Comment "we should" should be "we must".

/*
* For some reason if DELETE didn't happen (for e.g. trigger
* prevented it, or it was already deleted by self, or it was
* concurrently deleted by another transaction), then we should
* skip INSERT as well, otherwise, there will be effectively one
* new row inserted.

Maybe just:
/* If the DELETE operation was unsuccessful, then we must not
* perform the INSERT into the new partition.

"for e.g." is not really correct in English. "For example, ..." or
just "e.g. ..." is correct. If you de-abbreviate the e.g. then you've
written "For exempli gratia", which translates to "For for example".

26. You're not really explaining what's going on here:

if (mtstate->mt_transition_capture)
saved_tcs_map = mtstate->mt_transition_capture->tcs_map;

You have a comment later to say you're about to "Revert back to the
transition capture map", but I missed the part that explained about
modifying it in the first place.

27. Comment does not explain how we're skipping checking the partition
constraint check in:

* We have already checked partition constraints above, so skip
* checking them here.

Maybe something like:

* We've already checked the partition constraint above, however, we
* must still ensure the tuple passes all other constraints, so we'll
* call ExecConstraints() and have it validate all remaining checks.

28. For table WITH OIDs, the OID should probably follow the new tuple
for partition-key-UPDATEs.

CREATE TABLE p (a BOOL NOT NULL, b INT NOT NULL) PARTITION BY LIST (a)
WITH OIDS;
CREATE TABLE P_true PARTITION OF p FOR VALUES IN('t');
CREATE TABLE P_false PARTITION OF p FOR VALUES IN('f');
INSERT INTO p VALUES('t', 10);
SELECT tableoid::regclass,oid,a FROM p;tableoid |  oid  | a
----------+-------+---p_true   | 16792 | t
(1 row)

UPDATE p SET a = 'f'; -- partition-key-UPDATE (oid has changed (it
probably shouldn't have))
SELECT tableoid::regclass,oid,a FROM p;tableoid |  oid  | a
----------+-------+---p_false  | 16793 | f
(1 row)

UPDATE p SET b = 20; -- non-partition-key-UPDATE (oid remains the same)

SELECT tableoid::regclass,oid,a FROM p;tableoid |  oid  | a
----------+-------+---p_false  | 16793 | f
(1 row)

I'll try to continue with the review tomorrow, but I think some other
reviews are also looming too.

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

From
Thomas Munro
Date:
On Fri, Nov 10, 2017 at 4:42 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Attached is v23 patch that has just the above changes (and also
> rebased on hash-partitioning changes, like update.sql). I am still
> doing some sanity testing on this, although regression passes.

The test coverage[1] is 96.62%.  Nice work.  Here are the bits that
aren't covered:

In partition.c's pull_child_partition_columns(), the following loop is
never run:

+       foreach(lc, partexprs)
+       {
+               Node       *expr = (Node *) lfirst(lc);
+
+               pull_varattnos(expr, 1, &child_keycols);
+       }

In nodeModifyTable.c, the following conditional branches are never run:
               if (mtstate->mt_oc_transition_capture != NULL)
+               {
+                       Assert(mtstate->mt_is_tupconv_perpart == true);
mtstate->mt_oc_transition_capture->tcs_map=
 
-
mtstate->mt_transition_tupconv_maps[leaf_part_index];
+
mtstate->mt_childparent_tupconv_maps[leaf_part_index];
+               }

                               if (node->mt_oc_transition_capture != NULL)                               {
-
Assert(node->mt_transition_tupconv_maps != NULL);

node->mt_oc_transition_capture->tcs_map =
-
node->mt_transition_tupconv_maps[node->mt_whichplan];
+
tupconv_map_for_subplan(node, node->mt_whichplan);                               }

Is there any reason we shouldn't be able to test these paths?

[1] https://codecov.io/gh/postgresql-cfbot/postgresql/commit/a3beb8d8f598a64d75aa4b3afc143a5d3e3f7826

-- 
Thomas Munro
http://www.enterprisedb.com


Re: [HACKERS] UPDATE of partition key

From
David Rowley
Date:
On 14 November 2017 at 01:55, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I'll try to continue with the review tomorrow, but I think some other
> reviews are also looming too.

I started looking at this again today. Here's the remainder of my review.

29. ExecSetupChildParentMap gets called here for non-partitioned relations.
Maybe that's not the best function name? The function only seems to do
that when perleaf is True.

Is a leaf a partition of a partitioned table? It's not that clear the
meaning here.

/*
* If we found that we need to collect transition tuples then we may also
* need tuple conversion maps for any children that have TupleDescs that
* aren't compatible with the tuplestores.  (We can share these maps
* between the regular and ON CONFLICT cases.)
*/
if (mtstate->mt_transition_capture != NULL ||
mtstate->mt_oc_transition_capture != NULL)
{
int numResultRelInfos;

numResultRelInfos = (mtstate->mt_partition_tuple_slot != NULL ?
mtstate->mt_num_partitions :
mtstate->mt_nplans);

ExecSetupChildParentMap(mtstate, targetRelInfo, numResultRelInfos,
(mtstate->mt_partition_dispatch_info != NULL));


30. The following chunk of code is giving me a headache trying to
verify which arrays are which size:

ExecSetupPartitionTupleRouting(rel,  mtstate->resultRelInfo,  (operation == CMD_UPDATE ? nplans : 0),
node->nominalRelation, estate,  &partition_dispatch_info,  &partitions,  &partition_tupconv_maps,  &subplan_leaf_map,
&partition_tuple_slot, &num_parted, &num_partitions);
 
mtstate->mt_partition_dispatch_info = partition_dispatch_info;
mtstate->mt_num_dispatch = num_parted;
mtstate->mt_partitions = partitions;
mtstate->mt_num_partitions = num_partitions;
mtstate->mt_parentchild_tupconv_maps = partition_tupconv_maps;
mtstate->mt_subplan_partition_offsets = subplan_leaf_map;
mtstate->mt_partition_tuple_slot = partition_tuple_slot;
mtstate->mt_root_tuple_slot = MakeTupleTableSlot();

I know this patch is not completely responsible for it, but you're not
making things any better.

Would it not be better to invent some PartitionTupleRouting struct and
make that struct a member of ModifyTableState and CopyState, then just
pass the pointer to that struct to ExecSetupPartitionTupleRouting()
and have it fill in the required details? I think the complexity of
this is already on the high end, I think you really need to do the
refactor before this gets any worse.

The signature of the function is a bit scary!

extern void ExecSetupPartitionTupleRouting(Relation rel,  ResultRelInfo *update_rri,  int num_update_rri,  Index
resultRTindex, EState *estate,  PartitionDispatch **pd,  ResultRelInfo ***partitions,  TupleConversionMap
***tup_conv_maps, int **subplan_leaf_map,  TupleTableSlot **partition_tuple_slot,  int *num_parted, int
*num_partitions);

What do you think?

31. The following code seems incorrect:

/*
* If this is an UPDATE and a BEFORE UPDATE trigger is present, we may
* need to do update tuple routing.
*/
if (resultRelInfo->ri_TrigDesc &&
resultRelInfo->ri_TrigDesc->trig_update_before_row &&
operation == CMD_UPDATE)
update_tuple_routing_needed = true;

Shouldn't this be setting update_tuple_routing_needed to false if
there are no before row update triggers? Otherwise, you're setting it
to true regardless of if there are any partition key columns being
UPDATEd. That would make the work you're doing in
inheritance_planner() to set part_cols_updated a waste of time.

Also, this bit of code is a bit confused.

/* Decide whether we need to perform update tuple routing. */
if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
update_tuple_routing_needed = false;

/*
* Build state for tuple routing if it's an INSERT or if it's an UPDATE of
* partition key.
*/
if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
(operation == CMD_INSERT || update_tuple_routing_needed))


The first if test would not be required if you fixed the code where
you set update_tuple_routing_needed = true regardless if its a
partitioned table or not.

So basically, you need to take the node->part_cols_updated from the
planner, if that's true then perform your test for before row update
triggers, set a bool to false if there are none, then proceed to setup
the partition tuple routing for partition table inserts or if your
bool is still true. Right?

32. "WCO" abbreviation is not that common and might need to be expanded.

* Below are required as reference objects for mapping partition
* attno's in expressions such as WCO and RETURNING.

Searching for other comments which mention "WCO" they're all around
places that is easy to understand they mean "With Check Option", e.g.
next to a variable with a more descriptive name. That's not the case
here.

33. "are anyway newly allocated", should "anyway" be "always"?
Otherwise, it does not make sense.

* If this result rel is one of the subplan result rels, let
* ExecEndPlan() close it. For INSERTs, this does not apply because
* all leaf partition result rels are anyway newly allocated.

34. Comment added which mentions a member that does not exist.
* all_part_cols contains all attribute numbers from the parent that are* used as partitioning columns by the parent or
somedescendent which is* itself partitioned.*
 

I've not looked at the test coverage as I see Thomas has been looking
at that in some detail.

I'm going to set this patch as waiting for author now.

Thanks again for working on this.

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] UPDATE of partition key

From
Alvaro Herrera
Date:
David Rowley wrote:

> 5. Triggers. Do we need a new "TG_" tag to allow trigger functions to
> do something special when the DELETE/INSERT is a partition move? I
> have audit tables in mind here it may appear as though a user
> performed a DELETE when they actually performed an UPDATE giving
> visibility of this to the trigger function will allow the application
> to work around this.

+1  I think we do need a flag that can be inspected from the user
trigger function.

> 9. Also, this should likely be reworded:
> 
>  * 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT,
>  *      this is 0.
> 
>  'num_update_rri' number of elements in 'update_rri' array or zero for INSERT.

Also:

/pgsql/source/master/src/backend/executor/execMain.c: In function 'ExecSetupPartitionTupleRouting':
/pgsql/source/master/src/backend/executor/execMain.c:3401:18: warning: 'leaf_part_arr' may be used uninitialized in
thisfunction [-Wmaybe-uninitialized]   leaf_part_rri = leaf_part_arr + i;   ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~
 

I think using num_update_rri==0 as a flag to indicate INSERT is strange.
I suggest passing an additional boolean -- or maybe just split the whole
function in two, one for updates and another for inserts, say
ExecSetupPartitionTupleRoutingForInsert() and
ExecSetupPartitionTupleRoutingForUpdate().  They seem to
share almost no code, and the current flow is hard to read; maybe just
add a common subroutine for the lower bottom of the loop.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
Thanks David Rowley, Alvaro Herrera and Thomas Munro for stepping in
for the reviews !

In the attached patch v24, I have addressed Amit Langote's remaining
review points, and David Rowley's comments upto point #26.

Yet to address :
Robert's few suggestions.
All of Alvaro's comments.
David's points from #27 to #34.
Thomas's point about adding remaining test coverage on transition tables.

Below has the responses for both Amit's and David's comments, starting
with Amit's ....

===============

On 2 November 2017 at 12:40, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/10/24 0:15, Amit Khandekar wrote:
>> On 16 October 2017 at 08:28, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>>
>>> +           (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup
>>> == NULL))))
>>>
>>> Is there some reason why a bitwise operator is used here?
>>
>> That exact condition means that the function is called for transition
>> capture for updated rows being moved to another partition. For this
>> scenario, either the oldtup or the newtup is NULL. I wanted to exactly
>> capture that condition there. I think the bitwise operator is more
>> user-friendly in emphasizing the point that it is indeed an "either a
>> or b, not both" condition.
>
> I see.  In that case, since this patch adds the new condition, a note
> about it in the comment just above would be good, because the situation
> you describe here seems to arise only during update-tuple-routing, IIUC.

Done. Please check.

> + * 'update_rri' has the UPDATE per-subplan result rels. These are re-used
> + *      instead of allocating new ones while generating the array of all leaf
> + *      partition result rels.
>
> Instead of:
>
> "These are re-used instead of allocating new ones while generating the
> array of all leaf partition result rels."
>
> how about:
>
> "There is no need to allocate a new ResultRellInfo entry for leaf
> partitions for which one already exists in this array"

Ok. I have made it like this :

+ * 'update_rri' contains the UPDATE per-subplan result rels. For the
output param
+ *             'partitions', we don't allocate new ResultRelInfo objects for
+ *             leaf partitions for which they are already available
in 'update_rri'.

>
>>> ISTM, ConvertPartitionTupleSlot has a very specific job and a bit complex
>>> interface.  I guess it could simply have the following interface:
>>>
>>> static HeapTuple ConvertPartitionTuple(ModifyTabelState *mtstate,
>>>                                        HeapTuple tuple, bool is_update);
>>>
>>> And figure out, based on the value of is_update, which map to use and
>>> which slot to set *p_new_slot to (what is now "new_slot" argument).
>>> You're getting mtstate here anyway, which contains all the information you
>>> need here.  It seems better to make that (selecting which map and which
>>> slot) part of the function's implementation if we're having this function
>>> at all, imho.  Maybe I'm missing some details there, but my point still
>>> remains that we should try to put more logic in that function instead of
>>> it just do the mechanical tuple conversion.
>>
>> I tried to see how the interface would look if we do that way. Here is
>> how the code looks :
>>
>> static TupleTableSlot *
>> ConvertPartitionTupleSlot(ModifyTableState *mtstate,
>>                     bool for_update_tuple_routing,
>>                     int map_index,
>>                     HeapTuple *tuple,
>>                     TupleTableSlot *slot)
>> {
>>    TupleConversionMap   *map;
>>    TupleTableSlot *new_slot;
>>
>>    if (for_update_tuple_routing)
>>    {
>>       map = mtstate->mt_persubplan_childparent_maps[map_index];
>>       new_slot = mtstate->mt_rootpartition_tuple_slot;
>>    }
>>    else
>>    {
>>       map = mtstate->mt_perleaf_parentchild_maps[map_index];
>>       new_slot = mtstate->mt_partition_tuple_slot;
>>    }
>>
>>    if (!map)
>>       return slot;
>>
>>    *tuple = do_convert_tuple(*tuple, map);
>>
>>    /*
>>     * Change the partition tuple slot descriptor, as per converted tuple.
>>     */
>>    ExecSetSlotDescriptor(new_slot, map->outdesc);
>>    ExecStoreTuple(*tuple, new_slot, InvalidBuffer, true);
>>
>>    return new_slot;
>> }
>>
>> It looks like the interface does not much simplify, and above that, we
>> have more number of lines in that function. Also, the caller anyway
>> has to be aware whether the map_index is the index into the leaf
>> partitions or the update subplans. So it is not like the caller does
>> not have to be aware about whether the mapping should be
>> mt_persubplan_childparent_maps or mt_perleaf_parentchild_maps.
>
> Hmm, I think we should try to make it so that the caller doesn't have to
> be aware of that.  And by caller I guess you mean ExecInsert(), which
> should not be a place, IMHO, where to try to introduce a lot of new logic
> specific to update tuple routing.

I think, for ExecInsert() since we have already given the job of
routing the tuple from root partitioned table to a partition, it makes
sense to give the function the additional job of routing the tuple
from any partition to any partition. ExecInsert() can be looked at as
doing this job : "insert a tuple into the right partition; the
original tuple can belong to any partition"

> With that, now there are no persubplan and perleaf arrays for ExecInsert()
> to pick from to select a map to pass to ConvertPartitionTupleSlot(), or
> maybe even no need for the separate function.  The tuple-routing code
> block in ExecInsert would look like below (writing resultRelInfo as just Rel):
>
>   rootRel = (mtstate->rootRel != NULL) ? mtstate->rootRel : Rel
>
>   if (rootRel != Rel)    /* update tuple-routing active */
>   {
>       int  subplan_off = Rel - mtstate->Rel[0];
>       int  leaf_off = mtstate->mt_subplan_partition_offsets[subplan_off];
>
>       if (mt_transition_tupconv_maps[leaf_off])
>       {
>          /*
>           * Convert to root format using
>           * mt_transition_tupconv_maps[leaf_off]
>           */
>
>           slot = mt_root_tuple_slot;  /* for tuple-routing */
>
>           /* Store the converted tuple into slot */
>       }
>   }
>
>   /* Existing tuple-routing flow follows */
>   new_leaf = ExecFindPartition(rootRel, slot, ...)
>
>   if (mtstate->transition_capture)
>   {
>      transition_capture_map = mt_transition_tupconv_maps[new_leaf]
>   }
>
>   if (mt_partition_tupconv_maps[new_leaf])
>   {
>      /*
>       * Convert to leaf format using mt_partition_tupconv_maps[new_leaf]
>       */
>
>      slot = mt_partition_tuple_slot;
>
>      /* Store the converted tuple into slot */
>   }
>

After doing the changes for the int[] array map in the previous patch
version, I still feel that ConvertPartitionTupleSlot() should be
retained. We save some repeated lines of code saved.

>> On HEAD, the "parent Plan" refers to
>> mtstate->mt_plans[0]. Now in the patch, for the parent node in
>> ExecInitQual(), mtstate->ps is passed rather than mt_plans[0]. So the
>> parent plan refers to this mtstate node.
>
> Hmm, I'm not really sure if doing that (passing mtstate->ps) would be
> accurate.  In the update tuple routing case, it seems that it's better to
> pass the correct parent PlanState pointer to ExecInitQual(), that is, one
> corresponding to the partition's sub-plan.  At least I get that feeling by
> looking at how parent is used downstream to that ExecInitQual() call, but
> there *may* not be anything to worry about there after all.  I'm unsure.
>
>> BTW, the reason I had changed the parent node to mtstate->ps is :
>> Other places in that code use mtstate->ps while initializing
>> expressions :
>>
>> /*
>> * Build a projection for each result rel.
>> */
>>    resultRelInfo->ri_projectReturning =
>>       ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
>>                               resultRelInfo->ri_RelationDesc->rd_att);
>>
>> ...........
>>
>> /* build DO UPDATE WHERE clause expression */
>> if (node->onConflictWhere)
>> {
>>    ExprState  *qualexpr;
>>
>>    qualexpr = ExecInitQual((List *) node->onConflictWhere,
>>     &mtstate->ps);
>> ....
>> }
>>
>> I think wherever we initialize expressions belonging to a plan, we
>> should use that plan as the parent. WithCheckOptions are fields of
>> ModifyTableState.
>
> You may be right, but I see for WithCheckOptions initialization
> specifically that the non-tuple-routing code passes the actual sub-plan
> when initializing the WCO for a given result rel.

Yes that's true. The problem with WithCheckOptions for newly allocated
partition result rels is : we can't use a subplan for the parent
parameter because there is no subplan for it. But I will still think
on it a bit more (TODO).

>
>>> Comments on the optimizer changes:
>>>
>>> +get_all_partition_cols(List *rtables,
>>>
>>> Did you mean rtable?
>>
>> I did mean rtables. It's a list of rtables.
>
> It's not, AFAIK.  rtable (range table) is a list of range table entries,
> which is also what seems to get passed to get_all_partition_cols for that
> argument (root->parse->rtable, which is not a list of lists).
>
> Moreover, there are no existing instances of this naming within the
> planner other than those that this patch introduces:
>
> $ grep rtables src/backend/optimizer/
> planner.c:114: static void get_all_partition_cols(List *rtables,
> planner.c:1063: get_all_partition_cols(List *rtables,
> planner.c:1069: Oid     root_relid = getrelid(root_rti, rtables);
> planner.c:1078: Oid                     relid = getrelid(rti, rtables);
>
> OTOH, dependency.c does have rtables, but it's actually a list of range
> tables.  For example:
>
> dependency.c:1360:      context.rtables = list_make1(rtable);

Yes, Ok. To be consistent with naming convention at multiple places, I
have changed it to rtable.

>
>>> +       if (partattno != 0)
>>> +           child_keycols =
>>> +               bms_add_member(child_keycols,
>>> +                              partattno -
>>> FirstLowInvalidHeapAttributeNumber);
>>> +   }
>>> +   foreach(lc, partexprs)
>>> +   {
>>>
>>> Elsewhere (in quite a few places), we don't iterate over partexprs
>>> separately like this, although I'm not saying it is bad, just different
>>> from other places.
>>
>> I think you are suggesting we do it like how it's done in
>> is_partition_attr(). Can you please let me know other places we do
>> this same way ? I couldn't find.
>
> OK, not as many as I thought there would be, but there are following
> beside is_partition_attrs():
>
> partition.c: get_range_nulltest()
> partition.c: get_qual_for_range()
> relcache.c: RelationBuildPartitionKey()
>

Ok, I think I will first address Robert's suggestion of re-using
is_partition_attrs() for pull_child_partition_columns(). If I do that,
this discussion won't be applicable, so I am deferring this one.
(TODO)

=============


Below are my responses to David's comments upto point #26 :


On 13 November 2017 at 18:25, David Rowley <david.rowley@2ndquadrant.com> wrote:
>  On 10 November 2017 at 16:42, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> [ update-partition-key_v23.patch ]
>
> Hi Amit,
>
> Thanks for working on this. I'm looking forward to seeing this go in.
>
> So... I've signed myself up to review the patch, and I've just had a
> look at it, (after first reading this entire email thread!).

Thanks a lot for your extensive review.

>
> Overall the patch looks like it's in quite a good shape.

Nice to hear that.

> I think I do agree with Robert about the UPDATE anomaly that's been discussed.
> I don't think we're painting ourselves into any corner by not having
> this working correctly right away. Anyone who's using some trigger
> workarounds for the current lack of support for updating the partition
> key is already going to have the same issues, so at least this will
> save them some troubles implementing triggers and give them much
> better performance.

I believe you are referring to the concurrency anomaly. Yes I agree on
that. By the way, (you may be already aware), there is a separate mail
thread going on to address this anamoly, so that we don't silently
proceed with the UPDATE without error  :

https://www.postgresql.org/message-id/CAAJ_b95PkwojoYfz0bzXU8OokcTVGzN6vYGCNVUukeUDrnF3dw%40mail.gmail.com

> 1. Closing command tags in docs should not be abbreviated
>
>     triggers are concerned, <literal>AFTER</> <command>DELETE</command> and
>
> This changed in c29c5789. I think Peter will be happy if you don't
> abbreviate the closing tags.

Added the tag. I had done most of the corrections after I rebased over
this commit, but I think I missed some of those with <literal> tag.

>
> 2. "about to do" would read better as "about to perform"
>
>  concurrent session, and it is about to do an <command>UPDATE</command>
>
> I think this paragraph could be more clear if we identified the
> sessions with a number.
>
> Perhaps:
>        Suppose, session 1 is performing an <command>UPDATE</command> on a
>        partition key, meanwhile, session 2 tries to perform an <command>UPDATE
>        </command> or <command>DELETE</command> operation on the same row.
>        Session 2 can silently miss the row due to session 1's activity.  In
>        such a case, session 2's <command>UPDATE</command>/<command>DELETE
>        </command>, being unaware of the row's movement, interprets this that the
>        row has just been deleted, so there is nothing to be done for this row.
>        Whereas, in the usual case where the table is not partitioned, or where
>        there is no row movement, the second session would have identified the
>        newly updated row and carried <command>UPDATE</command>/<command>DELETE
>        </command> on this new row version.

Done like above, with slight changes.

>
>
> 3. Integer width. get_partition_natts returns int but we assign to int16.
>
> int16 partnatts = get_partition_natts(key);
>
> Confusingly get_partition_col_attnum() returns int16 instead of AttrNumber
> but that's existingly not correct.
>
> 4. The following code could just pull_varattnos(partexprs, 1, &child_keycols);
>
> foreach(lc, partexprs)
> {
> Node    *expr = (Node *) lfirst(lc);
>
> pull_varattnos(expr, 1, &child_keycols);
> }

I will defer this till I address Robert's request to try and see if we
can have a common code for pull_child_partition_columns() and
is_partition_attr(). (TODO)

>
> 5. Triggers. Do we need a new "TG_" tag to allow trigger functions to
> do something
> special when the DELETE/INSERT is a partition move? I have audit
> tables in mind here
> it may appear as though a user performed a DELETE when they actually
> performed an UPDATE
> giving visibility of this to the trigger function will allow the
> application to work around this.

I feel it's too early to add a user-visible variable for such purpose.
Currently we don't support triggers on partitioned tables, and so a
user who wants to have a common trigger for a partition subtree has no
choice but to install the same trigger on all the leaf partitions
under it. And so we have to live with a not-very-obvious behaviour of
firing triggers even for the delete/insert part of the update row
movement.

>
> 6. change "row" to "a row" and "old" to "the old"
>
> * depending on whether the event is for row being deleted from old
>
> But to be honest, I'm having trouble parsing the comment. I think it
> would be better to
> say explicitly when the row will be NULL rather than "depending on
> whether the event"

I have put it this way now :

 * For INSERT events newtup should be non-NULL, for DELETE events
 * oldtup should be non-NULL, whereas for UPDATE events normally both
 * oldtup and newtup are non-NULL.  But for UPDATE event fired for
 * capturing transition tuples during UPDATE partition-key row
 * movement, oldtup is NULL when the event is for row being inserted,
 * whereas newtup is NULL when the event is for row being deleted.

>
> 7. I'm confused with how this change came about. If the old comment
> was correct here then the comment you're referring to here should
> remain in ExecPartitionCheck(), but you're saying it's in
> ExecConstraints().
>
> /* See the comments in ExecConstraints. */
>
> If the comment really is in ExecConstraints(), then you might want to
> give an overview of what you mean, then reference ExecConstraints() if
> more details are required.

I have put it this way :
 * Need to first convert the tuple to the root partitioned table's row
 * type. For details, check similar comments in ExecConstraints().

Basically, the comment to be referred in ExecConstraints() is this :
 * If the tuple has been routed, it's been converted to the
 * partition's rowtype, which might differ from the root
 * table's.  We must convert it back to the root table's
 * rowtype so that val_desc shown error message matches the
 * input tuple.

>
> 8. I'm having trouble parsing this comment:
>
>  * 'update_rri' has the UPDATE per-subplan result rels.
>
> I think "has" should be "contains" ?

Ok, changed it to 'contains'.

>
> 9. Also, this should likely be reworded:
>
>  * 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT,
>  *      this is 0.
>
>  'num_update_rri' number of elements in 'update_rri' array or zero for INSERT.

Done.

>
> 10. There should be no space before the '?'
>
> /* Is this leaf partition present in the update resultrel ? */

Done.

>
> 11. I'm struggling to understand this comment:
>
> * This is required when converting tuple as per root
> * partition tuple descriptor.
>
> "tuple" should probably be "the tuple", but not quite sure what you
> mean by "as per root".
>
> I may have misunderstood, but maybe it should read:
>
> * This is required when we convert the partition's tuple to
> * be compatible with the partitioned table's tuple descriptor.

ri_PartitionRoot is set to NULL while creating the result rels for
each of the UPDATE subplans; and it is required to be set to the root
table for leaf partitions created for tuple routing so that the error
message displays the row in root tuple descriptor. Because we re-use
the same result rels for the per-partition array, we need to set it
for them here.

I have reworded the comment this way :

* This is required when we convert the partition's tuple to be
* compatible with the root partitioned table's tuple
* descriptor.  When generating the per-subplan UPDATE result
* rels, this was not set.

Let me know if this is clear enough.

>
> 12. I think "as well" would be better written as "either".
>
> * If we didn't open the partition rel, it means we haven't
> * initialized the result rel as well.

Done.

>
> 13. I'm unsure what is meant by the following comment:
>
> * Verify result relation is a valid target for insert operation. Even
> * for updates, we are doing this for tuple-routing, so again, we need
> * to check the validity for insert operation.
>
> I'm not quite sure where UPDATE comes in here as we're only checking for INSERT?

Here, "Even for update" means "Even when
ExecSetupPartitionTupleRouting() is called for an UPDATE operation".

>
> 14. Use of underscores instead of camelCase.
>
> COPY_SCALAR_FIELD(part_cols_updated);

>
> I know you're not the first one to break this as "partitioned_rels"
> does not follow it either, but that's probably not a good enough
> reason to break away from camelCase any further.
>
> I'd suggest "partColsUpdated". But after a re-think, maybe cols is
> incorrect. All columns are partitioned, it's the key columns that we
> care about, so how about "partKeyUpdate"

Sure. I have used partKeyUpdated as against partKeyUpdate.

>
> 15. Are you sure that you mean "root" here?
>
>  * All the child partition attribute numbers are converted to the root
>  * partitioned table.
>
> Surely this is just the target relation. "parent" maybe? A
> sub-partitioned table might be the target of an UPDATE too.

Here the root means the root of the partition subtree, which is also
the UPDATE target relation. I think in other places we call it the
root even though it may also have ancestors. It is the root of the
subtree in question. This is similar to how we have named the
ModifyTableState->rootResultRelInfo field.

Note that Robert has requested to collect the partition cols at some
other place where we have already the table open. So this function
itself may change.

>
> 15. I see get_all_partition_cols() is just used once to check if
> parent_rte->updatedCols contains and partition keys.
>
> Would it not be better to reform that function and pass
> parent_rte->updatedCols in and abort as soon as you see a single
> match?
>
> Maybe the function could return bool and be named
> partitioned_key_overlaps(), that way your assignment in
> inheritance_planner() would just become:
>
> part_cols_updated = partitioned_key_overlaps(root->parse->rtable,
> top_parentRTindex, partitioned_rels, parent_rte->updatedCols);
>
> or something like that anyway.

I am going to think on all of this when I start checking if we can
have some common code for pull_child_partition_columns() and
is_partition_attr(). (TODO)

One thing to note is : Usually the user is not going to modify
partition cols. So typically we would need to scan through all the
partitioned tables to check if the partition key is modified. So to
make this scan more efficient, avoid the "bitmap_overlap" operation
for each of the partitioned tables separately, and instead, collect
them first from all partitioned tables, and then do a single overlap
operation. This way we make the normal updates a tiny bit fast, at the
expense of tiny-bit slower partition-key-updates because we don't
abort the scan as soon as we find the partition key updated.

>
> 16. Typo in comment
>
>  * 'part_cols_updated' if any partitioning columns are being updated, either
>  * from the named relation or a descendent partitione table.
>
> "partitione" should be "partitioned". Also, normally for bool
> parameters, we might word things like "True if ..." rather than just "if"
>
> You probably should follow camelCase I mentioned in 14 here too.

Done. Similar to the other bool param canSetTag, made it :
"'partColsUpdated' is true if any ..."

>
> 17. Comment needs a few changes:
>
>  * ConvertPartitionTupleSlot -- convenience function for converting tuple and
>  * storing it into a tuple slot provided through 'new_slot', which typically
>  * should be one of the dedicated partition tuple slot. Passes the partition
>  * tuple slot back into output param p_old_slot. If no mapping present, keeps
>  * p_old_slot unchanged.
>  *
>  * Returns the converted tuple.
>
> There are a few typos here. For example, "tuple" should be "a tuple",
> but maybe the comment should just be worded like:
>
>  * ConvertPartitionTupleSlot -- convenience function for tuple conversion
>  * using 'map'. The tuple, if converted, is stored in 'new_slot' and
>  * 'p_old_slot' is set to the original partition tuple slot. If map is NULL,
>  * then the original tuple is returned unmodified, otherwise the converted
>  * tuple is returned.

Modified, with some changes. p_old_slot name is a bit confusing. So I
have renamed it to p_my_slot.
Here is how it looks now :

 * ConvertPartitionTupleSlot -- convenience function for tuple conversion using
 * 'map'. The tuple, if converted, is stored in 'new_slot', and 'p_my_slot' is
 * updated with the 'new_slot'. 'new_slot' typically should be one of the
 * dedicated partition tuple slots. If map is NULL, keeps p_my_slot unchanged.
 *
 * Returns the converted tuple, unless map is NULL, in which case original
 * tuple is returned unmodified.

>
> 18. Line goes over 80 chars.
>
> TransitionCaptureState *transition_capture = mtstate->mt_transition_capture;
>
> Better just to split the declaration and assignment.

Done.

>
> 19. Confusing comment:
>
> /*
> * If the original operation is UPDATE, the root partitioned table
> * needs to be fetched from mtstate->rootResultRelInfo.
> */
>
> It's not that clear here how you determine this is an UPDATE of a
> partitioned key.
>
> 20. This code looks convoluted:
>
> rootResultRelInfo = (mtstate->rootResultRelInfo ?
> mtstate->rootResultRelInfo : resultRelInfo);
>
> /*
> * If the resultRelInfo is not the root partitioned table (which
> * happens for UPDATE), we should convert the tuple into root's tuple
> * descriptor, since ExecFindPartition() starts the search from root.
> * The tuple conversion map list is in the order of
> * mtstate->resultRelInfo[], so to retrieve the one for this resultRel,
> * we need to know the position of the resultRel in
> * mtstate->resultRelInfo[].
> */
> if (rootResultRelInfo != resultRelInfo)
> {
>
> rootResultRelInfo is assigned via a ternary expression which makes the
> subsequent if test seem a little strange.
>
> Would it not be better to test:
>
> if (mtstate->rootResultRelInfo)
> {
> rootResultRelInfo = mtstate->rootResultRelInfo
> ... other stuff ...
> }
> else
> rootResultRelInfo = resultRelInfo;
>
> Then above the if test you can explain that rootResultRelInfo is only
> set during UPDATE of partition keys, as per #19.

Giving more thought on this, I think to avoid confusion to the reader,
we better have an explicit (operation == CMD_UPDATE) condition, and in
that block, assert that mtstate->rootResultRelInfo is non-NULL. I have
accordingly shuffled the if conditions. I think this is simple and
clear. Please check.

>
> 21. How come you renamed mt_partition_tupconv_maps[] to
> mt_parentchild_tupconv_maps[]?

mt_transition_tupconv_maps must be renamed to a more general map name
because it is not only used for transition capture but also for update
tuple routing. And we have mt_partition_tupconv_maps which is already
a general name. So to distinguish between the two tupconv maps, I
prepended "parent-child" or "child-parent" to "tupconv_maps".

>
> 22. Comment in ExecInsert() could be worded better.
>
> /*
> * In case this is part of update tuple routing, put this row into the
> * transition NEW TABLE if we are capturing transition tables. We need to
> * do this separately for DELETE and INSERT because they happen on
> * different tables.
> */
>
> /*
> * This INSERT may be the result of a partition-key-UPDATE. If so,
> * and we're required to capture transition tables then we'd better
> * record this as a statement level UPDATE on the target relation.
> * We're not interested in the statement level DELETE or INSERT as
> * these occur on the individual partitions, none of which are the
> * target of this the UPDATE statement.
> */
>
> A similar comment could use a similar improvement in ExecDelete()

I want to emphasize the fact that we need to do the OLD and NEW row
separately for DELETE and INSERT. And also, I think we need not
mention about statement triggers, though the transition table capture
with partitions currently is supported only for statement triggers. We
should only worry about capturing the row if
mtstate->mt_transition_capture != NULL, without having to know whether
it is for statement trigger or not.

Below is how the comment looks now after I did some changes as per
your suggestion about wording :

 * If this INSERT is part of a partition-key-UPDATE and we are capturing
 * transition tables, put this row into the transition NEW TABLE.
 * (Similarly we need to add the deleted row in OLD TABLE).  We need to do
 * this separately for DELETE and INSERT because they happen on different
 * tables.

>
> 23. Line is longer than 80 chars.
>
> TransitionCaptureState *transition_capture = mtstate->mt_transition_capture;

Done.

>
> 24. I know from reading the thread this name has changed before, but I
> think delete_skipped seems like the wrong name for this variable in:
>
> if (delete_skipped)
> *delete_skipped = true;
>
> Skipped is the wrong word here as that indicates like we had some sort
> of choice and that we decided not to. However, that's not the case
> when the tuple was concurrently deleted. Would it not be better to
> call it "tuple_deleted" or even "success" and reverse the logic? It's
> just a bit confusing that you're setting this to skipped before
> anything happens. It would be nicer if there was a better way to do
> this whole thing as it's a bit of a wart in the code. I understand why
> the code exists though.

I think "success" sounds like : if it is false, ExecDelete has failed.
So I have chosen "tuple_deleted". "tuple_actually_deleted" might sound
still better, but it is too long.

> Also, I wonder if it's better to always pass a boolean here to save
> having to test for NULL before setting it, that way you might consider
> putting the success = false just before the return NULL, then do
> success = true after the tuple is gone.
> Failing that, putting: something like: success = false; /* not yet! */
> where you're doing the if (deleted_skipped) test is might also be
> better.

I didn't really understand this.

>
> 25. Comment "we should" should be "we must".
>
> /*
> * For some reason if DELETE didn't happen (for e.g. trigger
> * prevented it, or it was already deleted by self, or it was
> * concurrently deleted by another transaction), then we should
> * skip INSERT as well, otherwise, there will be effectively one
> * new row inserted.
>
> Maybe just:
> /* If the DELETE operation was unsuccessful, then we must not
> * perform the INSERT into the new partition.

I think we better mention some scenarios of why this can happen ,
otherwise its confusing to the reader why the delete can't happen, or
why we shouldn't error out in that case.

>
> "for e.g." is not really correct in English. "For example, ..." or
> just "e.g. ..." is correct. If you de-abbreviate the e.g. then you've
> written "For exempli gratia", which translates to "For for example".

I see. Good to know that. Done.

>
> 26. You're not really explaining what's going on here:
>
> if (mtstate->mt_transition_capture)
> saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
>
> You have a comment later to say you're about to "Revert back to the
> transition capture map", but I missed the part that explained about
> modifying it in the first place.

I have now added main comments while saving the map, and I refer to
this comment while reverting back the map.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
The following contains replies to David's remaining comments , i.e.
from #27 onwards, followed by replies to Alvaro's review comments.

Attached is the revised patch v25.

=====================

On 13 November 2017 at 18:25, David Rowley <david.rowley@2ndquadrant.com> wrote:
>
> 27. Comment does not explain how we're skipping checking the partition
> constraint check in:
>
> * We have already checked partition constraints above, so skip
> * checking them here.
>
> Maybe something like:
>
> * We've already checked the partition constraint above, however, we
> * must still ensure the tuple passes all other constraints, so we'll
> * call ExecConstraints() and have it validate all remaining checks.

Done.

>
> 28. For table WITH OIDs, the OID should probably follow the new tuple
> for partition-key-UPDATEs.
>

I understand that as far as possible we want to simulate the UPDATE as
if it's a normal table update. But for system columns, I think we
should avoid that; and instead, let the system handle it the way it is
handling (i.e. the new row in a table should always have a new OID.)

> 29. ExecSetupChildParentMap gets called here for non-partitioned relations.
> Maybe that's not the best function name? The function only seems to do
> that when perleaf is True.

I didn't clearly understand this, particularly, what task you were
referring to when you said "the function only seems to do that" ? The
function does setup child-parent map even when perleaf=false. The
function name is chosen that way because the map is always a
child-to-root map, but the map array elements may be arranged in the
order of the per-partition array 'mtstate->mt_partitions[]', or in the
order of the per-subplan result rels 'mtstate->resultRelInfo[]'

>
> Is a leaf a partition of a partitioned table? It's not that clear the
> meaning here.

Leaf partition means it is a child of a partitioned table, but it
itself is not a partitioned table.

I have added more comments for the function ExecSetupChildParentMap()
(both, at the function header and inside). Please check and let me
know if you still have questions.

>
> 30. The following chunk of code is giving me a headache trying to
> verify which arrays are which size:
>
> ExecSetupPartitionTupleRouting(rel,
>    mtstate->resultRelInfo,
>    (operation == CMD_UPDATE ? nplans : 0),
>    node->nominalRelation,
>    estate,
>    &partition_dispatch_info,
>    &partitions,
>    &partition_tupconv_maps,
>    &subplan_leaf_map,
>    &partition_tuple_slot,
>    &num_parted, &num_partitions);
> mtstate->mt_partition_dispatch_info = partition_dispatch_info;
> mtstate->mt_num_dispatch = num_parted;
> mtstate->mt_partitions = partitions;
> mtstate->mt_num_partitions = num_partitions;
> mtstate->mt_parentchild_tupconv_maps = partition_tupconv_maps;
> mtstate->mt_subplan_partition_offsets = subplan_leaf_map;
> mtstate->mt_partition_tuple_slot = partition_tuple_slot;
> mtstate->mt_root_tuple_slot = MakeTupleTableSlot();
>
> I know this patch is not completely responsible for it, but you're not
> making things any better.
>
> Would it not be better to invent some PartitionTupleRouting struct and
> make that struct a member of ModifyTableState and CopyState, then just
> pass the pointer to that struct to ExecSetupPartitionTupleRouting()
> and have it fill in the required details? I think the complexity of
> this is already on the high end, I think you really need to do the
> refactor before this gets any worse.
>

Ok. I am currently working on doing this change. So not yet included
in the attached patch. Will send yet another revised patch for this
change. (TODO)

>
> 31. The following code seems incorrect:
>
> /*
> * If this is an UPDATE and a BEFORE UPDATE trigger is present, we may
> * need to do update tuple routing.
> */
> if (resultRelInfo->ri_TrigDesc &&
> resultRelInfo->ri_TrigDesc->trig_update_before_row &&
> operation == CMD_UPDATE)
> update_tuple_routing_needed = true;
>
> Shouldn't this be setting update_tuple_routing_needed to false if
> there are no before row update triggers? Otherwise, you're setting it
> to true regardless of if there are any partition key columns being
> UPDATEd. That would make the work you're doing in
> inheritance_planner() to set part_cols_updated a waste of time.

The point of setting it to true regardless of whether the partition
key is updated is : even if partition key is not explicitly modified
by the UPDATE, a before-row trigger can update it later. But we can
never know whether it is actually going to update. So if there are BR
UPDATE triggers on result rels of any of the subplans, we *always*
setup the tuple routing. This approach was concluded in the earlier
discussions about trigger handling.

>
> Also, this bit of code is a bit confused.
>
> /* Decide whether we need to perform update tuple routing. */
> if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
> update_tuple_routing_needed = false;
>
> /*
> * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
> * partition key.
> */
> if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
> (operation == CMD_INSERT || update_tuple_routing_needed))
>
>
> The first if test would not be required if you fixed the code where
> you set update_tuple_routing_needed = true regardless if its a
> partitioned table or not.

The place where I set update_tuple_routing_needed to true
unconditionally, we don't have the relation open, so we don't know
whether it is a partitioned table. Hence, set it anyways, and then
revert it to false if it's not a partitioned table after all.

>
> So basically, you need to take the node->part_cols_updated from the
> planner, if that's true then perform your test for before row update
> triggers, set a bool to false if there are none, then proceed to setup
> the partition tuple routing for partition table inserts or if your
> bool is still true. Right?

I think if we look at "update_tuple_routing_needed" as meaning that
update tuple routing *may be* required, then the logic as-is makes
sense: Set the variable if we see that we may require to do update
routing. And the conditions for that are : either node->partKeyUpdated
is true, or there is a BR UPDATE trigger and the operation is UPDATE.
So set this variable for those conditions, and revert it back to false
later if it is found that it's not a partitioned table.

So I have retained the existing logic in the patch, but with some
additional comments to make this logic clear to the reader.

>
> 32. "WCO" abbreviation is not that common and might need to be expanded.
>
> * Below are required as reference objects for mapping partition
> * attno's in expressions such as WCO and RETURNING.
>
> Searching for other comments which mention "WCO" they're all around
> places that is easy to understand they mean "With Check Option", e.g.
> next to a variable with a more descriptive name. That's not the case
> here.

Ok. Changed WCO to WithCheckOptions.

>
> 33. "are anyway newly allocated", should "anyway" be "always"?
> Otherwise, it does not make sense.
>

OK. Changed this :
* because all leaf partition result rels are anyway newly allocated.
to this (also removed 'all') :
* because leaf partition result rels are always newly allocated.

>
> 34. Comment added which mentions a member that does not exist.
>
>  * all_part_cols contains all attribute numbers from the parent that are
>  * used as partitioning columns by the parent or some descendent which is
>  * itself partitioned.
>  *

Oops. Left-overs from earlier patch. Removed the comment.


=====================

Below are Alvaro's review comments :


On 14 November 2017 at 22:22, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> David Rowley wrote:
>
>> 5. Triggers. Do we need a new "TG_" tag to allow trigger functions to
>> do something special when the DELETE/INSERT is a partition move? I
>> have audit tables in mind here it may appear as though a user
>> performed a DELETE when they actually performed an UPDATE giving
>> visibility of this to the trigger function will allow the application
>> to work around this.
>
> +1  I think we do need a flag that can be inspected from the user
> trigger function.

What I feel is : it's too early to do such changes. I think we should
first get in the core patch, and then consider this request and any
further enhancements.

>
>> 9. Also, this should likely be reworded:
>>
>>  * 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT,
>>  *      this is 0.
>>
>>  'num_update_rri' number of elements in 'update_rri' array or zero for INSERT.
>
> Also:
>
> /pgsql/source/master/src/backend/executor/execMain.c: In function 'ExecSetupPartitionTupleRouting':
> /pgsql/source/master/src/backend/executor/execMain.c:3401:18: warning: 'leaf_part_arr' may be used uninitialized in
thisfunction [-Wmaybe-uninitialized]
 
>     leaf_part_rri = leaf_part_arr + i;
>     ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~
>

Right. I have now made "leaf_part_arr = NULL" during declaration.
Actually leaf_part_arr is used only for inserts; but for compiler-sake
we should add this initialization.


> I think using num_update_rri==0 as a flag to indicate INSERT is strange.
> I suggest passing an additional boolean --

I think adding another param looks redundant. To make the condition
more readable, I have introduced a new local variable :
bool is_update = (num_update_rri > 0);

> or maybe just split the whole
> function in two, one for updates and another for inserts, say
> ExecSetupPartitionTupleRoutingForInsert() and
> ExecSetupPartitionTupleRoutingForUpdate().  They seem to
> share almost no code, and the current flow is hard to read; maybe just
> add a common subroutine for the lower bottom of the loop.

So there are two common code sections. One is the initial code which
initializes various arrays and output params. And the 2nd common code
is the 2nd half of the for loop block that includes calls to
heap_open(), InitResultRelInfo(), convert_tuples_by_name(),
CheckValidResultRel() and others. So it looks like there is a lot of
common code. We would need to have two functions, one to have the
initialization code, and the other to run the later half of the loop.
And, heap_open() and InitResultRelInfo() need to be called only if
partrel (which needs to be passed as function param) is NULL. Rather
than this, I think this condition is better placed in-line in
ExecSetupPartitionTupleRouting() for clarity. I am feeling it's not
worth doing the shuffling. We are extracting the code into two
functions only to avoid the "if num_update_rri" conditions.

That's why I feel having a "is_update" variable would solve the
purpose. The hard-to-understand code, I presume, is the update part
where it tries to re-use already-existing result resl, and this part
would anyways remain, although in a separate function.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: [HACKERS] UPDATE of partition key

From
Amit Langote
Date:
Thanks Amit.

Looking at the latest v25 patch.

On 2017/11/16 23:50, Amit Khandekar wrote:
> Below has the responses for both Amit's and David's comments, starting
> with Amit's ....
> On 2 November 2017 at 12:40, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2017/10/24 0:15, Amit Khandekar wrote:
>>> On 16 October 2017 at 08:28, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>>>
>>>> +           (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup
>>>> == NULL))))
>>>>
>>>> Is there some reason why a bitwise operator is used here?
>>>
>>> That exact condition means that the function is called for transition
>>> capture for updated rows being moved to another partition. For this
>>> scenario, either the oldtup or the newtup is NULL. I wanted to exactly
>>> capture that condition there. I think the bitwise operator is more
>>> user-friendly in emphasizing the point that it is indeed an "either a
>>> or b, not both" condition.
>>
>> I see.  In that case, since this patch adds the new condition, a note
>> about it in the comment just above would be good, because the situation
>> you describe here seems to arise only during update-tuple-routing, IIUC.
> 
> Done. Please check.

Looks fine.

>> + * 'update_rri' has the UPDATE per-subplan result rels. These are re-used
>> + *      instead of allocating new ones while generating the array of all leaf
>> + *      partition result rels.
>>
>> Instead of:
>>
>> "These are re-used instead of allocating new ones while generating the
>> array of all leaf partition result rels."
>>
>> how about:
>>
>> "There is no need to allocate a new ResultRellInfo entry for leaf
>> partitions for which one already exists in this array"
> 
> Ok. I have made it like this :
> 
> + * 'update_rri' contains the UPDATE per-subplan result rels. For the
> output param
> + *             'partitions', we don't allocate new ResultRelInfo objects for
> + *             leaf partitions for which they are already available
> in 'update_rri'.

Sure.

>>> It looks like the interface does not much simplify, and above that, we
>>> have more number of lines in that function. Also, the caller anyway
>>> has to be aware whether the map_index is the index into the leaf
>>> partitions or the update subplans. So it is not like the caller does
>>> not have to be aware about whether the mapping should be
>>> mt_persubplan_childparent_maps or mt_perleaf_parentchild_maps.
>>
>> Hmm, I think we should try to make it so that the caller doesn't have to
>> be aware of that.  And by caller I guess you mean ExecInsert(), which
>> should not be a place, IMHO, where to try to introduce a lot of new logic
>> specific to update tuple routing.
> 
> I think, for ExecInsert() since we have already given the job of
> routing the tuple from root partitioned table to a partition, it makes
> sense to give the function the additional job of routing the tuple
> from any partition to any partition. ExecInsert() can be looked at as
> doing this job : "insert a tuple into the right partition; the
> original tuple can belong to any partition"

Yeah, that's one way of looking at that.  But I think ExecInsert() as it
is today thinks it's got a *new* tuple and that's it.  I think the newly
introduced code in it to find out that it is not so (that the tuple
actually comes from some other partition), that it's really the
update-turned-into-delete-plus-insert, and then switch to the root
partitioned table's ResultRelInfo, etc. really belongs outside of it.
Maybe in its caller, which is ExecUpdate().  I mean why not add this code
to the block in ExecUpdate() that handles update-row-movement.

Just before calling ExecInsert() to do the re-routing seems like a good
place to do all that.  For example, try the attached incremental patch
that applies on top of yours.  I can see after applying it that diffs to
ExecInsert() are now just some refactoring ones and there are no
significant additions making it look like supporting update-row-movement
required substantial changes to how ExecInsert() itself works.

> After doing the changes for the int[] array map in the previous patch
> version, I still feel that ConvertPartitionTupleSlot() should be
> retained. We save some repeated lines of code saved.

OK.

>> You may be right, but I see for WithCheckOptions initialization
>> specifically that the non-tuple-routing code passes the actual sub-plan
>> when initializing the WCO for a given result rel.
> 
> Yes that's true. The problem with WithCheckOptions for newly allocated
> partition result rels is : we can't use a subplan for the parent
> parameter because there is no subplan for it. But I will still think
> on it a bit more (TODO).

Alright.

>>> I think you are suggesting we do it like how it's done in
>>> is_partition_attr(). Can you please let me know other places we do
>>> this same way ? I couldn't find.
>>
>> OK, not as many as I thought there would be, but there are following
>> beside is_partition_attrs():
>>
>> partition.c: get_range_nulltest()
>> partition.c: get_qual_for_range()
>> relcache.c: RelationBuildPartitionKey()
>>
> 
> Ok, I think I will first address Robert's suggestion of re-using
> is_partition_attrs() for pull_child_partition_columns(). If I do that,
> this discussion won't be applicable, so I am deferring this one.
> (TODO)

Sure, no problem.

Thanks,
Amit

Attachment

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 21 November 2017 at 17:24, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 13 November 2017 at 18:25, David Rowley <david.rowley@2ndquadrant.com> wrote:
>>
>> 30. The following chunk of code is giving me a headache trying to
>> verify which arrays are which size:
>>
>> ExecSetupPartitionTupleRouting(rel,
>>    mtstate->resultRelInfo,
>>    (operation == CMD_UPDATE ? nplans : 0),
>>    node->nominalRelation,
>>    estate,
>>    &partition_dispatch_info,
>>    &partitions,
>>    &partition_tupconv_maps,
>>    &subplan_leaf_map,
>>    &partition_tuple_slot,
>>    &num_parted, &num_partitions);
>> mtstate->mt_partition_dispatch_info = partition_dispatch_info;
>> mtstate->mt_num_dispatch = num_parted;
>> mtstate->mt_partitions = partitions;
>> mtstate->mt_num_partitions = num_partitions;
>> mtstate->mt_parentchild_tupconv_maps = partition_tupconv_maps;
>> mtstate->mt_subplan_partition_offsets = subplan_leaf_map;
>> mtstate->mt_partition_tuple_slot = partition_tuple_slot;
>> mtstate->mt_root_tuple_slot = MakeTupleTableSlot();
>>
>> I know this patch is not completely responsible for it, but you're not
>> making things any better.
>>
>> Would it not be better to invent some PartitionTupleRouting struct and
>> make that struct a member of ModifyTableState and CopyState, then just
>> pass the pointer to that struct to ExecSetupPartitionTupleRouting()
>> and have it fill in the required details? I think the complexity of
>> this is already on the high end, I think you really need to do the
>> refactor before this gets any worse.
>>
>
>Ok. I am currently working on doing this change. So not yet included in the attached patch. Will send yet another
revisedpatch for this change.
 

Attached incremental patch encapsulate_partinfo.patch (to be applied
over the latest v25 patch) contains changes to move all the
partition-related information into new structure
PartitionTupleRouting. Further to that, I also moved
PartitionDispatchInfo into this structure. So it looks like this :

typedef struct PartitionTupleRouting
{
PartitionDispatch *partition_dispatch_info;
int num_dispatch;
ResultRelInfo **partitions;
int num_partitions;
TupleConversionMap **parentchild_tupconv_maps;
int    *subplan_partition_offsets;
TupleTableSlot *partition_tuple_slot;
TupleTableSlot *root_tuple_slot;
} PartitionTupleRouting;

So this structure now encapsulates *all* the
partition-tuple-routing-related information. So ModifyTableState now
has only one tuple-routing-related field 'mt_partition_tuple_routing'.
It is changed like this :

@@ -976,24 +976,14 @@ typedef struct ModifyTableState
        TupleTableSlot *mt_existing;    /* slot to store existing
target tuple in */
        List       *mt_excludedtlist;   /* the excluded pseudo
relation's tlist  */
        TupleTableSlot *mt_conflproj;   /* CONFLICT ... SET ...
projection target */
-       struct PartitionDispatchData **mt_partition_dispatch_info;
-       /* Tuple-routing support info */
-       int                     mt_num_dispatch;        /* Number of
entries in the above array */
-       int                     mt_num_partitions;      /* Number of
members in the following
-
  * arrays */
-       ResultRelInfo **mt_partitions;  /* Per partition result
relation pointers */
-       TupleTableSlot *mt_partition_tuple_slot;
-       TupleTableSlot *mt_root_tuple_slot;
+       struct PartitionTupleRouting *mt_partition_tuple_routing; /*
Tuple-routing support info */
        struct TransitionCaptureState *mt_transition_capture;
        /* controls transition table population for specified operation */
        struct TransitionCaptureState *mt_oc_transition_capture;
        /* controls transition table population for INSERT...ON
CONFLICT UPDATE */
-       TupleConversionMap **mt_parentchild_tupconv_maps;
-       /* Per partition map for tuple conversion from root to leaf */
        TupleConversionMap **mt_childparent_tupconv_maps;
        /* Per plan/partition map for tuple conversion from child to root */
        bool            mt_is_tupconv_perpart;  /* Is the above map
per-partition ? */
-       int             *mt_subplan_partition_offsets;
        /* Stores position of update result rels in leaf partitions */
 } ModifyTableState;

So the code in nodeModifyTable.c (and similar code in copy.c) is
accordingly adjusted to use mtstate->mt_partition_tuple_routing.

The places where we use (mtstate->mt_partition_dispatch_info != NULL)
condition to run tuple-routing code, I have replaced it with
mtstate->mt_partition_tuple_routing != NULL.

If you are ok with the incremental patch, I can extract this change
into a separate preparatory patch to be applied on PG master.

Thanks
-Amit Khandekar

Attachment

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 7 November 2017 at 00:33, Robert Haas <robertmhaas@gmail.com> wrote:
> +       /* The caller must have already locked all the partitioned tables. */
> +       root_rel = heap_open(root_relid, NoLock);
> +       *all_part_cols = NULL;
> +       foreach(lc, partitioned_rels)
> +       {
> +               Index           rti = lfirst_int(lc);
> +               Oid                     relid = getrelid(rti, rtables);
> +               Relation        part_rel = heap_open(relid, NoLock);
> +
> +               pull_child_partition_columns(part_rel, root_rel, all_part_cols);
> +               heap_close(part_rel, NoLock);
>
> I don't like the fact that we're opening and closing the relation here
> just to get information on the partitioning columns.  I think it would
> be better to do this someplace that already has the relation open and
> store the details in the RelOptInfo.  set_relation_partition_info()
> looks like the right spot.

It seems, for UPDATE, baserel RelOptInfos are created only for the
subplan relations, not for the partitioned tables. I verified that
build_simple_rel() does not get called for paritioned tables for
UPDATE.

In earlier versions of the patch, we used to collect the partition
keys while expanding the partition tree so that we could get them
while the relations are open. After some reviews, I was inclined to
think that the collection logic better be moved out into the
inheritance_planner(), because it involved pulling the attributes from
partition key expressions, and the bitmap operation, which would be
unnecessarily done for SELECTs as well.

On the other hand, if we collect the partition keys separately in
inheritance_planner(), then as you say, we need to open the relations.
Opening the relation which is already in relcache and which is already
locked, involves only a hash lookup. Do you think this is expensive ?
I am open for either of these approaches.

If we collect the partition keys in expand_partitioned_rtentry(), we
need to pass the root relation also, so that we can convert the
partition key attributes to root rel descriptor. And the other thing
is, may be, we can check beforehand (in expand_inherited_rtentry)
whether the rootrte's updatedCols is empty, which I think implies that
it's not an UPDATE operation. If yes, we can just skip collecting the
partition keys.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] UPDATE of partition key

From
Amit Langote
Date:
On 2017/11/23 21:57, Amit Khandekar wrote:
> If we collect the partition keys in expand_partitioned_rtentry(), we
> need to pass the root relation also, so that we can convert the
> partition key attributes to root rel descriptor. And the other thing
> is, may be, we can check beforehand (in expand_inherited_rtentry)
> whether the rootrte's updatedCols is empty, which I think implies that
> it's not an UPDATE operation. If yes, we can just skip collecting the
> partition keys.

Yeah, it seems like a good idea after all to check in
expand_inherited_rtentry() whether the root RTE's updatedCols is non-empty
and if so check if any of the updatedCols are partition keys.  If we find
some, then it will suffice to just set a simple flag in the
PartitionedChildRelInfo that will be created for the root table.  That
should be done *after* we have visited all the tables in the partition
tree including some that might be partitioned and hence will provide their
partition keys.  The following block in expand_inherited_rtentry() looks
like a good spot:
       if (rte->inh && partitioned_child_rels != NIL)       {           PartitionedChildRelInfo *pcinfo;
           pcinfo = makeNode(PartitionedChildRelInfo);

Thanks,
Amit



Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 24 November 2017 at 10:52, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/11/23 21:57, Amit Khandekar wrote:
>> If we collect the partition keys in expand_partitioned_rtentry(), we
>> need to pass the root relation also, so that we can convert the
>> partition key attributes to root rel descriptor. And the other thing
>> is, may be, we can check beforehand (in expand_inherited_rtentry)
>> whether the rootrte's updatedCols is empty, which I think implies that
>> it's not an UPDATE operation. If yes, we can just skip collecting the
>> partition keys.
>
> Yeah, it seems like a good idea after all to check in
> expand_inherited_rtentry() whether the root RTE's updatedCols is non-empty
> and if so check if any of the updatedCols are partition keys.  If we find
> some, then it will suffice to just set a simple flag in the
> PartitionedChildRelInfo that will be created for the root table.  That
> should be done *after* we have visited all the tables in the partition
> tree including some that might be partitioned and hence will provide their
> partition keys.  The following block in expand_inherited_rtentry() looks
> like a good spot:
>
>         if (rte->inh && partitioned_child_rels != NIL)
>         {
>             PartitionedChildRelInfo *pcinfo;
>
>             pcinfo = makeNode(PartitionedChildRelInfo);

Yes, I am thinking about something like that. Thanks.

I am also working on your suggestion of moving the
convert-to-root-descriptor logic from ExecInsert() to ExecUpdate().

So, in the upcoming patch version, I am intending to include the above
two, and if possible, Robert's idea of re-using is_partition_attr()
for pull_child_partition_columns().


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] UPDATE of partition key

From
Michael Paquier
Date:
On Mon, Nov 27, 2017 at 5:28 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> So, in the upcoming patch version, I am intending to include the above
> two, and if possible, Robert's idea of re-using is_partition_attr()
> for pull_child_partition_columns().

Discussions are still going on, so moved to next CF.
-- 
Michael


Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 27 November 2017 at 13:58, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 24 November 2017 at 10:52, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2017/11/23 21:57, Amit Khandekar wrote:
>>> If we collect the partition keys in expand_partitioned_rtentry(), we
>>> need to pass the root relation also, so that we can convert the
>>> partition key attributes to root rel descriptor. And the other thing
>>> is, may be, we can check beforehand (in expand_inherited_rtentry)
>>> whether the rootrte's updatedCols is empty, which I think implies that
>>> it's not an UPDATE operation. If yes, we can just skip collecting the
>>> partition keys.
>>
>> Yeah, it seems like a good idea after all to check in
>> expand_inherited_rtentry() whether the root RTE's updatedCols is non-empty
>> and if so check if any of the updatedCols are partition keys.  If we find
>> some, then it will suffice to just set a simple flag in the
>> PartitionedChildRelInfo that will be created for the root table.  That
>> should be done *after* we have visited all the tables in the partition
>> tree including some that might be partitioned and hence will provide their
>> partition keys.  The following block in expand_inherited_rtentry() looks
>> like a good spot:
>>
>>         if (rte->inh && partitioned_child_rels != NIL)
>>         {
>>             PartitionedChildRelInfo *pcinfo;
>>
>>             pcinfo = makeNode(PartitionedChildRelInfo);
>
> Yes, I am thinking about something like that. Thanks.

In expand_partitioned_rtentry(), rather than collecting partition key
attributes of all partitioned tables, I am now checking if
parentrte->updatedCols has any partition key attributes. If an earlier
parentrte's updatedCols was already found to have partition-keys,
don't continue to check more.

Also, rather than converting all the partition key attriubtes to be
compatible with root's tuple descriptor, we better compare with each
of the partitioned table's updatedCols when we have their handle
handy. Each of the parentrte's updatedCols has exactly the same
attributes as the root's, just with the ordering possibly changed. So
it is safe to compare using the updatedCols of intermediate partition
rels rather than those of the root rel. And, the advantage is : we now
got rid of the conversion mapping from each of the partitioned table
to root that was earlier done in pull_child_partition_columns() in the
previous patches.

PartitionedChildRelInfo now has is_partition_key_update field. This is
updated using get_partitioned_child_rels().

> I am also working on your suggestion of moving the
> convert-to-root-descriptor logic from ExecInsert() to ExecUpdate().

Done.

>
> So, in the upcoming patch version, I am intending to include the above
> two, and if possible, Robert's idea of re-using is_partition_attr()
> for pull_child_partition_columns().

Done. Now, is_partition_attr() is renamed to has_partition_attrs().
This function now accepts a bitmapset of attnums instead of a single
attnum. Moved this function from tablecmds.c to partition.c. This is
now re-used, and the earlier pull_child_partition_columns() is
removed.

Attached v26, that has all of the above points covered. Also, this
patch contains the incremental changes that were attached in the patch
encapsulate_partinfo.patch attached in [1]. In the next version, I
will extract them out again and keep them as a separate preparatory
patch.

[1] https://www.postgresql.org/message-id/CAJ3gD9f86H64e4OCjFFszWW7f4EeyriSaFL8SvJs2yOUbc8VEw%40mail.gmail.com

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 29 November 2017 at 17:25, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Also, this
> patch contains the incremental changes that were attached in the patch
> encapsulate_partinfo.patch attached in [1]. In the next version, I
> will extract them out again and keep them as a separate preparatory
> patch.

As mentioned above, attached is
encapsulate_partinfo_preparatory.patch. This addresses David Rowley's
request to move all the partition-related information into new
structure PartitionTupleRouting, so that for
ExecSetupPartitionTupleRouting(), we could pass pointer to this
structure  instead of the many parameters that we currently pass: [1]

The main update-partition-key patch is to be applied over the above
preparatory patch. Attached is its v27 version. This version addresses
Thomas Munro's comments :

On 14 November 2017 at 01:32, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Fri, Nov 10, 2017 at 4:42 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> Attached is v23 patch that has just the above changes (and also
>> rebased on hash-partitioning changes, like update.sql). I am still
>> doing some sanity testing on this, although regression passes.
>
> The test coverage[1] is 96.62%.  Nice work.  Here are the bits that
> aren't covered:
>
> In partition.c's pull_child_partition_columns(), the following loop is
> never run:
>
> +       foreach(lc, partexprs)
> +       {
> +               Node       *expr = (Node *) lfirst(lc);
> +
> +               pull_varattnos(expr, 1, &child_keycols);
> +       }

In update.sql, part_c_100_200 is now partitioned by range(abs(d)). So
now the new function has_partition_atttrs() (in recent patch versions,
this function has replaced pull_child_partition_columns) goes through
the above code segment. This was indeed an important part left
uncovered. Thanks.

>
> In nodeModifyTable.c, the following conditional branches are never run:
>
>                 if (mtstate->mt_oc_transition_capture != NULL)
> +               {
> +                       Assert(mtstate->mt_is_tupconv_perpart == true);
>                         mtstate->mt_oc_transition_capture->tcs_map =
> -
> mtstate->mt_transition_tupconv_maps[leaf_part_index];
> +
> mtstate->mt_childparent_tupconv_maps[leaf_part_index];
> +               }

I think this code segment never hits even without the patch. For
partitions, ON CONFLICT is not supported, and this code segment runs
only for partitions.

>
>
>                                 if (node->mt_oc_transition_capture != NULL)
>                                 {
> -
> Assert(node->mt_transition_tupconv_maps != NULL);
>
> node->mt_oc_transition_capture->tcs_map =
> -
> node->mt_transition_tupconv_maps[node->mt_whichplan];
> +
> tupconv_map_for_subplan(node, node->mt_whichplan);
>                                 }

Here also, I verified that none of the regression tests hits this
segment. The reason might be : this segment is run when an UPDATE
starts with the next subplan, and mtstate->mt_oc_transition_capture is
never allocated for UPDATEs.


[1] : https://www.postgresql.org/message-id/CAJ3gD9f86H64e4OCjFFszWW7f4EeyriSaFL8SvJs2yOUbc8VEw%40mail.gmail.com

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
While addressing Thomas's point about test scenarios not yet covered,
I observed the following ...

Suppose an UPDATE RLS policy with a WITH CHECK clause is defined on
the target table. Now In ExecUpdate(), the corresponding WCO qual gets
executed *before* the partition constraint check, as per existing
behaviour. And the qual succeeds. And then because of partition-key
updated, the row is moved to another partition. Here, suppose there is
a BR INSERT trigger which modifies the row, and the resultant row
actually would *not* pass the UPDATE RLS policy. But for this
partition, since it is an INSERT, only INSERT RLS WCO quals are
executed.

So effectively, with a user-perspective, an RLS WITH CHECK policy that
was defined to reject an updated row, is getting updated successfully.
This is because the policy is not checked *after* a row trigger in the
new partition is executed.

Attached is a test case that reproduces this issue.

I think, in case of row-movement, we should defer calling
ExecWithCheckOptions() until the row is being inserted using
ExecInsert(). And then in ExecInsert(), ExecWithCheckOptions() should
be called using WCO_RLS_UPDATE_CHECK rather than WCO_RLS_INSERT_CHECK
(I recall Amit Langote was of this opinion) as below :

--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -510,7 +510,9 @@ ExecInsert(ModifyTableState *mtstate,
  * we are looking for at this point.
  */
  if (resultRelInfo->ri_WithCheckOptions != NIL)
-     ExecWithCheckOptions(WCO_RLS_INSERT_CHECK,
+        ExecWithCheckOptions((mtstate->operation == CMD_UPDATE ?
+                             WCO_RLS_UPDATE_CHECK : WCO_RLS_INSERT_CHECK),
                              resultRelInfo, slot, estate);


It can be argued that since in case of triggers we always execute
INSERT row triggers for rows inserted as part of update-row-movement,
we should be consistent and execute INSERT WCOs and not UPDATE WCOs
for such rows. But note that, the row triggers we execute are defined
on the leaf partitions. But the RLS policies being executed are
defined for the target partitioned table, and not the leaf partition.
Hence it makes sense to execute them as per the original operation on
the target table. This is similar to why we execute UPDATE statement
triggers even when the row is eventually inserted into another
partition. This is because UPDATE statement trigger was defined for
the target table, not the leaf partition.

Barring any objections, I am going to send a revised patch that fixes
the above issue as described.

Thanks
-Amit Khandekar

Attachment

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 30 November 2017 at 18:56, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> While addressing Thomas's point about test scenarios not yet covered,
> I observed the following ...
>
> Suppose an UPDATE RLS policy with a WITH CHECK clause is defined on
> the target table. Now In ExecUpdate(), the corresponding WCO qual gets
> executed *before* the partition constraint check, as per existing
> behaviour. And the qual succeeds. And then because of partition-key
> updated, the row is moved to another partition. Here, suppose there is
> a BR INSERT trigger which modifies the row, and the resultant row
> actually would *not* pass the UPDATE RLS policy. But for this
> partition, since it is an INSERT, only INSERT RLS WCO quals are
> executed.
>
> So effectively, with a user-perspective, an RLS WITH CHECK policy that
> was defined to reject an updated row, is getting updated successfully.
> This is because the policy is not checked *after* a row trigger in the
> new partition is executed.
>
> Attached is a test case that reproduces this issue.
>
> I think, in case of row-movement, we should defer calling
> ExecWithCheckOptions() until the row is being inserted using
> ExecInsert(). And then in ExecInsert(), ExecWithCheckOptions() should
> be called using WCO_RLS_UPDATE_CHECK rather than WCO_RLS_INSERT_CHECK
> (I recall Amit Langote was of this opinion) as below :
>
> --- a/src/backend/executor/nodeModifyTable.c
> +++ b/src/backend/executor/nodeModifyTable.c
> @@ -510,7 +510,9 @@ ExecInsert(ModifyTableState *mtstate,
>   * we are looking for at this point.
>   */
>   if (resultRelInfo->ri_WithCheckOptions != NIL)
> -     ExecWithCheckOptions(WCO_RLS_INSERT_CHECK,
> +        ExecWithCheckOptions((mtstate->operation == CMD_UPDATE ?
> +                             WCO_RLS_UPDATE_CHECK : WCO_RLS_INSERT_CHECK),
>                               resultRelInfo, slot, estate);

Attached is v28 patch which has the fix for this issue as described
above. In ExecUpdate(), if partition constraint fails, we skip
ExecWithCheckOptions (), and later in ExecInsert() it gets called with
WCO_RLS_UPDATE_CHECK.

Added a few test scenarios for the same, in regress/sql/update.sql.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 1 December 2017 at 17:24, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Attached is v28 patch which has the fix for this issue as described
> above. In ExecUpdate(), if partition constraint fails, we skip
> ExecWithCheckOptions (), and later in ExecInsert() it gets called with
> WCO_RLS_UPDATE_CHECK.

Amit Langote informed me off-list, - along with suggestions for
changes - that my patch needs a rebase. Attached is the rebased
version. I have also bumped the patch version number (now v29),
because this as additional changes, again, suggested by Amit L :
Because  ExecSetupPartitionTupleRouting() has mtstate parameter now,
no need to pass update_rri and num_update_rri, since they can be
retrieved from mtstate.

Also, the preparatory patch is also rebased.

Thanks Amit Langote.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: [HACKERS] UPDATE of partition key

From
Amit Langote
Date:
Thanks for the updated patches, Amit.

Some review comments.

Forgot to remove the description of update_rri and num_update_rri in the
header comment of ExecSetupPartitionTupleRouting().

-
+extern void pull_child_partition_columns(Relation rel,
+                             Relation parent,
+                             Bitmapset **partcols);

It seems you forgot to remove this declaration in partition.h, because I
don't find it defined or used anywhere.

I think some of the changes that are currently part of the main patch are
better taken out into their own patches, because having those diffs appear
in the main patch is kind of distracting.  Just like you now have a patch
that introduces a PartitionTupleRouting structure.  I know that leads to
too many patches, but it helps to easily tell less substantial changes
from the substantial ones.

1. Patch to rename partition_tupconv_maps to parentchild_tupconv_maps.

2. Patch that introduces has_partition_attrs() in place of
   is_partition_attr()

3. Patch to change the names of map_partition_varattnos() arguments

4. Patch that does the refactoring involving ExecConstrains(),
   ExecPartitionCheck(), and the introduction of
   ExecPartitionCheckEmitError()


Regarding ExecSetupChildParentMap(), it seems to me that it could simply
be declared as

static void ExecSetupChildParentMap(ModifyTableState *mtstate);

Looking at the places from where it's called, it seems that you're just
extracting information from mtstate and passing the same for the rest of
its arguments.

mt_is_tupconv_perpart seems like it's unnecessary.  Its function could be
fulfilled by inspecting the state of some other fields of
ModifyTableState.  For example, in the case of an update (operation ==
CMD_UPDATE), if mt_partition_tuple_routing is non-NULL, then we can always
assume that mt_childparent_tupconv_maps has entries for all partitions.
If it's NULL, then there would be only entries for partitions that have
sub-plans.

tupconv_map_for_subplan() looks like it could be done as a macro.

Thanks,
Amit



Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Wed, Dec 13, 2017 at 5:18 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Amit Langote informed me off-list, - along with suggestions for
> changes - that my patch needs a rebase. Attached is the rebased
> version. I have also bumped the patch version number (now v29),
> because this as additional changes, again, suggested by Amit L :
> Because  ExecSetupPartitionTupleRouting() has mtstate parameter now,
> no need to pass update_rri and num_update_rri, since they can be
> retrieved from mtstate.
>
> Also, the preparatory patch is also rebased.

Reviewing the preparatory patch:

+ PartitionTupleRouting *partition_tuple_routing;
+ /* Tuple-routing support info */

Something's wrong with the formatting here.

-    PartitionDispatch **pd,
-    ResultRelInfo ***partitions,
-    TupleConversionMap ***tup_conv_maps,
-    TupleTableSlot **partition_tuple_slot,
-    int *num_parted, int *num_partitions)
+    PartitionTupleRouting **partition_tuple_routing)

Since we're consolidating all of ExecSetupPartitionTupleRouting's
output parameters into a single structure, I think it might make more
sense to have it just return that value.  I think it's only done with
output parameter today because there are so many different things
being produced, and we can't return them all.

+ PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing;

This is just nitpicking, but I don't find "ptr" to be the greatest
variable name; it looks too much like "pointer".  Maybe we could use
"routing" or "proute" or something.

It seems to me that we could improve things here by adding a function
ExecCleanupTupleRouting(PartitionTupleRouting *) which would do the
various heap_close(), ExecDropSingleTupleTableSlot(), and
ExecCloseIndices() operations which are currently performed in
CopyFrom() and, by separate code, in ExecEndModifyTable().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Fri, Dec 15, 2017 at 7:58 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Reviewing the preparatory patch:

I started another review pass over the main patch, so here are some
comments about that.  This is unfortunately not a complete review,
however.

- map = ptr->partition_tupconv_maps[leaf_part_index];
+ map = ptr->parentchild_tupconv_maps[leaf_part_index];

I don't think there's any reason to rename this.  In previous patch
versions, you had multiple arrays of tuple conversion maps in this
structure, but the refactoring eliminated that.

Likewise, I'm not sure I get the point of mt_transition_tupconv_maps
-> mt_childparent_tupconv_maps.  That seems like it could similarly be
left alone.

+ /*
+ * If transition tables are the only reason we're here, return. As
+ * mentioned above, we can also be here during update tuple routing in
+ * presence of transition tables, in which case this function is called
+ * separately for oldtup and newtup, so either can be NULL, not both.
+ */
  if (trigdesc == NULL ||
  (event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
  (event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
- (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
+ (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
+ (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))

I guess this is correct, but it seems awfully fragile.  Can't we have
some more explicit signaling about whether we're only here for
transition tables, rather than deducing it based on exactly one of
oldtup and newtup being NULL?

+ /* Initialization specific to update */
+ if (mtstate && mtstate->operation == CMD_UPDATE)
+ {
+ ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+
+ is_update = true;
+ update_rri = mtstate->resultRelInfo;
+ num_update_rri = list_length(node->plans);
+ }

I guess I don't see why we need a separate "if" block for this.
Neither is_update nor update_rri nor num_update_rri are used until we
get to the block that begins with "if (is_update)".  Why not just
change that block to test if (mtstate && mtstate->operation ==
CMD_UPDATE)" and put the rest of these initializations inside that
block?

+ int num_update_rri = 0,
+ update_rri_index = 0;
...
+ update_rri_index = 0;

It's already 0.

+ leaf_part_rri = &update_rri[update_rri_index];
...
+ leaf_part_rri = leaf_part_arr + i;

These are doing the same kind of thing, but using different styles.  I
prefer the former style, so I'd change the second one to
&leaf_part_arr[i]. Alternatively, you could change the first one to
update_rri + update_rri_indx.  But it's strange to see the same
variable initialized in two different ways just a few lines apart.

+ if (!partrel)
+ {
+ /*
+ * We locked all the partitions above including the leaf
+ * partitions. Note that each of the newly opened relations in
+ * *partitions are eventually closed by the caller.
+ */
+ partrel = heap_open(leaf_oid, NoLock);
+ InitResultRelInfo(leaf_part_rri,
+   partrel,
+   resultRTindex,
+   rel,
+   estate->es_instrument);
+ }

Hmm, isn't there a problem here?  Before, we opened all the relations
here and the caller closed them all.  But now, we're only opening some
of them.  If the caller closes them all, then they will be closing
some that we opened and some that we didn't.  That seems quite bad,
because the reference counts that are incremented and decremented by
opening and closing should all end up at 0.  Maybe I'm confused
because it seems like this would break in any scenario where even 1
relation was already opened and surely you must have tested that
case... but if there's some reason this works, I don't know what it
is, and the comment doesn't tell me.

+static HeapTuple
+ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+   TupleConversionMap *map,
+   HeapTuple tuple,
+   TupleTableSlot *new_slot,
+   TupleTableSlot **p_my_slot)

This function doesn't use the mtstate argument at all.

+ * (Similarly we need to add the deleted row in OLD TABLE).  We need to do

The period should be before, not after, the closing parenthesis.

+ * Now that we have already captured NEW TABLE row, any AR INSERT
+ * trigger should not again capture it below. Arrange for the same.

A more American style would be something like "We've already captured
the NEW TABLE row, so make sure any AR INSERT trigger fired below
doesn't capture it again."  (Similarly for the other case.)

+ /* The delete has actually happened, so inform that to the caller */
+ if (tuple_deleted)
+ *tuple_deleted = true;

In the US, we inform the caller, not inform that to the caller.  In
other words, here the direct object of "inform" is the person or thing
getting the information (in this case, "the caller"), not the
information being conveyed (in this case, "that").  I realize your
usage is probably typical for your country...

+ Assert(mtstate->mt_is_tupconv_perpart == true);

We usually just Assert(thing_that_should_be_true), not
Assert(thing_that_should_be_true == true).

+ * In case this is part of update tuple routing, put this row into the
+ * transition OLD TABLE if we are capturing transition tables. We need to
+ * do this separately for DELETE and INSERT because they happen on
+ * different tables.

Maybe "...OLD table, but only if we are..."

Should it be capturing transition tables or capturing transition
tuples?  I'm not sure.

+ * partition, in which case, we should check the RLS CHECK policy just

In the US, the second comma in this sentence is incorrect and should be removed.

+ * When an UPDATE is run with a leaf partition, we would not have
+ * partition tuple routing setup. In that case, fail with

run with -> run on
would not -> will not
setup -> set up

+ * deleted by another transaction), then we should skip INSERT as
+ * well, otherwise, there will be effectively one new row inserted.

skip INSERT -> skip the insert
well, otherwise -> well; otherwise

I would also change "there will be effectively one new row inserted"
to "an UPDATE could cause an increase in the total number of rows
across all partitions, which is clearly wrong".

+ /*
+ * UPDATEs set the transition capture map only when a new subplan
+ * is chosen.  But for INSERTs, it is set for each row. So after
+ * INSERT, we need to revert back to the map created for UPDATE;
+ * otherwise the next UPDATE will incorrectly use the one created
+ * for INESRT.  So first save the one created for UPDATE.
+ */
+ if (mtstate->mt_transition_capture)
+ saved_tcs_map = mtstate->mt_transition_capture->tcs_map;

UPDATEs -> Updates
INESRT -> INSERT

I wonder if there is some more elegant way to handle this problem.
Basically, the issue is that ExecInsert() is stomping on
mtstate->mt_transition_capture, and your solution is to save and
restore the value you want to have there.  But maybe we could instead
find a way to get ExecInsert() not to stomp on that state in the first
place.  It seems like the ON CONFLICT stuff handled that by adding a
second TransitionCaptureState pointer to ModifyTable, thus
mt_transition_capture and mt_oc_transition_capture.  By that
precedent, we could add mt_utr_transition_capture or similar, and
maybe that's the way to go.  It seems a bit unsatisfying, but so does
what you have now.

+ * 2. For capturing transition tables that are partitions. For UPDATEs, we need

This isn't worded well.  A transition table is never a partition;
transition tables and partitions are two different kinds of things.

+ * If per-leaf map is required and the map is already created, that map
+ * has to be per-leaf. If that map is per-subplan, we won't be able to
+ * access the maps leaf-partition-wise. But if the map is per-leaf, we
+ * will be able to access the maps subplan-wise using the
+ * subplan_partition_offsets map using function
+ * tupconv_map_for_subplan().  So if the callers might need to access
+ * the map both leaf-partition-wise and subplan-wise, they should make
+ * sure that the first time this function is called, it should be
+ * called with perleaf=true so that the map created is per-leaf, not
+ * per-subplan.

This sounds complicated and fragile.  It ends up meaning that
mt_childparent_tupconv_maps is sometimes indexed by subplan number and
sometimes by partition leaf index, which is extremely confusing and
likely to lead to coding errors, either in this patch or in future
ones.  Would it be reasonable to just always do this by partition leaf
index, even if we don't strictly need that set of mappings?

That's all I've got for now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 14 December 2017 at 08:11, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> Forgot to remove the description of update_rri and num_update_rri in the
> header comment of ExecSetupPartitionTupleRouting().
>
> -
> +extern void pull_child_partition_columns(Relation rel,
> +                             Relation parent,
> +                             Bitmapset **partcols);
>
> It seems you forgot to remove this declaration in partition.h, because I
> don't find it defined or used anywhere.

Done both of the above. Attached v30 patch has the above changes.

>
> I think some of the changes that are currently part of the main patch are
> better taken out into their own patches, because having those diffs appear
> in the main patch is kind of distracting.  Just like you now have a patch
> that introduces a PartitionTupleRouting structure.  I know that leads to
> too many patches, but it helps to easily tell less substantial changes
> from the substantial ones.

Done. Created patches as shown below :

>
> 1. Patch to rename partition_tupconv_maps to parentchild_tupconv_maps.

As per Robert's suggestion, reverted back the renaming of this field.

>
> 2. Patch that introduces has_partition_attrs() in place of
>    is_partition_attr()

0002-Changed-is_partition_attr-to-has_partition_attrs.patch

>
> 3. Patch to change the names of map_partition_varattnos() arguments

0003-Renaming-parameters-of-map_partition_var_attnos.patch

>
> 4. Patch that does the refactoring involving ExecConstrains(),
>    ExecPartitionCheck(), and the introduction of
>    ExecPartitionCheckEmitError()

0004-Refactor-CheckConstraint-related-code.patch


The preparatory patches are to be applied in order of the patch
numbers, followed by the main patch update-partition-key_v30.patch

>
>
> Regarding ExecSetupChildParentMap(), it seems to me that it could simply
> be declared as
>
> static void ExecSetupChildParentMap(ModifyTableState *mtstate);
>
> Looking at the places from where it's called, it seems that you're just
> extracting information from mtstate and passing the same for the rest of
> its arguments.
>

Agreed. But the last parameter per_leaf might be necessary. I will
defer this until I address Robert's concern about the complexity of
the related code.

> mt_is_tupconv_perpart seems like it's unnecessary.  Its function could be
> fulfilled by inspecting the state of some other fields of
> ModifyTableState.  For example, in the case of an update (operation ==
> CMD_UPDATE), if mt_partition_tuple_routing is non-NULL, then we can always
> assume that mt_childparent_tupconv_maps has entries for all partitions.
> If it's NULL, then there would be only entries for partitions that have
> sub-plans.

I think we better have this field separately for code-clarity, and to
avoid repeated execution of multiple conditions, and in order to have
some signficant Asserts() that use this field.

>
> tupconv_map_for_subplan() looks like it could be done as a macro.

Or may be inline function. I will again defer this for similar reason
as the above deferred item about ExecSetupChildParentMap parameters.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 15 December 2017 at 18:28, Robert Haas <robertmhaas@gmail.com> wrote:
> Reviewing the preparatory patch:
>
> + PartitionTupleRouting *partition_tuple_routing;
> + /* Tuple-routing support info */
>
> Something's wrong with the formatting here.

Moved the comment above the declaration.

>
> -    PartitionDispatch **pd,
> -    ResultRelInfo ***partitions,
> -    TupleConversionMap ***tup_conv_maps,
> -    TupleTableSlot **partition_tuple_slot,
> -    int *num_parted, int *num_partitions)
> +    PartitionTupleRouting **partition_tuple_routing)
>
> Since we're consolidating all of ExecSetupPartitionTupleRouting's
> output parameters into a single structure, I think it might make more
> sense to have it just return that value.  I think it's only done with
> output parameter today because there are so many different things
> being produced, and we can't return them all.

You mean ExecSetupPartitionTupleRouting() will return the structure
(not pointer to structure), and the caller will get the copy of the
structure like this ? :

mtstate->mt_partition_tuple_routing =
ExecSetupPartitionTupleRouting(mtstate, rel, node->nominalRelation, estate);

I am ok with that, but just wanted to confirm if that is what you are
saying. I don't recall seeing a structure return value in PG code, so
not sure if it is conventional in PG to do that. Hence, I am somewhat
inclined to keep it as output param. It also avoids a structure copy.

Another way is for ExecSetupPartitionTupleRouting() to palloc this
structure, and return its pointer, but then caller would have to
anyway do a structure copy, so that's not convenient, and I don't
think you are suggesting this way either.

>
> + PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing;
>
> This is just nitpicking, but I don't find "ptr" to be the greatest
> variable name; it looks too much like "pointer".  Maybe we could use
> "routing" or "proute" or something.

Done. Renamed it to "proute".

>
> It seems to me that we could improve things here by adding a function
> ExecCleanupTupleRouting(PartitionTupleRouting *) which would do the
> various heap_close(), ExecDropSingleTupleTableSlot(), and
> ExecCloseIndices() operations which are currently performed in
> CopyFrom() and, by separate code, in ExecEndModifyTable().
>

Done. Changes are kept in a new preparatory patch
0005-Organize-cleanup-done-for-partition-tuple-routing.patch

Yet to address your other review comments.

Attached is patch v31. (Preparatory patches to be applied in order of
patch numbers, followed by the main patch)

Thanks
-Amit

Attachment

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 16 December 2017 at 03:09, Robert Haas <robertmhaas@gmail.com> wrote:
> started another review pass over the main patch, so here are
> some comments about that.

I am yet to address all the comments, but meanwhile, below are some
specific points  ...

> + if (!partrel)
> + {
> + /*
> + * We locked all the partitions above including the leaf
> + * partitions. Note that each of the newly opened relations in
> + * *partitions are eventually closed by the caller.
> + */
> + partrel = heap_open(leaf_oid, NoLock);
> + InitResultRelInfo(leaf_part_rri,
> +   partrel,
> +   resultRTindex,
> +   rel,
> +   estate->es_instrument);
> + }
>
> Hmm, isn't there a problem here?  Before, we opened all the relations
> here and the caller closed them all.  But now, we're only opening some
> of them.  If the caller closes them all, then they will be closing
> some that we opened and some that we didn't.  That seems quite bad,
> because the reference counts that are incremented and decremented by
> opening and closing should all end up at 0.  Maybe I'm confused
> because it seems like this would break in any scenario where even 1
> relation was already opened and surely you must have tested that
> case... but if there's some reason this works, I don't know what it
> is, and the comment doesn't tell me.

In ExecCleanupTupleRouting(), we are closing only those newly opened
partitions. We skip those which are actually part of the update result
rels.

> + /*
> + * UPDATEs set the transition capture map only when a new subplan
> + * is chosen.  But for INSERTs, it is set for each row. So after
> + * INSERT, we need to revert back to the map created for UPDATE;
> + * otherwise the next UPDATE will incorrectly use the one created
> + * for INESRT.  So first save the one created for UPDATE.
> + */
> + if (mtstate->mt_transition_capture)
> + saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
>
> I wonder if there is some more elegant way to handle this problem.
> Basically, the issue is that ExecInsert() is stomping on
> mtstate->mt_transition_capture, and your solution is to save and
> restore the value you want to have there.  But maybe we could instead
> find a way to get ExecInsert() not to stomp on that state in the first
> place.  It seems like the ON CONFLICT stuff handled that by adding a
> second TransitionCaptureState pointer to ModifyTable, thus
> mt_transition_capture and mt_oc_transition_capture.  By that
> precedent, we could add mt_utr_transition_capture or similar, and
> maybe that's the way to go.  It seems a bit unsatisfying, but so does
> what you have now.

In case of ON CONFLICT, if there are both INSERT and UPDATE statement
triggers referencing transition tables, both of the triggers need to
independently populate their own transition tables, and hence the need
for two separate transition states : mt_transition_capture and
mt_oc_transition_capture. But in case of update-tuple-routing, the
INSERT statement trigger won't come into picture. So the same
mt_transition_capture can serve the purpose of populating the
transition table with OLD and NEW rows. So I think it would be too
redundant, if not incorrect, to have a whole new transition state for
update tuple routing.

I will see if it turns out better to have two tcs_maps in
TransitionCaptureState, one for update and one for insert. But this,
on first look, does not look good.

> + * If per-leaf map is required and the map is already created, that map
> + * has to be per-leaf. If that map is per-subplan, we won't be able to
> + * access the maps leaf-partition-wise. But if the map is per-leaf, we
> + * will be able to access the maps subplan-wise using the
> + * subplan_partition_offsets map using function
> + * tupconv_map_for_subplan().  So if the callers might need to access
> + * the map both leaf-partition-wise and subplan-wise, they should make
> + * sure that the first time this function is called, it should be
> + * called with perleaf=true so that the map created is per-leaf, not
> + * per-subplan.
>
> This sounds complicated and fragile.  It ends up meaning that
> mt_childparent_tupconv_maps is sometimes indexed by subplan number and
> sometimes by partition leaf index, which is extremely confusing and
> likely to lead to coding errors, either in this patch or in future
> ones.

Even if we always index the map by leaf partition, while accessing the
map the code still needs to be aware of whether the index number with
which we are accessing the map is the subplan number or leaf partition
number:

If the access is by subplan number, use subplan_partition_offsets to
convert to the leaf partition index. So the function
tupconv_map_for_subplan() is anyways necessary for accessing using
subplan index. Only thing that will change is :
tupconv_map_for_subplan() will not have to check if the the map is
indexed by leaf partition or not. But that complexity is hidden in
this function alone; the outside code need not worry about that.

If the access is by leaf partition number, I think you are worried
here that the map might have been incorrectly indexed by subplan, and
the code might access it partition-wise. Currently we access the map
by leaf-partition-index only when setting up
mtstate->mt_*transition_capture->tcs_map during inserts. At that
place, there is an Assert(mtstate->mt_is_tupconv_perpart == true). May
be, we can have another function tupconv_map_for_partition() rather
than directly accessing mt_childparent_tupconv_maps[], and have this
Assert() in that function. What do you say ?

I am more inclined towards avoiding an always-leaf-partition-indexed
map for additional reasons mentioned below ...

> Would it be reasonable to just always do this by partition leaf
> index, even if we don't strictly need that set of mappings?

If there are no transition tables in picture, we don't require
per-leaf child-parent conversion. So, this would mean that the tuple
conversion maps will be set up for all (say, 100) leaf partitions even
if there are only, say, a couple of update plans. I feel this would
unnecessarily increase the startup cost of update-partition-key
operation.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] UPDATE of partition key

From
David Rowley
Date:
On 23 December 2017 at 04:00, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 15 December 2017 at 18:28, Robert Haas <robertmhaas@gmail.com> wrote:
>> -    PartitionDispatch **pd,
>> -    ResultRelInfo ***partitions,
>> -    TupleConversionMap ***tup_conv_maps,
>> -    TupleTableSlot **partition_tuple_slot,
>> -    int *num_parted, int *num_partitions)
>> +    PartitionTupleRouting **partition_tuple_routing)
>>
>> Since we're consolidating all of ExecSetupPartitionTupleRouting's
>> output parameters into a single structure, I think it might make more
>> sense to have it just return that value.  I think it's only done with
>> output parameter today because there are so many different things
>> being produced, and we can't return them all.
>
> You mean ExecSetupPartitionTupleRouting() will return the structure
> (not pointer to structure), and the caller will get the copy of the
> structure like this ? :
>
> mtstate->mt_partition_tuple_routing =
> ExecSetupPartitionTupleRouting(mtstate, rel, node->nominalRelation, estate);
>
> I am ok with that, but just wanted to confirm if that is what you are
> saying. I don't recall seeing a structure return value in PG code, so
> not sure if it is conventional in PG to do that. Hence, I am somewhat
> inclined to keep it as output param. It also avoids a structure copy.
>
> Another way is for ExecSetupPartitionTupleRouting() to palloc this
> structure, and return its pointer, but then caller would have to
> anyway do a structure copy, so that's not convenient, and I don't
> think you are suggesting this way either.

I'm pretty sure Robert is suggesting that
ExecSetupPartitionTupleRouting pallocs the memory for the structure,
sets it up then returns a pointer to the new struct. That's not very
unusual. It seems unusual for a function to return void and modify a
single parameter pointer to get the value to the caller rather than
just to return that value.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 16 December 2017 at 03:09, Robert Haas <robertmhaas@gmail.com> wrote:
>
> - map = ptr->partition_tupconv_maps[leaf_part_index];
> + map = ptr->parentchild_tupconv_maps[leaf_part_index];
>
> I don't think there's any reason to rename this.  In previous patch
> versions, you had multiple arrays of tuple conversion maps in this
> structure, but the refactoring eliminated that.

Done in an earlier version of the patch.

>
> Likewise, I'm not sure I get the point of mt_transition_tupconv_maps
> -> mt_childparent_tupconv_maps.  That seems like it could similarly be
> left alone.

We need to change it's name because now this map is not only used for
transition capture, but also for update-tuple-routing. Does it look ok
for you if, for readability, we keep the childparent tag ? Or else, we
can just make it "mt_tupconv_maps", but "mt_childparent_tupconv_maps"
looks more informative.

>
> + /*
> + * If transition tables are the only reason we're here, return. As
> + * mentioned above, we can also be here during update tuple routing in
> + * presence of transition tables, in which case this function is called
> + * separately for oldtup and newtup, so either can be NULL, not both.
> + */
>   if (trigdesc == NULL ||
>   (event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
>   (event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
> - (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
> + (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
> + (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))
>
> I guess this is correct, but it seems awfully fragile.  Can't we have
> some more explicit signaling about whether we're only here for
> transition tables, rather than deducing it based on exactly one of
> oldtup and newtup being NULL?

I had given a thought on this earlier. I felt, even the pre-existing
conditions like "!trigdesc->trig_update_after_row" are all indirect
ways to determine that this function is called only to capture
transition tables, and thought that it may have been better to have
separate parameter transition_table_only.

But then decided that I can continue on similar lines and add another
such condition to indicate that we are only capturing update-routed
tuples.

Instead of adding another parameter to AfterTriggerSaveEvent(), I had
also considered another approach: Put the transition-tuples-capture
logic part of AfterTriggerSaveEvent() into a helper function
CaptureTransitionTables(). In ExecInsert() and ExecDelete(), instead
of calling ExecARUpdateTriggers(), call this function
CaptureTransitionTables(). I then dropped this idea and thought rather
to call ExecARUpdateTriggers() which neatly does the required checks
and other things like locking the old tuple via GetTupleForTrigger().
So if we go by CaptureTransitionTables(), we would need to do what
ExecARUpdateTriggers() does before calling CaptureTransitionTables().
This is doable. If you think this is worth doing so as to get rid of
the "(oldtup == NULL) ^ (newtup == NULL)" condition, we can do that.

>
> + /* Initialization specific to update */
> + if (mtstate && mtstate->operation == CMD_UPDATE)
> + {
> + ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
> +
> + is_update = true;
> + update_rri = mtstate->resultRelInfo;
> + num_update_rri = list_length(node->plans);
> + }
>
> I guess I don't see why we need a separate "if" block for this.
> Neither is_update nor update_rri nor num_update_rri are used until we
> get to the block that begins with "if (is_update)".  Why not just
> change that block to test if (mtstate && mtstate->operation ==
> CMD_UPDATE)" and put the rest of these initializations inside that
> block?

Done.

>
> + int num_update_rri = 0,
> + update_rri_index = 0;
> ...
> + update_rri_index = 0;
>
> It's already 0.

Done. Retained the comment that mentions why we need to set it to 0,
and added a note in the end that we have already done this during
initialization.

>
> + leaf_part_rri = &update_rri[update_rri_index];
> ...
> + leaf_part_rri = leaf_part_arr + i;
>
> These are doing the same kind of thing, but using different styles.  I
> prefer the former style, so I'd change the second one to
> &leaf_part_arr[i]. Alternatively, you could change the first one to
> update_rri + update_rri_indx.  But it's strange to see the same
> variable initialized in two different ways just a few lines apart.
>

Done. Used the first style.

>
> +static HeapTuple
> +ConvertPartitionTupleSlot(ModifyTableState *mtstate,
> +   TupleConversionMap *map,
> +   HeapTuple tuple,
> +   TupleTableSlot *new_slot,
> +   TupleTableSlot **p_my_slot)
>
> This function doesn't use the mtstate argument at all.

Removed mtstate.

>
> + * (Similarly we need to add the deleted row in OLD TABLE).  We need to do
>
> The period should be before, not after, the closing parenthesis.

Done.

>
> + * Now that we have already captured NEW TABLE row, any AR INSERT
> + * trigger should not again capture it below. Arrange for the same.
>
> A more American style would be something like "We've already captured
> the NEW TABLE row, so make sure any AR INSERT trigger fired below
> doesn't capture it again."  (Similarly for the other case.)

Done.

>
> + /* The delete has actually happened, so inform that to the caller */
> + if (tuple_deleted)
> + *tuple_deleted = true;
>
> In the US, we inform the caller, not inform that to the caller.  In
> other words, here the direct object of "inform" is the person or thing
> getting the information (in this case, "the caller"), not the
> information being conveyed (in this case, "that").  I realize your
> usage is probably typical for your country...

Changed it to "inform the caller about the same"

>
> + Assert(mtstate->mt_is_tupconv_perpart == true);
>
> We usually just Assert(thing_that_should_be_true), not
> Assert(thing_that_should_be_true == true).

Ok. Changed it to Assert(mtstate->mt_is_tupconv_perpart)

>
> + * In case this is part of update tuple routing, put this row into the
> + * transition OLD TABLE if we are capturing transition tables. We need to
> + * do this separately for DELETE and INSERT because they happen on
> + * different tables.
>
> Maybe "...OLD table, but only if we are..."
>
> Should it be capturing transition tables or capturing transition
> tuples?  I'm not sure.

Changed it to "capturing transition tuples". In trigger.c, I see this
short form notation as well as a long-form notation like "capturing
tuples in transition tables". But not seen anywhere "capturing
transition tables", and it does seem odd.

>
> + * partition, in which case, we should check the RLS CHECK policy just
>
> In the US, the second comma in this sentence is incorrect and should be removed.

Done.

>
> + * When an UPDATE is run with a leaf partition, we would not have
> + * partition tuple routing setup. In that case, fail with
>
> run with -> run on
> would not -> will not
> setup -> set up

Done.

>
> + * deleted by another transaction), then we should skip INSERT as
> + * well, otherwise, there will be effectively one new row inserted.
>
> skip INSERT -> skip the insert
> well, otherwise -> well; otherwise
>
> I would also change "there will be effectively one new row inserted"
> to "an UPDATE could cause an increase in the total number of rows
> across all partitions, which is clearly wrong".

Done both.

>
> + /*
> + * UPDATEs set the transition capture map only when a new subplan
> + * is chosen.  But for INSERTs, it is set for each row. So after
> + * INSERT, we need to revert back to the map created for UPDATE;
> + * otherwise the next UPDATE will incorrectly use the one created
> + * for INESRT.  So first save the one created for UPDATE.
> + */
> + if (mtstate->mt_transition_capture)
> + saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
>
> UPDATEs -> Updates
Done. I believe you want to do this only if it's a plural ? In the
same para, also changed "INSERTs" to "inserts".
> INESRT -> INSERT
Done.

>
> + * 2. For capturing transition tables that are partitions. For UPDATEs, we need
>
> This isn't worded well.  A transition table is never a partition;
> transition tables and partitions are two different kinds of things.

Yeah. Changed it to :
"For capturing transition tuples when the target table is a partitioned table."

Attached v32 patch.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 2 January 2018 at 10:56, David Rowley <david.rowley@2ndquadrant.com> wrote:
> On 23 December 2017 at 04:00, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> On 15 December 2017 at 18:28, Robert Haas <robertmhaas@gmail.com> wrote:
>>> -    PartitionDispatch **pd,
>>> -    ResultRelInfo ***partitions,
>>> -    TupleConversionMap ***tup_conv_maps,
>>> -    TupleTableSlot **partition_tuple_slot,
>>> -    int *num_parted, int *num_partitions)
>>> +    PartitionTupleRouting **partition_tuple_routing)
>>>
>>> Since we're consolidating all of ExecSetupPartitionTupleRouting's
>>> output parameters into a single structure, I think it might make more
>>> sense to have it just return that value.  I think it's only done with
>>> output parameter today because there are so many different things
>>> being produced, and we can't return them all.
>>
>> You mean ExecSetupPartitionTupleRouting() will return the structure
>> (not pointer to structure), and the caller will get the copy of the
>> structure like this ? :
>>
>> mtstate->mt_partition_tuple_routing =
>> ExecSetupPartitionTupleRouting(mtstate, rel, node->nominalRelation, estate);
>>
>> I am ok with that, but just wanted to confirm if that is what you are
>> saying. I don't recall seeing a structure return value in PG code, so
>> not sure if it is conventional in PG to do that. Hence, I am somewhat
>> inclined to keep it as output param. It also avoids a structure copy.
>>
>> Another way is for ExecSetupPartitionTupleRouting() to palloc this
>> structure, and return its pointer, but then caller would have to
>> anyway do a structure copy, so that's not convenient, and I don't
>> think you are suggesting this way either.
>
> I'm pretty sure Robert is suggesting that
> ExecSetupPartitionTupleRouting pallocs the memory for the structure,
> sets it up then returns a pointer to the new struct. That's not very
> unusual. It seems unusual for a function to return void and modify a
> single parameter pointer to get the value to the caller rather than
> just to return that value.

Sorry, my mistake. Earlier I somehow was under the impression that the
callers of ExecSetupPartitionTupleRouting() already have this
structure palloc'ed, and that they pass address of this structure. I
now can see that both CopyStateData->partition_tuple_routing and
ModifyTableState->mt_partition_tuple_routing are pointers, not
structures. So it make perfect sense for
ExecSetupPartitionTupleRouting() to palloc and return a pointer. Sorry
for the noise. Will share the change in an upcoming patch version.
Thanks !

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 1 January 2018 at 21:43, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 16 December 2017 at 03:09, Robert Haas <robertmhaas@gmail.com> wrote:
>> + /*
>> + * UPDATEs set the transition capture map only when a new subplan
>> + * is chosen.  But for INSERTs, it is set for each row. So after
>> + * INSERT, we need to revert back to the map created for UPDATE;
>> + * otherwise the next UPDATE will incorrectly use the one created
>> + * for INESRT.  So first save the one created for UPDATE.
>> + */
>> + if (mtstate->mt_transition_capture)
>> + saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
>>
>> I wonder if there is some more elegant way to handle this problem.
>> Basically, the issue is that ExecInsert() is stomping on
>> mtstate->mt_transition_capture, and your solution is to save and
>> restore the value you want to have there.  But maybe we could instead
>> find a way to get ExecInsert() not to stomp on that state in the first
>> place.  It seems like the ON CONFLICT stuff handled that by adding a
>> second TransitionCaptureState pointer to ModifyTable, thus
>> mt_transition_capture and mt_oc_transition_capture.  By that
>> precedent, we could add mt_utr_transition_capture or similar, and
>> maybe that's the way to go.  It seems a bit unsatisfying, but so does
>> what you have now.
>
> In case of ON CONFLICT, if there are both INSERT and UPDATE statement
> triggers referencing transition tables, both of the triggers need to
> independently populate their own transition tables, and hence the need
> for two separate transition states : mt_transition_capture and
> mt_oc_transition_capture. But in case of update-tuple-routing, the
> INSERT statement trigger won't come into picture. So the same
> mt_transition_capture can serve the purpose of populating the
> transition table with OLD and NEW rows. So I think it would be too
> redundant, if not incorrect, to have a whole new transition state for
> update tuple routing.
>
> I will see if it turns out better to have two tcs_maps in
> TransitionCaptureState, one for update and one for insert. But this,
> on first look, does not look good.

Suppose TransitionCaptureState has separate maps, upd_del_tcs_maps and
insert_tcs_maps for UPDATE/DELETE and INSERT events respectively. So
upd_del_tcs_maps will be updated only after we start with the next
UPDATE subplan, whereas insert_tcs_maps will keep on getting updated
for each row. So in AfterTriggerSaveEvent(), upd_del_tcs_maps would be
used when the event is TRIGGER_EVENT_[UPDATE/DELETE], and
insert_tcs_maps will be used when event == TRIGGER_EVENT_INSERT. But
the issue is : even if the event is TRIGGER_EVENT_UPDATE, we don't
know whether this is caused by a normal update or as part of an insert
into new partition during partition-key-update. So blindly using
upd_del_tcs_maps is incorrect. If the event is caused by the later, we
should use insert_tcs_maps rather than upd_del_tcs_maps. But we do not
have the information in trigger.c as to what caused this event.

So, overall, it would not work, and even if we make it work by passing
or storing some more information somewhere, the
AfterTriggerSaveEvent() logic will become too complicated.

So I can't think of anything else, but to keep the way I did, i.e.
reverting back the tcs_map once insert finishes. We so a similar thing
for reverting back the estate->es_result_relation_info.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 20 December 2017 at 11:52, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 14 December 2017 at 08:11, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>
>> Regarding ExecSetupChildParentMap(), it seems to me that it could simply
>> be declared as
>>
>> static void ExecSetupChildParentMap(ModifyTableState *mtstate);
>>
>> Looking at the places from where it's called, it seems that you're just
>> extracting information from mtstate and passing the same for the rest of
>> its arguments.
>
> Agreed. But the last parameter per_leaf might be necessary. I will
> defer this until I address Robert's concern about the complexity of
> the related code.

Removed those parameters, but kept perleaf. The map required for
update-tuple-routing is a per-subplan one despite the presence of
partition tuple routing. And we cannot deduce from mtstate whether
update tuple routing is true. So for this case, the caller has to
explicitly specify that per-subplan map has to be created.

>>
>> tupconv_map_for_subplan() looks like it could be done as a macro.
>
> Or may be inline function. I will again defer this for similar reason
> as the above deferred item about ExecSetupChildParentMap parameters.
>

Made it inline.

Did the above changes in attached update-partition-key_v33.patch

On 3 January 2018 at 11:42, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 2 January 2018 at 10:56, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> I'm pretty sure Robert is suggesting that
>> ExecSetupPartitionTupleRouting pallocs the memory for the structure,
>> sets it up then returns a pointer to the new struct. That's not very
>> unusual. It seems unusual for a function to return void and modify a
>> single parameter pointer to get the value to the caller rather than
>> just to return that value.
>
> Sorry, my mistake. Earlier I somehow was under the impression that the
> callers of ExecSetupPartitionTupleRouting() already have this
> structure palloc'ed, and that they pass address of this structure. I
> now can see that both CopyStateData->partition_tuple_routing and
> ModifyTableState->mt_partition_tuple_routing are pointers, not
> structures. So it make perfect sense for
> ExecSetupPartitionTupleRouting() to palloc and return a pointer. Sorry
> for the noise. Will share the change in an upcoming patch version.
> Thanks !

ExecSetupPartitionTupleRouting() now returns PartitionTupleRouting *.

Did this change in v3 version of
0001-Encapsulate-partition-related-info-in-a-structure.patch

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: [HACKERS] UPDATE of partition key

From
David Rowley
Date:
> On 3 January 2018 at 11:42, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> [...] So it make perfect sense for
>> ExecSetupPartitionTupleRouting() to palloc and return a pointer. Sorry
>> for the noise. Will share the change in an upcoming patch version.
>> Thanks !
>
> ExecSetupPartitionTupleRouting() now returns PartitionTupleRouting *.

Thanks for changing. I've just done almost a complete review of v32.
(v33 came along a bit sooner than I thought).

I've not finished looking at the regression tests yet, but here are a
few things, some may have been changed in v33, I've not looked yet.
Also apologies in advance if anything seems nitpicky.

1. "by INSERT" -> "by an INSERT" in:

    from the original partition followed by <command>INSERT</command> into the

2. "and INSERT" -> "and an INSERT" in:

    a <command>DELETE</command> and <command>INSERT</command>. As far as

3. "due partition-key change" -> "due to the partition-key being changed" in:

 * capture is happening for UPDATEd rows being moved to another partition due
 * partition-key change, then this function is called once when the row is

4. "inserted to another" -> "inserted into another" in:

 * deleted (to capture OLD row), and once when the row is inserted to another

5. "for UPDATE event" -> "for an UPDATE event" (singular), or -> "for
UPDATE events" (plural)

* oldtup and newtup are non-NULL.  But for UPDATE event fired for

I'm unsure if you need singular or plural. It perhaps does not matter.

6. "for row" -> "for a row" in:

* movement, oldtup is NULL when the event is for row being inserted,

Likewise in:

* whereas newtup is NULL when the event is for row being deleted.

7. In the following fragment the code does not do what the comment says:

/*
* If transition tables are the only reason we're here, return. As
* mentioned above, we can also be here during update tuple routing in
* presence of transition tables, in which case this function is called
* separately for oldtup and newtup, so either can be NULL, not both.
*/
if (trigdesc == NULL ||
(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
(event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))
return;

With the comment; "so either can be NULL, not both.", I'd expect a
boolean OR not an XOR.

maybe the comment is better written as:

"so we expect exactly one of them to be non-NULL"

(I know you've been discussing with Robert, so I've not checked v33 to
see if this still exists)

8. I'm struggling to make sense of this:

/*
* Save a tuple conversion map to convert a tuple routed to this
* partition from the parent's type to the partition's.
*/

Maybe it's better to write this as:

/*
* Generate a tuple conversion map to convert tuples of the parent's
* type into the partition's type.
*/

9. insert should be capitalised here and should be prefixed with "an":

/*
* Verify result relation is a valid target for insert operation. Even
* for updates, we are doing this for tuple-routing, so again, we need
* to check the validity for insert operation.
*/
CheckValidResultRel(leaf_part_rri, CMD_INSERT);

Maybe it's better to write:

/*
* Verify result relation is a valid target for an INSERT.  An UPDATE of
* a partition-key becomes a DELETE/INSERT operation, so this check is
* still required when the operation is CMD_UPDATE.
*/

10. The following code would be more clear if you replaced
mtstate->mt_transition_capture with transition_capture.

if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
&& mtstate->mt_transition_capture->tcs_update_new_table)
{
ExecARUpdateTriggers(estate, resultRelInfo, NULL,
NULL,
tuple,
NULL,
mtstate->mt_transition_capture);

/*
* Now that we have already captured NEW TABLE row, any AR INSERT
* trigger should not again capture it below. Arrange for the same.
*/
transition_capture = NULL;
}

You are, after all, doing:

transition_capture = mtstate->mt_transition_capture;

at the top of the function. There are a few other places you're also
accessing mtstate->mt_transition_capture.

11. Should tuple_deleted and process_returning be camelCase like the
other params?:

static TupleTableSlot *
ExecDelete(ModifyTableState *mtstate,
   ItemPointer tupleid,
   HeapTuple oldtuple,
   TupleTableSlot *planSlot,
   EPQState *epqstate,
   EState *estate,
   bool *tuple_deleted,
   bool process_returning,
   bool canSetTag)

12. The following comment talks about "target table descriptor", which
I think is a good term. In several other places, you mention "root",
which I take it to mean "target table".

 * This map array is required for two purposes :
 * 1. For update-tuple-routing. We need to convert the tuple from the subplan
 * result rel to the root partitioned table descriptor.
 * 2. For capturing transition tuples when the target table is a partitioned
 * table. For updates, we need to convert the tuple from subplan result rel to
 * target table descriptor, and for inserts, we need to convert the inserted
 * tuple from leaf partition to the target table descriptor.

I'd personally rather we always talked about "target" rather than
"root". I understand there's probably many places in the code
where we talk about the target table as "root", but I really think we
need to fix that, so I'd rather not see the problem get any worse
before it gets better.

The comment block might also look better if you tab indent after the
1. and 2. then on each line below it.
Also the space before the ':' is not correct.

13. Does the following code really need to palloc0 rather than just palloc?

/*
* Build array of conversion maps from each child's TupleDesc to the
* one used in the tuplestore.  The map pointers may be NULL when no
* conversion is necessary, which is hopefully a common case for
* partitions.
*/
mtstate->mt_childparent_tupconv_maps = (TupleConversionMap **)
palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);

I don't see any case in the initialization of the array where any of
the elements are not assigned a value, so I think palloc() is fine.

14. I don't really like the way tupconv_map_for_subplan() works. It
would be nice to have two separate functions for this, but looking a
bit more at it, it seems the caller won't just need to always call
exactly one of those functions. I don't have any ideas to improve it,
so this is just a note.

15. I still don't really like the way ExecInitModifyTable() sets and
unsets update_tuple_routing_needed. I know we talked about this
before, but couldn't you just change:

if (resultRelInfo->ri_TrigDesc &&
resultRelInfo->ri_TrigDesc->trig_update_before_row &&
operation == CMD_UPDATE)
update_tuple_routing_needed = true;

To:

if (resultRelInfo->ri_TrigDesc &&
resultRelInfo->ri_TrigDesc->trig_update_before_row &&
node->partitioned_rels != NIL &&
operation == CMD_UPDATE)
update_tuple_routing_needed = true;

and get rid of:

/*
* If it's not a partitioned table after all, UPDATE tuple routing should
* not be attempted.
*/
if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
update_tuple_routing_needed = false;

looking at inheritance_planner(), partitioned_rels is only set to a
non-NIL value if parent_rte->relkind == RELKIND_PARTITIONED_TABLE.

16. "named" -> "target" in:

 * 'partKeyUpdated' is true if any partitioning columns are being updated,
 * either from the named relation or a descendent partitioned table.

I guess we're calling this one of; root, named, target :-(

17. You still have the following comment in ModifyTableState but
you've moved all those fields out to PartitionTupleRouting:

/* Tuple-routing support info */

18. Should the following not be just called partKeyUpdate (without the 'd')?

bool partKeyUpdated; /* some part key in hierarchy updated */

This occurs in the planner were the part key is certainly being updated.

19. In pathnode.h you've named a parameter partColsUpdated, but the
function in the .c file calls it partKeyUpdated.

I'll try to look at the tests tomorrow and also do some testing. So
far I've only read the code and the docs.

Overall, the patch appears to look quite good. Good to see the various
cleanups going in like the new PartitionTupleRouting struct.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
Robert, for tracking purpose, below I have consolidated your review
items on which we are yet to conclude. Let me know if you have more
comments on the points which I made.

------------------
1. ExecUpdate() needs to revert back tcs_map value changed by ExecInsert()
------------------

>> + /*
>> + * UPDATEs set the transition capture map only when a new subplan
>> + * is chosen.  But for INSERTs, it is set for each row. So after
>> + * INSERT, we need to revert back to the map created for UPDATE;
>> + * otherwise the next UPDATE will incorrectly use the one created
>> + * for INESRT.  So first save the one created for UPDATE.
>> + */
>> + if (mtstate->mt_transition_capture)
>> + saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
>>
>> I wonder if there is some more elegant way to handle this problem.
>> Basically, the issue is that ExecInsert() is stomping on
>> mtstate->mt_transition_capture, and your solution is to save and
>> restore the value you want to have there.  But maybe we could instead
>> find a way to get ExecInsert() not to stomp on that state in the first
>> place.  It seems like the ON CONFLICT stuff handled that by adding a
>> second TransitionCaptureState pointer to ModifyTable, thus
>> mt_transition_capture and mt_oc_transition_capture.  By that
>> precedent, we could add mt_utr_transition_capture or similar, and
>> maybe that's the way to go.  It seems a bit unsatisfying, but so does
>> what you have now.
>
> In case of ON CONFLICT, if there are both INSERT and UPDATE statement
> triggers referencing transition tables, both of the triggers need to
> independently populate their own transition tables, and hence the need
> for two separate transition states : mt_transition_capture and
> mt_oc_transition_capture. But in case of update-tuple-routing, the
> INSERT statement trigger won't come into picture. So the same
> mt_transition_capture can serve the purpose of populating the
> transition table with OLD and NEW rows. So I think it would be too
> redundant, if not incorrect, to have a whole new transition state for
> update tuple routing.
>
> I will see if it turns out better to have two tcs_maps in
> TransitionCaptureState, one for update and one for insert. But this,
> on first look, does not look good.

Suppose TransitionCaptureState has separate maps, upd_del_tcs_maps and
insert_tcs_maps for UPDATE/DELETE and INSERT events respectively. So
upd_del_tcs_maps will be updated only after we start with the next
UPDATE subplan, whereas insert_tcs_maps will keep on getting updated
for each row. So in AfterTriggerSaveEvent(), upd_del_tcs_maps would be
used when the event is TRIGGER_EVENT_[UPDATE/DELETE], and
insert_tcs_maps will be used when event == TRIGGER_EVENT_INSERT. But
the issue is : even if the event is TRIGGER_EVENT_UPDATE, we don't
know whether this is caused by a normal update or as part of an insert
into new partition during partition-key-update. So blindly using
upd_del_tcs_maps is incorrect. If the event is caused by the later, we
should use insert_tcs_maps rather than upd_del_tcs_maps. But we do not
have the information in trigger.c as to what caused this event.

So, overall, it would not work, and even if we make it work by passing
or storing some more information somewhere, the
AfterTriggerSaveEvent() logic will become too complicated.

So I can't think of anything else, but to keep the way I did, i.e.
reverting back the tcs_map once insert finishes. We so a similar thing
for reverting back the estate->es_result_relation_info.

------------------
2. mt_childparent_tupconv_maps is indexed by subplan or partition leaf index.
------------------

> + * If per-leaf map is required and the map is already created, that map
> + * has to be per-leaf. If that map is per-subplan, we won't be able to
> + * access the maps leaf-partition-wise. But if the map is per-leaf, we
> + * will be able to access the maps subplan-wise using the
> + * subplan_partition_offsets map using function
> + * tupconv_map_for_subplan().  So if the callers might need to access
> + * the map both leaf-partition-wise and subplan-wise, they should make
> + * sure that the first time this function is called, it should be
> + * called with perleaf=true so that the map created is per-leaf, not
> + * per-subplan.
>
> This sounds complicated and fragile.  It ends up meaning that
> mt_childparent_tupconv_maps is sometimes indexed by subplan number and
> sometimes by partition leaf index, which is extremely confusing and
> likely to lead to coding errors, either in this patch or in future
> ones.

Even if we always index the map by leaf partition, while accessing the
map the code still needs to be aware of whether the index number with
which we are accessing the map is the subplan number or leaf partition
number:

If the access is by subplan number, use subplan_partition_offsets to
convert to the leaf partition index. So the function
tupconv_map_for_subplan() is anyways necessary for accessing using
subplan index. Only thing that will change is :
tupconv_map_for_subplan() will not have to check if the the map is
indexed by leaf partition or not. But that complexity is hidden in
this function alone; the outside code need not worry about that.

If the access is by leaf partition number, I think you are worried
here that the map might have been incorrectly indexed by subplan, and
the code might access it partition-wise. Currently we access the map
by leaf-partition-index only when setting up
mtstate->mt_*transition_capture->tcs_map during inserts. At that
place, there is an Assert(mtstate->mt_is_tupconv_perpart == true). May
be, we can have another function tupconv_map_for_partition() rather
than directly accessing mt_childparent_tupconv_maps[], and have this
Assert() in that function. What do you say ?

I am more inclined towards avoiding an always-leaf-partition-indexed
map for additional reasons mentioned below ...

> Would it be reasonable to just always do this by partition leaf
> index, even if we don't strictly need that set of mappings?

If there are no transition tables in picture, we don't require
per-leaf child-parent conversion. So, this would mean that the tuple
conversion maps will be set up for all (say, 100) leaf partitions even
if there are only, say, a couple of update plans. I feel this would
unnecessarily increase the startup cost of update-partition-key
operation.

------------------
3. Renaming of mt_transition_tupconv_maps to mt_childparent_tupconv_maps
------------------

>
> Likewise, I'm not sure I get the point of mt_transition_tupconv_maps
> -> mt_childparent_tupconv_maps.  That seems like it could similarly be
> left alone.

We need to change it's name because now this map is not only used for
transition capture, but also for update-tuple-routing. Does it look ok
for you if, for readability, we keep the childparent tag ? Or else, we
can just make it "mt_tupconv_maps", but "mt_childparent_tupconv_maps"
looks more informative.

-------------------
4. Explicit signaling for "we are only here for transition tables"
-------------------

>
> + /*
> + * If transition tables are the only reason we're here, return. As
> + * mentioned above, we can also be here during update tuple routing in
> + * presence of transition tables, in which case this function is called
> + * separately for oldtup and newtup, so either can be NULL, not both.
> + */
>   if (trigdesc == NULL ||
>   (event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
>   (event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
> - (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
> + (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
> + (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))
>
> I guess this is correct, but it seems awfully fragile.  Can't we have
> some more explicit signaling about whether we're only here for
> transition tables, rather than deducing it based on exactly one of
> oldtup and newtup being NULL?

I had given a thought on this earlier. I felt, even the pre-existing
conditions like "!trigdesc->trig_update_after_row" are all indirect
ways to determine that this function is called only to capture
transition tables, and thought that it may have been better to have
separate parameter transition_table_only.

But then decided that I can continue on similar lines and add another
such condition to indicate that we are only capturing update-routed
tuples.

Instead of adding another parameter to AfterTriggerSaveEvent(), I had
also considered another approach: Put the transition-tuples-capture
logic part of AfterTriggerSaveEvent() into a helper function
CaptureTransitionTables(). In ExecInsert() and ExecDelete(), instead
of calling ExecARUpdateTriggers(), call this function
CaptureTransitionTables(). I then dropped this idea and thought rather
to call ExecARUpdateTriggers() which neatly does the required checks
and other things like locking the old tuple via GetTupleForTrigger().
So if we go by CaptureTransitionTables(), we would need to do what
ExecARUpdateTriggers() does before calling CaptureTransitionTables().
This is doable. If you think this is worth doing so as to get rid of
the "(oldtup == NULL) ^ (newtup == NULL)" condition, we can do that.


Re: [HACKERS] UPDATE of partition key

From
David Rowley
Date:
On 4 January 2018 at 02:52, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I'll try to look at the tests tomorrow and also do some testing. So
> far I've only read the code and the docs.

There are a few more things I noticed on another pass I made today:

20. "carried" -> "carried out the"

+       would have identified the newly updated row and carried
+       <command>UPDATE</command>/<command>DELETE</command> on this new row

21. Extra new line

+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
   </para>

22. In copy.c CopyFrom() you have the following code:

/*
 * We might need to convert from the parent rowtype to the
 * partition rowtype.
 */
map = proute->partition_tupconv_maps[leaf_part_index];
if (map)
{
    Relation partrel = resultRelInfo->ri_RelationDesc;

    tuple = do_convert_tuple(tuple, map);

    /*
    * We must use the partition's tuple descriptor from this
    * point on.  Use a dedicated slot from this point on until
    * we're finished dealing with the partition.
    */
    slot = proute->partition_tuple_slot;
    Assert(slot != NULL);
    ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
   ExecStoreTuple(tuple, slot, InvalidBuffer, true);
}

Should this use ConvertPartitionTupleSlot() instead?

23. Why write;

last_resultRelInfo = mtstate->resultRelInfo + mtstate->mt_nplans;

when you can write;

last_resultRelInfo = mtstate->resultRelInfo[mtstate->mt_nplans];?


24. In ExecCleanupTupleRouting(), do you think that you could just
have a special case loop for (mtstate && mtstate->operation ==
CMD_UPDATE)?

/*
* If this result rel is one of the UPDATE subplan result rels, let
* ExecEndPlan() close it. For INSERT or COPY, this does not apply
* because leaf partition result rels are always newly allocated.
*/
if (is_update &&
    resultRelInfo >= first_resultRelInfo &&
    resultRelInfo < last_resultRelInfo)
    continue;

Something like:

if (mtstate && mtstate->operation == CMD_UPDATE)
{
    ResultRelInfo *first_resultRelInfo = mtstate->resultRelInfo;
    ResultRelInfo *last_resultRelInfo =
mtstate->resultRelInfo[mtstate->mt_nplans];

    for (i = 0; i < proute->num_partitions; i++)
    {
        ResultRelInfo *resultRelInfo = proute->partitions[i];

        /*
         * Leave any resultRelInfos that belong to the UPDATE's subplan
         * list.  These will be closed during executor shutdown.
         */
        if (resultRelInfo >= first_resultRelInfo &&
            resultRelInfo < last_resultRelInfo)
            continue;

        ExecCloseIndices(resultRelInfo);
        heap_close(resultRelInfo->ri_RelationDesc, NoLock);
    }
}
else
{
    for (i = 0; i < proute->num_partitions; i++)
    {
        ResultRelInfo *resultRelInfo = proute->partitions[i];

        ExecCloseIndices(resultRelInfo);
        heap_close(resultRelInfo->ri_RelationDesc, NoLock);
    }
}

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Wed, Jan 3, 2018 at 6:29 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> ExecSetupPartitionTupleRouting() now returns PartitionTupleRouting *.
>
> Did this change in v3 version of
> 0001-Encapsulate-partition-related-info-in-a-structure.patch

I'll have to come back to some of the other open issues, but 0001 and
0005 look good to me now, so I pushed those as a single commit after
fixing a few things that pgindent didn't like.  I also think 0002 and
0003 look basically good, so I pushed those two as a single commit
also.  But the comment changes in 0003 didn't seem extensive enough to
me so I made a few more changes there along the way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 5 January 2018 at 03:04, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jan 3, 2018 at 6:29 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> ExecSetupPartitionTupleRouting() now returns PartitionTupleRouting *.
>>
>> Did this change in v3 version of
>> 0001-Encapsulate-partition-related-info-in-a-structure.patch
>
> I'll have to come back to some of the other open issues, but 0001 and
> 0005 look good to me now, so I pushed those as a single commit after
> fixing a few things that pgindent didn't like.  I also think 0002 and
> 0003 look basically good, so I pushed those two as a single commit
> also.  But the comment changes in 0003 didn't seem extensive enough to
> me so I made a few more changes there along the way.

Thanks. Attached is a rebased update-partition-key_v34.patch, which
also has the changes as per David Rowley's review comments as
explained below.

The above patch is to be applied over the last remaining preparatory
patch, now named (and attached) :
0001-Refactor-CheckConstraint-related-code.patch

On 3 January 2018 at 19:22, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I've not finished looking at the regression tests yet, but here are a
> few things, some may have been changed in v33, I've not looked yet.
> Also apologies in advance if anything seems nitpicky.
No worries. In fact, it's good to do this right now, otherwise it's
difficult to notice and fix at later point of time. Thanks.

>
> 1. "by INSERT" -> "by an INSERT" in:
>
>     from the original partition followed by <command>INSERT</command> into the
>
> 2. "and INSERT" -> "and an INSERT" in:
>
>     a <command>DELETE</command> and <command>INSERT</command>. As far as
>
> 3. "due partition-key change" -> "due to the partition-key being changed" in:
>
>  * capture is happening for UPDATEd rows being moved to another partition due
>  * partition-key change, then this function is called once when the row is
>
> 4. "inserted to another" -> "inserted into another" in:
>
>  * deleted (to capture OLD row), and once when the row is inserted to another
>
> 5. "for UPDATE event" -> "for an UPDATE event" (singular), or -> "for
> UPDATE events" (plural)
>
> * oldtup and newtup are non-NULL.  But for UPDATE event fired for
>
> I'm unsure if you need singular or plural. It perhaps does not matter.
>
> 6. "for row" -> "for a row" in:
>
> * movement, oldtup is NULL when the event is for row being inserted,
>
> Likewise in:
>
> * whereas newtup is NULL when the event is for row being deleted.

Done all of the above.

>
> 7. In the following fragment the code does not do what the comment says:
>
> /*
> * If transition tables are the only reason we're here, return. As
> * mentioned above, we can also be here during update tuple routing in
> * presence of transition tables, in which case this function is called
> * separately for oldtup and newtup, so either can be NULL, not both.
> */
> if (trigdesc == NULL ||
> (event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
> (event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
> (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
> (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))
> return;
>
> With the comment; "so either can be NULL, not both.", I'd expect a
> boolean OR not an XOR.
>
> maybe the comment is better written as:
>
> "so we expect exactly one of them to be non-NULL"

Ok. Made it : "so we expect exactly one of them to be NULL"

>
> (I know you've been discussing with Robert, so I've not checked v33 to
> see if this still exists)

Yes, it's not yet concluded.

>
> 8. I'm struggling to make sense of this:
>
> /*
> * Save a tuple conversion map to convert a tuple routed to this
> * partition from the parent's type to the partition's.
> */
>
> Maybe it's better to write this as:
>
> /*
> * Generate a tuple conversion map to convert tuples of the parent's
> * type into the partition's type.
> */

This is existing code; not from my patch.

>
> 9. insert should be capitalised here and should be prefixed with "an":
>
> /*
> * Verify result relation is a valid target for insert operation. Even
> * for updates, we are doing this for tuple-routing, so again, we need
> * to check the validity for insert operation.
> */
> CheckValidResultRel(leaf_part_rri, CMD_INSERT);
>
> Maybe it's better to write:
>
> /*
> * Verify result relation is a valid target for an INSERT.  An UPDATE of
> * a partition-key becomes a DELETE/INSERT operation, so this check is
> * still required when the operation is CMD_UPDATE.
> */

Done. Instead of DELETE/INSERT, used DELETE+INSERT.

>
> 10. The following code would be more clear if you replaced
> mtstate->mt_transition_capture with transition_capture.
>
> if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
> && mtstate->mt_transition_capture->tcs_update_new_table)
> {
> ExecARUpdateTriggers(estate, resultRelInfo, NULL,
> NULL,
> tuple,
> NULL,
> mtstate->mt_transition_capture);
>
> /*
> * Now that we have already captured NEW TABLE row, any AR INSERT
> * trigger should not again capture it below. Arrange for the same.
> */
> transition_capture = NULL;
> }
>
> You are, after all, doing:
>
> transition_capture = mtstate->mt_transition_capture;
>
> at the top of the function. There are a few other places you're also
> accessing mtstate->mt_transition_capture.

Actually I wanted to be able to have a temporary variable that has
it's scope only for ExecARInsertTriggers(). But because that wasn't
possible, had to declare it at the top. I feel if we use
transition_capture all over, and if some future code below the NULL
assignment starts using transition_capture, it will wrongly get the
left-over NULL value.

Instead, what I have done is : used a special variable name only for
this purpose : ar_insert_trig_tcs, so that code won't use this
variable, by looking at it's name. And also moved it's assignment down
to where it is used the first time.

Similarly for ExecDelete(), used ar_delete_trig_tcs.

>
> 11. Should tuple_deleted and process_returning be camelCase like the
> other params?:
>
> static TupleTableSlot *
> ExecDelete(ModifyTableState *mtstate,
>    ItemPointer tupleid,
>    HeapTuple oldtuple,
>    TupleTableSlot *planSlot,
>    EPQState *epqstate,
>    EState *estate,
>    bool *tuple_deleted,
>    bool process_returning,
>    bool canSetTag)

Done.

>
> 12. The following comment talks about "target table descriptor", which
> I think is a good term. In several other places, you mention "root",
> which I take it to mean "target table".
>
>  * This map array is required for two purposes :
>  * 1. For update-tuple-routing. We need to convert the tuple from the subplan
>  * result rel to the root partitioned table descriptor.
>  * 2. For capturing transition tuples when the target table is a partitioned
>  * table. For updates, we need to convert the tuple from subplan result rel to
>  * target table descriptor, and for inserts, we need to convert the inserted
>  * tuple from leaf partition to the target table descriptor.
>
> I'd personally rather we always talked about "target" rather than
> "root". I understand there's probably many places in the code
> where we talk about the target table as "root", but I really think we
> need to fix that, so I'd rather not see the problem get any worse
> before it gets better.

Not very sure if that's true at all places. In some contexts, it makes
sense to use root to emphasize that it is the root partitioned table.
E.g. :

+ * For ExecInsert(), make it look like we are inserting into the
+ * root.
+ */
+ Assert(mtstate->rootResultRelInfo != NULL);
+ estate->es_result_relation_info = mtstate->rootResultRelInfo;

+ * resultRelInfo is one of the per-subplan resultRelInfos.  So we
+ * should convert the tuple into root's tuple descriptor, since
+ * ExecInsert() starts the search from root.  The tuple conversion

>
> The comment block might also look better if you tab indent after the
> 1. and 2. then on each line below it.

Used spaces instead of tab, because tab was taking it too much away
from the numbers, which looked odd.

> Also the space before the ':' is not correct.
Done

>
> 13. Does the following code really need to palloc0 rather than just palloc?
>
> /*
> * Build array of conversion maps from each child's TupleDesc to the
> * one used in the tuplestore.  The map pointers may be NULL when no
> * conversion is necessary, which is hopefully a common case for
> * partitions.
> */
> mtstate->mt_childparent_tupconv_maps = (TupleConversionMap **)
> palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
>
> I don't see any case in the initialization of the array where any of
> the elements are not assigned a value, so I think palloc() is fine.

Right. Used palloc().

>
> 14. I don't really like the way tupconv_map_for_subplan() works. It
> would be nice to have two separate functions for this, but looking a
> bit more at it, it seems the caller won't just need to always call
> exactly one of those functions. I don't have any ideas to improve it,
> so this is just a note.

I am assuming you mean one function for the case where
mt_is_tupconv_perpart is true, and the other function when it is not
true. The idea is, the caller should not have to worry if the map is
per-subplan or not.

>
> 15. I still don't really like the way ExecInitModifyTable() sets and
> unsets update_tuple_routing_needed. I know we talked about this
> before, but couldn't you just change:
>
> if (resultRelInfo->ri_TrigDesc &&
> resultRelInfo->ri_TrigDesc->trig_update_before_row &&
> operation == CMD_UPDATE)
> update_tuple_routing_needed = true;
>
> To:
>
> if (resultRelInfo->ri_TrigDesc &&
> resultRelInfo->ri_TrigDesc->trig_update_before_row &&
> node->partitioned_rels != NIL &&
> operation == CMD_UPDATE)
> update_tuple_routing_needed = true;
>
> and get rid of:
> .....
> if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
> update_tuple_routing_needed = false;
>
> looking at inheritance_planner(), partitioned_rels is only set to a
> non-NIL value if parent_rte->relkind == RELKIND_PARTITIONED_TABLE.
>

Initially update_tuple_routing_needed can be already true because of :
bool update_tuple_routing_needed = node->partKeyUpdated;

So if it's not a partitioned table and update_tuple_routing_needed is
set to true due to the above declaration, the variable will remain
true if we don't check the relkind in the end, which means the final
conclusion will be that update-tuple-routing is required, when it is
really not. Now, I understand that node->partKeyUpdated will not be
true if it's a partitioned table, but I think we better play safe
here. partKeyUpdated as per its name implies whether any of the
partition key columns are updated; it does not imply whether the
target table is a partitioned table or just a partition.

> 16. "named" -> "target" in:
>
>  * 'partKeyUpdated' is true if any partitioning columns are being updated,
>  * either from the named relation or a descendent partitioned table.
>
> I guess we're calling this one of; root, named, target :-(

Changed it to:
* either from the target relation or a descendent partitioned table.

>
> 17. You still have the following comment in ModifyTableState but
> you've moved all those fields out to PartitionTupleRouting:
>
> /* Tuple-routing support info */

This comment applies to mt_partition_tuple_routing field.

>
> 18. Should the following not be just called partKeyUpdate (without the 'd')?
>
> bool partKeyUpdated; /* some part key in hierarchy updated */
>
> This occurs in the planner were the part key is certainly being updated.
>

Actually the way it is named, it can mean : the partition key "is
updated" or "..has been updated" or "..is being updated" all of which
make sense. This sounds consistent with the name
RangeTblEntry->updatedCols that means "which of the columns are being
updated".

> 19. In pathnode.h you've named a parameter partColsUpdated, but the
> function in the .c file calls it partKeyUpdated.

Renamed partColsUpdated to partKeyUpdated.

>
> I'll try to look at the tests tomorrow and also do some testing. So
> far I've only read the code and the docs.

Thanks David. Your review is valuable.


> 20. "carried" -> "carried out the"
>
> +       would have identified the newly updated row and carried
> +       <command>UPDATE</command>/<command>DELETE</command> on this new row

Done.

>
> 21. Extra new line
>
> +   <xref linkend="ddl-partitioning-declarative-limitations">.
> +
>    </para>

Done.

I am not sure when exactly, but this line has started giving compile
errors, seemingly because > should be />. Fixed it.

>
> 22. In copy.c CopyFrom() you have the following code:
>
> /*
>  * We might need to convert from the parent rowtype to the
>  * partition rowtype.
>  */
> map = proute->partition_tupconv_maps[leaf_part_index];
> if (map)
> {
>     Relation partrel = resultRelInfo->ri_RelationDesc;
>
>     tuple = do_convert_tuple(tuple, map);
>
>     /*
>     * We must use the partition's tuple descriptor from this
>     * point on.  Use a dedicated slot from this point on until
>     * we're finished dealing with the partition.
>     */
>     slot = proute->partition_tuple_slot;
>     Assert(slot != NULL);
>     ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
>    ExecStoreTuple(tuple, slot, InvalidBuffer, true);
> }
>
> Should this use ConvertPartitionTupleSlot() instead?

I will have a look at it to see if we can use
ConvertPartitionTupleSlot() without any changes.
(TODO)

>
> 23. Why write;
>
> last_resultRelInfo = mtstate->resultRelInfo + mtstate->mt_nplans;
>
> when you can write;
>
> last_resultRelInfo = mtstate->resultRelInfo[mtstate->mt_nplans];?

You meant : (with &)
> last_resultRelInfo = &mtstate->resultRelInfo[mtstate->mt_nplans];?

I think both are equally good, and equally readable. In this case, we
don't even want the array element, so why not just increment the
pointer to a particular offset.

>
>
> 24. In ExecCleanupTupleRouting(), do you think that you could just
> have a special case loop for (mtstate && mtstate->operation ==
> CMD_UPDATE)?
>
> /*
> * If this result rel is one of the UPDATE subplan result rels, let
> * ExecEndPlan() close it. For INSERT or COPY, this does not apply
> * because leaf partition result rels are always newly allocated.
> */
> if (is_update &&
>     resultRelInfo >= first_resultRelInfo &&
>     resultRelInfo < last_resultRelInfo)
>     continue;
>
> Something like:
>
> if (mtstate && mtstate->operation == CMD_UPDATE)
> {
>     ResultRelInfo *first_resultRelInfo = mtstate->resultRelInfo;
>     ResultRelInfo *last_resultRelInfo =
> mtstate->resultRelInfo[mtstate->mt_nplans];
>
>     for (i = 0; i < proute->num_partitions; i++)
>     {
>         ResultRelInfo *resultRelInfo = proute->partitions[i];
>
>         /*
>          * Leave any resultRelInfos that belong to the UPDATE's subplan
>          * list.  These will be closed during executor shutdown.
>          */
>         if (resultRelInfo >= first_resultRelInfo &&
>             resultRelInfo < last_resultRelInfo)
>             continue;
>
>         ExecCloseIndices(resultRelInfo);
>         heap_close(resultRelInfo->ri_RelationDesc, NoLock);
>     }
> }
> else
> {
>     for (i = 0; i < proute->num_partitions; i++)
>     {
>         ResultRelInfo *resultRelInfo = proute->partitions[i];
>
>         ExecCloseIndices(resultRelInfo);
>         heap_close(resultRelInfo->ri_RelationDesc, NoLock);
>     }
> }

I thought it's not worth having two separate loops in order to reduce
one if(is_update) condition in case of inserts. Although we will have
one less is_update check per partition, the code is not running
per-row.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Fri, Jan 5, 2018 at 7:12 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> The above patch is to be applied over the last remaining preparatory
> patch, now named (and attached) :
> 0001-Refactor-CheckConstraint-related-code.patch

Committed that one, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] UPDATE of partition key

From
David Rowley
Date:
On 4 January 2018 at 02:52, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I'll try to look at the tests tomorrow and also do some testing.

I've made a pass over the tests. Again, sometimes I'm probably a bit
pedantic. The reason for that is that the tests are not that easy to
follow. Moving creation and cleanup of objects closer to where they're
used and no longer needed makes it easier to read through and verify
the tests. There are some genuine mistakes in there too.

1.

   NEW.c = NEW.c + 1; -- Make even number odd, or vice versa

This seems to be worded as if there'd only ever be one number. I think
it should be plural and read "Make even numbers odd, and vice versa"

2. The following comment does not make a huge amount of sense.

-- UPDATE with
-- partition key or non-partition columns, with different column ordering,
-- triggers.

Should "or" be "on"? Does ", triggers" mean "with triggers"?

3. The follow test tries to test a BEFORE DELETE trigger stopping a
DELETE on sub_part1, but going by the SELECT, there are no rows in
that table to stop being DELETEd.

select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
  tableoid  | a | b  | c
------------+---+----+----
 list_part1 | 2 | 52 | 50
 list_part1 | 3 |  6 | 60
 sub_part2  | 1 |  2 | 10
 sub_part2  | 1 |  2 | 70
(4 rows)

drop trigger parted_mod_b ON sub_part1 ;
-- If BR DELETE trigger prevented DELETE from happening, we should also skip
-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
create or replace function func_parted_mod_b() returns trigger as $$
begin return NULL; end $$ language plpgsql;
create trigger trig_skip_delete before delete on sub_part1
   for each row execute procedure func_parted_mod_b();
update list_parted set b = 1 where c = 70;
select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
  tableoid  | a | b  | c
------------+---+----+----
 list_part1 | 2 | 52 | 50
 list_part1 | 3 |  6 | 60
 sub_part1  | 1 |  1 | 70
 sub_part2  | 1 |  2 | 10
(4 rows)

You've added the BEFORE DELETE trigger to sub_part1, but you can see
the tuple was DELETEd from sub_part2 and INSERTed into sub_part1, so
the test is not working as you've commented.

It's probably a good idea to RAISE NOTICE 'something useful here'; in
the trigger function to verify they're actually being called in the
test.

4. I think the final drop function in the following should be before
the UPDATE FROM test. You've already done some cleanup for that test
by doing "drop trigger trig_skip_delete ON sub_part1 ;"

drop trigger trig_skip_delete ON sub_part1 ;
-- UPDATE partition-key with FROM clause. If join produces multiple output
-- rows for the same row to be modified, we should tuple-route the row
only once.
-- There should not be any rows inserted.
create table non_parted (id int);
insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
  tableoid  | a | b  | c
------------+---+----+----
 list_part1 | 2 |  1 | 70
 list_part1 | 2 |  2 | 10
 list_part1 | 2 | 52 | 50
 list_part1 | 3 |  6 | 60
(4 rows)

drop table non_parted;
drop function func_parted_mod_b();

Also, there's a space before the ; in the drop trigger above. Can that
be removed?

5. The following comment:

-- update to a partition should check partition bound constraint for
the new tuple.
-- If partition key is updated, the row should be moved to the appropriate
-- partition. updatable views using partitions should enforce the check options
-- for the rows that have been moved.

Can this be changed a bit? I think it's not accurate to say that an
update to a partition key causes the row to move. The row movement
only occurs when the new tuple does not match the partition bound and
another partition exists that does have a partition bound that matches
the tuple. How about:

-- When a partitioned table receives an UPDATE to the partitioned key and the
-- new values no longer meet the partition's bound, the row must be moved to
-- the correct partition for the new partition key (if one exists). We must
-- also ensure that updatable views on partitioned tables properly enforce any
-- WITH CHECK OPTION that is defined. The situation with triggers in this case
-- also requires thorough testing as partition key updates causing row
-- movement convert UPDATEs into DELETE+INSERT.

6. What does the following actually test?

-- This tests partition-key UPDATE on a partitioned table that does
not have any child partitions
update part_b_10_b_20 set b = b - 6;

There are no records in that partition, or anywhere in the hierarchy.
Are you just testing that there's no error? If so then the comment
should say so.

7. I think the following comment:

-- As mentioned above, the partition creation is intentionally kept in
descending bound order.

should instead say:

-- Create some more partitions following the above pattern of descending bound
-- order, but let's make the situation a bit more complex by having the
-- attribute numbers of the columns vary from their parent partition.

8. Just to make the tests a bit easier to follow, can you move the
following down to where you're first using it:

create table mintab(c1 int);
insert into mintab values (120);

and

CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1
from mintab) WITH CHECK OPTION;

9. It seems that the existing part of update.sql capitalises SQL
keywords, but you mostly don't. I understand we're not always
consistent, but can you keep it the same as the existing part of the
file?

10. Stray space before trailing ':'

-- fail (row movement happens only within the partition subtree) :

11. Can the following become:

-- succeeds, row movement , check option passes

-- success, update with row movement, check option passes:

Seems there's also quite a mix of comment formats in your tests.

You're using either one of; ok, success, succeeds followed by
sometimes a comma, and sometimes a reason in parentheses. The existing
part of the file seems to use:

-- fail, <reason>:

and just

-- <reason>:

for non-failures.

Would be great to stick to what's there.

12. The following comment seems to indicate that you're installing
triggers on all leaf partitions, but that's not the case:

-- Install BR triggers on child partition, so that transition tuple
conversion takes place.

maybe you should write "on some child partitions"? Or did you mean to
define a trigger on them all?

13. Stray space at the end of the case statement:

update range_parted set c = (case when c = 96 then 110 else c + 1 end
) where a = 'b' and b > 10 and c >= 96;

14. Stray space in the USING clause:

create policy seeall ON range_parted as PERMISSIVE for SELECT using ( true);

15. we -> we're

-- part_a_10_a_20 to part_d_1_15, because we setting 'c' to an odd number.

16. The comment probably should be before the "update range_parted",
not the "set session authorization":

-- This should fail with RLS violation error while moving row from
-- part_a_10_a_20 to part_d_1_15, because we setting 'c' to an odd number.
set session authorization regress_range_parted_user;
update range_parted set a = 'b', c = 151 where a = 'a' and c = 200;

17. trigger -> the trigger function

-- part_d_1_15, because trigger makes 'c' value an even number.

likewise in:

-- This should fail with RLS violation error because trigger makes 'c' value
-- an odd number.

18. Why two RESET SESSION AUTHORIZATIONs?

reset session authorization;
drop trigger trig_d_1_15 ON part_d_1_15;
drop function func_d_1_15();
-- Policy expression contains SubPlan
reset session authorization;

19. The following should be cleaned up in the final test that its used
on rather than randomly after the next test after it:

drop table mintab;

20. Comment is not worded very well:

-- UPDATE which does not modify partition key of partitions that are
chosen for update.

Does "partitions that are chosen for update" mean "the UPDATE target"?

I'm also not quite sure what the test is testing. In the past I've
written tests that have a header comment as -- Ensure that <what the
test is testing>. Perhaps if you can't think of what you're ensuring
with the test, then the test might not be that worthwhile.

21. The following comment could be improved:

-- Triggers can cause UPDATE row movement if it modified partition key.

Might be better to write:

-- Tests for BR UPDATE triggers changing the partition key.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Thu, Jan 4, 2018 at 1:18 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> ------------------
> 1. ExecUpdate() needs to revert back tcs_map value changed by ExecInsert()
> ------------------
>
>>> It seems like the ON CONFLICT stuff handled that by adding a
>>> second TransitionCaptureState pointer to ModifyTable, thus
>>> mt_transition_capture and mt_oc_transition_capture.  By that
>>> precedent, we could add mt_utr_transition_capture or similar, and
>>> maybe that's the way to go.  It seems a bit unsatisfying, but so does
>>> what you have now.
>>
>> In case of ON CONFLICT, if there are both INSERT and UPDATE statement
>> triggers referencing transition tables, both of the triggers need to
>> independently populate their own transition tables, and hence the need
>> for two separate transition states : mt_transition_capture and
>> mt_oc_transition_capture. But in case of update-tuple-routing, the
>> INSERT statement trigger won't come into picture. So the same
>> mt_transition_capture can serve the purpose of populating the
>> transition table with OLD and NEW rows. So I think it would be too
>> redundant, if not incorrect, to have a whole new transition state for
>> update tuple routing.
>>
>> I will see if it turns out better to have two tcs_maps in
>> TransitionCaptureState, one for update and one for insert. But this,
>> on first look, does not look good.
>
> Suppose TransitionCaptureState has separate maps, upd_del_tcs_maps and
> insert_tcs_maps for UPDATE/DELETE and INSERT events respectively.

That's not what I suggested.  If you look at what I wrote, I floated
the idea of having two TransitionCaptureStates, not two separate maps
within the same TransitionCaptureState.

> ------------------
> 2. mt_childparent_tupconv_maps is indexed by subplan or partition leaf index.
> ------------------
> ------------------
> 3. Renaming of mt_transition_tupconv_maps to mt_childparent_tupconv_maps
> ------------------
>
> We need to change it's name because now this map is not only used for
> transition capture, but also for update-tuple-routing. Does it look ok
> for you if, for readability, we keep the childparent tag ? Or else, we
> can just make it "mt_tupconv_maps", but "mt_childparent_tupconv_maps"
> looks more informative.

I see your point: the array is being renamed because it now has more
than one purpose.  But that's also what I'm complaining about with
regard to point #2: the same array is being used for more than one
purpose.  That's generally bad style.  If you have two loops in a
function, it's best to declare two separate loop variables rather than
reusing the same variable.  This lets the compiler detect, for
example, an error where the second loop variable is used before it's
initialized, which would be undetectable if you reused the same
variable in both places.  Although that particular benefit doesn't
pertain in this case, I maintain that having a single structure member
that is indexed one of two different ways is a bad idea.

If I understand correctly, the way we got here is that, in earlier
patch versions, you had two arrays of maps, but it wasn't clear why we
needed both of them, and David suggested replacing one of them with an
array of indexes instead, in the hopes of reducing confusion.
However, it looks to me like that didn't really work out.  If we
always needed both maps, or even if we always needed the per-leaf map,
it would have been a good idea, but it seems here that we can need
either the per-leaf map or the per-subplan map or both or neither, and
we want to avoid computing all of the per-leaf conversion maps if we
only need per-subplan access.

I think one way to fix this might be to build the per-leaf maps on
demand.  Just because we're doing UPDATE tuple routing doesn't
necessarily mean we'll actually need a TupleConversionMap for every
child.  So we could allocate an array with one byte per leaf, where 0
means we don't know whether tuple conversion is necessary, 1 means it
is not, and 2 means it is, or something like that.  Then we have a
second array with conversion maps.  We provide a function
tupconv_map_for_leaf() or similar that checks the array; if it finds
1, it returns NULL; if it finds 2, it returns the conversion map
previously calculated. If it finds 0, it calls convert_tuples_by_name,
caches the result for later, updates the one-byte-per-leaf array with
the appropriate value, and returns the just-computed conversion map.
(The reason I'm suggesting 0/1/2 instead of just true/false is to
reduce cache misses; if we find a 1 in the first array we don't need
to access the second array at all.)

If that doesn't seem like a good idea for some reason, then my second
choice would be to leave mt_transition_tupconv_maps named the way it
is currently and have a separate mt_update_tupconv_maps, with the two
pointing, if both are initialized and as far as possible, to the same
TupleConversionMap objects.

> -------------------
> 4. Explicit signaling for "we are only here for transition tables"
> -------------------
>
> I had given a thought on this earlier. I felt, even the pre-existing
> conditions like "!trigdesc->trig_update_after_row" are all indirect
> ways to determine that this function is called only to capture
> transition tables, and thought that it may have been better to have
> separate parameter transition_table_only.

I see your point. I guess it's not really this patch's job to solve
this problem, although I think this is going to need some refactoring
in the not-too-distant future.  So I think the way you did it is
probably OK.

> Instead of adding another parameter to AfterTriggerSaveEvent(), I had
> also considered another approach: Put the transition-tuples-capture
> logic part of AfterTriggerSaveEvent() into a helper function
> CaptureTransitionTables(). In ExecInsert() and ExecDelete(), instead
> of calling ExecARUpdateTriggers(), call this function
> CaptureTransitionTables(). I then dropped this idea and thought rather
> to call ExecARUpdateTriggers() which neatly does the required checks
> and other things like locking the old tuple via GetTupleForTrigger().
> So if we go by CaptureTransitionTables(), we would need to do what
> ExecARUpdateTriggers() does before calling CaptureTransitionTables().
> This is doable. If you think this is worth doing so as to get rid of
> the "(oldtup == NULL) ^ (newtup == NULL)" condition, we can do that.

Duplicating logic elsewhere to avoid this problem here doesn't seem
like a good plan.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Fri, Jan 5, 2018 at 3:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jan 5, 2018 at 7:12 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> The above patch is to be applied over the last remaining preparatory
>> patch, now named (and attached) :
>> 0001-Refactor-CheckConstraint-related-code.patch
>
> Committed that one, too.

Some more comments on the main patch:

I don't really like the fact that ExecCleanupTupleRouting() now takes
a ModifyTableState as an argument, particularly because of the way
that is using that argument.  To figure out whether a ResultRelInfo
was pre-existing or one it created, it checks whether the pointer
address of the ResultRelInfo is >= mtstate->resultRelInfo and <
mtstate->resultRelInfo + mtstate->mt_nplans.  However, that means that
ExecCleanupTupleRouting() ends up knowing about the memory allocation
pattern used by ExecInitModifyTable(), which seems like a slightly
dangerous amount of action at a distance.  I think it would be better
for the PartitionTupleRouting structure to explicitly indicate which
ResultRelInfos should be closed, for example by storing a Bitmapset
*input_partitions.  (Here, by "input", I mean "provided from the
mtstate rather than created by the PartitionTupleRouting structure;
other naming suggestions welcome.)  When
ExecSetupPartitionTupleRouting latches onto a partition, it can do
proute->input_partitions = bms_add_member(proute->input_partitons, i).
In ExecCleanupTupleRouting, it can do if
(bms_is_member(proute->input_partitions, i)) continue.

We have a test, in the regression test suite for file_fdw, which
generates the message "cannot route inserted tuples to a foreign
table".  I think we should have a similar test for the case where an
UPDATE tries to move a tuple from a regular partition to a foreign
table partition.  I'm not sure if it should fail with the same error
or a different one, but I think we should have a test that it fails
cleanly and with a nice error message of some sort.

The comment for get_partitioned_child_rels() claims that it sets
is_partition_key_update, but it really sets *is_partition_key_update.
And I think instead of "is a partition key" it should say "is used in
the partition key either of the relation whose RTI is specified or of
any child relation."  I propose "used in" instead of "is" because
there can be partition expressions, and the rest is to clarify that
child partition keys matter.

create_modifytable_path uses partColsUpdated rather than
partKeyUpdated, which actually seems like better terminology.  I
propose partKeyUpdated -> partColsUpdated everywhere.  Also, why use
is_partition_key_update for basically the same thing in some other
places?  I propose changing that to partColsUpdated as well.

The capitalization of the first comment hunk in execPartition.h is strange.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 4 January 2018 at 02:52, David Rowley <david.rowley@2ndquadrant.com> wrote:
>
> 1.
>
>    NEW.c = NEW.c + 1; -- Make even number odd, or vice versa
>
> This seems to be worded as if there'd only ever be one number. I think
> it should be plural and read "Make even numbers odd, and vice versa"

Done.

>
> 2. The following comment does not make a huge amount of sense.
>
> -- UPDATE with
> -- partition key or non-partition columns, with different column ordering,
> -- triggers.
>
> Should "or" be "on"? Does ", triggers" mean "with triggers"?

Actually I was trying to summarize what kinds of scenarios are going
to be tested. Now I think we don't have to give this summary. Rather,
we should describe each of the scenarios individually. But I did want
to use list partitions at least in a subset of update-partition-key
scenarios. So I have removed this comment, and replaced it by :

-- Some more update-partition-key test scenarios below. This time use list
-- partitions.

>
> 3. The follow test tries to test a BEFORE DELETE trigger stopping a
> DELETE on sub_part1, but going by the SELECT, there are no rows in
> that table to stop being DELETEd.
>
> select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
>   tableoid  | a | b  | c
> ------------+---+----+----
>  list_part1 | 2 | 52 | 50
>  list_part1 | 3 |  6 | 60
>  sub_part2  | 1 |  2 | 10
>  sub_part2  | 1 |  2 | 70
> (4 rows)
>
> drop trigger parted_mod_b ON sub_part1 ;
> -- If BR DELETE trigger prevented DELETE from happening, we should also skip
> -- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
> create or replace function func_parted_mod_b() returns trigger as $$
> begin return NULL; end $$ language plpgsql;
> create trigger trig_skip_delete before delete on sub_part1
>    for each row execute procedure func_parted_mod_b();
> update list_parted set b = 1 where c = 70;
> select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
>   tableoid  | a | b  | c
> ------------+---+----+----
>  list_part1 | 2 | 52 | 50
>  list_part1 | 3 |  6 | 60
>  sub_part1  | 1 |  1 | 70
>  sub_part2  | 1 |  2 | 10
> (4 rows)
>
> You've added the BEFORE DELETE trigger to sub_part1, but you can see
> the tuple was DELETEd from sub_part2 and INSERTed into sub_part1, so
> the test is not working as you've commented.
>
> It's probably a good idea to RAISE NOTICE 'something useful here'; in
> the trigger function to verify they're actually being called in the
> test.

Done. The trigger should have been for sub_part2, not sub_part1. Corrected that.
Also, dropped the trigger and again tested the UPDATE.

>
> 4. I think the final drop function in the following should be before
> the UPDATE FROM test. You've already done some cleanup for that test
> by doing "drop trigger trig_skip_delete ON sub_part1 ;"
>
> drop trigger trig_skip_delete ON sub_part1 ;
> -- UPDATE partition-key with FROM clause. If join produces multiple output
> -- rows for the same row to be modified, we should tuple-route the row
> only once.
> -- There should not be any rows inserted.
> create table non_parted (id int);
> insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
> update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
> select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
>   tableoid  | a | b  | c
> ------------+---+----+----
>  list_part1 | 2 |  1 | 70
>  list_part1 | 2 |  2 | 10
>  list_part1 | 2 | 52 | 50
>  list_part1 | 3 |  6 | 60
> (4 rows)
>
> drop table non_parted;
> drop function func_parted_mod_b();

Done. Moved it to relevant place.

>
> Also, there's a space before the ; in the drop trigger above. Can that
> be removed?
Removed.

>
> 5. The following comment:
>
> -- update to a partition should check partition bound constraint for
> the new tuple.
> -- If partition key is updated, the row should be moved to the appropriate
> -- partition. updatable views using partitions should enforce the check options
> -- for the rows that have been moved.
>
> Can this be changed a bit? I think it's not accurate to say that an
> update to a partition key causes the row to move. The row movement
> only occurs when the new tuple does not match the partition bound and
> another partition exists that does have a partition bound that matches
> the tuple. How about:
>
> -- When a partitioned table receives an UPDATE to the partitioned key and the
> -- new values no longer meet the partition's bound, the row must be moved to
> -- the correct partition for the new partition key (if one exists). We must
> -- also ensure that updatable views on partitioned tables properly enforce any
> -- WITH CHECK OPTION that is defined. The situation with triggers in this case
> -- also requires thorough testing as partition key updates causing row
> -- movement convert UPDATEs into DELETE+INSERT.

Done.

>
> 6. What does the following actually test?
>
> -- This tests partition-key UPDATE on a partitioned table that does
> not have any child partitions
> update part_b_10_b_20 set b = b - 6;
>
> There are no records in that partition, or anywhere in the hierarchy.
> Are you just testing that there's no error? If so then the comment
> should say so.

Yes, I understand that there won't be any update scan plans. But, with
the modifications done in ExecInitModifyTable(), I wanted to run that
code with this scenario where there are no partitions, to make sure it
does not behave weirdly or crash. Any suggestions for comments, given
this perspective ? For now, I have made the comment this way:

-- Check that partition-key UPDATE works sanely on a partitioned table
that does not have any child partitions.


>
> 7. I think the following comment:
>
> -- As mentioned above, the partition creation is intentionally kept in
> descending bound order.
>
> should instead say:
>
> -- Create some more partitions following the above pattern of descending bound
> -- order, but let's make the situation a bit more complex by having the
> -- attribute numbers of the columns vary from their parent partition.

Done.

>
> 8. Just to make the tests a bit easier to follow, can you move the
> following down to where you're first using it:
>
> create table mintab(c1 int);
> insert into mintab values (120);
>
> and
>
> CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1
> from mintab) WITH CHECK OPTION;

Done.

>
> 9. It seems that the existing part of update.sql capitalises SQL
> keywords, but you mostly don't. I understand we're not always
> consistent, but can you keep it the same as the existing part of the
> file?

Done.

>
> 10. Stray space before trailing ':'
>
> -- fail (row movement happens only within the partition subtree) :

Done, at other applicable places also.

>
> 11. Can the following become:
>
> -- succeeds, row movement , check option passes
>
> -- success, update with row movement, check option passes:
>
> Seems there's also quite a mix of comment formats in your tests.
>
> You're using either one of; ok, success, succeeds followed by
> sometimes a comma, and sometimes a reason in parentheses. The existing
> part of the file seems to use:
>
> -- fail, <reason>:
>
> and just
>
> -- <reason>:
>
> for non-failures.
>
> Would be great to stick to what's there.

There were existing lines where "ok, " was used.
So, now used this everywhere :
ok, ...
fail, ...

>
> 12. The following comment seems to indicate that you're installing
> triggers on all leaf partitions, but that's not the case:
>
> -- Install BR triggers on child partition, so that transition tuple
> conversion takes place.
>
> maybe you should write "on some child partitions"? Or did you mean to
> define a trigger on them all?

Trigger should be installed at least on the partitions onto which rows
are moved. I have corrected the comment accordingly.

Actually, to test transition tuple conversion with
update-row-movement, it requires a statement level trigger that
references transition tables. And trans_updatetrig already was
dropped. So transition tuple conversion for rows being inserted did
not get tested (I had manually tested though). So I have moved down
the drop statement.

>
> 13. Stray space at the end of the case statement:
>
> update range_parted set c = (case when c = 96 then 110 else c + 1 end
> ) where a = 'b' and b > 10 and c >= 96;

Done.

>
> 14. Stray space in the USING clause:
>
> create policy seeall ON range_parted as PERMISSIVE for SELECT using ( true);
Done

>
> 15. we -> we're
> -- part_a_10_a_20 to part_d_1_15, because we setting 'c' to an odd number.
Changed it to "we are"

>
> 16. The comment probably should be before the "update range_parted",
> not the "set session authorization":
> -- This should fail with RLS violation error while moving row from
> -- part_a_10_a_20 to part_d_1_15, because we setting 'c' to an odd number.
> set session authorization regress_range_parted_user;
> update range_parted set a = 'b', c = 151 where a = 'a' and c = 200;

Moved "set session authorization" statement above the comment.

>
> 17. trigger -> the trigger function
>
> -- part_d_1_15, because trigger makes 'c' value an even number.
>
> likewise in:
>
> -- This should fail with RLS violation error because trigger makes 'c' value
> -- an odd number.

I have made changes to the comment to make it clearer. Finally, the statement
contains phrase "trigger at the destination partition again makes it
an even number". With this phrase, "trigger function at destination
partition" looks odd. So I think "trigger at destination partition
makes ..." looks ok. It is implied that it is the trigger function
that is actually changing the value.

>
> 18. Why two RESET SESSION AUTHORIZATIONs?
>
> reset session authorization;
> drop trigger trig_d_1_15 ON part_d_1_15;
> drop function func_d_1_15();
> -- Policy expression contains SubPlan
> reset session authorization;

The second reset is actually in a different paragraph. The reason it's
there is to ensure we have reset it regardless of the earlier cleanup.

>
> 19. The following should be cleaned up in the final test that its used
> on rather than randomly after the next test after it:
>
> drop table mintab;

Done.

>
> 20. Comment is not worded very well:
>
> -- UPDATE which does not modify partition key of partitions that are
> chosen for update.
>
> Does "partitions that are chosen for update" mean "the UPDATE target"?

Actually it means the partitions participating in the update subplans,
i.e the unpruned ones.

I have modified the comment as :
-- Test update-partition-key, where the unpruned partitions do not have their
-- partition keys updated.

>
> I'm also not quite sure what the test is testing. In the past I've
> written tests that have a header comment as -- Ensure that <what the
> test is testing>. Perhaps if you can't think of what you're ensuring
> with the test, then the test might not be that worthwhile.

I am just testing that the update behaves sanely in the particular scenario.

BTW, it was a concious decision made that in this particular scenario,
we still conclude internally that update-tuple-routing is needed, and
do the tuple routing setup.

>
> 21. The following comment could be improved:
>
> -- Triggers can cause UPDATE row movement if it modified partition key.
>
> Might be better to write:
>
> -- Tests for BR UPDATE triggers changing the partition key.

Done

I have also done this following suggestion of yours :

>
> 22. In copy.c CopyFrom() you have the following code:
>
> /*
>  * We might need to convert from the parent rowtype to the
>  * partition rowtype.
>  */
> map = proute->partition_tupconv_maps[leaf_part_index];
> if (map)
> {
>     Relation partrel = resultRelInfo->ri_RelationDesc;
>
>     tuple = do_convert_tuple(tuple, map);
>
>     /*
>     * We must use the partition's tuple descriptor from this
>     * point on.  Use a dedicated slot from this point on until
>     * we're finished dealing with the partition.
>     */
>     slot = proute->partition_tuple_slot;
>     Assert(slot != NULL);
>     ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
>    ExecStoreTuple(tuple, slot, InvalidBuffer, true);
> }
>
> Should this use ConvertPartitionTupleSlot() instead?


Attached v35 patch. Thanks.

Attachment

Re: [HACKERS] UPDATE of partition key

From
David Rowley
Date:
Thanks for making those changes.

On 11 January 2018 at 04:00, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Yes, I understand that there won't be any update scan plans. But, with
> the modifications done in ExecInitModifyTable(), I wanted to run that
> code with this scenario where there are no partitions, to make sure it
> does not behave weirdly or crash. Any suggestions for comments, given
> this perspective ? For now, I have made the comment this way:
>
> -- Check that partition-key UPDATE works sanely on a partitioned table
> that does not have any child partitions.

Sounds good.

>> 18. Why two RESET SESSION AUTHORIZATIONs?
>>
>> reset session authorization;
>> drop trigger trig_d_1_15 ON part_d_1_15;
>> drop function func_d_1_15();
>> -- Policy expression contains SubPlan
>> reset session authorization;
>
> The second reset is actually in a different paragraph. The reason it's
> there is to ensure we have reset it regardless of the earlier cleanup.

hmm, I was reviewing the .out file, which does not have the empty
lines. Still seems a bit surplus.

> Attached v35 patch. Thanks.

Thanks. I'll try to look at it soon.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 11 January 2018 at 10:44, David Rowley <david.rowley@2ndquadrant.com> wrote:
>>> 18. Why two RESET SESSION AUTHORIZATIONs?
>>>
>>> reset session authorization;
>>> drop trigger trig_d_1_15 ON part_d_1_15;
>>> drop function func_d_1_15();
>>> -- Policy expression contains SubPlan
>>> reset session authorization;
>>
>> The second reset is actually in a different paragraph. The reason it's
>> there is to ensure we have reset it regardless of the earlier cleanup.
>
> hmm, I was reviewing the .out file, which does not have the empty
> lines. Still seems a bit surplus.

I believe the output file does not have the blank lines present in the
.sql file. I was referring to the paragraph in the *.sql* file.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 9 January 2018 at 23:07, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jan 4, 2018 at 1:18 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> ------------------
>> 1. ExecUpdate() needs to revert back tcs_map value changed by ExecInsert()
>> ------------------
>>
>>>> It seems like the ON CONFLICT stuff handled that by adding a
>>>> second TransitionCaptureState pointer to ModifyTable, thus
>>>> mt_transition_capture and mt_oc_transition_capture.  By that
>>>> precedent, we could add mt_utr_transition_capture or similar, and
>>>> maybe that's the way to go.  It seems a bit unsatisfying, but so does
>>>> what you have now.
>>>
>>> In case of ON CONFLICT, if there are both INSERT and UPDATE statement
>>> triggers referencing transition tables, both of the triggers need to
>>> independently populate their own transition tables, and hence the need
>>> for two separate transition states : mt_transition_capture and
>>> mt_oc_transition_capture. But in case of update-tuple-routing, the
>>> INSERT statement trigger won't come into picture. So the same
>>> mt_transition_capture can serve the purpose of populating the
>>> transition table with OLD and NEW rows. So I think it would be too
>>> redundant, if not incorrect, to have a whole new transition state for
>>> update tuple routing.
>>>
>>> I will see if it turns out better to have two tcs_maps in
>>> TransitionCaptureState, one for update and one for insert. But this,
>>> on first look, does not look good.
>>
>> Suppose TransitionCaptureState has separate maps, upd_del_tcs_maps and
>> insert_tcs_maps for UPDATE/DELETE and INSERT events respectively.
>
> That's not what I suggested.  If you look at what I wrote, I floated
> the idea of having two TransitionCaptureStates, not two separate maps
> within the same TransitionCaptureState.

In the first paragraph of my explanation, I was explaining why two
Transition capture states does not look like a good idea to me :

>>> In case of ON CONFLICT, if there are both INSERT and UPDATE statement
>>> triggers referencing transition tables, both of the triggers need to
>>> independently populate their own transition tables, and hence the need
>>> for two separate transition states : mt_transition_capture and
>>> mt_oc_transition_capture. But in case of update-tuple-routing, the
>>> INSERT statement trigger won't come into picture. So the same
>>> mt_transition_capture can serve the purpose of populating the
>>> transition table with OLD and NEW rows. So I think it would be too
>>> redundant, if not incorrect, to have a whole new transition state for
>>> update tuple routing.

And in the next para, I explained about the other alternative of
having two separate maps as against transition states.

>
>> ------------------
>> 2. mt_childparent_tupconv_maps is indexed by subplan or partition leaf index.
>> ------------------
>> ------------------
>> 3. Renaming of mt_transition_tupconv_maps to mt_childparent_tupconv_maps
>> ------------------
>>
>> We need to change it's name because now this map is not only used for
>> transition capture, but also for update-tuple-routing. Does it look ok
>> for you if, for readability, we keep the childparent tag ? Or else, we
>> can just make it "mt_tupconv_maps", but "mt_childparent_tupconv_maps"
>> looks more informative.
>
> I see your point: the array is being renamed because it now has more
> than one purpose.  But that's also what I'm complaining about with
> regard to point #2: the same array is being used for more than one
> purpose.  That's generally bad style.  If you have two loops in a
> function, it's best to declare two separate loop variables rather than
> reusing the same variable.  This lets the compiler detect, for
> example, an error where the second loop variable is used before it's
> initialized, which would be undetectable if you reused the same
> variable in both places.  Although that particular benefit doesn't
> pertain in this case, I maintain that having a single structure member
> that is indexed one of two different ways is a bad idea.
>
> If I understand correctly, the way we got here is that, in earlier
> patch versions, you had two arrays of maps, but it wasn't clear why we
> needed both of them, and David suggested replacing one of them with an
> array of indexes instead, in the hopes of reducing confusion.

Slight correction; it was suggested by Amit Langote; not by David.

> However, it looks to me like that didn't really work out.  If we
> always needed both maps, or even if we always needed the per-leaf map,
> it would have been a good idea, but it seems here that we can need
> either the per-leaf map or the per-subplan map or both or neither, and
> we want to avoid computing all of the per-leaf conversion maps if we
> only need per-subplan access.

I was ok with either mine or Amit Langote's approach. His approach
uses array of offsets to leaf-partition array, which sounded to me
like it may be re-usable for some similar purpose later.

>
> I think one way to fix this might be to build the per-leaf maps on
> demand.  Just because we're doing UPDATE tuple routing doesn't
> necessarily mean we'll actually need a TupleConversionMap for every
> child.  So we could allocate an array with one byte per leaf, where 0
> means we don't know whether tuple conversion is necessary, 1 means it
> is not, and 2 means it is, or something like that.  Then we have a
> second array with conversion maps.  We provide a function
> tupconv_map_for_leaf() or similar that checks the array; if it finds
> 1, it returns NULL; if it finds 2, it returns the conversion map
> previously calculated. If it finds 0, it calls convert_tuples_by_name,
> caches the result for later, updates the one-byte-per-leaf array with
> the appropriate value, and returns the just-computed conversion map.
> (The reason I'm suggesting 0/1/2 instead of just true/false is to
> reduce cache misses; if we find a 1 in the first array we don't need
> to access the second array at all.)
>
> If that doesn't seem like a good idea for some reason, then my second
> choice would be to leave mt_transition_tupconv_maps named the way it
> is currently and have a separate mt_update_tupconv_maps, with the two
> pointing, if both are initialized and as far as possible, to the same
> TupleConversionMap objects.

So there are two independent optimizations we are talking about :

1. Create the map only when needed. We may not require a map for a
leaf partition if there is no insert happening to that partition. And,
the insert may be part of update-tuple-routing or a plain INSERT
tuple-routing. Also, we may not require map for *every* subplan. It
may happen that many of the update subplans do not return any tuples,
in which case we don't require the maps for the partitions
corresponding to those subplans. This optimization was also suggested
by Thomas Munro initially.

2. In case of UPDATE, for partitions that take part in update scans,
there should be a single map; there should not be two separate maps,
one for accessing per-subplan and the other for accessing per-leaf. My
approach for this was to have a per-leaf array and a per-subplan
array, but they should share the maps wherever possible. I think this
is what you are suggesting in your second choice. The other approach
is as suggested by Amit Langote (which is present in the latest
versions of the patch), where we have an array of maps, and a
subplan-offsets array.


So your preference is for #1. But I think this optimization is not
specific for update-tuple-routing. This was applicable for inserts
also, from the beginning. And we can do this on-demand stuff for
subplan maps also.

Both optimizations are good, and they are independently required. But
I think optimization#2 is purely relevant to update-tuple-routing, so
we should do it now. We can do optimization #1 as a general
optimization, over and above optimization #2. So my opinion is, we do
#1 not as part of update-tuple-routing patch.

For optimization#2 (i.e. your second choice), I can revert back to the
way I had earlier used two different arrays, with per-leaf array
re-using the per-subplan maps.

Let me know if you are ok with this plan.

Then later once we do optimization #1, the maps will not be just
shared between per-subplan and per-leaf arrays, they will also be
created only when required.


Regarding the array names ...

Regardless of any approach, we are going to require two array maps,
one is per-subplan, and the other per-leaf. Now, for transition
capture, we would require both of these maps: per-subplan for
capturing updated rows, and per-leaf for routed rows. And during
update-tuple-routing, for converting the tuple from source partition
to root partition, we require only per-subplan map.

So if we name the per-subplan map as mt_transition_tupconv_maps, it
implies the per-leaf map is not used for transition capture, which is
incorrect. Similar thing, if we name the per-leaf map as
mt_transition_tupconv_maps.

Update-tuple-routing uses only per-subplan map. So per-subplan map can
be named mt_update_tupconv_maps. But again, how can we name the
per-leaf map ?

Noting all this, I feel we can go with names according to the
structure of maps. Something like : mt_perleaf_tupconv_maps, and
mt_persubplan_tupconv_maps. Other suggestions welcome.


>
>> -------------------
>> 4. Explicit signaling for "we are only here for transition tables"
>> -------------------
>>
>> I had given a thought on this earlier. I felt, even the pre-existing
>> conditions like "!trigdesc->trig_update_after_row" are all indirect
>> ways to determine that this function is called only to capture
>> transition tables, and thought that it may have been better to have
>> separate parameter transition_table_only.
>
> I see your point. I guess it's not really this patch's job to solve
> this problem, although I think this is going to need some refactoring
> in the not-too-distant future.  So I think the way you did it is
> probably OK.
>
>> Instead of adding another parameter to AfterTriggerSaveEvent(), I had
>> also considered another approach: Put the transition-tuples-capture
>> logic part of AfterTriggerSaveEvent() into a helper function
>> CaptureTransitionTables(). In ExecInsert() and ExecDelete(), instead
>> of calling ExecARUpdateTriggers(), call this function
>> CaptureTransitionTables(). I then dropped this idea and thought rather
>> to call ExecARUpdateTriggers() which neatly does the required checks
>> and other things like locking the old tuple via GetTupleForTrigger().
>> So if we go by CaptureTransitionTables(), we would need to do what
>> ExecARUpdateTriggers() does before calling CaptureTransitionTables().
>> This is doable. If you think this is worth doing so as to get rid of
>> the "(oldtup == NULL) ^ (newtup == NULL)" condition, we can do that.
>
> Duplicating logic elsewhere to avoid this problem here doesn't seem
> like a good plan.

Yeah, ok.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Thu, Jan 11, 2018 at 6:07 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> In the first paragraph of my explanation, I was explaining why two
> Transition capture states does not look like a good idea to me :

Oh, sorry.  I didn't read what you wrote carefully enough, I guess.

I see your points.  I think that there is probably a general need for
some refactoring here.  AfterTriggerSaveEvent() got significantly more
complicated and harder to understand with the arrival of transition
tables, and this patch is adding more complexity still.  It's also
adding complexity in other places to make ExecInsert() and
ExecDelete() usable for the semi-internal DELETE/INSERT operations
being produced when we split a partition key update into a DELETE and
INSERT pair.  It would be awfully nice to have some better way to
separate out each of the different things we might or might not want
to do depending on the situation: capture old tuple, capture new
tuple, fire before triggers, fire after triggers, count processed
rows, set command tag, perform actual heap operation, update indexes,
etc.  However, I don't have a specific idea how to do it better, so
maybe we should just get this committed for now and perhaps, with more
eyes on the code, someone will have a good idea.

> Slight correction; it was suggested by Amit Langote; not by David.

Oh, OK, sorry.

> So there are two independent optimizations we are talking about :
>
> 1. Create the map only when needed.
> 2. In case of UPDATE, for partitions that take part in update scans,
> there should be a single map; there should not be two separate maps,
> one for accessing per-subplan and the other for accessing per-leaf.

These optimizations aren't completely independent.   Optimization #2
can be implemented in several different ways.  The way you've chosen
to do it is to index the same array in two different ways depending on
whether per-leaf indexing is not needed, which I think is
unacceptable.  Another approach, which I proposed upthread, is to
always built the per-leaf mapping, but you pointed out that this could
involve doing a lot of unnecessary work in the case where most leaves
were pruned.  However, if you also implement #1, then that problem
goes away.  In other words, depending on the design you choose for #2,
you may or may not need to also implement optimization #1 to get good
performance.

To put that another way, I think Amit's idea of keeping a
subplan-offsets array is a pretty good one.  From your comments, you
do too.  But if we want to keep that, then we need a way to avoid the
expense of populating it for leaves that got pruned, except when we
are doing update row movement.  Otherwise, I don't see much choice but
to jettison the subplan-offsets array and just maintain two separate
arrays of mappings.

> Regarding the array names ...
>
> Noting all this, I feel we can go with names according to the
> structure of maps. Something like : mt_perleaf_tupconv_maps, and
> mt_persubplan_tupconv_maps. Other suggestions welcome.

I'd probably do mt_per_leaf_tupconv_maps, since inserting an
underscore between some but not all words seems strange.  But OK
otherwise.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 12 January 2018 at 01:18, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jan 11, 2018 at 6:07 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> In the first paragraph of my explanation, I was explaining why two
>> Transition capture states does not look like a good idea to me :
>
> Oh, sorry.  I didn't read what you wrote carefully enough, I guess.
>
> I see your points.  I think that there is probably a general need for
> some refactoring here.  AfterTriggerSaveEvent() got significantly more
> complicated and harder to understand with the arrival of transition
> tables, and this patch is adding more complexity still.  It's also
> adding complexity in other places to make ExecInsert() and
> ExecDelete() usable for the semi-internal DELETE/INSERT operations
> being produced when we split a partition key update into a DELETE and
> INSERT pair.  It would be awfully nice to have some better way to
> separate out each of the different things we might or might not want
> to do depending on the situation: capture old tuple, capture new
> tuple, fire before triggers, fire after triggers, count processed
> rows, set command tag, perform actual heap operation, update indexes,
> etc.  However, I don't have a specific idea how to do it better, so
> maybe we should just get this committed for now and perhaps, with more
> eyes on the code, someone will have a good idea.
>
>> Slight correction; it was suggested by Amit Langote; not by David.
>
> Oh, OK, sorry.
>
>> So there are two independent optimizations we are talking about :
>>
>> 1. Create the map only when needed.
>> 2. In case of UPDATE, for partitions that take part in update scans,
>> there should be a single map; there should not be two separate maps,
>> one for accessing per-subplan and the other for accessing per-leaf.
>
> These optimizations aren't completely independent.   Optimization #2
> can be implemented in several different ways.  The way you've chosen
> to do it is to index the same array in two different ways depending on
> whether per-leaf indexing is not needed, which I think is
> unacceptable.  Another approach, which I proposed upthread, is to
> always built the per-leaf mapping, but you pointed out that this could
> involve doing a lot of unnecessary work in the case where most leaves
> were pruned.  However, if you also implement #1, then that problem
> goes away.  In other words, depending on the design you choose for #2,
> you may or may not need to also implement optimization #1 to get good
> performance.
>
> To put that another way, I think Amit's idea of keeping a
> subplan-offsets array is a pretty good one.  From your comments, you
> do too.  But if we want to keep that, then we need a way to avoid the
> expense of populating it for leaves that got pruned, except when we
> are doing update row movement.  Otherwise, I don't see much choice but
> to jettison the subplan-offsets array and just maintain two separate
> arrays of mappings.


Ok. So giving more thought on our both's points, here's what I feel we
can do ...

With the two arrays mt_per_leaf_tupconv_maps and
mt_per_subplan_tupconv_maps, we want the following things :
1. Create the map on-demand.
2. If possible, try to share the maps between the per-subplan and
per-leaf arrays.

For this, option 1 is :

-------

Both the arrays elements are made of this structure :

typedef struct TupleConversionMapInfo
{
uint8 map_required; /* 0 : Not known if map is required */
                    /* 1 : map is created/required */
                    /* 2 : map is not necessary */
TupleConversionMap *map;
} TupleConversionMapInfo;

Arrays look like this :
TupleConversionMapInfo mt_per_subplan_tupconv_maps[];
TupleConversionMapInfo mt_per_leaf_tupconv_maps[];

When a per-subplan array is to be accessed at index i, a macro
get_tupconv_map(mt_per_subplan_tupconv_maps, i, forleaf=false) will be
called. This will create a new map if necessary, populate the array
element fields, and it will also copy this info into a corresponding
array element in the per-leaf array. To get to the per-leaf array
element, we need a subplan-offsets array. Whereas, if the per-leaf
array element is already populated, this info will be copied into the
subplan element in the opposite direction.

When a per-leaf array is to be accessed at index i,
get_tupconv_map(mt_per_leaf_tupconv_maps, i, forleaf=true) will be
called. Here, it will similarly update the per-leaf array element. But
it will not try to access the corresponding per-subplan array because
we don't have such offset array.

This is how the macro will look like :

#define get_tupconv_map(mapinfo, i, perleaf)
    ((mapinfo[i].map_required == 2) ? NULL :
   ((mapinfo[i].map_required == 1) ? mapinfo[i].map :
     create_new_map(mapinfo, i, perleaf)))

where create_new_map() will take care of populating the array element
on both the arrays, and then return the map if created, or NULL if not
required.


-------

Option 2 :

Elements of both arrays are pointers to TupleConversionMapInfo structure.
Arrays look like this :
TupleConversionMapInfo *mt_per_subplan_tupconv_maps[];
TupleConversionMapInfo *mt_per_leaf_tupconv_maps[];

typedef struct TupleConversionMapInfo
{
uint8 map_required; /* 0 : map is not required, 1 : ...  */
TupleConversionMap *map;
}

So in ExecInitModifyTable(), for each of the array elements of both
arrays, we palloc TupleConversionMap structure, and wherever
applicable, a common palloc'ed structure is shared between the two
arrays. This way, subplan-offsets array is not required.

In this case, the macro get_tupconv_map() similarly populates the
structure, but it does not have to access the other map array, because
the structures are already shared in the two arrays.

The problem with this option is : since we have to share some of the
structures allocated by the array elements, we have to build the two
arrays together, but in the code the arrays are to be allocated when
required at different points, like update_tuple_routing required and
transition tables required. Also, beforehand we have to individually
palloc memory for TupleConversionMapInfo for all the array elements,
as against allocating memory in a single palloc of the whole array as
in option 1.

As of this writing, I am writing code relevant to adding the on-demand
logic, and I anticipate option 1 would turn out better than option 2.
But I would like to know if you are ok with both of these options.


------------

The reason why I am having map_required field inside a structure along
with the map, as against a separate array, is so that we can do the
on-demand allocation for both per-leaf array and per-subplan array.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Fri, Jan 12, 2018 at 5:12 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> The reason why I am having map_required field inside a structure along
> with the map, as against a separate array, is so that we can do the
> on-demand allocation for both per-leaf array and per-subplan array.

Putting the map_required field inside the structure with the map makes
it completely silly to do the 0/1/2 thing, because the whole structure
is going to be on the same cache line anyway.  It won't save anything
to access the flag instead of a pointer in the same struct.   Also,
the uint8 will be followed by 7 bytes of padding, because the pointer
that follows will need to begin on an 8-byte boundary (at least, on
64-bit machines), so this will use more memory.

What I suggest is:

#define MT_CONVERSION_REQUIRED_UNKNOWN        0
#define MT_CONVERSION_REQUIRED_YES                    1
#define MT_CONVERSION_REQUIRED_NO                      2

In ModifyTableState:

uint8 *mt_per_leaf_tupconv_required;
TupleConversionMap **mt_per_leaf_tupconv_maps;

In PartitionTupleRouting:

int *subplan_partition_offsets;

When you initialize the ModifyTableState, do this:

mtstate->mt_per_leaf_tupconv_required = palloc0(sizeof(uint8) *
numResultRelInfos);
mtstate->mt_per_leaf_tupconv_maps = palloc0(sizeof(TupleConversionMap
*) * numResultRelInfos);

When somebody needs a map, then

(1) if they need it by subplan index, first use
subplan_partition_offsets to convert it to a per-leaf index

(2) then write a function that takes the per-leaf index and does this:

switch (mtstate->mt_per_leaf_tupconv_required[leaf_part_index])
{
    case MT_CONVERSION_REQUIRED_UNKNOWN:
        map = convert_tuples_by_name(...);
        if (map == NULL)
            mtstate->mt_per_leaf_tupconv_required[leaf_part_index] =
MT_CONVERSION_REQUIRED_NO;
        else
        {
            mtstate->mt_per_leaf_tupconv_required[leaf_part_index] =
MT_CONVERSION_REQUIRED_YES;
            mtstate->mt_per_leaf_tupconv_maps[leaf_part_index] = map;
        }
        return map;
    case MT_CONVERSION_REQUIRED_YES:
        return mtstate->mt_per_leaf_tupconv_maps[leaf_part_index];
    case MT_CONVERSION_REQUIRED_NO:
        return NULL;
}

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 12 January 2018 at 20:24, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jan 12, 2018 at 5:12 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> The reason why I am having map_required field inside a structure along
>> with the map, as against a separate array, is so that we can do the
>> on-demand allocation for both per-leaf array and per-subplan array.
>
> Putting the map_required field inside the structure with the map makes
> it completely silly to do the 0/1/2 thing, because the whole structure
> is going to be on the same cache line anyway.  It won't save anything
> to access the flag instead of a pointer in the same struct.

I see. Got it.

>  Also,
> the uint8 will be followed by 7 bytes of padding, because the pointer
> that follows will need to begin on an 8-byte boundary (at least, on
> 64-bit machines), so this will use more memory.
>
> What I suggest is:
>
> #define MT_CONVERSION_REQUIRED_UNKNOWN        0
> #define MT_CONVERSION_REQUIRED_YES                    1
> #define MT_CONVERSION_REQUIRED_NO                      2
>
> In ModifyTableState:
>
> uint8 *mt_per_leaf_tupconv_required;
> TupleConversionMap **mt_per_leaf_tupconv_maps;
>
> In PartitionTupleRouting:
>
> int *subplan_partition_offsets;
>
> When you initialize the ModifyTableState, do this:
>
> mtstate->mt_per_leaf_tupconv_required = palloc0(sizeof(uint8) *
> numResultRelInfos);
> mtstate->mt_per_leaf_tupconv_maps = palloc0(sizeof(TupleConversionMap
> *) * numResultRelInfos);
>

A few points below where I wanted to confirm that we are on the same page ...

> When somebody needs a map, then
>
> (1) if they need it by subplan index, first use
> subplan_partition_offsets to convert it to a per-leaf index

Before that, we need to check if there *is* an offset array. If there
are no partitions, there is only going to be a per-subplan array,
there won't be an offsets array. But I guess, you are saying : "do the
on-demand allocation only for leaf partitions; if there are no
partitions, the per-subplan maps will always be allocated for each of
the subplans from the beginning" . So if there is no offset array,
just return mtstate->mt_per_subplan_tupconv_maps[subplan_index]
without any further checks.

>
> (2) then write a function that takes the per-leaf index and does this:
>
> switch (mtstate->mt_per_leaf_tupconv_required[leaf_part_index])
> {
>     case MT_CONVERSION_REQUIRED_UNKNOWN:
>         map = convert_tuples_by_name(...);
>         if (map == NULL)
>             mtstate->mt_per_leaf_tupconv_required[leaf_part_index] =
> MT_CONVERSION_REQUIRED_NO;
>         else
>         {
>             mtstate->mt_per_leaf_tupconv_required[leaf_part_index] =
> MT_CONVERSION_REQUIRED_YES;
>             mtstate->mt_per_leaf_tupconv_maps[leaf_part_index] = map;
>         }
>         return map;
>     case MT_CONVERSION_REQUIRED_YES:
>         return mtstate->mt_per_leaf_tupconv_maps[leaf_part_index];
>     case MT_CONVERSION_REQUIRED_NO:
>         return NULL;
> }

Yeah, right.

But after that, I am not sure then why is mt_per_sub_plan_maps[] array
needed ? We are always going to convert the subplan index into leaf
index, so per-subplan map array will not come into picture. Or are you
saying, it will be allocated and used only when there are no
partitions ?  From one of your earlier replies, you did mention about
trying to share the maps between the two arrays, that means you were
considering both arrays being used at the same time.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Fri, Jan 12, 2018 at 12:23 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> (1) if they need it by subplan index, first use
>> subplan_partition_offsets to convert it to a per-leaf index
>
> Before that, we need to check if there *is* an offset array. If there
> are no partitions, there is only going to be a per-subplan array,
> there won't be an offsets array. But I guess, you are saying : "do the
> on-demand allocation only for leaf partitions; if there are no
> partitions, the per-subplan maps will always be allocated for each of
> the subplans from the beginning" . So if there is no offset array,
> just return mtstate->mt_per_subplan_tupconv_maps[subplan_index]
> without any further checks.

Oops.  I forgot that there might not be partitions.  I was assuming
that mt_per_subplan_tupconv_maps wouldn't exist at all, and we'd
always use subplan_partition_offsets.  Both that won't work in the
inheritance case.

> But after that, I am not sure then why is mt_per_sub_plan_maps[] array
> needed ? We are always going to convert the subplan index into leaf
> index, so per-subplan map array will not come into picture. Or are you
> saying, it will be allocated and used only when there are no
> partitions ?  From one of your earlier replies, you did mention about
> trying to share the maps between the two arrays, that means you were
> considering both arrays being used at the same time.

We'd use them both at the same time if we didn't have, or didn't use,
subplan_partition_offsets, but if we have subplan_partition_offsets
and can use it then we don't need mt_per_sub_plan_maps.

I guess I'm inclined to keep mt_per_sub_plan_maps for the case where
there are no partitions, but not use it when partitions are present.
What do you think about that?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 13 January 2018 at 02:56, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jan 12, 2018 at 12:23 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>> (1) if they need it by subplan index, first use
>>> subplan_partition_offsets to convert it to a per-leaf index
>>
>> Before that, we need to check if there *is* an offset array. If there
>> are no partitions, there is only going to be a per-subplan array,
>> there won't be an offsets array. But I guess, you are saying : "do the
>> on-demand allocation only for leaf partitions; if there are no
>> partitions, the per-subplan maps will always be allocated for each of
>> the subplans from the beginning" . So if there is no offset array,
>> just return mtstate->mt_per_subplan_tupconv_maps[subplan_index]
>> without any further checks.
>
> Oops.  I forgot that there might not be partitions.  I was assuming
> that mt_per_subplan_tupconv_maps wouldn't exist at all, and we'd
> always use subplan_partition_offsets.  Both that won't work in the
> inheritance case.
>
>> But after that, I am not sure then why is mt_per_sub_plan_maps[] array
>> needed ? We are always going to convert the subplan index into leaf
>> index, so per-subplan map array will not come into picture. Or are you
>> saying, it will be allocated and used only when there are no
>> partitions ?  From one of your earlier replies, you did mention about
>> trying to share the maps between the two arrays, that means you were
>> considering both arrays being used at the same time.
>
> We'd use them both at the same time if we didn't have, or didn't use,
> subplan_partition_offsets, but if we have subplan_partition_offsets
> and can use it then we don't need mt_per_sub_plan_maps.
>
> I guess I'm inclined to keep mt_per_sub_plan_maps for the case where
> there are no partitions, but not use it when partitions are present.
> What do you think about that?

Even where partitions are present, in the usual case where there are
no transition tables we won't require per-leaf map at all [1]. So I
think we should keep mt_per_sub_plan_maps only for the case where
per-leaf map is not allocated. And we will not allocate
mt_per_sub_plan_maps when mt_per_leaf_maps is needed. In other words,
exactly one of the two maps will be allocated.

This is turning out to be close to what's already there in the last
patch versions: use a single map array, and an offsets array. The
difference is : in the patch I am using the *same* variable for the
two maps. Where as, now we are talking about two different array
variables for maps, but only allocating one of them.

Are you ok with this ? I think the thing you were against was to have
a common *variable* for two purposes. But above, I am saying we have
two variables but assign a map array to only *one* of them and leave
the other unused.

---------

Regarding the on-demand map allocation ....
Where mt_per_sub_plan_maps is allocated, we won't have the on-demand
allocation: all the maps will be allocated initially. The reason is
becaues the map_is_required array is only per-leaf. Or else, again, we
need to keep another map_is_required array for per-subplan. May be we
can support the on-demand stuff for subplan maps also, but only as a
separate change after we are done with update-partition-key.


---------

Regarding mt_per_leaf_tupconv_required, I am thinking we can make it a
bool array, and name it : mt_per_leaf_map_not_required. When it is
true for a given index, it means, we have already called
convert_tuples_by_name() and it returned NULL; i.e. it means we are
sure that map is not required. A false value means we need to call
convert_tuples_by_name() if it is NULL, and then set
mt_per_leaf_map_not_required to (map == NULL).

Instead of a bool array, we can even make it a Bitmapset. But I think
access would become slower as compared to array, particularly because
it is going to be a heavily used function.

---------

[1] - For update-tuple-routing, only per-subplan access is required;
    - For transition tables, per-subplan access is required,
      and additionally per-leaf access is required when tuples are
      update-routed
    - So if both update-tuple-routing and transition tables are
      required, both of the maps are needed.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 10 January 2018 at 02:30, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jan 5, 2018 at 3:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, Jan 5, 2018 at 7:12 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>> The above patch is to be applied over the last remaining preparatory
>>> patch, now named (and attached) :
>>> 0001-Refactor-CheckConstraint-related-code.patch
>>
>> Committed that one, too.
>
> Some more comments on the main patch:
>
> I don't really like the fact that ExecCleanupTupleRouting() now takes
> a ModifyTableState as an argument, particularly because of the way
> that is using that argument.  To figure out whether a ResultRelInfo
> was pre-existing or one it created, it checks whether the pointer
> address of the ResultRelInfo is >= mtstate->resultRelInfo and <
> mtstate->resultRelInfo + mtstate->mt_nplans.  However, that means that
> ExecCleanupTupleRouting() ends up knowing about the memory allocation
> pattern used by ExecInitModifyTable(), which seems like a slightly
> dangerous amount of action at a distance.  I think it would be better
> for the PartitionTupleRouting structure to explicitly indicate which
> ResultRelInfos should be closed, for example by storing a Bitmapset
> *input_partitions.  (Here, by "input", I mean "provided from the
> mtstate rather than created by the PartitionTupleRouting structure;
> other naming suggestions welcome.)  When
> ExecSetupPartitionTupleRouting latches onto a partition, it can do
> proute->input_partitions = bms_add_member(proute->input_partitons, i).
> In ExecCleanupTupleRouting, it can do if
> (bms_is_member(proute->input_partitions, i)) continue.

Did the changes. But, instead of a new bitmapet, I used the offset
array for the purpose. As per our parallel discussion on
tup-conversion maps, it is almost finalized that the subplan-partition
offset map is good to have. So I have used that offset array to
determine whether a partition is present in the subplan. I used the
assumption that subplan and partition array have their partitions in
the same order.

>
> We have a test, in the regression test suite for file_fdw, which
> generates the message "cannot route inserted tuples to a foreign
> table".  I think we should have a similar test for the case where an
> UPDATE tries to move a tuple from a regular partition to a foreign
> table partition.

Added an UPDATE scenario in contrib/file_fdw/input/file_fdw.source.

> I'm not sure if it should fail with the same error
> or a different one, but I think we should have a test that it fails
> cleanly and with a nice error message of some sort.

The update-tuple-routing goes through the same ExecInsert() code, so
it fails at the same place with the same error message.

>
> The comment for get_partitioned_child_rels() claims that it sets
> is_partition_key_update, but it really sets *is_partition_key_update.
> And I think instead of "is a partition key" it should say "is used in
> the partition key either of the relation whose RTI is specified or of
> any child relation."  I propose "used in" instead of "is" because
> there can be partition expressions, and the rest is to clarify that
> child partition keys matter.

Fixed.

>
> create_modifytable_path uses partColsUpdated rather than
> partKeyUpdated, which actually seems like better terminology.  I
> propose partKeyUpdated -> partColsUpdated everywhere.  Also, why use
> is_partition_key_update for basically the same thing in some other
> places?  I propose changing that to partColsUpdated as well.

Done.

>
> The capitalization of the first comment hunk in execPartition.h is strange.

I think you are referring to :
 * subplan_partition_offsets int Array ordered by UPDATE subplans. Each
Changed Array to array. Didn't change UPDATE.

Attached v36 patch.

Attachment

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 14 January 2018 at 17:27, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 13 January 2018 at 02:56, Robert Haas <robertmhaas@gmail.com> wrote:
> > I guess I'm inclined to keep mt_per_sub_plan_maps for the case where
> > there are no partitions, but not use it when partitions are present.
> > What do you think about that?
>
> Even where partitions are present, in the usual case where there are no transition tables we won't require per-leaf
mapat all [1]. So I think we should keep mt_per_sub_plan_maps only for the case where per-leaf map is not allocated.
Andwe will not allocate mt_per_sub_plan_maps when mt_per_leaf_maps is needed. In other words, exactly one of the two
mapswill be allocated. 
>
> This is turning out to be close to what's already there in the last patch versions: use a single map array, and an
offsetsarray. The difference is : in the patch I am using the *same* variable for the two maps. Where as, now we are
talkingabout two different array variables for maps, but only allocating one of them. 
>
> Are you ok with this ? I think the thing you were against was to have a common *variable* for two purposes. But
above,I am saying we have two variables but assign a map array to only *one* of them and leave the other unused. 
>
> ---------
>
> Regarding the on-demand map allocation ....
> Where mt_per_sub_plan_maps is allocated, we won't have the on-demand allocation: all the maps will be allocated
initially.The reason is becaues the map_is_required array is only per-leaf. Or else, again, we need to keep another
map_is_requiredarray for per-subplan. May be we can support the on-demand stuff for subplan maps also, but only as a
separatechange after we are done with update-partition-key. 
>
>
> ---------
>
> Regarding mt_per_leaf_tupconv_required, I am thinking we can make it a bool, and name it :
mt_per_leaf_map_not_required.When it is true for a given index, it means, we have already called
convert_tuples_by_name()and it returned NULL; i.e. it means we are sure that map is not required. A false value means
weneed to call convert_tuples_by_name() if it is NULL, and then set mt_per_leaf_map_not_required to (map == NULL). 
>
> Instead of a bool array, , we can instead make it a Bitmapset. But I think access would become slower as compared to
array,particularly because it is going to be a heavily used function. 

I went ahead and did the above changes. I haven't yet merged these
changes in the main patch. Instead, I have attached it as an
incremental patch to be applied on the main v36 patch. The incremental
patch is not yet quite polished, and quite a bit of cosmetic changes
might be required, plus testing. But am posting it in case I have some
early feedback. Details :

The per-subplan map array variable is kept in ModifyTableState :
-       TupleConversionMap **mt_childparent_tupconv_maps;
-       /* Per plan/partition map for tuple conversion from child to root */
-       bool            mt_is_tupconv_perpart;  /* Is the above map
per-partition ? */
+       TupleConversionMap **mt_per_subplan_tupconv_maps;
+       /* Per plan map for tuple conversion from child to root */
 } ModifyTableState;

The per-leaf array variable and the not_required array is kept in
PartitionTupleRouting :
-       TupleConversionMap **partition_tupconv_maps;
+       TupleConversionMap **parent_child_tupconv_maps;
+       TupleConversionMap **child_parent_tupconv_maps;
+       bool       *child_parent_tupconv_map_not_reqd;
As you can see above, all the arrays are per-partition. So removed the
per-leaf tag in these arrays. Instead, renamed the existing
partition_tupconv_maps to parent_child_tupconv_maps, and the new
per-leaf array to child_parent_tupconv_maps

Have two separate functions ExecSetupChildParentMapForLeaf() and
ExecSetupChildParentMapForSubplan() since most of their code is
different. And now because of this, we can re-use
ExecSetupChildParentMapForLeaf() in both copy.c and nodeModifyTable.c.

Even inserts/copy will benefit from the on-demand map allocation. This
is because now there is a function TupConvMapForLeaf() that is called
in both copy.c and ExecInsert(). This is the function that does
on-demand allocation.

Attached the incremental patch conversion_map_changes.patch that has
the above changes. It is to be applied over the latest main patch
(update-partition-key_v36.patch).

Attachment

Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Sun, Jan 14, 2018 at 6:57 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Even where partitions are present, in the usual case where there are
> no transition tables we won't require per-leaf map at all [1]. So I
> think we should keep mt_per_sub_plan_maps only for the case where
> per-leaf map is not allocated. And we will not allocate
> mt_per_sub_plan_maps when mt_per_leaf_maps is needed. In other words,
> exactly one of the two maps will be allocated.
>
> This is turning out to be close to what's already there in the last
> patch versions: use a single map array, and an offsets array. The
> difference is : in the patch I am using the *same* variable for the
> two maps. Where as, now we are talking about two different array
> variables for maps, but only allocating one of them.
>
> Are you ok with this ? I think the thing you were against was to have
> a common *variable* for two purposes. But above, I am saying we have
> two variables but assign a map array to only *one* of them and leave
> the other unused.

Yes, I'm OK with that.

> Regarding the on-demand map allocation ....
> Where mt_per_sub_plan_maps is allocated, we won't have the on-demand
> allocation: all the maps will be allocated initially. The reason is
> becaues the map_is_required array is only per-leaf. Or else, again, we
> need to keep another map_is_required array for per-subplan. May be we
> can support the on-demand stuff for subplan maps also, but only as a
> separate change after we are done with update-partition-key.

Sure.

> Regarding mt_per_leaf_tupconv_required, I am thinking we can make it a
> bool array, and name it : mt_per_leaf_map_not_required. When it is
> true for a given index, it means, we have already called
> convert_tuples_by_name() and it returned NULL; i.e. it means we are
> sure that map is not required. A false value means we need to call
> convert_tuples_by_name() if it is NULL, and then set
> mt_per_leaf_map_not_required to (map == NULL).

OK.

> Instead of a bool array, we can even make it a Bitmapset. But I think
> access would become slower as compared to array, particularly because
> it is going to be a heavily used function.

It probably makes little difference -- the Bitmapset will be more
compact (which saves time) but involve function calls (which cost
time).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] UPDATE of partition key

From
David Rowley
Date:
On 16 January 2018 at 01:09, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Jan 14, 2018 at 6:57 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> Even where partitions are present, in the usual case where there are
>> Instead of a bool array, we can even make it a Bitmapset. But I think
>> access would become slower as compared to array, particularly because
>> it is going to be a heavily used function.
>
> It probably makes little difference -- the Bitmapset will be more
> compact (which saves time) but involve function calls (which cost
> time).

I'm not arguing in either direction, but you'd also want to factor in
how Bitmapsets only allocate words for the maximum stored member,
which might mean multiple realloc() calls resulting in palloc/memcpy
calls. The array would just be allocated in a single chunk, although
it would be more memory and would require a memset too, however,
that's likely much cheaper than the palloc() anyway.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 15 January 2018 at 16:11, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> I went ahead and did the above changes. I haven't yet merged these
> changes in the main patch. Instead, I have attached it as an
> incremental patch to be applied on the main v36 patch. The incremental
> patch is not yet quite polished, and quite a bit of cosmetic changes
> might be required, plus testing. But am posting it in case I have some
> early feedback.

I have now embedded the above incremental patch changes into the main
patch (v37) , which is attached.

Because it is used heavily in case of transition tables with
partitions, I have made TupConvMapForLeaf() a macro. And the actual
creation of the map is in separate function CreateTupConvMapForLeaf(),
so as to reduce the macro size.

Retained child_parent_map_not_required as a bool array, as against a bitmap.

To include one scenario related to on-demand map allocation that was
not getting covered with the update.sql test, I added one more
scenario in that file :
+-- Case where per-partition tuple conversion map array is allocated, but the
+-- map is not required for the particular tuple that is routed, thanks to
+-- matching table attributes of the partition and the target table.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 16 January 2018 at 09:17, David Rowley <david.rowley@2ndquadrant.com> wrote:
> On 16 January 2018 at 01:09, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sun, Jan 14, 2018 at 6:57 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>> Even where partitions are present, in the usual case where there are
>>> Instead of a bool array, we can even make it a Bitmapset. But I think
>>> access would become slower as compared to array, particularly because
>>> it is going to be a heavily used function.
>>
>> It probably makes little difference -- the Bitmapset will be more
>> compact (which saves time) but involve function calls (which cost
>> time).
>
> I'm not arguing in either direction, but you'd also want to factor in
> how Bitmapsets only allocate words for the maximum stored member,
> which might mean multiple realloc() calls resulting in palloc/memcpy
> calls. The array would just be allocated in a single chunk, although
> it would be more memory and would require a memset too, however,
> that's likely much cheaper than the palloc() anyway.

Right. I agree. And also a function call for knowing whether required
or not. Overall, I think especially because the data structure will be
used heavily whenever it is set up, it's better to make it an array.
In the latest patch, I have retained it as an array


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 16 January 2018 at 16:09, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> I have now embedded the above incremental patch changes into the main
> patch (v37) , which is attached.

The patch had to be rebased over commit dca48d145e0e :
    Remove useless lookup of root partitioned rel in ExecInitModifyTable().

In ExecInitModifyTable(), "rel" variable was needed only for INSERT.
And node->partitioned_rels is only set in UPDATE/DELETE cases, so the
extra logic of getting the root partitioned rel from
node->partitioned_rels was removed as part of that commit.

But now for update-tuple-routing, we require rel for UPDATE also. So
we need to get the root partitioned rel. But, rather than opening the
root table from node->partitioned_rels, we can re-use the
already-opened mtstate->rootResultInfo. rootResultInfo is the same as
head of partitioned_rels. I have renamed getASTriggerResultRelInfo()
to getTargetResultRelInfo(), and used it to get the root partitioned
table. The rename made sense, because it has become a function for
more general use, rather than specific to triggers-related
functionality.

Attached rebased patch.




-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Fri, Jan 19, 2018 at 4:37 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Attached rebased patch.

Committed with a bunch of mostly-cosmetic revisions.  I removed the
macro you added, which has a multiple evaluation hazard, and just put
that logic back into the function.  I don't think it's likely to
matter for performance, and this way is safer.  I removed an inline
keyword from another static function as well; better to let the
compiler decide what to do.  I rearranged a few things to shorten some
long lines, too.  Aside from that I think all of the changes I made
were to comments and documentation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] UPDATE of partition key

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Committed with a bunch of mostly-cosmetic revisions.

Buildfarm member skink has been unhappy since this patch went in.
Running the regression tests under valgrind easily reproduces the
failure.  Now, I might be wrong about which of the patches committed
on Friday caused the unhappiness, but the valgrind backtrace sure
looks like it's to do with partition routing:

==00:00:05:49.683 17549== Invalid read of size 4
==00:00:05:49.683 17549==    at 0x62A8BA: ExecCleanupTupleRouting (execPartition.c:483)
==00:00:05:49.683 17549==    by 0x6483AA: ExecEndModifyTable (nodeModifyTable.c:2682)
==00:00:05:49.683 17549==    by 0x627139: standard_ExecutorEnd (execMain.c:1604)
==00:00:05:49.683 17549==    by 0x7780AF: ProcessQuery (pquery.c:206)
==00:00:05:49.683 17549==    by 0x7782E4: PortalRunMulti (pquery.c:1286)
==00:00:05:49.683 17549==    by 0x778AAF: PortalRun (pquery.c:799)
==00:00:05:49.683 17549==    by 0x774E4C: exec_simple_query (postgres.c:1120)
==00:00:05:49.683 17549==    by 0x776C17: PostgresMain (postgres.c:4143)
==00:00:05:49.683 17549==    by 0x6FA419: PostmasterMain (postmaster.c:4412)
==00:00:05:49.683 17549==    by 0x66E51F: main (main.c:228)
==00:00:05:49.683 17549==  Address 0xe25e298 is 2,088 bytes inside a block of size 32,768 alloc'd
==00:00:05:49.683 17549==    at 0x4A06A2E: malloc (vg_replace_malloc.c:270)
==00:00:05:49.683 17549==    by 0x89EB15: AllocSetAlloc (aset.c:945)
==00:00:05:49.683 17549==    by 0x8A7577: palloc (mcxt.c:848)
==00:00:05:49.683 17549==    by 0x671969: new_list (list.c:68)
==00:00:05:49.683 17549==    by 0x672859: lappend_oid (list.c:169)
==00:00:05:49.683 17549==    by 0x55330E: find_inheritance_children (pg_inherits.c:144)
==00:00:05:49.683 17549==    by 0x553447: find_all_inheritors (pg_inherits.c:203)
==00:00:05:49.683 17549==    by 0x62AC76: ExecSetupPartitionTupleRouting (execPartition.c:68)
==00:00:05:49.683 17549==    by 0x64949D: ExecInitModifyTable (nodeModifyTable.c:2232)
==00:00:05:49.683 17549==    by 0x62BBE8: ExecInitNode (execProcnode.c:174)
==00:00:05:49.683 17549==    by 0x627B53: standard_ExecutorStart (execMain.c:1043)
==00:00:05:49.683 17549==    by 0x778046: ProcessQuery (pquery.c:156)

(This is my local result, but skink's log looks about the same.)

            regards, tom lane


Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Sun, Jan 21, 2018 at 1:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Committed with a bunch of mostly-cosmetic revisions.
>
> Buildfarm member skink has been unhappy since this patch went in.
> Running the regression tests under valgrind easily reproduces the
> failure.  Now, I might be wrong about which of the patches committed
> on Friday caused the unhappiness, but the valgrind backtrace sure
> looks like it's to do with partition routing:

Yeah, that must be the fault of this patch.  We assign to
proute->subplan_partition_offsets[update_rri_index] from
update_rri_index = 0 .. num_update_rri, and there's an Assert() at the
bottom of this function that checks this, so probably this is indexing
off the end of the array.  I bet the issue happens when we find all of
the UPDATE result rels while there are still partitions left; then,
subplan_index will be equal to the length of the
proute->subplan_partition_offsets array and we'll be indexing just off
the end.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] UPDATE of partition key

From
Amit Khandekar
Date:
On 22 January 2018 at 02:40, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Jan 21, 2018 at 1:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> Committed with a bunch of mostly-cosmetic revisions.
>>
>> Buildfarm member skink has been unhappy since this patch went in.
>> Running the regression tests under valgrind easily reproduces the
>> failure.  Now, I might be wrong about which of the patches committed
>> on Friday caused the unhappiness, but the valgrind backtrace sure
>> looks like it's to do with partition routing:
>
> Yeah, that must be the fault of this patch.  We assign to
> proute->subplan_partition_offsets[update_rri_index] from
> update_rri_index = 0 .. num_update_rri, and there's an Assert() at the
> bottom of this function that checks this, so probably this is indexing
> off the end of the array.  I bet the issue happens when we find all of
> the UPDATE result rels while there are still partitions left; then,
> subplan_index will be equal to the length of the
> proute->subplan_partition_offsets array and we'll be indexing just off
> the end.

Yes, right, that's what is happening. It is not happening on an Assert
though (there is no assert in that function). It is happening when we
try to access the array here :

                if (proute->subplan_partition_offsets &&
                        proute->subplan_partition_offsets[subplan_index] == i)

Attached is a fix, where I have introduced another field
PartitionTupleRouting.num_ subplan_partition_offsets, so that above,
we can add another condition (subplan_index <
proute->num_subplan_partition_offsets) in order to stop accessing the
array once we are done with all the offset array elements.

Ran the update.sql test with valgrind enabled on my laptop, and the
valgrind output now does not show errors.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Mon, Jan 22, 2018 at 2:44 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Yes, right, that's what is happening. It is not happening on an Assert
> though (there is no assert in that function). It is happening when we
> try to access the array here :
>
>                 if (proute->subplan_partition_offsets &&
>                         proute->subplan_partition_offsets[subplan_index] == i)
>
> Attached is a fix, where I have introduced another field
> PartitionTupleRouting.num_ subplan_partition_offsets, so that above,
> we can add another condition (subplan_index <
> proute->num_subplan_partition_offsets) in order to stop accessing the
> array once we are done with all the offset array elements.
>
> Ran the update.sql test with valgrind enabled on my laptop, and the
> valgrind output now does not show errors.

Tom, do you want to double-check that this fixes it for you?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] UPDATE of partition key

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Tom, do you want to double-check that this fixes it for you?

I can confirm that a valgrind run succeeded for me with the patch
in place.

            regards, tom lane


Re: [HACKERS] UPDATE of partition key

From
Robert Haas
Date:
On Mon, Jan 22, 2018 at 9:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Tom, do you want to double-check that this fixes it for you?
>
> I can confirm that a valgrind run succeeded for me with the patch
> in place.

Committed.  Sorry for the delay.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] UPDATE of partition key

From
Thomas Munro
Date:
On Thu, Jan 25, 2018 at 10:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jan 22, 2018 at 9:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> Tom, do you want to double-check that this fixes it for you?
>>
>> I can confirm that a valgrind run succeeded for me with the patch
>> in place.
>
> Committed.  Sorry for the delay.

FYI I'm planning to look into adding a valgrind check to the
commitfest CI thing I run so we can catch these earlier without
committer involvement.  It's super slow because of all those pesky
regression tests so I'll probably need to improve the scheduling logic
a bit to make it useful first (prioritising new patches or something,
since otherwise it'll take up to multiple days to get around to
valgrind-testing any given patch...).

-- 
Thomas Munro
http://www.enterprisedb.com