Thread: a misbehavior of partition row movement (?)

a misbehavior of partition row movement (?)

From
Amit Langote
Date:
Hi,

Robert forwarded me a pgsql-general thread [1] where a ON DELETE
CASCADE specified on a foreign key pointing to a partitioned table is
shown to cause a possibly surprising end result during an update of
the partitioned table.  Example from that thread:

create table parent ( id serial, constraint parent_pkey primary key
(id)) partition by range (id);
create table parent_10 partition of parent for values from (0) to (10);
create table parent_20 partition of parent for values from (11) to (20);
create table child (id serial, parent_id int constraint parent_id_fk
references parent(id) on update cascade on delete cascade);
insert into parent values(0);
insert into child values(1,0);
update parent set id = 5;  -- no row movement, so normal update
table parent;
 id
----
  5
(1 row)

table child;
 id | parent_id
----+-----------
  1 |         5
(1 row)

update parent set id = 15;  -- row movement, so delete+insert
table parent;
 id
----
 15
(1 row)

table child;  -- ON DELETE CASCADE having done its job
 id | parent_id
----+-----------
(0 rows)

Reporter on that thread says that the last update should have failed
and I don't quite see a workable alternative to that.  What we could
do is check before calling ExecDelete() that will perform the DELETE
part of the row movement if the foreign key action trigger that
implements the ON DELETE CASCADE action (an internal trigger) is among
the AR delete triggers that will run as part of that DELETE.  If yes,
abort the operation.  See attached a patch for that.  I'm not terribly
happy with the error and details messages though:

update parent set id = 15;
ERROR:  cannot move row being updated to another partition
DETAIL:  Moving the row may cause a foreign key involving the source
partition to be violated.

Thoughts?

--
Amit Langote
EDB: http://www.enterprisedb.com

[1] https://www.postgresql.org/message-id/flat/CAL54xNZsLwEM1XCk5yW9EqaRzsZYHuWsHQkA2L5MOSKXAwviCQ%40mail.gmail.com

Attachment

a misbehavior of partition row movement (?)

From
"David G. Johnston"
Date:
On Friday, October 2, 2020, Amit Langote <amitlangote09@gmail.com> wrote:

Reporter on that thread says that the last update should have failed
and I don't quite see a workable alternative to that. 
 
To be clear the OP would rather have it just work, the same as the non-row-movement version.  Maybe insert the new row first, execute the on update trigger chained from the old row, then delete the old row?

David J.

Re: a misbehavior of partition row movement (?)

From
Amit Langote
Date:
On Fri, Oct 2, 2020 at 11:32 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
> On Friday, October 2, 2020, Amit Langote <amitlangote09@gmail.com> wrote:
>>
>>
>> Reporter on that thread says that the last update should have failed
>> and I don't quite see a workable alternative to that.
>
>
> To be clear the OP would rather have it just work, the same as the non-row-movement version.  Maybe insert the new
rowfirst, execute the on update trigger chained from the old row, then delete the old row?
 

I was thinking yesterday about making it just work, but considering
the changes that would need to be made to how the underlying triggers
fire, it does not seem we would be able to back-port the solution.

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: a misbehavior of partition row movement (?)

From
Tomas Vondra
Date:
On Sat, Oct 03, 2020 at 11:42:21AM +0900, Amit Langote wrote:
>On Fri, Oct 2, 2020 at 11:32 PM David G. Johnston
><david.g.johnston@gmail.com> wrote:
>> On Friday, October 2, 2020, Amit Langote <amitlangote09@gmail.com>
>> wrote:
>>>
>>>
>>> Reporter on that thread says that the last update should have failed
>>> and I don't quite see a workable alternative to that.
>>
>>
>> To be clear the OP would rather have it just work, the same as the
>> non-row-movement version.  Maybe insert the new row first, execute
>> the on update trigger chained from the old row, then delete the old
>> row?
>
>I was thinking yesterday about making it just work, but considering the
>changes that would need to be made to how the underlying triggers fire,
>it does not seem we would be able to back-port the solution.
>

I think we need to differentiate between master and backbranches. IMO we
should try to make it "just work" in master, and the amount of code
should not be an issue there I think (no opinion on whether insert and
update trigger is the way to go). For backbranches we may need to do
something less intrusive, of course.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: a misbehavior of partition row movement (?)

From
Amit Langote
Date:
On Sat, Oct 3, 2020 at 8:15 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote
> On Sat, Oct 03, 2020 at 11:42:21AM +0900, Amit Langote wrote:
> >On Fri, Oct 2, 2020 at 11:32 PM David G. Johnston
> ><david.g.johnston@gmail.com> wrote:
> >> On Friday, October 2, 2020, Amit Langote <amitlangote09@gmail.com>
> >> wrote:
> >>>
> >>>
> >>> Reporter on that thread says that the last update should have failed
> >>> and I don't quite see a workable alternative to that.
> >>
> >>
> >> To be clear the OP would rather have it just work, the same as the
> >> non-row-movement version.  Maybe insert the new row first, execute
> >> the on update trigger chained from the old row, then delete the old
> >> row?
> >
> >I was thinking yesterday about making it just work, but considering the
> >changes that would need to be made to how the underlying triggers fire,
> >it does not seem we would be able to back-port the solution.
> >
>
> I think we need to differentiate between master and backbranches. IMO we
> should try to make it "just work" in master, and the amount of code
> should not be an issue there I think (no opinion on whether insert and
> update trigger is the way to go). For backbranches we may need to do
> something less intrusive, of course.

Sure, that makes sense.  I will try making a patch for HEAD to make it
just work unless someone beats me to it.

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: a misbehavior of partition row movement (?)

From
Amit Langote
Date:
On Sat, Oct 3, 2020 at 8:26 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Sat, Oct 3, 2020 at 8:15 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote
> > On Sat, Oct 03, 2020 at 11:42:21AM +0900, Amit Langote wrote:
> > >On Fri, Oct 2, 2020 at 11:32 PM David G. Johnston
> > ><david.g.johnston@gmail.com> wrote:
> > >> On Friday, October 2, 2020, Amit Langote <amitlangote09@gmail.com>
> > >> wrote:
> > >>>
> > >>>
> > >>> Reporter on that thread says that the last update should have failed
> > >>> and I don't quite see a workable alternative to that.
> > >>
> > >>
> > >> To be clear the OP would rather have it just work, the same as the
> > >> non-row-movement version.  Maybe insert the new row first, execute
> > >> the on update trigger chained from the old row, then delete the old
> > >> row?
> > >
> > >I was thinking yesterday about making it just work, but considering the
> > >changes that would need to be made to how the underlying triggers fire,
> > >it does not seem we would be able to back-port the solution.
> > >
> >
> > I think we need to differentiate between master and backbranches. IMO we
> > should try to make it "just work" in master, and the amount of code
> > should not be an issue there I think (no opinion on whether insert and
> > update trigger is the way to go). For backbranches we may need to do
> > something less intrusive, of course.
>
> Sure, that makes sense.  I will try making a patch for HEAD to make it
> just work unless someone beats me to it.

After working on this for a while, here is my proposal for HEAD.

To reiterate, the problem occurs when an UPDATE of a partitioned PK
table is turned into a DELETE + INSERT.  In that case, the UPDATE RI
triggers are not fired at all, but DELETE ones are, so the foreign key
may fail to get enforced correctly.  For example, even though the new
row after the update is still logically present in the PK table, it
would wrongly get deleted because of the DELETE RI trigger firing if
there's a ON DELETE CASCADE clause on the foreign key.

To fix that, I propose that we teach trigger.c to skip queuing the
events that would be dangerous to fire, such as that for the DELETE on
the source leaf partition mentioned above.  Instead, queue an UPDATE
event on the root target table, matching the actual operation being
performed.  Note though that this new arrangement doesn't affect the
firing of any other triggers except those that are relevant to the
reported problem, viz. the PK-side RI triggers.  All other triggers
fire exactly as they did before.

To make that happen, I had to:

1. Make RI triggers available on partitioned tables at all, which they
are not today.  Also, mark the RI triggers in partitions correctly as
*clones* of the RI triggers in their respective parents.

2. Make it possible to allow AfterTriggerSaveEvent() to access the
query's actual target relation, that is, in addition to the target
relation on which an event was fired.  Also, added a bunch of code to
AFTER TRIGGER infrastructure to handle events fired on partitioned
tables.  Because those events cannot contain physical references to
affected tuples, I generalized the code currently used to handle after
triggers on foreign tables by storing the tuples in and retrieving
them from a tuple store.  I read a bunch of caveats of that
implementation (such as its uselessness for deferred triggers), but
for the limited cases for which it will be used for partitioned
tables, it seems safe, because it won't be used for deferred triggers
on partitioned tables.

Attached patches 0001 and 0002 implement 1 and 2, respectively.
Later, I will post an updated version of the patch for the
back-branches, which, as mentioned upthread, is to prevent the
cross-partition updates on foreign key PK tables.

--
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: a misbehavior of partition row movement (?)

From
Amit Langote
Date:
On Fri, Nov 20, 2020 at 8:55 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Sat, Oct 3, 2020 at 8:26 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > On Sat, Oct 3, 2020 at 8:15 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote
> > > I think we need to differentiate between master and backbranches. IMO we
> > > should try to make it "just work" in master, and the amount of code
> > > should not be an issue there I think (no opinion on whether insert and
> > > update trigger is the way to go). For backbranches we may need to do
> > > something less intrusive, of course.
> >
> > Sure, that makes sense.  I will try making a patch for HEAD to make it
> > just work unless someone beats me to it.
>
> After working on this for a while, here is my proposal for HEAD.
>
> To reiterate, the problem occurs when an UPDATE of a partitioned PK
> table is turned into a DELETE + INSERT.  In that case, the UPDATE RI
> triggers are not fired at all, but DELETE ones are, so the foreign key
> may fail to get enforced correctly.  For example, even though the new
> row after the update is still logically present in the PK table, it
> would wrongly get deleted because of the DELETE RI trigger firing if
> there's a ON DELETE CASCADE clause on the foreign key.
>
> To fix that, I propose that we teach trigger.c to skip queuing the
> events that would be dangerous to fire, such as that for the DELETE on
> the source leaf partition mentioned above.  Instead, queue an UPDATE
> event on the root target table, matching the actual operation being
> performed.  Note though that this new arrangement doesn't affect the
> firing of any other triggers except those that are relevant to the
> reported problem, viz. the PK-side RI triggers.  All other triggers
> fire exactly as they did before.
>
> To make that happen, I had to:
>
> 1. Make RI triggers available on partitioned tables at all, which they
> are not today.  Also, mark the RI triggers in partitions correctly as
> *clones* of the RI triggers in their respective parents.
>
> 2. Make it possible to allow AfterTriggerSaveEvent() to access the
> query's actual target relation, that is, in addition to the target
> relation on which an event was fired.  Also, added a bunch of code to
> AFTER TRIGGER infrastructure to handle events fired on partitioned
> tables.  Because those events cannot contain physical references to
> affected tuples, I generalized the code currently used to handle after
> triggers on foreign tables by storing the tuples in and retrieving
> them from a tuple store.  I read a bunch of caveats of that
> implementation (such as its uselessness for deferred triggers), but
> for the limited cases for which it will be used for partitioned
> tables, it seems safe, because it won't be used for deferred triggers
> on partitioned tables.
>
> Attached patches 0001 and 0002 implement 1 and 2, respectively.
> Later, I will post an updated version of the patch for the
> back-branches, which, as mentioned upthread, is to prevent the
> cross-partition updates on foreign key PK tables.

I have created a CF entry for this:

https://commitfest.postgresql.org/31/2877/

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: a misbehavior of partition row movement (?)

From
Arne Roland
Date:

Hi,


thanks for looking into this. I haven't yet looked at your patch in detail, yet I think we should address the general issue here. To me this doesn't seem to be a RI-trigger issue, but a more general issue like I mentioned in the pg-bugs thread https://www.postgresql.org/message-id/b1bfc99296e34725900bcd9689be8420@index.de


While I like the idea of making fks work, I'd prefer a solution that fires the appropriate row trigger for partitioned tables for non RI-cases as well.


Regards

Arne




Re: a misbehavior of partition row movement (?)

From
Amit Langote
Date:
Hi,

On Tue, Dec 15, 2020 at 12:01 AM Arne Roland <A.Roland@index.de> wrote:
> thanks for looking into this. I haven't yet looked at your patch in detail, yet I think we should address the general
issuehere. To me this doesn't seem to be a RI-trigger issue, but a more general issue like I mentioned in the pg-bugs
threadhttps://www.postgresql.org/message-id/b1bfc99296e34725900bcd9689be8420@index.de 

Hmm, maybe you're reading too much into the implementation details of
the fix, because the patch does fix the issue that you mention in the
linked report:

Quoting your original example:

drop table a, b;
create table a (id serial, primary key (id)) partition by range (id);
create table b (id serial,  primary key (id)) partition by range (id);
alter table b add constraint a_fk foreign key (id) references a (id)
on delete cascade;
create table a1 partition of a for values from (1) to (2);
create table a2 partition of a for values from (2) to (3);
create table b1 partition of b for values from (1) to (2);
create table b2 partition of b for values from (2) to (3);
insert into a (id) values (1);
insert into b (id) values (1);

-- correctly errors out instead of silently performing the ON DELETE CASCADE
update a set id=2;
ERROR:  update or delete on table "a" violates foreign key constraint
"a_fk" on table "b"
DETAIL:  Key (id)=(1) is still referenced from table "b".

select * from b;
 id
----
  1
(1 row)

Changing the example to specify ON UPDATE CASCADE:

drop table a, b;
create table a (id serial, primary key (id)) partition by range (id);
create table b (id serial,  primary key (id)) partition by range (id);
alter table b add constraint a_fk foreign key (id) references a (id)
on delete cascade;
create table a1 partition of a for values from (1) to (2);
create table a2 partition of a for values from (2) to (3);
create table b1 partition of b for values from (1) to (2);
create table b2 partition of b for values from (2) to (3);
insert into a (id) values (1);
insert into b (id) values (1);

-- correctly applies ON UPDATE CASCADE action
update a set id=2;
UPDATE 1

select * from b;
 id
----
  2
(1 row)

What am I missing about what you think is the more general problem to be solved?

> While I like the idea of making fks work, I'd prefer a solution that fires the appropriate row trigger for
partitionedtables for non RI-cases as well. 

Maybe we could do that, but I don't know of a known issue where the
root cause is our firing of a row trigger on the leaf partition
instead of on the root partitioned table.

--
Amit Langote
EDB: http://www.enterprisedb.com



Re: a misbehavior of partition row movement (?)

From
Amit Langote
Date:
On Tue, Dec 15, 2020 at 12:43 PM Amit Langote <amitlangote09@gmail.com> wrote:
> Quoting your original example:
>
> drop table a, b;
> create table a (id serial, primary key (id)) partition by range (id);
> create table b (id serial,  primary key (id)) partition by range (id);
> alter table b add constraint a_fk foreign key (id) references a (id)
> on delete cascade;
> create table a1 partition of a for values from (1) to (2);
> create table a2 partition of a for values from (2) to (3);
> create table b1 partition of b for values from (1) to (2);
> create table b2 partition of b for values from (2) to (3);
> insert into a (id) values (1);
> insert into b (id) values (1);
>
> -- correctly errors out instead of silently performing the ON DELETE CASCADE
> update a set id=2;
> ERROR:  update or delete on table "a" violates foreign key constraint
> "a_fk" on table "b"
> DETAIL:  Key (id)=(1) is still referenced from table "b".
>
> select * from b;
>  id
> ----
>   1
> (1 row)
>
> Changing the example to specify ON UPDATE CASCADE:
>
> drop table a, b;
> create table a (id serial, primary key (id)) partition by range (id);
> create table b (id serial,  primary key (id)) partition by range (id);
> alter table b add constraint a_fk foreign key (id) references a (id)
> on delete cascade;

Oops, I copy-pasted the wrong block of text from my terminal.  I meant:

alter table b add constraint a_fk foreign key (id) references a (id)
on delete cascade on update cascade;

> create table a1 partition of a for values from (1) to (2);
> create table a2 partition of a for values from (2) to (3);
> create table b1 partition of b for values from (1) to (2);
> create table b2 partition of b for values from (2) to (3);
> insert into a (id) values (1);
> insert into b (id) values (1);
>
> -- correctly applies ON UPDATE CASCADE action
> update a set id=2;
> UPDATE 1
>
> select * from b;
>  id
> ----
>   2
> (1 row)

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: a misbehavior of partition row movement (?)

From
Arne Roland
Date:

Hi Amit,


thanks for the quick reply! Sadly I have been busy and the second part of your patch does no longer apply in src/include/nodes/execnodes.h:497.


I'm sorry, I should have been more careful rereading my posts. The case I meant is the one below. I checked the thread again. I can barely believe, I didn't post such an example along back then. Sorry for the confusion!


create table a (id serial, primary key (id)) partition by range (id);
create table b (id serial,  primary key (id)) partition by range (id);
create table a1 partition of a for values from (1) to (2);
create table a2 partition of a for values from (2) to (3);
create table b1 partition of b for values from (1) to (2);
create table b2 partition of b for values from (2) to (3);
insert into a (id) values (1);
insert into b (id) values (1);

create or replace function del_trig_fkt()
 returns trigger
 language plpgsql
as $$
  begin
    raise notice 'Deleted!';
    return old;
  end;
$$;
create trigger a_del_trig after delete on a for each row execute function del_trig_fkt();
create or replace function public.upd_trig_fkt()
 returns trigger
 language plpgsql
as $function$
begin
  raise notice 'Updated!';
  return new;
end;
$function$;
create trigger a_upd_trig after update on a for each row execute function upd_trig_fkt();

update a set id=2;

To me the issue seems to have litte to do with the fact that the trigger is executed on the leaf node in itself, but rather we lack the infrastructure to track whether the tuple is removed or only rerouted.

Regards
Arne


From: Amit Langote <amitlangote09@gmail.com>
Sent: Tuesday, December 15, 2020 4:45:19 AM
To: Arne Roland
Cc: Tomas Vondra; David G. Johnston; PostgreSQL-development
Subject: Re: a misbehavior of partition row movement (?)
 
On Tue, Dec 15, 2020 at 12:43 PM Amit Langote <amitlangote09@gmail.com> wrote:
> Quoting your original example:
>
> drop table a, b;
> create table a (id serial, primary key (id)) partition by range (id);
> create table b (id serial,  primary key (id)) partition by range (id);
> alter table b add constraint a_fk foreign key (id) references a (id)
> on delete cascade;
> create table a1 partition of a for values from (1) to (2);
> create table a2 partition of a for values from (2) to (3);
> create table b1 partition of b for values from (1) to (2);
> create table b2 partition of b for values from (2) to (3);
> insert into a (id) values (1);
> insert into b (id) values (1);
>
> -- correctly errors out instead of silently performing the ON DELETE CASCADE
> update a set id=2;
> ERROR:  update or delete on table "a" violates foreign key constraint
> "a_fk" on table "b"
> DETAIL:  Key (id)=(1) is still referenced from table "b".
>
> select * from b;
>  id
> ----
>   1
> (1 row)
>
> Changing the example to specify ON UPDATE CASCADE:
>
> drop table a, b;
> create table a (id serial, primary key (id)) partition by range (id);
> create table b (id serial,  primary key (id)) partition by range (id);
> alter table b add constraint a_fk foreign key (id) references a (id)
> on delete cascade;

Oops, I copy-pasted the wrong block of text from my terminal.  I meant:

alter table b add constraint a_fk foreign key (id) references a (id)
on delete cascade on update cascade;

> create table a1 partition of a for values from (1) to (2);
> create table a2 partition of a for values from (2) to (3);
> create table b1 partition of b for values from (1) to (2);
> create table b2 partition of b for values from (2) to (3);
> insert into a (id) values (1);
> insert into b (id) values (1);
>
> -- correctly applies ON UPDATE CASCADE action
> update a set id=2;
> UPDATE 1
>
> select * from b;
>  id
> ----
>   2
> (1 row)

--
Amit Langote
EDB: http://www.enterprisedb.com



Re: a misbehavior of partition row movement (?)

From
Amit Langote
Date:
Hi,

On Mon, Dec 21, 2020 at 11:30 PM Arne Roland <A.Roland@index.de> wrote:
> thanks for the quick reply! Sadly I have been busy and the second part of your patch does no longer apply in
src/include/nodes/execnodes.h:497.

I don't see any problem with applying the patch.  Are you sure you're
applying the patch to the correct git branch?  The patch is meant to
be applied to the development (master) branch.

> I'm sorry, I should have been more careful rereading my posts. The case I meant is the one below. I checked the
threadagain. I can barely believe, I didn't post such an example along back then. Sorry for the confusion!
 

No worries, thanks for the follow up.

> create table a (id serial, primary key (id)) partition by range (id);
> create table b (id serial,  primary key (id)) partition by range (id);
> create table a1 partition of a for values from (1) to (2);
> create table a2 partition of a for values from (2) to (3);
> create table b1 partition of b for values from (1) to (2);
> create table b2 partition of b for values from (2) to (3);
> insert into a (id) values (1);
> insert into b (id) values (1);
>
> create or replace function del_trig_fkt()
>  returns trigger
>  language plpgsql
> as $$
>   begin
>     raise notice 'Deleted!';
>     return old;
>   end;
> $$;
> create trigger a_del_trig after delete on a for each row execute function del_trig_fkt();
> create or replace function public.upd_trig_fkt()
>  returns trigger
>  language plpgsql
> as $function$
> begin
>   raise notice 'Updated!';
>   return new;
> end;
> $function$;
> create trigger a_upd_trig after update on a for each row execute function upd_trig_fkt();
>
> update a set id=2;

The output for this I get with (or without) the patch is:

NOTICE:  Deleted!
UPDATE 1

> To me the issue seems to have litte to do with the fact that the trigger is executed on the leaf node in itself, but
ratherwe lack the infrastructure to track whether the tuple is removed or only rerouted.
 

This behavior of partition key updates with regard to *user-defined*
AFTER triggers is documented:

https://www.postgresql.org/docs/current/trigger-definition.html

"As far as AFTER ROW triggers are concerned, AFTER DELETE and AFTER
INSERT triggers are applied; but AFTER UPDATE triggers are not applied
because the UPDATE has been converted to a DELETE and an INSERT."

I don't quite recall if the decision to implement it like this was
based on assuming that this is what users would like to see happen in
this case or the perceived difficulty of implementing it the other way
around, that is, of firing AFTER UPDATE triggers in this case.

As for the original issue of this thread, it happens to be fixed by
firing the *internal* AFTER UPDATE triggers that are involved in
enforcing the foreign key even if the UPDATE has been turned into
DELETE+INSERT, which it makes sense to do, because what can happen
today with CASCADE triggers does not seem like a useful behavior by
any measure.

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: a misbehavior of partition row movement (?)

From
Amit Langote
Date:
On Tue, Dec 22, 2020 at 4:16 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Mon, Dec 21, 2020 at 11:30 PM Arne Roland <A.Roland@index.de> wrote:
> > thanks for the quick reply! Sadly I have been busy and the second part of your patch does no longer apply in
src/include/nodes/execnodes.h:497.
>
> I don't see any problem with applying the patch.  Are you sure you're
> applying the patch to the correct git branch?  The patch is meant to
> be applied to the development (master) branch.

Sorry, it seems you are right and the 2nd patch indeed fails to apply to master.

Attached updated version.


--
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: a misbehavior of partition row movement (?)

From
Arne Roland
Date:
While I'd agree that simply deleting with "on delete cascade" on tuple rerouting is a strong enough violation of the spirit of partitioning changing that behavior, I am surprised by the idea to make a differentiation between fks and other triggers. The way user defined triggers work, is a violation to the same degree I get complaints about that on a monthly (if not weekly) basis. It's easy to point towards the docs, but the docs offer no solution to archive the behavior, that makes partitioning somewhat transparent. Most people I know don't partition because they like to complicate things, but just because they have to much data.

It isn't just a thing with after triggers. Imo before triggers are even worse: If we move a row between partitions we fire all three triggers at once (at different leaf pages obviously). It's easy to point towards the docs. Heart bleed was well documented, but I am happy that it was fixed. I don't want to imply this totally unrelated security issue has anything to do with our weird behavior. I just want to raise the question whether that's a good thing, because frankly I haven't met anyone thus far, who thought it is.

To me it seems the RI triggers are just a specific victim of the way we made triggers work in general.

What I tried to express, albeit I apparently failed: I care about having triggers, which make partitioning transparent, on the long run.

> because what can happen
> today with CASCADE triggers does not seem like a useful behavior by
> any measure.

I wholeheartedly agree. I just want to point out, that you could state the same about triggers in general.

Backwards compatibility is usually a pretty compelling argument. I would still want to raise the question, whether this should change, because I frankly think this is a bad idea.

You might ask: Why am I raising this question in this thread?

In case we can not (instantly) agree on the fact that this behavior should change, I'd still prefer to think about making that behavior possible for other triggers (possibly later) as well. One possibility would be having an entry in the catalog to mark when the trigger should fire.

I don't want to stall your definitely useful RI-Trigger fix, but I sincerely believe, that we have to do better with triggers in general.

If we would make the condition when to fire or not dependent something besides that fact whether the trigger is internal, we could at a later date choose to make both ways available, if anyone makes a good case for this. Even though I still think it's not worth it.

>I don't quite recall if the decision to implement it like this was
> based on assuming that this is what users would like to see happen in
> this case or the perceived difficulty of implementing it the other way
> around, that is, of firing AFTER UPDATE triggers in this case.

I tried to look that up, but I couldn't find any discussion about this. Do you have any ideas in which thread that was handled?

> Sorry, it seems you are right and the 2nd patch indeed fails to apply to master.

Thank you! I hope to have a more in depth look later this week.

Regards
Arne


Re: a misbehavior of partition row movement (?)

From
Amit Langote
Date:
Hi,

On Mon, Dec 28, 2020 at 10:01 PM Arne Roland <A.Roland@index.de> wrote:
> While I'd agree that simply deleting with "on delete cascade" on tuple rerouting is a strong enough violation of the
spiritof partitioning changing that behavior, I am surprised by the idea to make a differentiation between fks and
othertriggers. The way user defined triggers work, is a violation to the same degree I get complaints about that on a
monthly(if not weekly) basis. It's easy to point towards the docs, but the docs offer no solution to archive the
behavior,that makes partitioning somewhat transparent. Most people I know don't partition because they like to
complicatethings, but just because they have to much data. 
>
> It isn't just a thing with after triggers. Imo before triggers are even worse: If we move a row between partitions we
fireall three triggers at once (at different leaf pages obviously). It's easy to point towards the docs. Heart bleed
waswell documented, but I am happy that it was fixed. I don't want to imply this totally unrelated security issue has
anythingto do with our weird behavior. I just want to raise the question whether that's a good thing, because frankly I
haven'tmet anyone thus far, who thought it is. 

Just to be clear, are you only dissatisfied with the firing of
triggers during a row-moving UPDATEs on partitioned tables or all
firing behaviors of triggers defined on partitioned tables?  Can you
give a specific example of the odd behavior in that case?

> To me it seems the RI triggers are just a specific victim of the way we made triggers work in general.

That's true.

> What I tried to express, albeit I apparently failed: I care about having triggers, which make partitioning
transparent,on the long run. 
>
> > because what can happen
> > today with CASCADE triggers does not seem like a useful behavior by
> > any measure.
>
> I wholeheartedly agree. I just want to point out, that you could state the same about triggers in general.
>
> Backwards compatibility is usually a pretty compelling argument. I would still want to raise the question, whether
thisshould change, because I frankly think this is a bad idea. 
>
> You might ask: Why am I raising this question in this thread?
>
> In case we can not (instantly) agree on the fact that this behavior should change, I'd still prefer to think about
makingthat behavior possible for other triggers (possibly later) as well. One possibility would be having an entry in
thecatalog to mark when the trigger should fire. 

Sorry, I don't understand what new setting for triggers you are
thinking of here.

> I don't want to stall your definitely useful RI-Trigger fix, but I sincerely believe, that we have to do better with
triggersin general. 
>
> If we would make the condition when to fire or not dependent something besides that fact whether the trigger is
internal,we could at a later date choose to make both ways available, if anyone makes a good case for this. Even though
Istill think it's not worth it. 
>
> >I don't quite recall if the decision to implement it like this was
> > based on assuming that this is what users would like to see happen in
> > this case or the perceived difficulty of implementing it the other way
> > around, that is, of firing AFTER UPDATE triggers in this case.
>
> I tried to look that up, but I couldn't find any discussion about this. Do you have any ideas in which thread that
washandled? 

It was discussed here:

https://www.postgresql.org/message-id/flat/CAJ3gD9do9o2ccQ7j7%2BtSgiE1REY65XRiMb%3DyJO3u3QhyP8EEPQ%40mail.gmail.com

It's a huge discussion, so you'll have to ctrl+f "trigger" to spot
relevant emails.  You might notice that the developers who
participated in that discussion gave various opinions and what we have
today got there as a result of a majority of them voting for the
current approach.  Someone also said this during the discussion:
"Regarding the trigger issue, I can't claim to have a terribly strong
opinion on this. I think that practically anything we do here might
upset somebody, but probably any halfway-reasonable thing we choose to
do will be OK for most people." So what we've got is that
"halfway-reasonable" thing, YMMV. :)


--
Amit Langote
EDB: http://www.enterprisedb.com



Re: a misbehavior of partition row movement (?)

From
Peter Eisentraut
Date:
On 2021-01-08 09:54, Amit Langote wrote:
>>> I don't quite recall if the decision to implement it like this was
>>> based on assuming that this is what users would like to see happen in
>>> this case or the perceived difficulty of implementing it the other way
>>> around, that is, of firing AFTER UPDATE triggers in this case.
>> I tried to look that up, but I couldn't find any discussion about this. Do you have any ideas in which thread that
washandled?
 
> It was discussed here:
> 
> https://www.postgresql.org/message-id/flat/CAJ3gD9do9o2ccQ7j7%2BtSgiE1REY65XRiMb%3DyJO3u3QhyP8EEPQ%40mail.gmail.com
> 
> It's a huge discussion, so you'll have to ctrl+f "trigger" to spot
> relevant emails.  You might notice that the developers who
> participated in that discussion gave various opinions and what we have
> today got there as a result of a majority of them voting for the
> current approach.  Someone also said this during the discussion:
> "Regarding the trigger issue, I can't claim to have a terribly strong
> opinion on this. I think that practically anything we do here might
> upset somebody, but probably any halfway-reasonable thing we choose to
> do will be OK for most people." So what we've got is that
> "halfway-reasonable" thing, YMMV. :)

Could you summarize here what you are trying to do with respect to what 
was decided before?  I'm a bit confused, looking through the patches you 
have posted.  The first patch you posted hard-coded FK trigger OIDs 
specifically, other patches talk about foreign key triggers in general 
or special case internal triggers or talk about all triggers.



Re: a misbehavior of partition row movement (?)

From
Amit Langote
Date:
On Wed, Jan 20, 2021 at 4:13 PM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
> On 2021-01-08 09:54, Amit Langote wrote:
> >>> I don't quite recall if the decision to implement it like this was
> >>> based on assuming that this is what users would like to see happen in
> >>> this case or the perceived difficulty of implementing it the other way
> >>> around, that is, of firing AFTER UPDATE triggers in this case.
> >> I tried to look that up, but I couldn't find any discussion about this. Do you have any ideas in which thread that
washandled?
 
> > It was discussed here:
> >
> >
https://www.postgresql.org/message-id/flat/CAJ3gD9do9o2ccQ7j7%2BtSgiE1REY65XRiMb%3DyJO3u3QhyP8EEPQ%40mail.gmail.com
> >
> > It's a huge discussion, so you'll have to ctrl+f "trigger" to spot
> > relevant emails.  You might notice that the developers who
> > participated in that discussion gave various opinions and what we have
> > today got there as a result of a majority of them voting for the
> > current approach.  Someone also said this during the discussion:
> > "Regarding the trigger issue, I can't claim to have a terribly strong
> > opinion on this. I think that practically anything we do here might
> > upset somebody, but probably any halfway-reasonable thing we choose to
> > do will be OK for most people." So what we've got is that
> > "halfway-reasonable" thing, YMMV. :)
>
> Could you summarize here what you are trying to do with respect to what
> was decided before?  I'm a bit confused, looking through the patches you
> have posted.  The first patch you posted hard-coded FK trigger OIDs
> specifically, other patches talk about foreign key triggers in general
> or special case internal triggers or talk about all triggers.

The original problem statement is this: the way we generally fire
row-level triggers of a partitioned table can lead to some unexpected
behaviors of the foreign keys pointing to that partitioned table
during its cross-partition updates.

Let's start with an example that shows how triggers are fired during a
cross-partition update:

create table p (a numeric primary key) partition by list (a);
create table p1 partition of p for values in (1);
create table p2 partition of p for values in (2);
create or replace function report_trig_details() returns trigger as $$
begin raise notice '% % on %', tg_when, tg_op, tg_relname; if tg_op =
'DELETE' then return old; end if; return new; end; $$ language
plpgsql;
create trigger trig1 before update on p for each row execute function
report_trig_details();
create trigger trig2 after update on p for each row execute function
report_trig_details();
create trigger trig3 before delete on p for each row execute function
report_trig_details();
create trigger trig4 after delete on p for each row execute function
report_trig_details();
create trigger trig5 before insert on p for each row execute function
report_trig_details();
create trigger trig6 after insert on p for each row execute function
report_trig_details();

insert into p values (1);
update p set a = 2;
NOTICE:  BEFORE UPDATE on p1
NOTICE:  BEFORE DELETE on p1
NOTICE:  BEFORE INSERT on p2
NOTICE:  AFTER DELETE on p1
NOTICE:  AFTER INSERT on p2
UPDATE 1

(AR update triggers are not fired.)

For contrast, here is an intra-partition update:

update p set a = a;
NOTICE:  BEFORE UPDATE on p2
NOTICE:  AFTER UPDATE on p2
UPDATE 1

Now, the trigger machinery makes no distinction between user-defined
and internal triggers, which has implications for the foreign key
enforcing triggers on partitions.  Consider the following example:

create table q (a bigint references p);
insert into q values (2);
update p set a = 1;
NOTICE:  BEFORE UPDATE on p2
NOTICE:  BEFORE DELETE on p2
NOTICE:  BEFORE INSERT on p1
ERROR:  update or delete on table "p2" violates foreign key constraint
"q_a_fkey2" on table "q"
DETAIL:  Key (a)=(2) is still referenced from table "q".

So the RI delete trigger (NOT update) on p2 prevents the DELETE that
occurs as part of the row movement.  One might make the updates
cascade and expect that to prevent the error:

drop table q;
create table q (a bigint references p on update cascade);
insert into q values (2);
update p set a = 1;
NOTICE:  BEFORE UPDATE on p2
NOTICE:  BEFORE DELETE on p2
NOTICE:  BEFORE INSERT on p1
ERROR:  update or delete on table "p2" violates foreign key constraint
"q_a_fkey2" on table "q"
DETAIL:  Key (a)=(2) is still referenced from table "q".

No luck, because again it's the RI delete trigger on p2 that gets
fired.  If you make deletes cascade too, an even worse thing happens:

drop table q;
create table q (a bigint references p on update cascade on delete cascade);
insert into q values (2);
update p set a = 1;
NOTICE:  BEFORE UPDATE on p2
NOTICE:  BEFORE DELETE on p2
NOTICE:  BEFORE INSERT on p1
NOTICE:  AFTER DELETE on p2
NOTICE:  AFTER INSERT on p1
UPDATE 1
select * from q;
 a
---
(0 rows)

The RI delete trigger deleted 2 from q, whereas the expected result
would've been for q to be updated to change 2 to 1.

This shows that the way we've made these triggers behave in general
can cause some unintended behaviors for foreign keys during
cross-partition updates.  I started this thread to do something about
that and sent a patch to prevent cross-partition updates at all when
there are foreign keys pointing to it.  As others pointed out, that's
not a great long-term solution to the problem, but that's what we may
have to do in the back-branches if anything at all.

So I wrote another patch targeting the dev branch to make the
cross-partition updates work while producing a sane foreign key
behavior.  The idea of the patch is to tweak the firing of AFTER
triggers such that unintended RI triggers don't get fired, that is,
those corresponding to DELETE and INSERT occurring internally as part
of a cross-partition update.  Instead we now fire the AFTER UPDATE
triggers, passing the root table as the target result relation (not
the source partition because the new row doesn't belong to it).  This
results in the same foreign key behavior as when no partitioning is
involved at all.

Then, Arne came along and suggested that we do this kind of firing for
*all* triggers, not just the internal RI triggers, or at least that's
what I understood Arne as saying.  That however would be changing the
original design of cross-partition updates and will change the
documented user-visible trigger behavior.  Changing this for internal
triggers like the patch does changes no user-visible behavior, AFAIK,
other than fixing the foreign key annoyance.  So I said if we do want
to go the way that Arne is suggesting, it should be its own discussion
and that's that.

Sorry for a long "summary", but I hope it helps clarify things somewhat.



--
Amit Langote
EDB: http://www.enterprisedb.com



Re: a misbehavior of partition row movement (?)

From
Amit Langote
Date:
On Wed, Jan 20, 2021 at 7:03 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Wed, Jan 20, 2021 at 4:13 PM Peter Eisentraut
> > Could you summarize here what you are trying to do with respect to what
> > was decided before?  I'm a bit confused, looking through the patches you
> > have posted.  The first patch you posted hard-coded FK trigger OIDs
> > specifically, other patches talk about foreign key triggers in general
> > or special case internal triggers or talk about all triggers.
>
> The original problem statement is this: the way we generally fire
> row-level triggers of a partitioned table can lead to some unexpected
> behaviors of the foreign keys pointing to that partitioned table
> during its cross-partition updates.
>
> Let's start with an example that shows how triggers are fired during a
> cross-partition update:
>
> create table p (a numeric primary key) partition by list (a);
> create table p1 partition of p for values in (1);
> create table p2 partition of p for values in (2);
> create or replace function report_trig_details() returns trigger as $$
> begin raise notice '% % on %', tg_when, tg_op, tg_relname; if tg_op =
> 'DELETE' then return old; end if; return new; end; $$ language
> plpgsql;
> create trigger trig1 before update on p for each row execute function
> report_trig_details();
> create trigger trig2 after update on p for each row execute function
> report_trig_details();
> create trigger trig3 before delete on p for each row execute function
> report_trig_details();
> create trigger trig4 after delete on p for each row execute function
> report_trig_details();
> create trigger trig5 before insert on p for each row execute function
> report_trig_details();
> create trigger trig6 after insert on p for each row execute function
> report_trig_details();
>
> insert into p values (1);
> update p set a = 2;
> NOTICE:  BEFORE UPDATE on p1
> NOTICE:  BEFORE DELETE on p1
> NOTICE:  BEFORE INSERT on p2
> NOTICE:  AFTER DELETE on p1
> NOTICE:  AFTER INSERT on p2
> UPDATE 1
>
> (AR update triggers are not fired.)
>
> For contrast, here is an intra-partition update:
>
> update p set a = a;
> NOTICE:  BEFORE UPDATE on p2
> NOTICE:  AFTER UPDATE on p2
> UPDATE 1
>
> Now, the trigger machinery makes no distinction between user-defined
> and internal triggers, which has implications for the foreign key
> enforcing triggers on partitions.  Consider the following example:
>
> create table q (a bigint references p);
> insert into q values (2);
> update p set a = 1;
> NOTICE:  BEFORE UPDATE on p2
> NOTICE:  BEFORE DELETE on p2
> NOTICE:  BEFORE INSERT on p1
> ERROR:  update or delete on table "p2" violates foreign key constraint
> "q_a_fkey2" on table "q"
> DETAIL:  Key (a)=(2) is still referenced from table "q".
>
> So the RI delete trigger (NOT update) on p2 prevents the DELETE that
> occurs as part of the row movement.  One might make the updates
> cascade and expect that to prevent the error:
>
> drop table q;
> create table q (a bigint references p on update cascade);
> insert into q values (2);
> update p set a = 1;
> NOTICE:  BEFORE UPDATE on p2
> NOTICE:  BEFORE DELETE on p2
> NOTICE:  BEFORE INSERT on p1
> ERROR:  update or delete on table "p2" violates foreign key constraint
> "q_a_fkey2" on table "q"
> DETAIL:  Key (a)=(2) is still referenced from table "q".
>
> No luck, because again it's the RI delete trigger on p2 that gets
> fired.  If you make deletes cascade too, an even worse thing happens:
>
> drop table q;
> create table q (a bigint references p on update cascade on delete cascade);
> insert into q values (2);
> update p set a = 1;
> NOTICE:  BEFORE UPDATE on p2
> NOTICE:  BEFORE DELETE on p2
> NOTICE:  BEFORE INSERT on p1
> NOTICE:  AFTER DELETE on p2
> NOTICE:  AFTER INSERT on p1
> UPDATE 1
> select * from q;
>  a
> ---
> (0 rows)
>
> The RI delete trigger deleted 2 from q, whereas the expected result
> would've been for q to be updated to change 2 to 1.
>
> This shows that the way we've made these triggers behave in general
> can cause some unintended behaviors for foreign keys during
> cross-partition updates.  I started this thread to do something about
> that and sent a patch to prevent cross-partition updates at all when
> there are foreign keys pointing to it.  As others pointed out, that's
> not a great long-term solution to the problem, but that's what we may
> have to do in the back-branches if anything at all.
>
> So I wrote another patch targeting the dev branch to make the
> cross-partition updates work while producing a sane foreign key
> behavior.  The idea of the patch is to tweak the firing of AFTER
> triggers such that unintended RI triggers don't get fired, that is,
> those corresponding to DELETE and INSERT occurring internally as part
> of a cross-partition update.  Instead we now fire the AFTER UPDATE
> triggers, passing the root table as the target result relation (not
> the source partition because the new row doesn't belong to it).  This
> results in the same foreign key behavior as when no partitioning is
> involved at all.
>
> Then, Arne came along and suggested that we do this kind of firing for
> *all* triggers, not just the internal RI triggers, or at least that's
> what I understood Arne as saying.  That however would be changing the
> original design of cross-partition updates and will change the
> documented user-visible trigger behavior.  Changing this for internal
> triggers like the patch does changes no user-visible behavior, AFAIK,
> other than fixing the foreign key annoyance.  So I said if we do want
> to go the way that Arne is suggesting, it should be its own discussion
> and that's that.
>
> Sorry for a long "summary", but I hope it helps clarify things somewhat.

Here is an updated version of the patch with some cosmetic changes
from the previous version.  I moved the code being added to
AfterTriggerSaveEvent() and ExecUpdate() into separate subroutines to
improve readability, hopefully.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: a misbehavior of partition row movement (?)

From
Masahiko Sawada
Date:
On Wed, Jan 20, 2021 at 7:04 PM Amit Langote <amitlangote09@gmail.com> wrote:
>
> On Wed, Jan 20, 2021 at 4:13 PM Peter Eisentraut
> <peter.eisentraut@enterprisedb.com> wrote:
> > On 2021-01-08 09:54, Amit Langote wrote:
> > >>> I don't quite recall if the decision to implement it like this was
> > >>> based on assuming that this is what users would like to see happen in
> > >>> this case or the perceived difficulty of implementing it the other way
> > >>> around, that is, of firing AFTER UPDATE triggers in this case.
> > >> I tried to look that up, but I couldn't find any discussion about this. Do you have any ideas in which thread
thatwas handled?
 
> > > It was discussed here:
> > >
> > >
https://www.postgresql.org/message-id/flat/CAJ3gD9do9o2ccQ7j7%2BtSgiE1REY65XRiMb%3DyJO3u3QhyP8EEPQ%40mail.gmail.com
> > >
> > > It's a huge discussion, so you'll have to ctrl+f "trigger" to spot
> > > relevant emails.  You might notice that the developers who
> > > participated in that discussion gave various opinions and what we have
> > > today got there as a result of a majority of them voting for the
> > > current approach.  Someone also said this during the discussion:
> > > "Regarding the trigger issue, I can't claim to have a terribly strong
> > > opinion on this. I think that practically anything we do here might
> > > upset somebody, but probably any halfway-reasonable thing we choose to
> > > do will be OK for most people." So what we've got is that
> > > "halfway-reasonable" thing, YMMV. :)
> >
> > Could you summarize here what you are trying to do with respect to what
> > was decided before?  I'm a bit confused, looking through the patches you
> > have posted.  The first patch you posted hard-coded FK trigger OIDs
> > specifically, other patches talk about foreign key triggers in general
> > or special case internal triggers or talk about all triggers.
>
> The original problem statement is this: the way we generally fire
> row-level triggers of a partitioned table can lead to some unexpected
> behaviors of the foreign keys pointing to that partitioned table
> during its cross-partition updates.
>
> Let's start with an example that shows how triggers are fired during a
> cross-partition update:
>
> create table p (a numeric primary key) partition by list (a);
> create table p1 partition of p for values in (1);
> create table p2 partition of p for values in (2);
> create or replace function report_trig_details() returns trigger as $$
> begin raise notice '% % on %', tg_when, tg_op, tg_relname; if tg_op =
> 'DELETE' then return old; end if; return new; end; $$ language
> plpgsql;
> create trigger trig1 before update on p for each row execute function
> report_trig_details();
> create trigger trig2 after update on p for each row execute function
> report_trig_details();
> create trigger trig3 before delete on p for each row execute function
> report_trig_details();
> create trigger trig4 after delete on p for each row execute function
> report_trig_details();
> create trigger trig5 before insert on p for each row execute function
> report_trig_details();
> create trigger trig6 after insert on p for each row execute function
> report_trig_details();
>
> insert into p values (1);
> update p set a = 2;
> NOTICE:  BEFORE UPDATE on p1
> NOTICE:  BEFORE DELETE on p1
> NOTICE:  BEFORE INSERT on p2
> NOTICE:  AFTER DELETE on p1
> NOTICE:  AFTER INSERT on p2
> UPDATE 1
>
> (AR update triggers are not fired.)
>
> For contrast, here is an intra-partition update:
>
> update p set a = a;
> NOTICE:  BEFORE UPDATE on p2
> NOTICE:  AFTER UPDATE on p2
> UPDATE 1
>
> Now, the trigger machinery makes no distinction between user-defined
> and internal triggers, which has implications for the foreign key
> enforcing triggers on partitions.  Consider the following example:
>
> create table q (a bigint references p);
> insert into q values (2);
> update p set a = 1;
> NOTICE:  BEFORE UPDATE on p2
> NOTICE:  BEFORE DELETE on p2
> NOTICE:  BEFORE INSERT on p1
> ERROR:  update or delete on table "p2" violates foreign key constraint
> "q_a_fkey2" on table "q"
> DETAIL:  Key (a)=(2) is still referenced from table "q".
>
> So the RI delete trigger (NOT update) on p2 prevents the DELETE that
> occurs as part of the row movement.  One might make the updates
> cascade and expect that to prevent the error:
>
> drop table q;
> create table q (a bigint references p on update cascade);
> insert into q values (2);
> update p set a = 1;
> NOTICE:  BEFORE UPDATE on p2
> NOTICE:  BEFORE DELETE on p2
> NOTICE:  BEFORE INSERT on p1
> ERROR:  update or delete on table "p2" violates foreign key constraint
> "q_a_fkey2" on table "q"
> DETAIL:  Key (a)=(2) is still referenced from table "q".
>
> No luck, because again it's the RI delete trigger on p2 that gets
> fired.  If you make deletes cascade too, an even worse thing happens:
>
> drop table q;
> create table q (a bigint references p on update cascade on delete cascade);
> insert into q values (2);
> update p set a = 1;
> NOTICE:  BEFORE UPDATE on p2
> NOTICE:  BEFORE DELETE on p2
> NOTICE:  BEFORE INSERT on p1
> NOTICE:  AFTER DELETE on p2
> NOTICE:  AFTER INSERT on p1
> UPDATE 1
> select * from q;
>  a
> ---
> (0 rows)
>
> The RI delete trigger deleted 2 from q, whereas the expected result
> would've been for q to be updated to change 2 to 1.

Thank you for a good summary. That's helpful to catch up on this thread.

>
> This shows that the way we've made these triggers behave in general
> can cause some unintended behaviors for foreign keys during
> cross-partition updates.  I started this thread to do something about
> that and sent a patch to prevent cross-partition updates at all when
> there are foreign keys pointing to it.  As others pointed out, that's
> not a great long-term solution to the problem, but that's what we may
> have to do in the back-branches if anything at all.

I've started by reviewing the patch for back-patching that the first
patch you posted[1].

+       for (i = 0; i < trigdesc->numtriggers; i++)
+       {
+           Trigger *trigger = &trigdesc->triggers[i];
+
+           if (trigger->tgisinternal &&
+               OidIsValid(trigger->tgconstrrelid) &&
+               trigger->tgfoid == F_RI_FKEY_CASCADE_DEL)
+           {
+               found = true;
+               break;
+           }
+       }

IIUC the above checks if the partition table is referenced by a
foreign key constraint on another table with ON DELETE CASCADE option.
I think we should prevent cross-partition update also when ON DELETE
SET NULL and ON DELETE SET DEFAULT. For example, with the patch, a
tuple in a partition table is still updated to NULL when
cross-partition update as follows:

postgres=# create table p (a numeric primary key) partition by list (a);
CREATE TABLE
postgres=# create table p1 partition of p for values in (1);
CREATE TABLE
postgres=# create table p2 partition of p for values in (2);
CREATE TABLE
postgres=# insert into p values (1);
INSERT 0 1
postgres=# create table q (a int references p(a) on delete set null);
CREATE TABLE
postgres=# insert into q values (1);
INSERT 0 1
postgres=# update p set a = 2;
UPDATE 1
postgres=# table p;
 a
---
 2
(1 row)

postgres=# table q;
 a
---

(1 row)

Regards,

[1] https://www.postgresql.org/message-id/CA%2BHiwqFvkBCmfwkQX_yBqv2Wz8ugUGiBDxum8%3DWvVbfU1TXaNg%40mail.gmail.com

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: a misbehavior of partition row movement (?)

From
Amit Langote
Date:
Thank you for looking at this.

On Mon, Feb 15, 2021 at 4:12 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Wed, Jan 20, 2021 at 7:04 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > This shows that the way we've made these triggers behave in general
> > can cause some unintended behaviors for foreign keys during
> > cross-partition updates.  I started this thread to do something about
> > that and sent a patch to prevent cross-partition updates at all when
> > there are foreign keys pointing to it.  As others pointed out, that's
> > not a great long-term solution to the problem, but that's what we may
> > have to do in the back-branches if anything at all.
>
> I've started by reviewing the patch for back-patching that the first
> patch you posted[1].
>
> +       for (i = 0; i < trigdesc->numtriggers; i++)
> +       {
> +           Trigger *trigger = &trigdesc->triggers[i];
> +
> +           if (trigger->tgisinternal &&
> +               OidIsValid(trigger->tgconstrrelid) &&
> +               trigger->tgfoid == F_RI_FKEY_CASCADE_DEL)
> +           {
> +               found = true;
> +               break;
> +           }
> +       }
>
> IIUC the above checks if the partition table is referenced by a
> foreign key constraint on another table with ON DELETE CASCADE option.
> I think we should prevent cross-partition update also when ON DELETE
> SET NULL and ON DELETE SET DEFAULT. For example, with the patch, a
> tuple in a partition table is still updated to NULL when
> cross-partition update as follows:
>
> postgres=# create table p (a numeric primary key) partition by list (a);
> CREATE TABLE
> postgres=# create table p1 partition of p for values in (1);
> CREATE TABLE
> postgres=# create table p2 partition of p for values in (2);
> CREATE TABLE
> postgres=# insert into p values (1);
> INSERT 0 1
> postgres=# create table q (a int references p(a) on delete set null);
> CREATE TABLE
> postgres=# insert into q values (1);
> INSERT 0 1
> postgres=# update p set a = 2;
> UPDATE 1
> postgres=# table p;
>  a
> ---
>  2
> (1 row)
>
> postgres=# table q;
>  a
> ---
>
> (1 row)

Yeah, I agree that's not good.

Actually, I had meant to send an updated version of the patch to apply
in back-branches that would prevent such a cross-partition update, but
never did since starting to work on the patch for HEAD.  I have
attached it here.

Regarding the patch, I would have liked if it only prevented the
update when the foreign key that would be violated by the component
delete references the update query's main target relation.  If the
foreign key references the source partition directly, then the delete
would be harmless.  However there's not enough infrastructure in v12,
v13 branches to determine that, which makes back-patching this a bit
hard.  For example, there's no way to tell in the back-branches if the
foreign-key-enforcing triggers of a leaf partition have descended from
the parent table.  IOW, I am not so sure anymore if we should try to
do anything in the back-branches.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: a misbehavior of partition row movement (?)

From
Rahila Syed
Date:
Hi Amit,



Here is an updated version of the patch with some cosmetic changes
from the previous version.  I moved the code being added to
AfterTriggerSaveEvent() and ExecUpdate() into separate subroutines to
improve readability, hopefully.

I tested these patches. It works as expected in case of cross partition updates, by correctly updating the
referencing table.  It works fine for ON UPDATE SET NULL and SET DEFAULT options as well. 
Also, tested by having a table reference only a partition and not the parent. In this case, the delete 
trigger is correctly called when the row is moved out of referenced partition. 

The partition-key-update-1.spec test fails with the following error message appearing in the diffs.

 step s1u3pc: UPDATE foo_range_parted SET a=11 WHERE a=7;
+ERROR:  cannot move row being updated to another partition

I think the documentation update is missing from the patches. 
 
Thank you,
Rahila Syed

Re: a misbehavior of partition row movement (?)

From
Amit Langote
Date:
Hi Rahila,

Thanks for the review.

On Thu, Feb 18, 2021 at 7:08 PM Rahila Syed <rahilasyed90@gmail.com> wrote:
>> Here is an updated version of the patch with some cosmetic changes
>> from the previous version.  I moved the code being added to
>> AfterTriggerSaveEvent() and ExecUpdate() into separate subroutines to
>> improve readability, hopefully.
>>
> I tested these patches.
>
> It works as expected in case of cross partition updates, by correctly updating the
> referencing table.  It works fine for ON UPDATE SET NULL and SET DEFAULT options as well.
> Also, tested by having a table reference only a partition and not the parent. In this case, the delete
> trigger is correctly called when the row is moved out of referenced partition.

I assume these are comments for the v3-0001 & v3-0002 patches...

> The partition-key-update-1.spec test fails with the following error message appearing in the diffs.
>
>  step s1u3pc: UPDATE foo_range_parted SET a=11 WHERE a=7;
> +ERROR:  cannot move row being updated to another partition

...whereas, this error happens with the patch I posted in my last
email (prevent-row-movement-on-pk-table.patch) that is not meant to be
considered for HEAD, but for back-branches (if at all).  I also see
that cfbot got triggered by it and shows the same failure.  I am not
going to try to take care of these failures unless we want to do
something in the back-branches.

To be clear, patches for HEAD do pass make check-world.

> I think the documentation update is missing from the patches.

Hmm, I don't think we document the behavior that is improved by the v3
patches as a limitation of any existing feature, neither of foreign
keys referencing partitioned tables nor of the update row movement
feature.  So maybe there's nothing in the existing documentation that
is to be updated.

However, the patch does add a new error message for a case that the
patch doesn't handle, so maybe we could document that as a limitation.
Not sure if in the Notes section of the UPDATE reference page which
has some notes on row movement or somewhere else.  Do you have
suggestions?

Attaching rebased version of the patches for HEAD to appease the cfbot.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: a misbehavior of partition row movement (?)

From
Masahiko Sawada
Date:
On Mon, Feb 15, 2021 at 10:37 PM Amit Langote <amitlangote09@gmail.com> wrote:
>
> Thank you for looking at this.
>
> On Mon, Feb 15, 2021 at 4:12 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > On Wed, Jan 20, 2021 at 7:04 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > > This shows that the way we've made these triggers behave in general
> > > can cause some unintended behaviors for foreign keys during
> > > cross-partition updates.  I started this thread to do something about
> > > that and sent a patch to prevent cross-partition updates at all when
> > > there are foreign keys pointing to it.  As others pointed out, that's
> > > not a great long-term solution to the problem, but that's what we may
> > > have to do in the back-branches if anything at all.
> >
> > I've started by reviewing the patch for back-patching that the first
> > patch you posted[1].
> >
> > +       for (i = 0; i < trigdesc->numtriggers; i++)
> > +       {
> > +           Trigger *trigger = &trigdesc->triggers[i];
> > +
> > +           if (trigger->tgisinternal &&
> > +               OidIsValid(trigger->tgconstrrelid) &&
> > +               trigger->tgfoid == F_RI_FKEY_CASCADE_DEL)
> > +           {
> > +               found = true;
> > +               break;
> > +           }
> > +       }
> >
> > IIUC the above checks if the partition table is referenced by a
> > foreign key constraint on another table with ON DELETE CASCADE option.
> > I think we should prevent cross-partition update also when ON DELETE
> > SET NULL and ON DELETE SET DEFAULT. For example, with the patch, a
> > tuple in a partition table is still updated to NULL when
> > cross-partition update as follows:
> >
> > postgres=# create table p (a numeric primary key) partition by list (a);
> > CREATE TABLE
> > postgres=# create table p1 partition of p for values in (1);
> > CREATE TABLE
> > postgres=# create table p2 partition of p for values in (2);
> > CREATE TABLE
> > postgres=# insert into p values (1);
> > INSERT 0 1
> > postgres=# create table q (a int references p(a) on delete set null);
> > CREATE TABLE
> > postgres=# insert into q values (1);
> > INSERT 0 1
> > postgres=# update p set a = 2;
> > UPDATE 1
> > postgres=# table p;
> >  a
> > ---
> >  2
> > (1 row)
> >
> > postgres=# table q;
> >  a
> > ---
> >
> > (1 row)
>
> Yeah, I agree that's not good.
>
> Actually, I had meant to send an updated version of the patch to apply
> in back-branches that would prevent such a cross-partition update, but
> never did since starting to work on the patch for HEAD.  I have
> attached it here.

Thank you for updating the patch!

>
> Regarding the patch, I would have liked if it only prevented the
> update when the foreign key that would be violated by the component
> delete references the update query's main target relation.  If the
> foreign key references the source partition directly, then the delete
> would be harmless. However there's not enough infrastructure in v12,
> v13 branches to determine that, which makes back-patching this a bit
> hard.  For example, there's no way to tell in the back-branches if the
> foreign-key-enforcing triggers of a leaf partition have descended from
> the parent table.  IOW, I am not so sure anymore if we should try to
> do anything in the back-branches.

Hmm I'm not sure the necessity of figuring out foreign-key-enforcing
triggers of a leaf partition have descended from the parent table. I
might be missing something but even if the foreign key references the
leaf partition directly, the same problem could happen if doing a
cross-partition update, is that right?

Also, the updated patch prevents a cross-partition update even when
the foreign key references another column of itself. This kind of
cross-update works as expected for now so it seems not to need to
disallow that. What do you think? The scenario is:

create table p (a int primary key, b int references p(a) on delete
cascade) partition by list (a);
create table p1 partition of p for values in (1);
create table p2 partition of p for values in (2);
insert into p values (1, 1);
update p set a = 2, b = 2 where a = 1;
ERROR:  cannot move row being updated to another partition
DETAIL:  Moving the row may cause a foreign key involving the source
partition to be violated.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: a misbehavior of partition row movement (?)

From
Amit Langote
Date:
Sawada-san,

On Wed, Mar 10, 2021 at 4:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I looked at the 0001 patch and here are random comments. Please ignore
> a comment if it is already discussed.

Thanks a lot for the review and sorry for the delay in replying.

> ---
> @@ -9077,7 +9102,8 @@ addFkRecurseReferenced(List **wqueue, Constraint
> *fkconstraint, Relation rel,
>                                    partIndexId, constrOid, numfks,
>                                    mapped_pkattnum, fkattnum,
>                                    pfeqoperators, ppeqoperators, ffeqoperators,
> -                                  old_check_ok);
> +                                  old_check_ok,
> +                                  deleteTriggerOid, updateTriggerOid);
>
>             /* Done -- clean up (but keep the lock) */
>             table_close(partRel, NoLock);
> @@ -9126,8 +9152,12 @@ addFkRecurseReferencing(List **wqueue,
> Constraint *fkconstraint, Relation rel,
>                         Relation pkrel, Oid indexOid, Oid parentConstr,
>                         int numfks, int16 *pkattnum, int16 *fkattnum,
>                         Oid *pfeqoperators, Oid *ppeqoperators, Oid
> *ffeqoperators,
> -                       bool old_check_ok, LOCKMODE lockmode)
> +                       bool old_check_ok, LOCKMODE lockmode,
> +                       Oid parentInsTrigger, Oid parentUpdTrigger)
>  {
>
> We need to update the function comments as well.

Done.

> ---
> I think it's better to add comments for newly added functions such as
> GetForeignKeyActionTriggers() and GetForeignKeyCheckTriggers() etc.
> Those functions have no comment at all.

I've added comments.

> BTW, those two functions out of newly added four functions:
> AttachForeignKeyCheckTriggers() and DetachForeignKeyCheckTriggers(),
> have only one user. Can we past the functions body at where each
> function is called?

I made those pieces of code into functions because I thought future
patches may have a need for them.  But maybe those future patches
should do the refactoring, so I've incorporated their code into the
respective callers as you suggest.

> ---
>     /*
>      * If the referenced table is a plain relation, create the action triggers
>      * that enforce the constraint.
>      */
> -   if (pkrel->rd_rel->relkind == RELKIND_RELATION)
> -   {
> -       createForeignKeyActionTriggers(rel, RelationGetRelid(pkrel),
> -                                      fkconstraint,
> -                                      constrOid, indexOid);
> -   }
> +   createForeignKeyActionTriggers(rel, RelationGetRelid(pkrel),
> +                                  fkconstraint,
> +                                  constrOid, indexOid,
> +                                  parentDelTrigger, parentUpdTrigger,
> +                                  &deleteTriggerOid, &updateTriggerOid);
>
> The comment needs to be updated.
> ---
>      /*
>       * If the referencing relation is a plain table, add the check triggers to
>       * it and, if necessary, schedule it to be checked in Phase 3.
>       *
>       * If the relation is partitioned, drill down to do it to its partitions.
>       */
> +    createForeignKeyCheckTriggers(RelationGetRelid(rel),
> +                                  RelationGetRelid(pkrel),
> +                                  fkconstraint,
> +                                  parentConstr,
> +                                  indexOid,
> +                                  parentInsTrigger, parentUpdTrigger,
> +                                  &insertTriggerOid, &updateTriggerOid);
>
> Same as above.

Done and done.

> ---
> I think TriggerSetParentTrigger needs to free the heap tuple copied by
> heap_copytuple().

Oops, done.

> ---
> +               Assert(trigForm->tgparentid == 0);
> +               if (trigForm->tgparentid != InvalidOid)
> +                       elog(ERROR, "trigger %u already has a parent trigger",
> +                                childTrigId);
>
> I think the first condition in Assert() can be
> !OidIsValid(trigForm->tgparentid) and the second condition in 'if'
> statement can be OidIsValid(trigForm->tgparentid). So those are
> redundant?

Ah, indeed.  I've kept the if () elog(...) after applying your suggested change.

> ---
> -       if (fout->remoteVersion >= 90000)
> +       if (fout->remoteVersion >= 130000)
> +       {
> +           /*
> +            * NB: think not to use pretty=true in pg_get_triggerdef.  It
> +            * could result in non-forward-compatible dumps of WHEN clauses
> +            * due to under-parenthesization.
> +            */
> +           appendPQExpBuffer(query,
> +                             "SELECT tgname, "
> +                             "tgfoid::pg_catalog.regproc AS tgfname, "
> +                             "pg_catalog.pg_get_triggerdef(oid,
> false) AS tgdef, "
> +                             "tgenabled, tableoid, oid "
> +                             "FROM pg_catalog.pg_trigger t "
> +                             "WHERE tgrelid = '%u'::pg_catalog.oid "
> +                             "AND NOT tgisinternal "
> +                             "AND tgparentid = 0",
> +                             tbinfo->dobj.catId.oid);
> +       }
> +       else if (fout->remoteVersion >= 90000)
>
> You meant 140000 instead of 130000?

I think I used 130000 because tgparentid was added in v13, but maybe
140000 is right, because for v13 the existing condition (NOT
tgisnternal) already gets us the intended triggers.

> Or is this change really needed? This change added one condition
> "tgparentid = 0" but IIUC I think triggers that are NOT tgisinternal
> are always tgparentid = 0. Also, it seems it is true both before and
> after this patch.

Actually, as noted in the commit message, I'm intending to change
tgisnternal to only be true for triggers generated by foreign keys and
no longer for partitions' user-defined triggers that are inherited.
So whereas NOT tgisnternal would suffice to exclude partitions'
inherited triggers before, that would no longer be the case with this
patch; AND tgparentid = 0 will be needed for that.

> ---
> @@ -2973,11 +2973,7 @@ describeOneTableDetails(const char *schemaname,
>                            "       AND u.tgparentid = 0) AS parent" :
>                            "NULL AS parent"),
>                           oid);
> -       if (pset.sversion >= 110000)
> -           appendPQExpBufferStr(&buf, "(NOT t.tgisinternal OR
> (t.tgisinternal AND t.tgenabled = 'D') \n"
> -                                "    OR EXISTS (SELECT 1 FROM
> pg_catalog.pg_depend WHERE objid = t.oid \n"
> -                                "        AND refclassid =
> 'pg_catalog.pg_trigger'::pg_catalog.regclass))");
> -       else if (pset.sversion >= 90000)
> +       if (pset.sversion >= 90000)
>
> I think we cannot remove this code. For instance, in PG13 since the
> trisinternal of a user-defined trigger that has descended from its
> parent table is true, executing \d against PG13 by the patched psql
> won't show that trigger.

I think you're right.  I simply needed to change the if condition as follows:

-       if (pset.sversion >= 110000)
+       if (pset.sversion >= 110000 && pset.sversion < 140000)

Note that without this change, this code ends up revealing partitions'
foreign key triggers, because we will now be marking them dependent on
their parent trigger, which wasn't the case before this patch.

> I'll look at 0002 patch.

Actually, I found a big hole in my assumptions around deferrable
foreign constraints, invalidating the approach I took in 0002 to use a
query-lifetime tuplestore to record root parent tuples.  I'm trying to
find a way to make the tuplestore transaction-lifetime so that the
patch still works.

In the meantime, I'm attaching an updated set with 0001 changed per
your comments.

--
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: a misbehavior of partition row movement (?)

From
Masahiko Sawada
Date:
On Tue, Mar 23, 2021 at 6:27 PM Amit Langote <amitlangote09@gmail.com> wrote:
>
> Sawada-san,
>
> On Wed, Mar 10, 2021 at 4:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > I looked at the 0001 patch and here are random comments. Please ignore
> > a comment if it is already discussed.
>
> Thanks a lot for the review and sorry for the delay in replying.

No problem. Sorry for the late reply too.

> > Or is this change really needed? This change added one condition
> > "tgparentid = 0" but IIUC I think triggers that are NOT tgisinternal
> > are always tgparentid = 0. Also, it seems it is true both before and
> > after this patch.
>
> Actually, as noted in the commit message, I'm intending to change
> tgisnternal to only be true for triggers generated by foreign keys and
> no longer for partitions' user-defined triggers that are inherited.
> So whereas NOT tgisnternal would suffice to exclude partitions'
> inherited triggers before, that would no longer be the case with this
> patch; AND tgparentid = 0 will be needed for that.

Understood.

>
> Actually, I found a big hole in my assumptions around deferrable
> foreign constraints, invalidating the approach I took in 0002 to use a
> query-lifetime tuplestore to record root parent tuples.  I'm trying to
> find a way to make the tuplestore transaction-lifetime so that the
> patch still works.
>
> In the meantime, I'm attaching an updated set with 0001 changed per
> your comments.

0001 patch conflicts with 71f4c8c6f74. Could you please rebase the patchset?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: a misbehavior of partition row movement (?)

From
Amit Langote
Date:
On Thu, Apr 1, 2021 at 10:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Tue, Mar 23, 2021 at 6:27 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > Actually, I found a big hole in my assumptions around deferrable
> > foreign constraints, invalidating the approach I took in 0002 to use a
> > query-lifetime tuplestore to record root parent tuples.  I'm trying to
> > find a way to make the tuplestore transaction-lifetime so that the
> > patch still works.
> >
> > In the meantime, I'm attaching an updated set with 0001 changed per
> > your comments.
>
> 0001 patch conflicts with 71f4c8c6f74. Could you please rebase the patchset?

Thanks for the heads up.

I still don't have a working patch to address the above mentioned
shortcoming of the previous approach, but here is a rebased version in
the meantime.


--
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: a misbehavior of partition row movement (?)

From
Ibrar Ahmed
Date:


On Fri, Apr 2, 2021 at 6:09 PM Amit Langote <amitlangote09@gmail.com> wrote:
On Thu, Apr 1, 2021 at 10:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Tue, Mar 23, 2021 at 6:27 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > Actually, I found a big hole in my assumptions around deferrable
> > foreign constraints, invalidating the approach I took in 0002 to use a
> > query-lifetime tuplestore to record root parent tuples.  I'm trying to
> > find a way to make the tuplestore transaction-lifetime so that the
> > patch still works.
> >
> > In the meantime, I'm attaching an updated set with 0001 changed per
> > your comments.
>
> 0001 patch conflicts with 71f4c8c6f74. Could you please rebase the patchset?

Thanks for the heads up.

I still don't have a working patch to address the above mentioned
shortcoming of the previous approach, but here is a rebased version in
the meantime.


--
Amit Langote
EDB: http://www.enterprisedb.com


@Amit patch is not successfully applying, can you please rebase that. 

Masahiko Sawada, it's been a bit long since you reviewed the patch, are you still interested to review that? 

--
Ibrar Ahmed

Re: a misbehavior of partition row movement (?)

From
Amit Langote
Date:
Hi Ibrar, Sawada-san,

On Tue, Jul 13, 2021 at 20:25 Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:


On Fri, Apr 2, 2021 at 6:09 PM Amit Langote <amitlangote09@gmail.com> wrote:
On Thu, Apr 1, 2021 at 10:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Tue, Mar 23, 2021 at 6:27 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > Actually, I found a big hole in my assumptions around deferrable
> > foreign constraints, invalidating the approach I took in 0002 to use a
> > query-lifetime tuplestore to record root parent tuples.  I'm trying to
> > find a way to make the tuplestore transaction-lifetime so that the
> > patch still works.
> >
> > In the meantime, I'm attaching an updated set with 0001 changed per
> > your comments.
>
> 0001 patch conflicts with 71f4c8c6f74. Could you please rebase the patchset?

Thanks for the heads up.

I still don't have a working patch to address the above mentioned
shortcoming of the previous approach, but here is a rebased version in
the meantime.


--
Amit Langote
EDB: http://www.enterprisedb.com


@Amit patch is not successfully applying, can you please rebase that. 

Thanks for the reminder.

Masahiko Sawada, it's been a bit long since you reviewed the patch, are you still interested to review that? 

Unfortunately, I don’t think I’ll have time in this CF to solve some very fundamental issues I found in the patch during the last cycle.  I’m fine with either marking this as RwF for now or move to the next CF.
--

Re: a misbehavior of partition row movement (?)

From
Andrew Dunstan
Date:
On 7/13/21 8:09 AM, Amit Langote wrote:
>
>
>
>     @Amit patch is not successfully applying, can you please rebase that. 
>
>
> Thanks for the reminder.
>
>     Masahiko Sawada, it's been a bit long since you reviewed the
>     patch, are you still interested to review that? 
>
>
> Unfortunately, I don’t think I’ll have time in this CF to solve some
> very fundamental issues I found in the patch during the last cycle. 
> I’m fine with either marking this as RwF for now or move to the next CF.


Amit, do you have time now to work on this?


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: a misbehavior of partition row movement (?)

From
Amit Langote
Date:
Hi Andrew,

On Fri, Sep 3, 2021 at 6:19 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> On 7/13/21 8:09 AM, Amit Langote wrote:
> > Unfortunately, I don’t think I’ll have time in this CF to solve some
> > very fundamental issues I found in the patch during the last cycle.
> > I’m fine with either marking this as RwF for now or move to the next CF.
>
> Amit, do you have time now to work on this?

I will take some time next week to take a fresh look at this and post an update.

Thank you.

--
Amit Langote
EDB: http://www.enterprisedb.com



Re: a misbehavior of partition row movement (?)

From
Amit Langote
Date:
On Fri, Sep 3, 2021 at 12:23 PM Amit Langote <amitlangote09@gmail.com> wrote:
> Hi Andrew,
>
> On Fri, Sep 3, 2021 at 6:19 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> > On 7/13/21 8:09 AM, Amit Langote wrote:
> > > Unfortunately, I don’t think I’ll have time in this CF to solve some
> > > very fundamental issues I found in the patch during the last cycle.
> > > I’m fine with either marking this as RwF for now or move to the next CF.
> >
> > Amit, do you have time now to work on this?
>
> I will take some time next week to take a fresh look at this and post an update.

So I started looking at this today.  I didn't make much an inroad into
the stumbling block with 0002 patch that I had mentioned back in [1],
though I decided to at least post a rebased version of the patches
that apply.

I think 0001 is independently committable on its own merits,
irrespective of the yet unresolved problems of 0002, a patch to fix
$subject, which I'll continue to work on.

0003 shows a test that crashes the server due to said problem.

--
Amit Langote
EDB: http://www.enterprisedb.com

[1] https://www.postgresql.org/message-id/CA%2BHiwqHMpNZOc2Z-zgdO9hbJ7wMCOC%3DWpJYszVusZ%3DoE2OTf8w%40mail.gmail.com

Attachment

Re: a misbehavior of partition row movement (?)

From
Zhihong Yu
Date:


On Fri, Sep 10, 2021 at 7:06 AM Amit Langote <amitlangote09@gmail.com> wrote:
On Fri, Sep 3, 2021 at 12:23 PM Amit Langote <amitlangote09@gmail.com> wrote:
> Hi Andrew,
>
> On Fri, Sep 3, 2021 at 6:19 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> > On 7/13/21 8:09 AM, Amit Langote wrote:
> > > Unfortunately, I don’t think I’ll have time in this CF to solve some
> > > very fundamental issues I found in the patch during the last cycle.
> > > I’m fine with either marking this as RwF for now or move to the next CF.
> >
> > Amit, do you have time now to work on this?
>
> I will take some time next week to take a fresh look at this and post an update.

So I started looking at this today.  I didn't make much an inroad into
the stumbling block with 0002 patch that I had mentioned back in [1],
though I decided to at least post a rebased version of the patches
that apply.

I think 0001 is independently committable on its own merits,
irrespective of the yet unresolved problems of 0002, a patch to fix
$subject, which I'll continue to work on.

0003 shows a test that crashes the server due to said problem.

--
Amit Langote
EDB: http://www.enterprisedb.com

[1] https://www.postgresql.org/message-id/CA%2BHiwqHMpNZOc2Z-zgdO9hbJ7wMCOC%3DWpJYszVusZ%3DoE2OTf8w%40mail.gmail.com
Hi,
For patch 0001, GetForeignKeyActionTriggers:

+   if (!OidIsValid(*deleteTriggerOid) || !OidIsValid(*updateTriggerOid))
+       elog(ERROR, "could not find action triggers of foreign key constraint %u", 

I think if the error message includes whether it is the delete or update trigger that isn't found, it would be helpful.

Similar comment for error message in GetForeignKeyCheckTriggers().

Cheers

Re: a misbehavior of partition row movement (?)

From
Amit Langote
Date:
On Fri, Sep 10, 2021 at 11:03 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Fri, Sep 3, 2021 at 12:23 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > Hi Andrew,
> >
> > On Fri, Sep 3, 2021 at 6:19 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> > > On 7/13/21 8:09 AM, Amit Langote wrote:
> > > > Unfortunately, I don’t think I’ll have time in this CF to solve some
> > > > very fundamental issues I found in the patch during the last cycle.
> > > > I’m fine with either marking this as RwF for now or move to the next CF.
> > >
> > > Amit, do you have time now to work on this?
> >
> > I will take some time next week to take a fresh look at this and post an update.
>
> So I started looking at this today.  I didn't make much an inroad into
> the stumbling block with 0002 patch that I had mentioned back in [1],
> though I decided to at least post a rebased version of the patches
> that apply.
>
> I think 0001 is independently committable on its own merits,
> irrespective of the yet unresolved problems of 0002, a patch to fix
> $subject, which I'll continue to work on.
>
> 0003 shows a test that crashes the server due to said problem.

I think I found a solution to the problem with 0002.

The problem was that the tuplestore
(afterTriggers.query_stack[query_level].tuplestore) that I decided to
use to store the AFTER trigger tuples of a partitioned table that is
the target of an cross-partition update lives only for the duration of
a given query.  So that approach wouldn't work if the foreign key
pointing into that partitioned table is marked INITIALLY DEFERRED.  To
fix, I added a List field to AfterTriggersData that stores the
tuplestores to store the tuples of partitioned tables that undergo
cross-partition updates in a transaction and are pointed to by
INITIALLY DEFERRED foreign key constraints.  I couldn't understand one
comment about why using a tuplestore for such cases *might not* work,
which as follows:

 * Note that we need triggers on foreign tables to be fired in exactly the
 * order they were queued, so that the tuples come out of the tuplestore in
 * the right order.  To ensure that, we forbid deferrable (constraint)
 * triggers on foreign tables.  This also ensures that such triggers do not
 * get deferred into outer trigger query levels, meaning that it's okay to
 * destroy the tuplestore at the end of the query level.

I tried to break the approach using various test cases (some can be
seen added by the patch to foreign_key.sql), but couldn't see the
issue alluded to in the above comment.  So I've marked the comment
with an XXX note as follows:

- * Note that we need triggers on foreign tables to be fired in exactly the
- * order they were queued, so that the tuples come out of the tuplestore in
- * the right order.  To ensure that, we forbid deferrable (constraint)
- * triggers on foreign tables.  This also ensures that such triggers do not
- * get deferred into outer trigger query levels, meaning that it's okay to
- * destroy the tuplestore at the end of the query level.
+ * Note that we need triggers on foreign and partitioned tables to be fired in
+ * exactly the order they were queued, so that the tuples come out of the
+ * tuplestore in the right order.  To ensure that, we forbid deferrable
+ * (constraint) triggers on foreign tables.  This also ensures that such
+ * triggers do not get deferred into outer trigger query levels, meaning that
+ * it's okay to destroy the tuplestore at the end of the query level.
+ * XXX - update this paragraph if the new approach, whereby tuplestores in
+ * afterTriggers.deferred_tuplestores outlive any given query, can be proven
+ * to not really break any assumptions mentioned here.

If anyone reading this can think of the issue the original comment
seems to be talking about, please let me know.

I've attached updated patches.  I've addressed Zhihong Yu's comment too.

Thank you.

--
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: a misbehavior of partition row movement (?)

From
Amit Langote
Date:
On Mon, Sep 20, 2021 at 3:32 PM Amit Langote <amitlangote09@gmail.com> wrote:
> The problem was that the tuplestore
> (afterTriggers.query_stack[query_level].tuplestore) that I decided to
> use to store the AFTER trigger tuples of a partitioned table that is
> the target of an cross-partition update lives only for the duration of
> a given query.  So that approach wouldn't work if the foreign key
> pointing into that partitioned table is marked INITIALLY DEFERRED.  To
> fix, I added a List field to AfterTriggersData that stores the
> tuplestores to store the tuples of partitioned tables that undergo
> cross-partition updates in a transaction and are pointed to by
> INITIALLY DEFERRED foreign key constraints.  I couldn't understand one
> comment about why using a tuplestore for such cases *might not* work,
> which as follows:
>
>  * Note that we need triggers on foreign tables to be fired in exactly the
>  * order they were queued, so that the tuples come out of the tuplestore in
>  * the right order.  To ensure that, we forbid deferrable (constraint)
>  * triggers on foreign tables.  This also ensures that such triggers do not
>  * get deferred into outer trigger query levels, meaning that it's okay to
>  * destroy the tuplestore at the end of the query level.
>
> I tried to break the approach using various test cases (some can be
> seen added by the patch to foreign_key.sql), but couldn't see the
> issue alluded to in the above comment.  So I've marked the comment
> with an XXX note as follows:
>
> - * Note that we need triggers on foreign tables to be fired in exactly the
> - * order they were queued, so that the tuples come out of the tuplestore in
> - * the right order.  To ensure that, we forbid deferrable (constraint)
> - * triggers on foreign tables.  This also ensures that such triggers do not
> - * get deferred into outer trigger query levels, meaning that it's okay to
> - * destroy the tuplestore at the end of the query level.
> + * Note that we need triggers on foreign and partitioned tables to be fired in
> + * exactly the order they were queued, so that the tuples come out of the
> + * tuplestore in the right order.  To ensure that, we forbid deferrable
> + * (constraint) triggers on foreign tables.  This also ensures that such
> + * triggers do not get deferred into outer trigger query levels, meaning that
> + * it's okay to destroy the tuplestore at the end of the query level.
> + * XXX - update this paragraph if the new approach, whereby tuplestores in
> + * afterTriggers.deferred_tuplestores outlive any given query, can be proven
> + * to not really break any assumptions mentioned here.
>
> If anyone reading this can think of the issue the original comment
> seems to be talking about, please let me know.

I brought this up in an off-list discussion with Robert and he asked
why I hadn't considered not using tuplestores to remember the tuples
in the first place, specifically pointing out that it may lead to
unnecessarily consuming a lot of memory when such updates move a bunch
of rows around.  Like, we could simply remember the tuples to be
passed to the trigger function using their CTIDs as is done for normal
(single-heap-relation) updates, though in this case also remember the
OIDs of the source and the destination partitions of a particular
cross-partition update.

I had considered that idea before but I think I had overestimated the
complexity of that approach so didn't go with it.  I tried and the
resulting patch doesn't look all that complicated, with the added
bonus that the bulk update case shouldn't consume so much memory with
this approach like the previous tuplestore version would have.

Updated patches attached.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: a misbehavior of partition row movement (?)

From
Amit Langote
Date:
On Thu, Oct 14, 2021 at 6:00 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Mon, Sep 20, 2021 at 3:32 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > The problem was that the tuplestore
> > (afterTriggers.query_stack[query_level].tuplestore) that I decided to
> > use to store the AFTER trigger tuples of a partitioned table that is
> > the target of an cross-partition update lives only for the duration of
> > a given query.  So that approach wouldn't work if the foreign key
> > pointing into that partitioned table is marked INITIALLY DEFERRED.  To
> > fix, I added a List field to AfterTriggersData that stores the
> > tuplestores to store the tuples of partitioned tables that undergo
> > cross-partition updates in a transaction and are pointed to by
> > INITIALLY DEFERRED foreign key constraints.  I couldn't understand one
> > comment about why using a tuplestore for such cases *might not* work,
> > which as follows:
> >
> >  * Note that we need triggers on foreign tables to be fired in exactly the
> >  * order they were queued, so that the tuples come out of the tuplestore in
> >  * the right order.  To ensure that, we forbid deferrable (constraint)
> >  * triggers on foreign tables.  This also ensures that such triggers do not
> >  * get deferred into outer trigger query levels, meaning that it's okay to
> >  * destroy the tuplestore at the end of the query level.
> >
> > I tried to break the approach using various test cases (some can be
> > seen added by the patch to foreign_key.sql), but couldn't see the
> > issue alluded to in the above comment.  So I've marked the comment
> > with an XXX note as follows:
> >
> > - * Note that we need triggers on foreign tables to be fired in exactly the
> > - * order they were queued, so that the tuples come out of the tuplestore in
> > - * the right order.  To ensure that, we forbid deferrable (constraint)
> > - * triggers on foreign tables.  This also ensures that such triggers do not
> > - * get deferred into outer trigger query levels, meaning that it's okay to
> > - * destroy the tuplestore at the end of the query level.
> > + * Note that we need triggers on foreign and partitioned tables to be fired in
> > + * exactly the order they were queued, so that the tuples come out of the
> > + * tuplestore in the right order.  To ensure that, we forbid deferrable
> > + * (constraint) triggers on foreign tables.  This also ensures that such
> > + * triggers do not get deferred into outer trigger query levels, meaning that
> > + * it's okay to destroy the tuplestore at the end of the query level.
> > + * XXX - update this paragraph if the new approach, whereby tuplestores in
> > + * afterTriggers.deferred_tuplestores outlive any given query, can be proven
> > + * to not really break any assumptions mentioned here.
> >
> > If anyone reading this can think of the issue the original comment
> > seems to be talking about, please let me know.
>
> I brought this up in an off-list discussion with Robert and he asked
> why I hadn't considered not using tuplestores to remember the tuples
> in the first place, specifically pointing out that it may lead to
> unnecessarily consuming a lot of memory when such updates move a bunch
> of rows around.  Like, we could simply remember the tuples to be
> passed to the trigger function using their CTIDs as is done for normal
> (single-heap-relation) updates, though in this case also remember the
> OIDs of the source and the destination partitions of a particular
> cross-partition update.
>
> I had considered that idea before but I think I had overestimated the
> complexity of that approach so didn't go with it.  I tried and the
> resulting patch doesn't look all that complicated, with the added
> bonus that the bulk update case shouldn't consume so much memory with
> this approach like the previous tuplestore version would have.
>
> Updated patches attached.

Patch 0001 conflicted with some pg_dump changes that were recently
committed, so rebased.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: a misbehavior of partition row movement (?)

From
Alvaro Herrera
Date:
Pushed 0001.

I had to adjust the query used in pg_dump; you changed the attribute
name in the query used for pg15, but left unchanged the one for older
branches, so pg_dump failed for all branches other than 15.  Also,
psql's describe.c required a small tweak to a version number test.
https://github.com/alvherre/postgres/commit/3451612e0fa082d3ec953551c6d25432bd725502

Thanks!  What was 0002 is attached, to keep cfbot happy.  It's identical
to your v11-0002.

I have pushed it thinking that we would not backpatch any of this fix.
However, after running the tests and realizing that I didn't need an
initdb for either patch, I wonder if maybe the whole series *is*
backpatchable.

There is one possible problem, which is that psql and pg_dump would need
testing to verify that they work decently (i.e. no crash, no
misbehavior) with partitioned tables created with the original code.
But there are few ABI changes, maybe we can cope and get all branches
fixes instead of just 15.

What do you think?

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Los dioses no protegen a los insensatos.  Éstos reciben protección de
otros insensatos mejor dotados" (Luis Wu, Mundo Anillo)

Attachment

Re: a misbehavior of partition row movement (?)

From
Amit Langote
Date:
Hi Alvaro,

On Thu, Jan 6, 2022 at 7:27 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Pushed 0001.

Thank you.

> I had to adjust the query used in pg_dump; you changed the attribute
> name in the query used for pg15, but left unchanged the one for older
> branches, so pg_dump failed for all branches other than 15.

Oops, should've tested that.

>  Also,
> psql's describe.c required a small tweak to a version number test.

Ah, yes -- 15, not 14 anymore.

> https://github.com/alvherre/postgres/commit/3451612e0fa082d3ec953551c6d25432bd725502
>
> Thanks!  What was 0002 is attached, to keep cfbot happy.  It's identical
> to your v11-0002.
>
> I have pushed it thinking that we would not backpatch any of this fix.
> However, after running the tests and realizing that I didn't need an
> initdb for either patch, I wonder if maybe the whole series *is*
> backpatchable.

Interesting thought.

We do lack help from trigger.c in the v12 branch in that there's no
Trigger.tgisclone, which is used in a couple of places in the fix.  I
haven't checked how big of a deal it would be to back-port
Trigger.tgisclone to v12, but maybe that's doable.

> There is one possible problem, which is that psql and pg_dump would need
> testing to verify that they work decently (i.e. no crash, no
> misbehavior) with partitioned tables created with the original code.

I suppose you mean checking if the psql and pg_dump after applying
*0001* work sanely with partitioned tables defined without 0001?

Will test that.

> But there are few ABI changes, maybe we can cope and get all branches
> fixes instead of just 15.
>
> What do you think?

Yeah, as long as triggers are configured as required by the fix, and
that would be ensured if we're able to back-patch 0001 down to v12, I
suppose it would be nice to get this fixed down to v12.

--
Amit Langote
EDB: http://www.enterprisedb.com



Re: a misbehavior of partition row movement (?)

From
Alvaro Herrera
Date:
On 2022-Jan-06, Amit Langote wrote:

> On Thu, Jan 6, 2022 at 7:27 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> > I have pushed it thinking that we would not backpatch any of this fix.
> > However, after running the tests and realizing that I didn't need an
> > initdb for either patch, I wonder if maybe the whole series *is*
> > backpatchable.
> 
> We do lack help from trigger.c in the v12 branch in that there's no
> Trigger.tgisclone, which is used in a couple of places in the fix.  I
> haven't checked how big of a deal it would be to back-port
> Trigger.tgisclone to v12, but maybe that's doable.

Yeah, I realized afterwards that we added tgparentid in 13 only
(b9b408c48), so we should only backpatch to that.

> > There is one possible problem, which is that psql and pg_dump would need
> > testing to verify that they work decently (i.e. no crash, no
> > misbehavior) with partitioned tables created with the original code.
> 
> I suppose you mean checking if the psql and pg_dump after applying
> *0001* work sanely with partitioned tables defined without 0001?

Yes.

> Will test that.

I looked at the backpatch at the last minute yesterday.  The tablecmds.c
conflict is easy to resolve, but the one in pg_dump.c is a giant
conflict zone that I didn't have time to look closely :-(

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/



Re: a misbehavior of partition row movement (?)

From
Amit Langote
Date:
On Thu, Jan 6, 2022 at 9:36 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2022-Jan-06, Amit Langote wrote:
> > On Thu, Jan 6, 2022 at 7:27 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> > > I have pushed it thinking that we would not backpatch any of this fix.
> > > However, after running the tests and realizing that I didn't need an
> > > initdb for either patch, I wonder if maybe the whole series *is*
> > > backpatchable.
> >
> > We do lack help from trigger.c in the v12 branch in that there's no
> > Trigger.tgisclone, which is used in a couple of places in the fix.  I
> > haven't checked how big of a deal it would be to back-port
> > Trigger.tgisclone to v12, but maybe that's doable.
>
> Yeah, I realized afterwards that we added tgparentid in 13 only
> (b9b408c48), so we should only backpatch to that.
>
> > > There is one possible problem, which is that psql and pg_dump would need
> > > testing to verify that they work decently (i.e. no crash, no
> > > misbehavior) with partitioned tables created with the original code.
> >
> > I suppose you mean checking if the psql and pg_dump after applying
> > *0001* work sanely with partitioned tables defined without 0001?
>
> Yes.
>
> > Will test that.
>
> I looked at the backpatch at the last minute yesterday.  The tablecmds.c
> conflict is easy to resolve, but the one in pg_dump.c is a giant
> conflict zone that I didn't have time to look closely :-(

I think I've managed to apply f4566345cf40b into v13 and v14.  Patches attached.

Patched pg_dump seems to work with existing partition triggers and
psql too with some changes to the condition to decide whether to show
tgisinternal triggers when describing a partition.

As for the fix to make cross-partition updates work correctly with
foreign keys, I just realized it won't work for the users' existing
foreign keys, because the parent table's triggers that are needed for
the fix to work would not be present.  Were you thinking that we'd ask
users of v13 and v14 to drop and recreate those constraints?

--
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: a misbehavior of partition row movement (?)

From
Alvaro Herrera
Date:
On 2022-Jan-11, Amit Langote wrote:

> As for the fix to make cross-partition updates work correctly with
> foreign keys, I just realized it won't work for the users' existing
> foreign keys, because the parent table's triggers that are needed for
> the fix to work would not be present.  Were you thinking that we'd ask
> users of v13 and v14 to drop and recreate those constraints?

Yeah, more or less.  Also, any tables created from 13.6 onwards.

I was mainly thinking that we'll still have people creating new clusters
using pg13 for half a decade.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
"Right now the sectors on the hard disk run clockwise, but I heard a rumor that
you can squeeze 0.2% more throughput by running them counterclockwise.
It's worth the effort. Recommended."  (Gerry Pourwelle)



Re: a misbehavior of partition row movement (?)

From
Julien Rouhaud
Date:
Hi,

On Tue, Jan 11, 2022 at 05:08:59PM +0900, Amit Langote wrote:
> 
> I think I've managed to apply f4566345cf40b into v13 and v14.  Patches attached.
> 

FTR this doesn't play well with the cfbot unfortunately as it tries to apply
both patches, and obviously on the wrong branches anyway.

It means that the previous-0002-now-0001 patch that Álvaro previously sent
(https://www.postgresql.org/message-id/202201052227.bc4yvvy6lqpb%40alvherre.pgsql)
is not tested anymore, and IIUC it's not pushed yet so it's not ideal.

There's now an official documentation on how to send patches that should be
ignored by the cfbot [1], so sending backpatch versions with a .txt extension
could be useful.  Just in case I'm attaching the pending patch to this mail to
make the cfbot happy again.

[1] https://wiki.postgresql.org/wiki/Cfbot#Which_attachments_are_considered_to_be_patches.3F

Attachment

Re: a misbehavior of partition row movement (?)

From
Amit Langote
Date:
On Thu, Jan 13, 2022 at 12:19 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
> On Tue, Jan 11, 2022 at 05:08:59PM +0900, Amit Langote wrote:
> >
> > I think I've managed to apply f4566345cf40b into v13 and v14.  Patches attached.
> >
>
> FTR this doesn't play well with the cfbot unfortunately as it tries to apply
> both patches, and obviously on the wrong branches anyway.

Oops, that's right.  Thanks for the heads up.

> It means that the previous-0002-now-0001 patch that Álvaro previously sent
> (https://www.postgresql.org/message-id/202201052227.bc4yvvy6lqpb%40alvherre.pgsql)
> is not tested anymore, and IIUC it's not pushed yet so it's not ideal.

Agreed.

> There's now an official documentation on how to send patches that should be
> ignored by the cfbot [1], so sending backpatch versions with a .txt extension
> could be useful.  Just in case I'm attaching the pending patch to this mail to
> make the cfbot happy again.

Thanks and sorry I wasn't aware of the rule about sending back patch versions.

--
Amit Langote
EDB: http://www.enterprisedb.com



Re: a misbehavior of partition row movement (?)

From
Amit Langote
Date:
On Tue, Jan 11, 2022 at 8:23 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2022-Jan-11, Amit Langote wrote:
> > As for the fix to make cross-partition updates work correctly with
> > foreign keys, I just realized it won't work for the users' existing
> > foreign keys, because the parent table's triggers that are needed for
> > the fix to work would not be present.  Were you thinking that we'd ask
> > users of v13 and v14 to drop and recreate those constraints?
>
> Yeah, more or less.  Also, any tables created from 13.6 onwards.
>
> I was mainly thinking that we'll still have people creating new clusters
> using pg13 for half a decade.

Okay, I created versions of the patch series for branches 13 and 14
(.txt files).  The one for HEAD is also re-attached.

Note that the fix involves adding fields to ResultRelInfo -- v13 needs
2 additional, while v14 and HEAD need 1.  That combined with needing
new catalog entries for parent FK triggers, back-patching this does
make me a bit uncomfortable.  Another thing to consider is that we
haven't seen many reports of the problem (UPDATEs of partitioned PK
tables causing DELETEs in referencing tables), even though it can be
possibly very surprising to those who do run into it.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: a misbehavior of partition row movement (?)

From
Alvaro Herrera
Date:
On 2022-Jan-17, Amit Langote wrote:

> Note that the fix involves adding fields to ResultRelInfo -- v13 needs
> 2 additional, while v14 and HEAD need 1.  That combined with needing
> new catalog entries for parent FK triggers, back-patching this does
> make me a bit uncomfortable.

Yeah, that's a good point, so I ran abidiff on the binaries in branch 13
to have some data on it.  The report does indeed have a lot of noise
about those three added members in struct ResultRelInfo; but as far as I
can see in the report, there is no ABI affected because of these
changes.

However, the ones that caught my eye next were the ABI changes for
ExecGetTriggerResultRel() and ExecAR{Delete,Update}Triggers().  These seem more
significant, if any external code is calling these.  Now, while I think
we could dodge that (at least part of it) by having a shim for
AfterTriggerSaveEvent that passes a NULL mtstate, and takes the
assumption that there is no row partition migration when that happens
... that seems like treading in dangerous territory: we would have 
code that would behave differently for an extension that was compiled
with an earlier copy of the backend.

So I see two options.  One is to introduce the aforementioned shim, but
instead of making any assumptions, we cause the shim raise an error:
"your extension is outdated, please recompile with the new postgres
version".  However, that seems even more harmful, because production
systems that auto-update to the next Postgres version would start to
fail.

The other is suggested by you:

> Another thing to consider is that we haven't seen many reports of the
> problem (UPDATEs of partitioned PK tables causing DELETEs in
> referencing tables), even though it can be possibly very surprising to
> those who do run into it.

Do nothing in the old branches.


Another thing I saw which surprised me very much is this bit, which I
think must be a bug in abidiff:

                                type of 'EPQState* EState::es_epq_active' changed:
                                  in pointed to type 'struct EPQState' at execnodes.h:1104:1:
                                    type size hasn't changed
                                    1 data member changes (3 filtered):
                                      type of 'PlanState* EPQState::recheckplanstate' changed:
                                        in pointed to type 'struct PlanState' at execnodes.h:1056:1:
                                          entity changed from 'struct PlanState' to compatible type 'typedef PlanState'
atexecnodes.h:1056:1
 


-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"La grandeza es una experiencia transitoria.  Nunca es consistente.
Depende en gran parte de la imaginación humana creadora de mitos"
(Irulan)



Re: a misbehavior of partition row movement (?)

From
Zhihong Yu
Date:


On Mon, Jan 17, 2022 at 6:26 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2022-Jan-17, Amit Langote wrote:

> Note that the fix involves adding fields to ResultRelInfo -- v13 needs
> 2 additional, while v14 and HEAD need 1.  That combined with needing
> new catalog entries for parent FK triggers, back-patching this does
> make me a bit uncomfortable.

Yeah, that's a good point, so I ran abidiff on the binaries in branch 13
to have some data on it.  The report does indeed have a lot of noise
about those three added members in struct ResultRelInfo; but as far as I
can see in the report, there is no ABI affected because of these
changes.

However, the ones that caught my eye next were the ABI changes for
ExecGetTriggerResultRel() and ExecAR{Delete,Update}Triggers().  These seem more
significant, if any external code is calling these.  Now, while I think
we could dodge that (at least part of it) by having a shim for
AfterTriggerSaveEvent that passes a NULL mtstate, and takes the
assumption that there is no row partition migration when that happens
... that seems like treading in dangerous territory: we would have
code that would behave differently for an extension that was compiled
with an earlier copy of the backend.

So I see two options.  One is to introduce the aforementioned shim, but
instead of making any assumptions, we cause the shim raise an error:
"your extension is outdated, please recompile with the new postgres
version".  However, that seems even more harmful, because production
systems that auto-update to the next Postgres version would start to
fail.

The other is suggested by you:

> Another thing to consider is that we haven't seen many reports of the
> problem (UPDATEs of partitioned PK tables causing DELETEs in
> referencing tables), even though it can be possibly very surprising to
> those who do run into it.

Do nothing in the old branches.


Another thing I saw which surprised me very much is this bit, which I
think must be a bug in abidiff:

                                type of 'EPQState* EState::es_epq_active' changed:
                                  in pointed to type 'struct EPQState' at execnodes.h:1104:1:
                                    type size hasn't changed
                                    1 data member changes (3 filtered):
                                      type of 'PlanState* EPQState::recheckplanstate' changed:
                                        in pointed to type 'struct PlanState' at execnodes.h:1056:1:
                                          entity changed from 'struct PlanState' to compatible type 'typedef PlanState' at execnodes.h:1056:1

Hi,
I think option 2, not backpatching, is more desirable at this stage.

If, complaints from the field arise in the future, we can consider backpatching.

Cheers

Re: a misbehavior of partition row movement (?)

From
Alvaro Herrera
Date:
On 2022-Jan-17, Zhihong Yu wrote:

> On Mon, Jan 17, 2022 at 6:26 AM Alvaro Herrera <alvherre@alvh.no-ip.org>
> wrote:

> > On 2022-Jan-17, Amit Langote wrote:

> > The other is suggested by you:
> >
> > > Another thing to consider is that we haven't seen many reports of the
> > > problem (UPDATEs of partitioned PK tables causing DELETEs in
> > > referencing tables), even though it can be possibly very surprising to
> > > those who do run into it.
> >
> > Do nothing in the old branches.

> I think option 2, not backpatching, is more desirable at this stage.

Preliminarly, I tend to agree.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/



Re: a misbehavior of partition row movement (?)

From
Alvaro Herrera
Date:
> @@ -3398,7 +3432,7 @@ typedef SetConstraintStateData *SetConstraintState;
>   */
>  typedef uint32 TriggerFlags;
>  
> -#define AFTER_TRIGGER_OFFSET            0x0FFFFFFF    /* must be low-order bits */
> +#define AFTER_TRIGGER_OFFSET            0x07FFFFFF    /* must be low-order bits */
>  #define AFTER_TRIGGER_DONE                0x10000000
>  #define AFTER_TRIGGER_IN_PROGRESS        0x20000000
>  /* bits describing the size and tuple sources of this event */
> @@ -3406,7 +3440,8 @@ typedef uint32 TriggerFlags;
>  #define AFTER_TRIGGER_FDW_FETCH            0x80000000
>  #define AFTER_TRIGGER_1CTID                0x40000000
>  #define AFTER_TRIGGER_2CTID                0xC0000000
> -#define AFTER_TRIGGER_TUP_BITS            0xC0000000
> +#define AFTER_TRIGGER_CP_UPDATE            0x08000000
> +#define AFTER_TRIGGER_TUP_BITS            0xC8000000

So this patch releases one bit from AFTER_TRIGGER_OFFSET and makes it
become AFTER_TRIGGER_CP_UPDATE.  As far as I can tell there is no harm
in doing so.

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/



Re: a misbehavior of partition row movement (?)

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> So this patch releases one bit from AFTER_TRIGGER_OFFSET and makes it
> become AFTER_TRIGGER_CP_UPDATE.  As far as I can tell there is no harm
> in doing so.

I agree that taking a bit away from AFTER_TRIGGER_OFFSET is okay
(it could spare even a couple more, if we need them).

But could we please do it in a way that is designed to keep the
code readable, rather than to minimize the number of lines of diff?
It makes zero sense to have the bits in AFTER_TRIGGER_TUP_BITS not
be adjacent.  So what should happen here is to renumber the symbols
in between to move their bits over one place.

(Since this data is only known within trigger.c, I don't even see
an ABI-stability argument for not changing these assignments.)

            regards, tom lane



Re: a misbehavior of partition row movement (?)

From
Alvaro Herrera
Date:
On 2022-Jan-17, Tom Lane wrote:

> But could we please do it in a way that is designed to keep the
> code readable, rather than to minimize the number of lines of diff?
> It makes zero sense to have the bits in AFTER_TRIGGER_TUP_BITS not
> be adjacent.  So what should happen here is to renumber the symbols
> in between to move their bits over one place.

Is it typical to enumerate bits starting from the right of each byte,
when doing it from the high bits of the word?  DONE and IN_PROGRESS have
been defined as 0x1 and 0x2 of that byte for a very long time and I
found that very strange.  I am inclined to count from the left, so I'd
pick 8 first, defining the set like this:

#define AFTER_TRIGGER_OFFSET            0x07FFFFFF    /* must be low-order bits */
#define AFTER_TRIGGER_DONE            0x80000000
#define AFTER_TRIGGER_IN_PROGRESS        0x40000000
/* bits describing the size and tuple sources of this event */
#define AFTER_TRIGGER_FDW_REUSE            0x00000000
#define AFTER_TRIGGER_FDW_FETCH            0x20000000
#define AFTER_TRIGGER_1CTID            0x10000000
#define AFTER_TRIGGER_2CTID            0x30000000
#define AFTER_TRIGGER_CP_UPDATE            0x08000000
#define AFTER_TRIGGER_TUP_BITS            0x38000000

(The fact that FDW_REUSE bits is actually an empty mask comes from
7cbe57c34dec, specifically [1])

Is this what you were thinking?

[1] https://postgr.es/m/20140306033644.GA3527902@tornado.leadboat.com

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"All rings of power are equal,
But some rings of power are more equal than others."
                                 (George Orwell's The Lord of the Rings)



Re: a misbehavior of partition row movement (?)

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> On 2022-Jan-17, Tom Lane wrote:
>> It makes zero sense to have the bits in AFTER_TRIGGER_TUP_BITS not
>> be adjacent.  So what should happen here is to renumber the symbols
>> in between to move their bits over one place.

> Is it typical to enumerate bits starting from the right of each byte,
> when doing it from the high bits of the word?  DONE and IN_PROGRESS have
> been defined as 0x1 and 0x2 of that byte for a very long time and I
> found that very strange.  I am inclined to count from the left, so I'd
> pick 8 first, defining the set like this:

Doesn't matter to me either way, as long as the values look like they
were all defined by the same person ;-)

> (The fact that FDW_REUSE bits is actually an empty mask comes from
> 7cbe57c34dec, specifically [1])

That seemed a bit odd to me too, but this is not the patch to be
changing it in, I suppose.

            regards, tom lane



Re: a misbehavior of partition row movement (?)

From
Amit Langote
Date:
On Tue, Jan 18, 2022 at 7:15 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2022-Jan-17, Tom Lane wrote:
> > But could we please do it in a way that is designed to keep the
> > code readable, rather than to minimize the number of lines of diff?
> > It makes zero sense to have the bits in AFTER_TRIGGER_TUP_BITS not
> > be adjacent.  So what should happen here is to renumber the symbols
> > in between to move their bits over one place.
>
> Is it typical to enumerate bits starting from the right of each byte,
> when doing it from the high bits of the word?  DONE and IN_PROGRESS have
> been defined as 0x1 and 0x2 of that byte for a very long time and I
> found that very strange.  I am inclined to count from the left, so I'd
> pick 8 first, defining the set like this:
>
> #define AFTER_TRIGGER_OFFSET                    0x07FFFFFF      /* must be low-order bits */
> #define AFTER_TRIGGER_DONE                      0x80000000
> #define AFTER_TRIGGER_IN_PROGRESS               0x40000000
> /* bits describing the size and tuple sources of this event */
> #define AFTER_TRIGGER_FDW_REUSE                 0x00000000
> #define AFTER_TRIGGER_FDW_FETCH                 0x20000000
> #define AFTER_TRIGGER_1CTID                     0x10000000
> #define AFTER_TRIGGER_2CTID                     0x30000000
> #define AFTER_TRIGGER_CP_UPDATE                 0x08000000
> #define AFTER_TRIGGER_TUP_BITS                  0x38000000

Agree that the ultimate code readability trumps diff minimalism. :-)

Would you like me to update the patch with the above renumbering or
are you working on a new version yourself?

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: a misbehavior of partition row movement (?)

From
Julien Rouhaud
Date:
Hi,

On Mon, Jan 17, 2022 at 08:40:54PM +0900, Amit Langote wrote:
> 
> Okay, I created versions of the patch series for branches 13 and 14
> (.txt files).  The one for HEAD is also re-attached.

FYI The patch failed today on FreeBSD, while it was previously quite stable on
all platforms (https://cirrus-ci.com/build/4551468081479680), with this error:

https://api.cirrus-ci.com/v1/artifact/task/6360787076775936/regress_diffs/src/test/recovery/tmp_check/regression.diffs
diff -U3 /tmp/cirrus-ci-build/src/test/recovery/../regress/expected/reloptions.out
/tmp/cirrus-ci-build/src/test/recovery/tmp_check/results/reloptions.out
--- /tmp/cirrus-ci-build/src/test/recovery/../regress/expected/reloptions.out    2022-01-18 00:12:52.533086000 +0000
+++ /tmp/cirrus-ci-build/src/test/recovery/tmp_check/results/reloptions.out    2022-01-18 00:28:00.000524000 +0000
@@ -133,7 +133,7 @@
 SELECT pg_relation_size('reloptions_test') = 0;
  ?column?
 ----------
- t
+ f
 (1 row)


I'm not sure why this test failed as it doesn't seem like something impacted by
the patch, but I may have missed something as I only had a quick look at the
patch and discussion.



Re: a misbehavior of partition row movement (?)

From
Michael Paquier
Date:
On Tue, Jan 18, 2022 at 12:16:23PM +0800, Julien Rouhaud wrote:
> I'm not sure why this test failed as it doesn't seem like something impacted by
> the patch, but I may have missed something as I only had a quick look at the
> patch and discussion.

This issue is discussed here:
https://www.postgresql.org/message-id/20220117203746.oj43254j5jurbneu@alap3.anarazel.de
--
Michael

Attachment

Re: a misbehavior of partition row movement (?)

From
Julien Rouhaud
Date:
Hi,

On Tue, Jan 18, 2022 at 02:33:39PM +0900, Michael Paquier wrote:
> On Tue, Jan 18, 2022 at 12:16:23PM +0800, Julien Rouhaud wrote:
> > I'm not sure why this test failed as it doesn't seem like something impacted by
> > the patch, but I may have missed something as I only had a quick look at the
> > patch and discussion.
> 
> This issue is discussed here:
> https://www.postgresql.org/message-id/20220117203746.oj43254j5jurbneu@alap3.anarazel.de

Oh I missed it, thanks!  Sorry for the noise.



Re: a misbehavior of partition row movement (?)

From
Amit Langote
Date:
On Tue, Jan 18, 2022 at 2:41 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
> On Tue, Jan 18, 2022 at 02:33:39PM +0900, Michael Paquier wrote:
> > On Tue, Jan 18, 2022 at 12:16:23PM +0800, Julien Rouhaud wrote:
> > > I'm not sure why this test failed as it doesn't seem like something impacted by
> > > the patch, but I may have missed something as I only had a quick look at the
> > > patch and discussion.
> >
> > This issue is discussed here:
> > https://www.postgresql.org/message-id/20220117203746.oj43254j5jurbneu@alap3.anarazel.de
>
> Oh I missed it, thanks!  Sorry for the noise.

Thanks, it had puzzled me too when I first saw it this morning.

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: a misbehavior of partition row movement (?)

From
Alvaro Herrera
Date:
On 2022-Jan-18, Amit Langote wrote:

> Would you like me to update the patch with the above renumbering or
> are you working on a new version yourself?

I have a few very minor changes apart from that.  Let me see if I can
get this pushed soon; if I'm unable to (there are parts I don't fully
grok yet), I'll post what I have.

Thank you!

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Re: a misbehavior of partition row movement (?)

From
Tom Lane
Date:
Julien Rouhaud <rjuju123@gmail.com> writes:
> @@ -133,7 +133,7 @@
>  SELECT pg_relation_size('reloptions_test') = 0;
>   ?column?
>  ----------
> - t
> + f
>  (1 row)

Some machines have been showing that on HEAD too, so I doubt
it's the fault of this patch.  That reloptions test isn't stable
yet seemingly.

            regards, tom lane



Re: a misbehavior of partition row movement (?)

From
Alvaro Herrera
Date:
On 2022-Jan-18, Alvaro Herrera wrote:

> On 2022-Jan-18, Amit Langote wrote:
> 
> > Would you like me to update the patch with the above renumbering or
> > are you working on a new version yourself?
> 
> I have a few very minor changes apart from that.  Let me see if I can
> get this pushed soon; if I'm unable to (there are parts I don't fully
> grok yet), I'll post what I have.

Here's v13, a series with your patch as 0001 and a few changes from me;
the bulk is just pgindent, and there are a few stylistic changes and an
unrelated typo fix and I added a couple of comments to your new code.

I don't like the API change to ExecInsert().  You're adding two output
arguments:
- the tuple being inserted.  But surely you have this already, because
it's in the 'slot' argument you passed in. ExecInsert is even careful to
set the ->tts_tableOid argument there.  So ExecInsert's caller doesn't
need to receive the inserted tuple as an argument, it can just read
'slot'.
- the relation to which the tuple was inserted.  Can't this be obtained
by looking at slot->tts_tableOid?  We should be able to use
ExecLookupResultRelByOid() to obtain it, no? (I suppose you may claim
that this is wasteful, but I think this is not a common case anyway and
it's worth keeping ExecInsert()'s API clean for the sake of the 99.99%
of its other calls).

I think the argument definition of ExecCrossPartitionUpdateForeignKey is
a bit messy.  I propose to move mtstate,estate as two first arguments;
IIRC the whole executor does it that way.

AfterTriggerSaveEvent determines maybe_crosspart_update (by looking at
mtstate->operation -- why doesn't it look at 'event' instead?) and later
it determines new_event.ate_flags.  Why can't it use
maybe_crosspart_update to simplify part of that?  Or maybe the other way
around, not sure.  It looks like something could be made simpler there.

Overall, the idea embodied in the patch looks sensible to me.

Thank you,

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
"Nunca confiaré en un traidor.  Ni siquiera si el traidor lo he creado yo"
(Barón Vladimir Harkonnen)

Attachment

Re: a misbehavior of partition row movement (?)

From
Amit Langote
Date:
On Wed, Jan 19, 2022 at 7:29 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2022-Jan-18, Alvaro Herrera wrote:
> > On 2022-Jan-18, Amit Langote wrote:
> >
> > > Would you like me to update the patch with the above renumbering or
> > > are you working on a new version yourself?
> >
> > I have a few very minor changes apart from that.  Let me see if I can
> > get this pushed soon; if I'm unable to (there are parts I don't fully
> > grok yet), I'll post what I have.
>
> Here's v13, a series with your patch as 0001 and a few changes from me;
> the bulk is just pgindent, and there are a few stylistic changes and an
> unrelated typo fix and I added a couple of comments to your new code.

Thank you.

> I don't like the API change to ExecInsert().  You're adding two output
> arguments:
> - the tuple being inserted.  But surely you have this already, because
> it's in the 'slot' argument you passed in. ExecInsert is even careful to
> set the ->tts_tableOid argument there.  So ExecInsert's caller doesn't
> need to receive the inserted tuple as an argument, it can just read
> 'slot'.

Hmm, ExecInsert()'s input slot belongs to either the source partition
or the "root" target relation, the latter due to the following stanza
in ExecCrossPartitionUpdate():

    /*
     * resultRelInfo is one of the per-relation resultRelInfos.  So we should
     * convert the tuple into root's tuple descriptor if needed, since
     * ExecInsert() starts the search from root.
     */
    tupconv_map = ExecGetChildToRootMap(resultRelInfo);
    if (tupconv_map != NULL)
        slot = execute_attr_map_slot(tupconv_map->attrMap,
                                     slot,
                                     mtstate->mt_root_tuple_slot);

Though the slot whose tuple ExecInsert() ultimately inserts may be
destination partition's ri_PartitionTupleSlot due to the following
stanza in it:

    /*
     * If the input result relation is a partitioned table, find the leaf
     * partition to insert the tuple into.
     */
    if (proute)
    {
        ResultRelInfo *partRelInfo;

        slot = ExecPrepareTupleRouting(mtstate, estate, proute,
                                       resultRelInfo, slot,
                                       &partRelInfo);
        resultRelInfo = partRelInfo;
    }

It's not great that ExecInsert()'s existing header comment doesn't
mention that the slot whose tuple is actually inserted may not be the
slot it receives from the caller :-(.

> - the relation to which the tuple was inserted.  Can't this be obtained
> by looking at slot->tts_tableOid?  We should be able to use
> ExecLookupResultRelByOid() to obtain it, no? (I suppose you may claim
> that this is wasteful, but I think this is not a common case anyway and
> it's worth keeping ExecInsert()'s API clean for the sake of the 99.99%
> of its other calls).

ExecLookupResultRelByOid() is only useful when *all* relevant leaf
partitions are present in the ModifyTableState.resultRelInfo array
(due to being present in ModifyTable.resultRelations).  Leaf
partitions that are only initialized by tuple routing (such as
destination partitions of cross-partition updates) are only present in
ModifyTableState.mt_partition_tuple_routing.partitions[] that are not
discoverable by ExecLookupResultRelByOid().

> I think the argument definition of ExecCrossPartitionUpdateForeignKey is
> a bit messy.  I propose to move mtstate,estate as two first arguments;
> IIRC the whole executor does it that way.

Okay, done.

> AfterTriggerSaveEvent determines maybe_crosspart_update (by looking at
> mtstate->operation -- why doesn't it look at 'event' instead?) and later
> it determines new_event.ate_flags.  Why can't it use
> maybe_crosspart_update to simplify part of that?  Or maybe the other way
> around, not sure.  It looks like something could be made simpler there.

Actually, I remember disliking maybe_crosspart_update for sounding a
bit fuzzy and also having to add mtstate to a bunch of trigger.c
interface functions only to calculate that.

I now wonder if passing an is_crosspart_update through
ExecAR{Update|Delete}Triggers() would not be better.  Both
ExecDelete() and ExecCrossPartitionUpdateForeignKey() know for sure if
their ExecAR{Update|Delete}Triggers() invocation is for a
cross-partition update, so better to just pass that information down
to AfterTriggerSaveEvent() than pass 'mtstate' and have the latter
reverse-engineer only a fuzzy guess of whether that's the case.

I like that interface better and have implemented it in the updated version.

I've also merged your changes and made some of my own as mentioned
above to end up with the attached v14.  I'm also attaching a delta
patch over v13 (0001+0002) to easily see the changes I made to get
v14.

BTW, your tweaks patch added this:

+ *     *inserted_tuple is the tuple that's effectively inserted;
+ *     *inserted_destrel is the relation where it was inserted.
+ *     These are only set on success.  FIXME -- see what happens on
the "do nothing" cases.

If by "do nothing cases" you mean INSERT ON CONFLICT ... DO NOTHING,
then I don't think it matters, because the caller in that case would
be ExecModifyTable() which doesn't care about inserted_tuple and
inserted_destrel.

> Overall, the idea embodied in the patch looks sensible to me.

Thanks again for taking time to review this.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: a misbehavior of partition row movement (?)

From
Amit Langote
Date:
On Wed, Jan 19, 2022 at 4:13 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Wed, Jan 19, 2022 at 7:29 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > On 2022-Jan-18, Alvaro Herrera wrote:
> > > On 2022-Jan-18, Amit Langote wrote:
> > >
> > > > Would you like me to update the patch with the above renumbering or
> > > > are you working on a new version yourself?
> > >
> > > I have a few very minor changes apart from that.  Let me see if I can
> > > get this pushed soon; if I'm unable to (there are parts I don't fully
> > > grok yet), I'll post what I have.
> >
> > Here's v13, a series with your patch as 0001 and a few changes from me;
> > the bulk is just pgindent, and there are a few stylistic changes and an
> > unrelated typo fix and I added a couple of comments to your new code.
>
> Thank you.
>
> > I don't like the API change to ExecInsert().  You're adding two output
> > arguments:
> > - the tuple being inserted.  But surely you have this already, because
> > it's in the 'slot' argument you passed in. ExecInsert is even careful to
> > set the ->tts_tableOid argument there.  So ExecInsert's caller doesn't
> > need to receive the inserted tuple as an argument, it can just read
> > 'slot'.
>
> Hmm, ExecInsert()'s input slot belongs to either the source partition
> or the "root" target relation, the latter due to the following stanza
> in ExecCrossPartitionUpdate():
>
>     /*
>      * resultRelInfo is one of the per-relation resultRelInfos.  So we should
>      * convert the tuple into root's tuple descriptor if needed, since
>      * ExecInsert() starts the search from root.
>      */
>     tupconv_map = ExecGetChildToRootMap(resultRelInfo);
>     if (tupconv_map != NULL)
>         slot = execute_attr_map_slot(tupconv_map->attrMap,
>                                      slot,
>                                      mtstate->mt_root_tuple_slot);
>
> Though the slot whose tuple ExecInsert() ultimately inserts may be
> destination partition's ri_PartitionTupleSlot due to the following
> stanza in it:
>
>     /*
>      * If the input result relation is a partitioned table, find the leaf
>      * partition to insert the tuple into.
>      */
>     if (proute)
>     {
>         ResultRelInfo *partRelInfo;
>
>         slot = ExecPrepareTupleRouting(mtstate, estate, proute,
>                                        resultRelInfo, slot,
>                                        &partRelInfo);
>         resultRelInfo = partRelInfo;
>     }
>
> It's not great that ExecInsert()'s existing header comment doesn't
> mention that the slot whose tuple is actually inserted may not be the
> slot it receives from the caller :-(.
>
> > - the relation to which the tuple was inserted.  Can't this be obtained
> > by looking at slot->tts_tableOid?  We should be able to use
> > ExecLookupResultRelByOid() to obtain it, no? (I suppose you may claim
> > that this is wasteful, but I think this is not a common case anyway and
> > it's worth keeping ExecInsert()'s API clean for the sake of the 99.99%
> > of its other calls).
>
> ExecLookupResultRelByOid() is only useful when *all* relevant leaf
> partitions are present in the ModifyTableState.resultRelInfo array
> (due to being present in ModifyTable.resultRelations).  Leaf
> partitions that are only initialized by tuple routing (such as
> destination partitions of cross-partition updates) are only present in
> ModifyTableState.mt_partition_tuple_routing.partitions[] that are not
> discoverable by ExecLookupResultRelByOid().
>
> > I think the argument definition of ExecCrossPartitionUpdateForeignKey is
> > a bit messy.  I propose to move mtstate,estate as two first arguments;
> > IIRC the whole executor does it that way.
>
> Okay, done.
>
> > AfterTriggerSaveEvent determines maybe_crosspart_update (by looking at
> > mtstate->operation -- why doesn't it look at 'event' instead?) and later
> > it determines new_event.ate_flags.  Why can't it use
> > maybe_crosspart_update to simplify part of that?  Or maybe the other way
> > around, not sure.  It looks like something could be made simpler there.
>
> Actually, I remember disliking maybe_crosspart_update for sounding a
> bit fuzzy and also having to add mtstate to a bunch of trigger.c
> interface functions only to calculate that.
>
> I now wonder if passing an is_crosspart_update through
> ExecAR{Update|Delete}Triggers() would not be better.  Both
> ExecDelete() and ExecCrossPartitionUpdateForeignKey() know for sure if
> their ExecAR{Update|Delete}Triggers() invocation is for a
> cross-partition update, so better to just pass that information down
> to AfterTriggerSaveEvent() than pass 'mtstate' and have the latter
> reverse-engineer only a fuzzy guess of whether that's the case.
>
> I like that interface better and have implemented it in the updated version.
>
> I've also merged your changes and made some of my own as mentioned
> above to end up with the attached v14.  I'm also attaching a delta
> patch over v13 (0001+0002) to easily see the changes I made to get
> v14.

Oops, broke the cfbot's patch-applier again.  Delta-diff reattached as txt.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: a misbehavior of partition row movement (?)

From
Alvaro Herrera
Date:
On 2022-Jan-19, Amit Langote wrote:

> BTW, your tweaks patch added this:
> 
> + *     *inserted_tuple is the tuple that's effectively inserted;
> + *     *inserted_destrel is the relation where it was inserted.
> + *     These are only set on success.  FIXME -- see what happens on
> the "do nothing" cases.
> 
> If by "do nothing cases" you mean INSERT ON CONFLICT ... DO NOTHING,
> then I don't think it matters, because the caller in that case would
> be ExecModifyTable() which doesn't care about inserted_tuple and
> inserted_destrel.

No, I meant a FOR EACH ROW trigger that does RETURN NULL to "abort" the
insertion.  IIRC in non-partitioned cases it is possibly to break
referential integrity by using those.  What I was wondering is whether
you can make this new code crash.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
"I can see support will not be a problem.  10 out of 10."    (Simon Wittber)
      (http://archives.postgresql.org/pgsql-general/2004-12/msg00159.php)



Re: a misbehavior of partition row movement (?)

From
Amit Langote
Date:
On Wed, Jan 19, 2022 at 6:26 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2022-Jan-19, Amit Langote wrote:
> > BTW, your tweaks patch added this:
> >
> > + *     *inserted_tuple is the tuple that's effectively inserted;
> > + *     *inserted_destrel is the relation where it was inserted.
> > + *     These are only set on success.  FIXME -- see what happens on
> > the "do nothing" cases.
> >
> > If by "do nothing cases" you mean INSERT ON CONFLICT ... DO NOTHING,
> > then I don't think it matters, because the caller in that case would
> > be ExecModifyTable() which doesn't care about inserted_tuple and
> > inserted_destrel.
>
> No, I meant a FOR EACH ROW trigger that does RETURN NULL to "abort" the
> insertion.

Ah, gotcha.

>  IIRC in non-partitioned cases it is possibly to break
> referential integrity by using those.  What I was wondering is whether
> you can make this new code crash.

insert_destrel would be left set to NULL, which means
ExecCrossPartitionUpdateForeignKey() won't get called, because:

             * NULL insert_destrel means that the move failed to occur, that
             * is, the update failed, so no need to anything in that case.
             */
            if (insert_destrel &&
                resultRelInfo->ri_TrigDesc &&
                resultRelInfo->ri_TrigDesc->trig_update_after_row)
                ExecCrossPartitionUpdateForeignKey(mtstate, estate,

Moreover, trigger documentation warns of a "possibility of surprising
outcomes" if BR triggers are present in partitions that are chosen as
destinations of cross-partition updates:

"Then all row-level BEFORE INSERT triggers are fired on the
destination partition. The possibility of surprising outcomes should
be considered when all these triggers affect the row being moved."

I suppose the new code shouldn't need to take special care in such cases either.

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: a misbehavior of partition row movement (?)

From
Alvaro Herrera
Date:
I rebased this patch; v15 attached.  Other than fixing the (very large)
conflicts due to nodeModifyTable.c rework, the most important change is
moving GetAncestorResultRels into execMain.c and renaming it to have the
"Exec-" prefix.  The reason is that what this code is doing is affect
struct ResultRelInfo, which is owned by execMain.c, so it seemed bogus
to do that in nodeModifyTable.c and then let execMain.c's
ExecCloseResultRelations() do the cleanup.  I added a little commentary
in the latter routine too.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
"World domination is proceeding according to plan"        (Andrew Morton)

Attachment

Re: a misbehavior of partition row movement (?)

From
Zhihong Yu
Date:


On Fri, Mar 18, 2022 at 9:38 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
I rebased this patch; v15 attached.  Other than fixing the (very large)
conflicts due to nodeModifyTable.c rework, the most important change is
moving GetAncestorResultRels into execMain.c and renaming it to have the
"Exec-" prefix.  The reason is that what this code is doing is affect
struct ResultRelInfo, which is owned by execMain.c, so it seemed bogus
to do that in nodeModifyTable.c and then let execMain.c's
ExecCloseResultRelations() do the cleanup.  I added a little commentary
in the latter routine too.

--
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
"World domination is proceeding according to plan"        (Andrew Morton)

Hi,

+#define AFTER_TRIGGER_OFFSET           0x07FFFFFF  /* must be low-order bits */
+#define AFTER_TRIGGER_DONE             0x80000000
+#define AFTER_TRIGGER_IN_PROGRESS      0x40000000

Is it better if the order of AFTER_TRIGGER_DONE and AFTER_TRIGGER_IN_PROGRESS is swapped (for the ordinal values to be sequential) ? 

+#define AFTER_TRIGGER_CP_UPDATE            0x08000000

It would be better to add a comment for this constant, explaining what CP means (cross partition).

+   if (!partRel->rd_rel->relispartition)
+       elog(ERROR, "cannot find ancestors of a non-partition result relation");

It would be better to include the relation name in the error message.

+       /* Ignore the root ancestor, because ...?? */

Please fill out the remainder of the comment.

+               if (!trig->tgisclone &&
+                   RI_FKey_trigger_type(trig->tgfoid) == RI_TRIGGER_PK)
+               {
+                   has_noncloned_fkey = true;

The variable says fkey, but the constant is not RI_TRIGGER_FK. Maybe add a comment explaining why.

Cheers

Re: a misbehavior of partition row movement (?)

From
Alvaro Herrera
Date:
On 2022-Mar-18, Zhihong Yu wrote:

> +#define AFTER_TRIGGER_OFFSET           0x07FFFFFF  /* must be low-order
> bits */
> +#define AFTER_TRIGGER_DONE             0x80000000
> +#define AFTER_TRIGGER_IN_PROGRESS      0x40000000
> 
> Is it better if the order of AFTER_TRIGGER_DONE
> and AFTER_TRIGGER_IN_PROGRESS is swapped (for the ordinal values to be
> sequential) ?

They *are* sequential -- See
https://www.postgresql.org/message-id/202201172215.2tse3vjjgi2b%40alvherre.pgsql

> +#define AFTER_TRIGGER_CP_UPDATE            0x08000000
> 
> It would be better to add a comment for this constant, explaining what CP
> means (cross partition).

Sure.

> +   if (!partRel->rd_rel->relispartition)
> +       elog(ERROR, "cannot find ancestors of a non-partition result
> relation");
> 
> It would be better to include the relation name in the error message.

I don't think it matters.  We don't really expect to hit this.

> +       /* Ignore the root ancestor, because ...?? */
> 
> Please fill out the remainder of the comment.

I actually would like to know what's the rationale for this myself.
Amit?

> +               if (!trig->tgisclone &&
> +                   RI_FKey_trigger_type(trig->tgfoid) == RI_TRIGGER_PK)
> +               {
> +                   has_noncloned_fkey = true;
> 
> The variable says fkey, but the constant is not RI_TRIGGER_FK. Maybe add a
> comment explaining why.

Well, the constant is about the trigger *function*, not about any
constraint.  This code is testing "is this a noncloned trigger, and does
that trigger use an FK-related function?"  If you have a favorite
comment to include, I'm all ears.

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"It takes less than 2 seconds to get to 78% complete; that's a good sign.
A few seconds later it's at 90%, but it seems to have stuck there.  Did
somebody make percentages logarithmic while I wasn't looking?"
                http://smylers.hates-software.com/2005/09/08/1995c749.html

Attachment

Re: a misbehavior of partition row movement (?)

From
Amit Langote
Date:
On Sun, Mar 20, 2022 at 5:13 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2022-Mar-18, Zhihong Yu wrote:
>
> > +#define AFTER_TRIGGER_OFFSET           0x07FFFFFF  /* must be low-order
> > bits */
> > +#define AFTER_TRIGGER_DONE             0x80000000
> > +#define AFTER_TRIGGER_IN_PROGRESS      0x40000000
> >
> > Is it better if the order of AFTER_TRIGGER_DONE
> > and AFTER_TRIGGER_IN_PROGRESS is swapped (for the ordinal values to be
> > sequential) ?
>
> They *are* sequential -- See
> https://www.postgresql.org/message-id/202201172215.2tse3vjjgi2b%40alvherre.pgsql
>
> > +#define AFTER_TRIGGER_CP_UPDATE            0x08000000
> >
> > It would be better to add a comment for this constant, explaining what CP
> > means (cross partition).
>
> Sure.

Thanks.

> > +   if (!partRel->rd_rel->relispartition)
> > +       elog(ERROR, "cannot find ancestors of a non-partition result
> > relation");
> >
> > It would be better to include the relation name in the error message.
>
> I don't think it matters.  We don't really expect to hit this.

I tend to think maybe showing at least the OID in the error message
doesn't hurt, but maybe we don't need to.

> > +       /* Ignore the root ancestor, because ...?? */
> >
> > Please fill out the remainder of the comment.
>
> I actually would like to know what's the rationale for this myself.
> Amit?

Ah, it's just that the ancstorRels list contains *all* ancestors,
including the root one, whose triggers will actually be fired to
enforce its foreign key. The code below the line of code that this
comment is for is to check *non-root* ancestor's triggers to spot any
that look like they enforce foreign keys to flag them as
unenforceable.

I've fixed the comment as:

-       /* Ignore the root ancestor, because ...?? */
+       /* Root ancestor's triggers will be processed. */

> > +               if (!trig->tgisclone &&
> > +                   RI_FKey_trigger_type(trig->tgfoid) == RI_TRIGGER_PK)
> > +               {
> > +                   has_noncloned_fkey = true;
> >
> > The variable says fkey, but the constant is not RI_TRIGGER_FK. Maybe add a
> > comment explaining why.
>
> Well, the constant is about the trigger *function*, not about any
> constraint.  This code is testing "is this a noncloned trigger, and does
> that trigger use an FK-related function?"  If you have a favorite
> comment to include, I'm all ears.

A description of what we're looking for with this code is in the
comment above the loop:

    /*
     * For any foreign keys that point directly into a non-root ancestors of
     * the source partition,...

So finding triggers in those non-root ancestors whose function is
RI_TRIGGER_PK tells us that those relations have foreign keys pointing
into it or that it is the PK table in that relationship.  Other than
the comment, the code itself seems self-documenting with regard to
what's being done (given the function/macro/variable naming), so maybe
we're better off without additional commentary here.

I've attached a delta patch on v16 for the above comment and a couple
of other changes.

--
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: a misbehavior of partition row movement (?)

From
Alvaro Herrera
Date:
On 2022-Mar-20, Amit Langote wrote:

> On Sun, Mar 20, 2022 at 5:13 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > On 2022-Mar-18, Zhihong Yu wrote:

> > > +   if (!partRel->rd_rel->relispartition)
> > > +       elog(ERROR, "cannot find ancestors of a non-partition result
> > > relation");
> > >
> > > It would be better to include the relation name in the error message.
> >
> > I don't think it matters.  We don't really expect to hit this.
> 
> I tend to think maybe showing at least the OID in the error message
> doesn't hurt, but maybe we don't need to.

Since we don't even know of a situation in which this error message
would be raised, I'm hardly bothered by failing to print the OID.  If
any users complain, we can add more detail.

> I've fixed the comment as:
> 
> -       /* Ignore the root ancestor, because ...?? */
> +       /* Root ancestor's triggers will be processed. */

Okay, included that.

> A description of what we're looking for with this code is in the
> comment above the loop:
> 
>     /*
>      * For any foreign keys that point directly into a non-root ancestors of
>      * the source partition,...
> 
> So finding triggers in those non-root ancestors whose function is
> RI_TRIGGER_PK tells us that those relations have foreign keys pointing
> into it or that it is the PK table in that relationship.  Other than
> the comment, the code itself seems self-documenting with regard to
> what's being done (given the function/macro/variable naming), so maybe
> we're better off without additional commentary here.

Yeah, WFM.

> I've attached a delta patch on v16 for the above comment and a couple
> of other changes.

Merged that in, and pushed.  I made a couple of wording changes in
comments here and there as well.

I lament the fact that this fix is not going to hit Postgres 12-14, but
ratio of effort to reward seems a bit too high.  I think we could
backpatch the two involved commits if someone is motivated enough to
verify everything and come up with solutions for the necessary ABI
changes.

Thank you, Amit, for your perseverance in getting this bug fixed!  

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"No hay hombre que no aspire a la plenitud, es decir,
la suma de experiencias de que un hombre es capaz"



Re: a misbehavior of partition row movement (?)

From
Amit Langote
Date:
Hi Alvaro,

On Mon, Mar 21, 2022 at 2:58 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2022-Mar-20, Amit Langote wrote:
> > On Sun, Mar 20, 2022 at 5:13 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > > On 2022-Mar-18, Zhihong Yu wrote:
>
> > > > +   if (!partRel->rd_rel->relispartition)
> > > > +       elog(ERROR, "cannot find ancestors of a non-partition result
> > > > relation");
> > > >
> > > > It would be better to include the relation name in the error message.
> > >
> > > I don't think it matters.  We don't really expect to hit this.
> >
> > I tend to think maybe showing at least the OID in the error message
> > doesn't hurt, but maybe we don't need to.
>
> Since we don't even know of a situation in which this error message
> would be raised, I'm hardly bothered by failing to print the OID.  If
> any users complain, we can add more detail.

Sure.

> I lament the fact that this fix is not going to hit Postgres 12-14, but
> ratio of effort to reward seems a bit too high.  I think we could
> backpatch the two involved commits if someone is motivated enough to
> verify everything and come up with solutions for the necessary ABI
> changes.
>
> Thank you, Amit, for your perseverance in getting this bug fixed!

Thanks a lot for taking the time to review and commit.


--
Amit Langote
EDB: http://www.enterprisedb.com